Guide To Reliable Internet Services And Applications (Computer Communications Networks)

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 637

DownloadGuide To Reliable Internet Services And Applications (Computer Communications Networks)
Open PDF In BrowserView PDF
Computer Communications and Networks

For other titles published in this series, go to
www.springer.com/series/4198

The Computer Communications and Networks series is a range of textbooks, monographs
and handbooks. It sets out to provide students, researchers and non-specialists alike with
a sure grounding in current knowledge, together with comprehensible access to the latest
developments in computer communications and networking.
Emphasis is placed on clear and explanatory styles that support a tutorial approach, so that
even the most complex of topics is presented in a lucid and intelligible manner.

Charles R. Kalmanek
Y. Richard Yang

•

Sudip Misra

Editors

Guide to Reliable Internet
Services and Applications

123

Editors
Charles R. Kalmanek
AT&T Labs Research
180 Park Ave.
Florham Park NJ 07932
USA
crk@research.att.com

Y. Richard Yang
Yale University
Dept. of Computer Science
51 Prospect St.
New Haven CT 06511
USA
yry@cs.yale.edu

Sudip Misra
Indian Institute of Technology Kharagpur
School of Information Technology
Kharagpur-721302, India
smisra.editor@gmail.com
Series Editor
Professor A.J. Sammes, BSc, MPhil, PhD, FBCS, CEng
Centre for Forensic Computing
Cranfield University
DCMT, Shrivenham
Swindon SN6 8LA
UK

ISSN 1617-7975
ISBN 978-1-84882-827-8
e-ISBN 978-1-84882-828-5
DOI 10.1007/978-1-84882-828-5
Springer London Dordrecht Heidelberg New York
British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Control Number: 2010921296
c Springer-Verlag London Limited 2010

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as
permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced,
stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the
Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to
the publishers.
The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a
specific statement, that such names are exempt from the relevant laws and regulations and therefore free
for general use.
The publisher makes no representation, express or implied, with regard to the accuracy of the information
contained in this book and cannot accept any legal responsibility or liability for any errors or omissions
that may be made.
Cover design: SPi Publisher Services
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Foreword

An oft-repeated adage among telecommunication providers goes, “There are five
things that matter: reliability, reliability, reliability, time to market, and cost. If you
can’t do all five, at least do the first three.”
Yet, designing and operating reliable networks and services is a Herculean task.
Building truly reliable components is unacceptably expensive, forcing us to construct reliable systems out of unreliable components. The resulting systems are
inherently complex, consisting of many different kinds of components running a
variety of different protocols that interact in subtle ways. Inter-networks such as the
Internet span multiple regions of administrative control, from campus and corporate networks to Internet Service Providers, making good end-to-end performance
a shared responsibility borne by sometimes uncooperative parties. Moreover, these
networks consist not only of routers, but also lower-layer devices such as optical
switches and higher-layer components such as firewalls and proxies. And, these
components are highly configurable, leaving ample room for operator error and
buggy software. As if that were not difficult enough, end users understandably care
about the performance of their higher-level applications, which has a complicated
relationship with the behavior of the underlying network.
Despite these challenges, researchers and practitioners alike have made tremendous strides in improving the reliability of modern networks and services. Their
efforts have laid the groundwork for the Internet to evolve into a worldwide communications infrastructure – one of the most impressive engineering artifacts ever
built. Yet, much of the amassed wisdom of how to design and run reliable networks
has been spread across a variety of papers and presentations in a diverse array of
venues, in tools and best-common practices for managing networks, and sometimes
only in the minds of the many engineers who design networking equipment and
operate large networks.
This brings us to this book, which captures the state-of-the-art for building reliable networks and services. Like the topic of reliability itself, the book is broad,
ranging from reliability modeling and planning, to network monitoring and network configuration, to disaster preparedness and reliable applications. A diverse
collection of experts, from both industry and the academe, have come together to
distill the collective wisdom. The book is both grounded in practical challenges and

v

vi

Foreword

forward looking to put the design and operation of reliable networks on a strong
foundation. As such, the book can help us build more reliable networks and services
today, and face the many challenges of achieving even greater reliability in the
years ahead.
Jennifer Rexford
Princeton University

Preface

Overview
This book arose from a conversation at the Internet Network Management workshop (INM) in 2007. INM’07 was subtitled “The Five Nine’s Workshop” because it
focused on raising the availability of Internet services to “Five Nine’s” or 99.999%,
an availability metric traditionally associated with the telephone network. During
our conversation, we talked about and vehemently agreed that there was a need for
a comprehensive book on reliable Internet services and applications – a guide that
would collect in one volume the accumulated wisdom of leading researchers and
practitioners in the field.
Networks and networked application services using the Internet Protocol have
become a critical part of society. Service disruptions can have significant impact
on people’s lives and business. In fact, as the Internet has grown, application requirements have become more demanding. In the early days of the Internet, the
typical applications were nonreal-time applications, where packet retransmission
and application layer retry would hide underlying transient network disruptions.
Today, applications such as online stock trading, online gaming, Voice over IP
(VoIP), and video are much more sensitive to small perturbations in the network.
For example, following one undersea cable failure in the Pacific, AT&T restored
the service on an alternate route, which introduced 5 ms of additional packet delay.
This seemingly small additional delay was sufficient to cause problems for an enterprise customer that operated an application between a call center in India and a data
center in Canada. This problem led to subsequent re-engineering of the customer’s
end-to-end connection.
In addition, networked application services have become an increasingly important part of people’s lives. The Internet and virtual private networks support many
mission critical business services. Ten years ago, it would have been just an inconvenience if someone lost their IP service. Today, people and businesses depend on
Internet applications. Online stock trading companies are not in business if people cannot implement their trades. The Department of Defense cannot operate their
information-based programs if their information infrastructure is not operating. Call
centers with VoIP services cannot serve their customers without their IP network.

vii

viii

Preface

Although we started work on this book with a focus on network reliability, it
should be obvious from the preceding description that it is important to consider
both reliability and performance, and to consider both networks and networked application services. Examples of networked applications include email, VoIP, search
engines, ecommerce sites, news sites, or content delivery networks.

Features
This book has a number of features that make it a unique and valuable guide to
reliable Internet services and applications.
Systematic, interdisciplinary approach: Building and operating reliable network
services and applications requires a systematic approach. This book provides
comprehensive, systematic, and interdisciplinary coverage of the important technical topics, including areas such as networking; performance, and reliability
modeling; network measurement; configuration, fault, and security management;
and software systems. The book provides an introduction to all of the topics,
while at the same time, going into enough depth for interested readers that already
understand the basics.
Specifically, the book is divided into seven parts. Part I provides an introduction
to the challenges of building reliable networks and applications, and presents an
overview of the structure of a large Internet Service Provider (ISP) network. Part II
introduces reliability modeling and network capacity planning. Part III extends the
discussion beyond a single network administrative domain, covering interdomain
reliability and overlay networks. Part IV provides an introduction to an important aspect of reliability: configuration management. Part V introduces network
measurements, which provide the underpinning of network management. Part VI
covers network and security management, and disaster preparedness. Part VII describes techniques for building application services, and provides a comprehensive
overview of capacity and performance engineering for these services. Taken in total,
the book provides a comprehensive introduction to an important topic.
Coverage of pragmatic problems arising in real, operational deployments: Building and operating reliable networks and applications require an understanding of
the pragmatic challenges that arise in an operational setting. This book is written
by leading practitioners and researchers, and provides a unique perspective on the
subject matter arising from their experience. Several chapters provide valuable “best
practices” to help readers translate ideas into practice.
Content and structure allows reference reading: Although the book can be read
from cover to cover, each chapter is designed to be largely self-contained, allowing
readers to jump to specific topics that they may be interested in. The necessary
overlap across a few of the chapters is minimal.

Preface

ix

Audience
The goal of this book is to present a comprehensive guide to reliable Internet
services and applications in a form that will be of broad interest to educators and
researchers. The material is covered in a level of detail that would be suitable for an
advanced undergraduate or graduate course in computer science. It can be used as
the basis or supplemental material for a one-or-two semester course, providing a
solid grounding in both theory and practice. The book will also be valuable to researchers seeking to understand the challenges faced by service providers and to
identify areas that are ripe for research.
The book is also intended to be useful to practitioners who want to broaden their
understanding of the field, and/or to deepen their knowledge of the fundamentals.
By focusing our attention on a large ISP network and associated application services, we consider a problem that is large enough to expose the real challenges
and yet broad enough to expose guidelines and best practices that will be applicable in other domains. For example, though the book does not discuss access or
wireless networks, we believe that the principles and approaches to reliability that
are presented in this book apply to them and are in fact, broadly applicable to any
large network or networked application. We hope that you will find the book to be
informative and useful.
Florham Park, NJ
India
New Haven, CT

Charles R. Kalmanek
Sudip Misra
Y. Richard Yang

Acknowledgments

The credit for this book goes first and foremost to the authors of the individual
chapters. It takes a great deal of effort to crystallize one’s understanding of a topic
into an overview that is self-contained, technically deep, and interesting. The authors
of this volume have done an outstanding job.
The editors acknowledge the contributions of many reviewers, whose comments
clearly improved the quality of the chapters. Simon Rees and Wayne Wheeler, our
editors at Springer, have been helpful and supportive.
The editors also acknowledge the support that they have been given by their
families and loved ones during the long evenings and weekends spent developing this book.

xi

Contents

Part I Introduction and Reliable Network Design
1

2

The Challenges of Building Reliable Networks
and Networked Application Services .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .
Charles R. Kalmanek and Y. Richard Yang

3

Structural Overview of ISP Networks.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 19
Robert D. Doverspike, K.K. Ramakrishnan, and Chris Chase

Part II Reliability Modeling and Network Planning
3

Reliability Metrics for Routers in IP Networks . . . . . . . . . . . . . . . .. . . . . . . . . . . 97
Yaakov Kogan

4

Network Performability Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .113
Kostas N. Oikonomou

5

Robust Network Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .137
Matthew Roughan

Part III

Interdomain Reliability and Overlay Networks

6

Interdomain Routing and Reliability .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .181
Feng Wang and Lixin Gao

7

Overlay Networking and Resiliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .221
Bobby Bhattacharjee and Michael Rabinovich

Part IV
8

Configuration Management

Network Configuration Management . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .255
Brian D. Freeman

xiii

xiv

9

Contents

Network Configuration Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .277
Sanjai Narain, Rajesh Talpade, and Gary Levin

Part V

Network Measurement

10 Measurements of Data Plane Reliability and Performance .. .. . . . . . . . . . .319
Nick Duffield and Al Morton
11 Measurements of Control Plane Reliability
and Performance.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .357
Lee Breslau and Aman Shaikh
Part VI Network and Security Management, and Disaster Preparedness
12 Network Management: Fault Management, Performance
Management, and Planned Maintenance . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .397
Jennifer M. Yates and Zihui Ge
13 Network Security – A Service Provider View . . . . . . . . . . . . . . . . . .. . . . . . . . . . .447
Brian Rexroad and Jacobus Van der Merwe
14 Disaster Preparedness and Resiliency . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .517
Susan R. Bailey
Part VII Reliable Application Services
15 Building Large-Scale, Reliable Network Services.. . . . . . . . . . . . .. . . . . . . . . . .547
Alan L. Glasser
16 Capacity and Performance Engineering for Networked
Application Servers: A Case Study in E-mail Platform
Planning . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .581
Paul Reeser
Index . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .629

Part I

Introduction and Reliable Network Design

Chapter 1

The Challenges of Building Reliable Networks
and Networked Application Services
Charles R. Kalmanek and Y. Richard Yang

1.1 Introduction
In the decades since the ARPANET interconnected four research labs in 1969
[1], computer networks have become a critical infrastructure supporting our
information-based society. Our dependence on this infrastructure is similar to
our dependence on other basic infrastructures such as the world’s power grids and
the global transportation systems. Failures of the network infrastructure or major
applications running on top of it can have an enormous financial and social cost
with serious consequences to the organizations and consumers that depend on these
services.
Given the importance of this communications and applications infrastructure to
the economy and society as a whole, reliability is a major concern of network and
service providers. After a survey of major network carriers including AT&T, BT,
and NTT, Telemark [7] concludes that, “The three elements which carriers are most
concerned about when deploying communication services are network reliability,
network usability, and network fault processing capabilities. The top three elements
all belong to the reliability category.” Unfortunately, the challenges associated with
running reliable, large-scale networks are not well documented in the research literature. Moreover, while networking and software-educational curricula provide a good
theoretical foundation, there is little training in the techniques used by experienced
practitioners to address reliability challenges. Another issue is that while traditional
telecommunications vendors gained extensive experience in building reliable software, the pace of change has accelerated as the Internet has grown and Internet
system vendors do not meet the level of reliability traditionally associated with “carrier grade” systems. Newer vendors accustomed to building consumer software are
C.R. Kalmanek ()
AT&T Labs, 180 Park Ave., 07932, Florham Park, NJ, USA
e-mail: crk@research.att.com
Y.R. Yang
Yale University, 51 Prospect Street, New Haven, CT, USA
e-mail: yry@cs.yale.edu

C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications,
Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 1,
c Springer-Verlag London Limited 2010


3

4

C.R. Kalmanek and Y.R. Yang

entering the service provider market, but they do not have a culture that focuses
on the higher level of required reliability. This places a greater burden on service
providers who integrate their software to help these vendors “raise the bar” on reliability to offer reliable services.
Although we emphasize network reliability in the foregoing section, it is important to consider both reliability and performance and to consider both networks and
networked application services. Users are interested in the performance of an endto-end service. When a user is unable to access his e-mail, he does not particularly
care whether the network or the application is at fault. Examples of network applications include e-mail, Voice over IP, search engines, e-commerce sites, news sites,
or content delivery networks.

1.2 Why Is Reliability Hard?
Supporting reliable networks and networked application services involves some of
the most complex engineering and operational challenges that are dealt with in any
industry. Much of this complexity is intentionally transparent to the end users, who
expect things to “just work.” Moreover, the end users are typically not exposed to
the root causes of network or service problems when their service is degraded or
interrupted. As a result, it is natural for end users to assume that network and service reliability are not hard. In part, users get this impression because most service
providers and Internet-facing web services operate at very high levels of reliability.
Though it may look easy, this level of reliability is a result of solid engineering and
“constant vigilance.” The best service providers engage in a process of continuous
improvement, similar to the Japanese “Kaizen” philosophy that was popularized by
Deming [2]. In this book, we address the challenges faced by service providers and
the approaches that they use to deliver reliable services to their users. Before delving into the solution, we ask ourselves, why is it so hard to build highly reliable
networks and networked application services?
We can characterize the difficulty as resulting from three primary causes. The
first challenge is scale and complexity; the second is that the services operate in the
presence of constant change. These challenges are inherent to large-scale networks.
The third challenge is less fundamental but still important. It relates to challenges
with measurement and data.

1.2.1 Scale and Complexity Challenges
Scale and complexity challenges are fundamental to any large network or service
infrastructure. As Steve Bellovin remarked, “Things break. Complex systems break
in complex ways” [8]. In particular, large service provider networks contain hundreds of thousands of network elements distributed around the world, and tens of

1

The Challenges of Building Reliable Networks and Networked Application Services

5

thousands of different models of equipment. These network elements are interconnected and must interoperate correctly to offer services to the network users.
Failures in one part of the network can impact other parts of the network. Even if
we consider only the infrastructure needed to provide basic IP connectivity services,
it consists of a vast number of complex building blocks: routers, multiplexers, transmission equipment, servers, systems software, load balancers, storage, firewalls,
application software, etc. At any given point in time, some network elements have
failed, have been taken out of service, or will be operating at a degraded performance
level.
The preceding description only hints at the challenges. Despite the careful engineering and modeling that is done through all stages of the service life cycle, if
we look at the service infrastructure as a system, we note that the system does not
always behave as expected. There are many reasons for this, including:
 Software defects in network elements;
 Inadequate modeling of dependencies;
 Complex software-support systems.

The vast majority of the elements involved in providing a network service contain
software, which can be buggy, particularly when the software function is complex.
If a bug is triggered, a piece of equipment can behave in unexpected ways. Even
though the correct operation of router software is critical to service, we have seen
design flaws in the way that the router-operating system handles resource management and scheduling, which manifest themselves as latent outages. The history
of the telephone network contains examples of major network outages caused by
software faults, such as the famous “crash” of the AT&T long-distance telephone
network in 1990 [3]. Similarly, the network elements that make up the IP network
infrastructure contain complex control-plane software implementing distributed
protocols that must interoperate properly for the network to work. When compared
to the telephone switching software, control plan software of IP networks changes
more frequently and is far more likely to be subject to undetected software faults.
These faults occasionally result in unexpected behaviors that can lead to outages or
degraded performance.
In a large complex infrastructure, operators do not have a comprehensive model
of all of the dependencies between systems supporting a given service: they rely on
simplifying abstractions such as network layering and administrative separation of
concerns. These abstractions can break down in unexpected ways. For example,
there are complex interactions between network layers, such as the transport and IP
layers, that affect reliability. Consider a link between two routers that is transported
over a SONET ring. Networks are typically designed so that protection switching at
the SONET layer is transparent to the IP layer. However, several years ago, AT&T
experienced problems in the field, whereby a SONET “protection switching event”
triggered a router-software bug that caused several minutes of unexpected customer
downtime. Since the protection switch occurred correctly, the problem did not trigger an alarm and was only uncovered by correlating customer trouble tickets with

6

C.R. Kalmanek and Y.R. Yang

network event data. This cross-layer interaction is an example of the kinds of dependency that can be difficult to anticipate and troubleshoot.
In addition to the scale of the network and the complexity of the network equipment, correct operation depends on the operation of complex software systems that
manage the network and support customer care. Router-configuration files contain a
large number of parameters that must be configured correctly. Incorrect configuration of an access control list can create security vulnerabilities, or alternatively, can
cause traffic to be “blackholed” by blocking legitimate traffic. If there is a mismatch
between the Quality of Service settings on a customer-edge router and those on
the provider-edge router that it connects, some applications may experience performance problems under heavy load. An inconsistency between the network inventory
database and the running network can lead to stranded network capacity, service
degradations, network outages, etc. These problems sometimes manifest themselves
weeks or months after the inconsistency appeared – for this reason, they are sometimes referred to as “time bombs.”

1.2.2 Constant Change
The second challenge relates to the fact that any large-scale service infrastructure
undergoes constant change. Maintenance and customer-provisioning activities in
a large global network are ongoing, spanning multiple time zones. On a typical
workday, new customers are being provisioned, service for departing customers is
being turned down, and change orders to change some service characteristic are being processed for existing customers. Capacity augmentation and traffic grooming,
whereby private-line connections are rearranged to use network resources more efficiently, take place daily. Routine maintenance activities such as software upgrades
also take place during predefined maintenance “windows.” More complex maintenance activities, such as network migrations, also occur periodically. Examples of
network migration include moving a customer connection from one access router
to another, replacing a backbone router, or consolidating all of a regional network’s
traffic onto a national backbone network in order to retire an older backbone. Replacing a backbone router in a service provider network requires careful planning
and execution of a sequence of moves of the “uplinks” from access routers in order
to minimize the amount of traffic that is dropped. Decision-support tools are used to
model the traffic that impinges on all of the affected links at every step of the move
to ensure that links are not congested.
In the midst of these day-to-day changes, network failures can occur at any time.
The network is designed to automatically restore service after a failure. However,
during planned maintenance activities, it is possible that some network capacity
has been removed from service temporarily, potentially leaving the network more
vulnerable to specific failures. Under normal conditions, maintenance to repair the
failed network element is scheduled to occur later at a convenient time, after which
the network traffic may revert back to its original path.

1

The Challenges of Building Reliable Networks and Networked Application Services

7

Finally, in addition to the day-to-day changes of new customers, or the occasional changes that come from major network migrations, there are also architectural
changes. These changes might result from the introduction of new features and
services, or new protocols. An example might be the addition of a new “class of
service” in the backbone. Another example might be turning up support for multicast services in MPLS-based VPNs. The first example (class of service) involves
configuration changes that may touch every router in the network. The second example involves introducing a new architectural element (i.e., a PIM rendezvous point),
enabling a new protocol (i.e., PIM), validating the operation of multicast monitoring
tools, etc. All of these changes would have been tested in the lab prior to the First
Field Application (FFA), which is typically the first time that everything comes together in an operational network carrying live customer traffic. If there are problems
during the FFA with the new feature that is being deployed, network operations will
execute procedures to gracefully back out of the change until the root cause of the
problem is analyzed and corrected.

1.2.3 Measurement and Data Challenges
The third challenge associated with building reliable networks is associated with
measurement and data. Vendor products deployed by service providers often suffer
from an inadequate implementation of basic telemetry functions that are necessary
to monitor and manage the equipment. In addition, because of the complexity of
the operating environment described earlier, there are many, diverse data sources,
with highly variable data quality. We present two examples. Despite the maturity of
SNMP [4], AT&T has seen an implementation of a commercial SNMP poller that
did not correctly handle the data impacts of router reboots or loss of data in transit.
Ideally, problems like this are discovered in the lab, but occasionally they are not discovered until the equipment is deployed and supporting live service. Data problems
are not limited to network layer equipment: vendor-developed software components
running on servers may not support monitoring agents that export the data necessary to implement a comprehensive performance-monitoring infrastructure. When
these software components are combined in a complex, multitiered application, the
workflow and dependencies among the components may not be fully understood
even by the vendor. When such a system is deployed, even with a well-designed
server instrumentation, it may be difficult to determine exactly which component is
the bottleneck with limited system throughput.
Another issue is that data are often “locked up” in management system “silos.”
This can result from selecting a vendor’s proprietary element-management system.
Typically, proprietary systems are not designed to make data export easy, since the
vendor seeks to lock the service provider into a complete “solution.” Data silos
can also result from internal implementations. These often result from organizational silos: a management system is specified and built to address a specific set of
functions, without the involvement of subject matter experts from other domains.

8

C.R. Kalmanek and Y.R. Yang

Whatever the cause, the end result is that the data necessary to monitor and manage
the infrastructure may not exist or may be difficult to access by analysts who are
trying to understand the system.

1.3 Toward Network and Service Reliability
The examples in Section 1.2 give only a glimpse into the complex challenges faced
by service providers who seek to provide reliable services. Despite these complexities, the vast majority of users receive good service. How is this achieved? At the
highest level, network and service reliability involve both good engineering design
and good operational practices. These practices are inextricably linked: no matter
how good the operations team is, good operation practices cannot make up for a
poorly thought out design. Likewise, a good design that is implemented or operated
poorly will not result in reliable service.
It should be obvious that reliable services start with good design and engineering.
The service design process relies on extensive domain knowledge and a good understanding of the business and service-level objectives. Network engineers develop
detailed requirements for each network element in light of the end-to-end objectives
for reliability, availability, and operability. Network elements are selected carefully.
After a detailed paper and lab evaluation, an engineering team selects a specific
product to meet a particular need. Once the product is selected, it enters a change
control process where differences between the requirements and the product’s capabilities are managed by the service provider in conjunction with the vendor. The
service designers, working closely with test engineers, develop comprehensive engineering rules for each of the network elements, including safe operating limits for
resources such as bandwidth or CPU utilization. Detailed engineering documents
are developed that describe how the network element is to be used, its engineering limits, etc. Network management requirements for the new network element
are developed in conjunction with operations personnel and delivered to the IT
team responsible for the operations-support systems (OSSs). Before the FFA of the
new element, the element, and OSSs undergo an Operations Readiness Test (ORT),
which verifies that the element and the associated OSSs work as expected, and can
be managed by network operations.
The preceding paragraph gives a brief overview of some of the engineering “best
practices” involved in building a reliable network. In addition, reliability and capacity modeling must be done for the network as a whole. The network architecture
includes the appropriate recovery mechanisms to address potential failures. Reliability modeling tools are used to model the impact on the network of failures in light
of both current and forecast demands. Where possible, the tools model cross-layer
dependencies between IP layer links and the underlying transport or physical layer
network, such as the existence of “shared risk groups” – links or elements that may
be subject to simultaneous failure. By simulating all possible failure scenarios, these
tools allow the network designers to trade off network cost against survivability. The

1

The Challenges of Building Reliable Networks and Networked Application Services

9

network design also includes a comprehensive security design that considers the important threats to the network and its customers, and implements appropriate access
controls and other security detection and mitigation strategies.
An operations organization is typically responsible for managing the network or
service on a day-to-day basis. The operations team is supported by the operationssupport systems mentioned earlier. These include configuration-management
systems responsible for maintaining network inventory data and configuring the
network elements, and service assurance systems that collect telemetry data from
the network to support fault and performance management functions. The fault
and performance management systems are the “eyes” of the operations team into
the service infrastructure to figure out, in the case of problems, what needs to be
repaired. We can consider fault and performance management systems as involving
the following areas:
 Instrumentation layer;
 Data management layer;
 Management application layer.

We start thinking about the instrumentation layer by asking what telemetry or
measurement data need to be collected to validate that the service is meeting its
service-level objectives (or to troubleshoot problems if it is not). Standardized
router MIB data provide a base level of information, but additional instrumentation is needed to manage large networks supporting complex applications. Passive
monitoring techniques support collection of data directly from network elements
and dedicated passive monitoring devices, but active monitoring, involving the
injection and monitoring of synthetic traffic, is also required and is commonly
used. Since the correct operation of the IP forwarding layer (data plane) critically
depends on the correct operation of the IP control plane, both data plane and the
control-plane monitoring are important. In software-based application services,
the telemetry frequently does not adequately capture “soft” failure modes, such as
transaction timeouts between devices or errors in software settings and parameters.
Both the servers supporting application software and the applications themselves
need to be instrumented and monitored for both faults and key performance
parameters.
Large service providers typically have a significant number of data sources that
are relevant to service management, and the data management layer needs to be able
to handle large volumes of telemetry and alarm data. As a result, the data-collection
and data-management infrastructure presents challenging systems design problems.
A good design allows data-source-specific collectors to be easily integrated. It also
provides a framework for data normalization, so that common fields such as timestamps, router names, etc., can be normalized to a common key during data ingest
so that application developers are spared some of the complexity of understanding
details of the raw data streams. Ideally, the design of the data management layer
supports a common real-time and archival data store that is accessed by a range of
applications.

10

C.R. Kalmanek and Y.R. Yang

The management applications supported on top of the data management layer
support routine operations functions such as fault and performance management,
in addition to supporting more complex analyses. Given the vast quantity of event
data that is generated by the network, the event management system must appropriately filter the information that must be acted upon by the operations team to
avoid flooding them with spurious information. The impact of alarm storms (and
the importance of alarm filtering) can be illustrated by the story of Three Mile Island, in which the computer system noted 700 distinct error conditions within the
first minute of the problem, followed by thousands of error reports and updates [5].
The operators were drowning in a sea of information at a time when they needed a
small number of actionable items to work on.
Management applications also enable operations personnel to control the network, including performing routine tasks such as resetting a line card on a router as
well as more complex tasks. Standard tasks are handled through an operations interface to an operations-support system. Ad hoc tasks that involve a complex workflow
may require operations staff to use a scripting language that accesses the network inventory database and sends commands to network elements or element-management
systems. Ideally, the operations-support systems automate most of the routine tasks
to a large extent, audit the results of these tasks, and back them out if there are
problems.
It is useful to note that operations personnel are typically organized in multiple response tiers. The lower tiers of operations staff work on immediate problems,
following established procedures. The tools that they use have constrained functionality, targeted at the functions that they are expected to perform. The highest tier
of operations personnel consists of senior operations staff charged with diagnosing
complex problems in real-time or performing postmortem analysis of complex, unresolved problems that occurred in the past. These investigations may take more
time than lower-tier operations staff can afford to spend on a specific problem.
When there are serious problems affecting major customers or the network as a
whole, engineers from the network engineering team are also called upon to assist. In these cases, one or more analysts do exploratory data mining (EDM) using
data exploration tools [6] that support data drill down, statistical data analysis, and
data visualization. Well-designed data exploration tools can make a huge difference
when analysts are faced with the “needle in the haystack” problem – trying to sort
through huge quantities of telemetry data to draw meaningful conclusions. When
analysts uncover the root cause of a particular problem, this information can be
used to eliminate the problem, e.g., by pressing a vendor to fix a software bug, by
repairing a configuration error, etc.
As we mentioned in Section 1.2, a broad goal of both the network designers
and network operations is to maintain and continuously improve network reliability,
availability, and performance, despite the challenges. “Holding the gains” or staying
flat on network performance is insufficient to meet increasingly tight customer and
application requirements. There is evidence that the principles and best practices
presented in this book have results. Figure 1.1 shows measured Defects-per-Million

1

The Challenges of Building Reliable Networks and Networked Application Services

11

DPM (linear scale)

UNPLANNED DPM

1999 2000 2001

2002 2003 2004 2005 2006 2007 2008
YEAR

Fig. 1.1 Unplanned DPM for AT&T IP Backbone

(DPM) on the AT&T IP Backbone since the AT&T Managed Internet Service was
first offered in 1999. This chart plots the total number of minutes of port outages
during a year (i.e., the number of minutes each customer port was out of service),
divided by the number of port minutes in that year (i.e., the number of ports times the
number of minutes each was in service), times a normalization factor of 1,000,000.
The points are measured data; the smooth curve resembles a classic improvement
curve. Over the first 2 years of the service, DPM was reduced significantly as vendor problems were addressed, architectural improvements were put in place, and
operations processes were matured. Further improvements continue to be achieved.
While DPM is only one of the many fault and performance metrics that must be
tracked and managed, this chart illustrates how good design and good operations
pay off.
The principles that underlie design and operation of reliable networks are also
critical to the design and operation of reliable application services. However, there
are also many differences between these two domains, including wide differences
in the domain knowledge of the typical network engineer and the typical software
developers. The life cycle of reliable software starts with understanding the requirements, and involves every step of the development process, including field support
and application monitoring. As in networks, capacity and performance engineering
of application services rely on both modeling and data collection.
This section has described some of the design and network management practices
that are performed by large service providers that run reliable networks and services.
In Section 1.4, we provide an overview of the material that is covered in the book.

12

C.R. Kalmanek and Y.R. Yang

1.4 A Bird’s Eye View of the Book
The book consists of six parts, covering both reliable networks and reliable network
application services.

1.4.1 Part I: Reliable Network Design
Part I introduces the challenges of building reliable networks and services, and provides background for the rest of the book. Following this chapter, Chapter 2 presents
an overview of the structure of a large ISP backbone network. Since IP network reliability is tied intimately to the underlying transport network layers, this chapter
presents an overview of these technologies. Section 2.4 provides an overview of the
IP control plane, and introduces Multi-Protocol Label Switching (MPLS), a routing
and forwarding technology that is used by most large ISPs to support Internet and
Virtual Private Network (VPN) services on a shared backbone network. Section 2.5
introduces network restoration, which allows the network to rapidly recover from
failures. This section provides a performance analysis of the limitations of OSPF
failure detection and recovery to motivate the deployment of MPLS Fast Reroute.
The chapter concludes with a case study of an IP network supporting IPTV services
that links together many of the concepts.

1.4.2 Part II: Reliability Modeling and Network Planning
Part II of the book covers network reliability modeling, and its close cousin, network
planning. Chapter 3 starts with an overview of the main router elements (e.g., routing
processors, line cards, switching fabric, power supply, and cooling system), and their
failure modes. Section 3.2 introduces redundancy mechanisms for router elements,
as they are important for availability modeling. Section 3.3 shows how to compute
the reliability metrics of a single router with and without redundancy mechanisms.
Section 3.4 extends the reliability model from a single router to a large network
of edge routers and presents reliability metrics that consider device heterogeneity.
The chapter also provides an overview of the challenges in measuring end-to-end
availability, which is the focus of Chapter 4.
Chapter 4 provides a theoretical grounding in performance and reliability (performability) modeling in the context of a large-scale network. A fundamental
challenge is that the size of the state space is exponential in the number of network
elements. Section 4.2 presents a hierarchical network model used for performability
modeling. Section 4.3 discusses the performability evaluation problem in general
and presents the state-generation approach. The chapter also introduces the nperf
network performability analyzer, a software package developed at AT&T Labs

1

The Challenges of Building Reliable Networks and Networked Application Services

13

Research. Section 4.4 concludes by presenting two case studies that illustrate the
material of this chapter, the first involving an IPTV distribution network, and the
second dealing with architecture choices for network access.
Chapter 5 focuses on network planning. Since capacity planning depends on
utilization and traffic data, the chapter takes a systems view: since network measurements are of varying quality, the modeling process must be robust to data-quality
problems while giving useful estimates that can be used for planning: “Essentially,
all models are wrong, but some are useful.” This chapter is organized around the
key steps in network planning. Sections 5.2 and 5.3 cover measurements, analysis,
and modeling of network traffic. Section 5.4 covers prediction, including both incremental planning and green-field planning. Section 5.5 presents optimal network
planning. Section 5.6 covers robust planning.

1.4.3 Part III: Interdomain Reliability and Overlay Networks
Part III extends beyond the design of a large backbone network to interdomain
and overlay networks. Chapter 6 provides an overview of interdomain routing.
Section 6.3 highlights the limitations of the BGP routing protocol. For example,
the protocol design does not guarantee that routing will converge to a stable route.
Section 6.4 presents measurement results that quantify the impact of interdomain
routing impairments on end-to-end path performance. Section 6.5 presents a detailed overview of the existing solutions to achieve reliable interdomain routing,
and Section 6.6 points out possible future research directions.
Overlay networks are discussed in Chapter 7 as a way of providing end-to-end
reliability at the application or service layer. The overlay topology can be tailored to
application requirements; overlay routing may choose application-specific policies;
and overlay networks can emulate functionality not supported by the underlying
network. This chapter surveys overlay applications with a focus on how they are
used to increase network resilience. The chapter considers how overlay networks
can make a distributed application more resilient to flash crowds, to component
failures and churn, network failures and congestion, and to denial-of-service attacks.

1.4.4 Part IV: Configuration Management
Network design is just one part of building a reliable network or service infrastructure; configuration management is another critical function. Part IV discusses this
topic.
Chapter 8 discusses network configuration management, presenting a high-level
view of the software system involved in managing a large network of routers in support of carrier class services. Section 8.2 reviews key concepts to structure the types

14

C.R. Kalmanek and Y.R. Yang

of data items that the system must deal with. Section 8.3 describes the subcomponents of the system and the requirements of each subcomponent. This section also
discusses two approaches that are commonly used for router configuration – policybased and template-based, and highlights the different requirements associated with
provisioning consumer and enterprise services. Section 8.4 gives an overview of
one of the key challenges in designing a configuration-management system, which
is handling changes. Finally, the chapter presents a step-by-step overview of the
subscriber provisioning process.
While a well-designed configuration-management system does configuration auditing, Chapter 9 looks at auditing from a different perspective, describing the need
for bottom-up, network-wide configuration validation. Section 9.2 provides a case
study of the challenges of configuring a multi-organization “collaboration network,”
the types of vulnerabilities caused by configuration errors, the reasons these arise,
and the benefits derived from using a configuration validation system. Section 9.3
abstracts from experience and proposes a reference design of a validation system.
Section 9.4 discusses the IPAssure system and the design choices it has made to realize this design. Section 9.5 surveys related technologies for realizing this design.
Section 9.6 discusses the experience with using IPAssure to assist a US government
agency with compliance with FISMA requirements.

1.4.5 Part V: Network Measurement
While measurement was not a priority in the original design of the Internet, the complexity of networks, traffic, and the protocols that mediate them now require detailed
measurements to manage the network, to verify that performance meets the required
goals, and to diagnose performance degradations when they occur. Part V covers
network measurement, with a focus on reliability and performance monitoring.
Chapter 10 covers data plane measurements. Sections 10.2–10.5 describe a spectrum of passive traffic measurement methods that are currently employed in provider
networks, and also describe some newer approaches that have been proposed or may
even be deployed in the medium term. Section 10.6 covers active measurement tools.
Sections 10.7–10.8 review IP performance metrics and their usage in service-level
agreements. Section 10.9 presents multiple approaches to deploy active measurement systems.
The control plane in an IP network controls the overall flow of traffic in the network, and is critical to its operation. Chapter 11 covers control-plane measurements.
Section 11.2 gives an overview of the key protocols that make up the “unicast” control plane (OSPF and BGP) describes how they are monitored, and surveys key
applications of the measurement data. Section 11.3 presents the additional challenges that arise in performing multicast monitoring.

1

The Challenges of Building Reliable Networks and Networked Application Services

15

1.4.6 Part VI: Network and Security Management,
and Disaster Preparedness
Chapter 12 focuses on the network management systems and the tasks involved
in supporting the day-to-day operations of an IP network. The goal of network
operations is to keep the network up and running, and performing at or above
designed levels of service performance. Section 12.2 covers fault and performance
management – detecting, troubleshooting, and repairing network faults and performance impairments. Section 12.3 examines how process automation is incorporated
in fault and performance management to automate many of the tasks that were originally executed by humans. Process automation is the key ingredient that enables
a relatively small Operations group to manage a rapidly expanding number of network elements, customer ports, and complexity. Section 12.4 discusses tracking and
managing network availability and performance over time, looking across larger
numbers of network events to identify opportunities for performance improvements.
Section 12.5 then focuses on planned maintenance. The chapter also presents areas
for innovation and a set of best practices.
Chapter 13 presents a service provider’s view of network security. Section 13.2
provides an exposition of the network security threats and their causes. A fundamental concern is that in the area of network security, the economic balance is heavily
skewed in favor of bad actors. Section 13.3 presents a framework for network security, including the means of detecting security incidents. Section 13.4 deals with the
importance of developing good network security intelligence. Section 13.5 presents
a number of operational network security systems used for the detection and mitigation of security threats. Finally, Section 13.6 summarizes important insights and
then briefly considers important new and developing directions and concerns in network security as an indication of where resources should be focused both tactically
and strategically.
Chapter 14 discusses disaster preparedness as the critical factor that determines
an operator’s ability to recover from a network disaster. For network operators to
effectively recover from a disaster, a significant investment must be made to prepare
before the disaster occurs, so that network operations are prepared to act quickly
and efficiently. This chapter describes the creation, exercise, and management of
disaster recovery plans. With good disaster preparedness, disaster recovery becomes
the disciplined management of the execution of disaster recovery plans.

1.4.7 Part VII: Reliable Application Services
Large-scale networks exist to connect users to applications. Part VII expands the
scope of the book to the software and servers that support network applications.
Chapter 15 presents an approach to the design and development of reliable network application software. This chapter presents the entire life cycle of what it

16

C.R. Kalmanek and Y.R. Yang

takes to build reliable network applications, including software development process, requirements development, architecture, design and implementation, testing
methodology, support, and reporting. This chapter also discusses techniques that
aid in troubleshooting failed systems as well as techniques that tend to minimize
the duration of a failure. The chapter presents best practices for building reliable
network applications.
Chapter 16 provides a comprehensive overview of capacity and performance
engineering (C/PE), which is especially critical to the successful deployment of a
networked service platform. At the highest level, the goal is to ensure that the service meets all performance and reliability requirements in the most cost-effective
manner, where “cost” encompasses such areas as hardware/software resources, delivery schedule, and scalability. The chapter uses e-mail as an illustrating example.
Section 16.4 covers the architecture assessment phase of the C/PE process, including
the flow of critical transactions. Section 16.5 covers the workload/metric assessment phase, including the workload placed on platform elements and the servicelevel performance/reliability metrics that the platform must meet. Sections 16.6
and 16.7 develop analytic models to predict how a proposed platform will handle
the workload while meeting the requirements (reliability/ availability assessment
and capacity/performance assessment). Sections 16.8 and 16.9 develop engineering
guidelines to size the platform initially (scalability assessment) and to maintain service capacity, performance, and reliability post deployment (capacity/performance
management). Best practices of C/PE are given at the end of the chapter.

1.5 Conclusion
With our society’s increasing dependence on networks and networked application
services, the importance of reliability and performance engineering has never been
greater. Unfortunately, large-scale networks and services present significant challenges: scale and complexity, the need for correct operation in the presence of
constant change, as well as measurement and data challenges. Addressing these
challenges requires good design and sound operational practices. Network and
service engineers start with a firm understanding of the design objectives, the technology, and the operational environment for the service; follow a comprehensive
service design process; and develop capacity and performance engineering models.
Network and service management rely on a well-thought out measurement design,
a data collection and storage infrastructure, and a suite of management tools and applications. When done right, the end result is a network or service that works well.
As customers and applications become more demanding, this “raises the bar” for
reliability and performance, ensuring that this field will continue to provide opportunities for research and improvements in practice.

1

The Challenges of Building Reliable Networks and Networked Application Services

17

References
1. A History of the ARPANET. Bolt, Beranek, and Newman, 1981.
2. Deming, W. E. (2000). The new economics for government, industry and education (2nd ed.).
Cambridge, MA: MIT Press. ISBN 0–262–54116–5.
3. AT&T statement (1990). The Risks Digest, 9(63).
4. Wilson, A. M. (1998). Alarm management and its importance in ensuring safety, Best practices
in alarm management, Digest 1998/279.
5. Stallings, W. (1999). SNMP, SNMPv2, SNMPv3, and RMON 1 and 2 (3rd ed.). Reading, MA:
Addison-Wesley.
6. Mahimkar, A., Yates, J., Zhang, Y., Shaikh, A., Wang, J., Ge, Z., et al. (December 2008). Troubleshooting chronic conditions in large IP networks. Proceedings of the 4th ACM international
conference on emerging Networking Experiments and Technologies (CoNEXT).
7. Telemark Survey. http://www.telemarkservices.com/
8. Schwartz, J. (2007). Who needs hackers? New York Times, September 12, 2007.

Chapter 2

Structural Overview of ISP Networks
Robert D. Doverspike, K.K. Ramakrishnan, and Chris Chase

2.1 Introduction
An Internet Service Provider (ISP) is a telecommunications company that offers its
customers access to the Internet. This chapter specifically covers the design of a
large Tier 1 ISP that provides services to both residential and enterprise customers.
Our primary focus is on a large IP backbone network in the continental USA, though
similarities arise in smaller networks operated by telecommunication providers in
other parts of the world. This chapter is principally motivated by the observation that
in large carrier networks, the IP backbone is not a self-contained entity; it co-exists
with numerous access and transport networks operated by the same or other service providers. In fact, how the IP backbone interacts with its neighboring networks
and the transport layers is fundamental to understanding its structure, operation, and
planning. This chapter is a hands-on description of the practical structure and implementation of IP backbone networks. Our goal is complicated by the complexity of
the different network layers, each of which has its own nomenclature and concepts.
Therefore, one of our first tasks is to define the nomenclature we will use, classifying the network into layers and segments. Once this partitioning is accomplished,
we identify where the IP backbone fits and describe its key surrounding layers and
networks.
This chapter is motivated by three aspects of the design of large IP networks.
The first aspect is that the design of an IP backbone is strongly influenced by
the details of the underlying network layers. We will illustrate how the evolution
R.D. Doverspike ()
Executive Director, Network Evolution Research, AT&T Labs Research,
200 S. Laurel Ave, Middletown, NJ 07748, USA
e-mail: rdd@research.att.com
K.K. Ramakrishnan
Distinguished Member of Technical Staff, Networking Research, AT&T Labs Research,
Shannon Labs, 180 Park Avenue, Florham Park, NJ 07932, USA
C. Chase
AT&T Labs, 9505 Arboretum Blvd, Austin, TX 78759, USA
e-mail: chase@labs.att.com

C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications,
Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 2,
c Springer-Verlag London Limited 2010


19

20

R.D. Doverspike et al.

of customer access through the metro network has influenced the design of the
backbone. We also show how the evolution of the Dense Wavelength-Division
Multiplexing (DWDM) layer has influenced core backbone design.
The second aspect presents the use of Multiprotocol Label Switching (MPLS) in
large ISP networks. The separation of routing and forwarding provided by MPLS
allows carriers to support Virtual Private Networks (VPNs) and Traffic Engineering
(TE) on their backbones much more simply than with traditional IP forwarding.
The third aspect is how network outages manifest in multiple network layers and
how the network layers are designed to respond to such disruptions, usually through
a set of processes called network restoration. This is of prime importance because
a major objective of large ISPs is to provide a known level of quality of service to
its customers through Service Level Agreements (SLAs). Network disruptions occur
from two major sources: failure of network components and maintenance activity.
Network restoration is accomplished through preplanned network design processes
and real-time network control processes, as provided by an Interior Gateway Protocol (IGP) such as Open Shortest Path First (OSPF). We present an overview
of OSPF reconvergence and the factors that affect its performance. As customers
and applications place more stringent requirements on restoration performance in
large ISPs, the assessment of OSPF reconvergence motivates the use of MPLS Fast
Reroute (FRR).
Beyond the motivations described above, the concepts defined in this chapter lay
useful groundwork for the succeeding chapters. Section 2.2 provides a structural
basis by providing a high-level picture of the network layers and segments of a
typical, large nationwide terrestrial carrier. It also provides nomenclature and technical background about the equipment and network structure of some of the layers
that have the largest impact on the IP backbone. Section 2.3 provides more details
about the architecture, network topology, and operation of the IP backbone (the IP
layer) and how it interacts with the key network layers identified in Section 2.2.
Section 2.4 discusses routing and control protocols and their application in the IP
backbone, such as MPLS. The background and concepts introduced in Sections 2.2–
2.4 are utilized in Section 2.5, where we describe network restoration and planning.
Finally, Section 2.6 describes a “case study” of an IPTV backbone. This section
unifies many of the concepts presented in the earlier sections and how they come
together to allow network operators to meet their network performance objectives.
Section 2.7 provides a summary, followed by a reference list, and a glossary of
acronyms and key terms.

2.2 The IP Backbone Network in Its Broader Network Context
2.2.1 Background and Nomenclature
From the standpoint of large telecommunication carriers, the USA and most large
countries are organized into metropolitan areas, which are colloquially referred to as
metros. Large intrametro carriers place their transmission and switching equipment

2

Structural Overview of ISP Networks

21

in buildings called Central Offices (COs). Business and residential customers typically obtain telecommunication services by connecting to a designated first CO
called a serving central office. This connection occurs over a feeder network that
extends from the CO toward the customer plus a local loop (or last mile) segment
that connects from the last equipment node of the feeder network to the customer
premise. Equipment in the feeder network is usually housed in above-ground huts,
on poles, or in vaults. The feeder and last-mile segments usually consist of copper,
optical fiber, coaxial cable, or some combination thereof. Coaxial cable is typical
to a cable company, also called a Multiple System Operator (MSO). While we will
not discuss metro networks in detail in this chapter, it is important to discuss their
aspects that affect the IP backbone. However, the metro networks we describe coincide mostly with those carriers whose origins are from large telephone companies
(sometimes called “Telcos”).
Almost all central offices today are interconnected by optical fiber. Once a customer’s data or voice enters the serving central office, if it is destined outside that
serving central office, it is routed to other central offices in the same metro area. If
the service is bound for another metro, it is routed to one or more gateway COs.
If it is bound for another country, it eventually routes to an international gateway.
A metro gateway CO is often called a Point of Presence (POP). While POPs were
originally defined for telephone service, they have evolved to serve as intermetro
gateways for almost all telecommunication services. Large intermetro carriers have
one or more POPs in every large city.
Given this background, we now employ some visualization aids. Networks are
organized into network layers, which we depict vertically with two network graphs
vertically stacked on top of one another in Fig. 2.1. Each of the network layers
can be considered to be an overlay network with respect to the network below.

Inter-metro
network
Metro 5
Metro 4

Metro 3
Metro 1

Metro 2

Fig. 2.1 Conceptual network layers and segmentation

22

R.D. Doverspike et al.

We can further organize these layers into access, metro, and core network segments. Figure 2.1 shows the core segment connected to multiple metro segments.
Each metro segment represents the network layers of the equipment located in
the central offices of a given metropolitan area. The access segment represents the
feeder network and loop network associated with a given metro segment. The core
segment represents the equipment in the POPs and network structures that connect
them for intermetro transport and switching.
In this chapter, we focus on the ISP backbone network, which is primarily
associated with the core segment. We refer only briefly to access architectures
and will discuss portions of the metro segment to the extent to which they interact and connect to the core segment. Also, in this chapter we will not discuss
broader telecommunication contexts, such as international networks (including undersea links), satellite, and wireless networks. More detail on the various network
segments and their network layers and a historical description of how they arose can
be found in [11].
Unfortunately, there is a wide variety of terminology used in the industry, which
presents a challenge for this chapter because of our broad scope. Some of the terminology is local to an organization, application, or network layer and, thus, when used
in a broader context can be confused with other applications or layers. Within the
context of network-layering descriptions, we will use the term IP layer. However,
we use the term “IP backbone” interchangeably with “IP layer” in the context of the
core network segment. The terms Local Area Network (LAN), Metropolitan Area
Network (MAN), and Wide Area Network (WAN) are also sometimes used and correlate roughly with the access, metro, and core segments defined earlier; however,
LAN, MAN, and WAN are usually applied only in the context of packet-based networks. Therefore, in this chapter, we will use the terms access, metro, and core, since
they apply to a broader context of different network technologies and layers. Other
common terms for the various layers within the core segment are long-distance and
long-haul networks.

2.2.2 Simple Graphical Model of Network Layers
The following simple graph-oriented model is helpful when modeling routing and
network design algorithms, to understand how network layers interact and, in particular, how to classify and analyze the impact of potential network disruptions. This
model applies to most connection-oriented networks and, thus, will apply to some
higher-layer protocols that sit on top of the IP layer. The IP layer itself is connectionless and does not fit exactly in this model. However, this model is particularly
helpful to understand how lower network layers and neighboring network layers
interact.
In the layered model, a network layer consists of nodes, links (also called edges),
and connections. The nodes represent types of switches or cross-connect equipment that exchange data in either digital or analog form via the links that connect

2

Structural Overview of ISP Networks

23

them. Note that at the lowest layer (such as fiber) nodes represent equipment, such
as fiber-optic patch panels, in which connections are switched manually by crossconnecting fiber patch cords from one interface to another. Links can be modeled
as directed (unidirectional) or undirected (bidirectional). Connections are crossconnected (or switched) by the nodes onto the links, and thus form paths over the
nodes and links of the graph. Note that the term connection often has different names
at different layers and segments. For example, in most telecommunication carriers,
a connection (or portions thereof ) is called a circuit in many of the lower network layers, often referred to as transport layers. Connections can be point-to-point
(unidirectional or bidirectional), point-to-multipoint or, more rarely, multipoint-tomultipoint. Generally, connections arise from two sources. First, telecommunication
services can arise “horizontally” (relative to our conceptual picture of Fig. 2.1) from
a neighboring network segment. Second, connections in a given layer can originate from edges of a higher-layer network layer. In this way, each layer provides
a connection “service” for the layer immediately above it to provide connectivity. Sometimes, a “client/server” model is referenced, such as the User-Network
Interface (UNI) model [29] of the Optical Internetworking Forum (OIF), wherein
the links of higher-layer networks are “clients” and the connections of lower-layer
networks are “servers”. For example, see G.7713.2 [19] for more discussion of connection management in lower-layer transport networks.
Recall that the technology layers we define are differentiated by the nodes,
which represent actual switching or cross-connect equipment, rather than more abstract entities, such as protocols within each of these technology layers that can
create multiple protocol sublayers. An early manifestation of protocol layering is
the OSI model developed by the ISO standards organization [37] and the resulting classification of packet layering, such as Layer 1, Layer 2, Layer 3, which
subsequently emerged in the industry. Although these layering definitions can be
somewhat strained in usage, the industry generally associates IP with Layer 3 and
MPLS or Ethernet VLANS with Layer 2 (which will be described later in the chapter). Layer 1, or the Physical Layer (PHY layer) of the OSI stack, covers multiple
technology layers that we will cover in the next section.
We illustrate this graphical network-layering model in Fig. 2.2, which depicts
two layers. Note that for simplicity, we depict the edges in Fig. 2.2 as undirected.
The cross-connect equipment represented by the nodes of Layer U (“upper layer”)
connect to their counterpart nodes in Layer L (“lower layer”) by interlayer links,
depicted as lightly dashed vertical lines. While this model has no specific geographical correlation, we note that the switching or cross-connect equipment represented
in Layer U usually are colocated in the same buildings/locations (central offices in
carrier networks) as their lower-layer counterparts in Layer L. In such representations, the interlayer links are called intra-office links. The links of Layer U are
transported as connections in lower Layer L. For example, Fig. 2.2 highlights a link
between nodes 1 and 6 of layer U . This link is transported via a connection between
nodes 1 and 6 of Layer L. The path of this connection is shown through nodes (1, 2,
3, 4, 5, 6) at Layer L.

24

R.D. Doverspike et al.

Example Layer U
links

Nodes of Layer U
and Layer L are
co-located (same
central office)

Layer U

1
6

3
5

1
2
Layer-U link is
transported as a
connection in Layer L

6
3

5
4

Layer L

Fig. 2.2 Example of network layering

Another example is given by the link between nodes 3 and 5 of Layer U . This
routes over nodes (3, 4, 5) in Layer L. As this layered model illustrates, the concept
of a “link” is a logical construct, even in lower “physical layer(s)”. Along these
lines, we identify some interesting observations in Fig. 2.2:
1. There are more nodes in Layer L than in Layer U .
2. When viewed as separate abstract graphs, the degree of logical connectivity in
Layer L is less than that for Layer U . For example, there are at the most three
edge-diverse paths between nodes 1 and 6 in layer U . However, there are at the
most, only two edge-diverse paths between the corresponding pair of nodes in
Layer L.
3. When we project the links of Layer U onto their connection paths in Layer L;
we see some overlap. For example, the two logical links highlighted in Layer U
overlap on links (3, 4) and (4, 5) of Layer L.
These observations generalize to the network layers associated with the IP backbone
and affect how network layers are designed and how network failures at various layers affect higher-layer networks. The second observation says that while the logical
topology of an upper-layer network, such as the IP layer, looks like it has many
alternate paths to accommodate network disruptions, this can be deceiving unless
one incorporates the lower-layer dependencies. For example, if link 3–4 of Layer
L fails, then both links 1–6 and 3–5 of Layer U fail. Put more generally, failures
of links of lower-layer networks usually cause multiple link failures in higher-layer
networks. Specific examples will be described in Section 2.3.2.

2

Structural Overview of ISP Networks

25

2.2.3 Snapshot of Today’s Core Network Layers
Figure 2.3 provides a representation of the set of services that might be provided by
a large US-based carrier, and how these services map onto different network layers
in the core segment. This figure is borrowed from [11] and depicts a mixture of
legacy network layers (i.e., older technologies slowly being phased out) and current
or emerging network layers. For a connection-oriented network layer (call it layer
L), demand for connections comes from two sources: (1) links of higher network
layers that route over layer L and (2) demand for telecommunications services provided by layer L but which originate outside layer L’s network segment. The second
source of demand is depicted by rounded rectangles in Fig. 2.3. Note that Fig. 2.3
is a significant simplification of reality; however, it does capture most predominant
layers and principal interlayer relationships relevant to our objectives. Note that an
important observation in Fig. 2.3 is that links of a given layer can be spread over
multiple lower layers including “skipping” over intermediate lower layers.
Before we describe these layers, we provide some preliminary background on
Time Division Multiplexing (TDM), whose signals are often used to transport links
of the IP layer. Table 2.1 summarizes the most common TDM transmission rates.
The Synchronous Optical Network (SONET) digital-signal standard [35], pioneered

Frame
Relay &
ATM
Private Line
(DS3 to OC-12)

Residential
IPTV

Voice
over IP

ISP &
Business
VPN

Ethernet
Services
Ethernet
Layer

IP Layer

ATM Layer

Circuitswitched
Voice

DS1
Private
Line

Circuit-Switched
Layer
W-DCS Layer

DCS-3/3 Layer
Ethernet
Private
Line

Intelligent
Optical Switch
(IOS) Layer

SONET Ring
Layer

Wavelength
Services

Key:
ROADM /
Pt-to-pt
DWDM Layer

Service
Layer-Layer
Service

Fiber Layer

Network
Layer
Legacy
Layer

Fig. 2.3 Example of core-segment network layers

Connections

Gigabit Ethernet
Private Line

Pre-SONET
Transmission
Layer

26

R.D. Doverspike et al.

Table 2.1 Time division multiplexing (TDM) digital hierarchy (partial list)
Approximate rate
DS-n
Plesiosynchronous
SONET
SDH
64 Kb/s
DS-0
E0
1.5 Mb/s
DS-1
2.0 Mb/s
E-1
34 Mb/s
E-3
45 Mb/s
DS-3
51.84 Mb/s
STS-1
VC-3
155.5 Mb/s
OC-3
STM-1
622 Mb/s
OC-12
STM-3
2.5 Gb/s
OC-48
STM-16
10 Gb/s
OC-192
STM-48
40 Gb/s
OC-768
STM-192
100 Gb/s

OTN wrapper

ODU-1
ODU-2
ODU-3
ODU-4

Kb/s D kilobits per second; Mb/s D megabits per second; Gb/s D gigabits per second.
OTN line rates are higher than payload. ODU-2 includes 10 GigE and ODU-3 includes 40 GigE
(under development). ODU-4 only includes 100 GigE

by Bellcore (now Telcordia) in the early 1990s, is shown in the fourth column
of Table 2.1. SONET is the existing higher-rate digital-signal hierarchy of North
America. Synchronous Digital Hierarchy (SDH) is a similar digital-signal standard
later pioneered by the International Telecommunication Union (ITU-T) and adopted
by most of the rest of the world. The DS-n column represents the North American
pre-SONET digital-signal rates, most of which originated in the Bell System. The
Plesiosynchronous column represents the pre-SDH rates used mostly in Europe.
However, after nearly 30 years, both DS-n and Plesiosynchronous are still quite
abundant and their related private-line services are still sold actively. Finally, in the
last column, we show the more recent Optical Transport Network (OTN) signals,
also standardized by the ITU-T [18]. Development of the OTN signal standards
were originally motivated by the need for a more robust standard to achieve very
high bit rates in DWDM technologies; for example, it was needed to incorporate
and standardize various bit-error recovery techniques, such as Forward Error Correction (FEC). As such, the OTN rates were originally termed “digital wrappers” to
contain high rate SONET, SDH, or Ethernet signals, plus provide the extra fault notification information needed to reliably transport the high rates. Although there are
many protocol layers in OTN, we just show the Optical channel Data Unit (ODU)
rates in Table 2.1. To minimize confusion, in the rest of this chapter, we will mostly
give examples in terms of DS-n and SONET rates.
Referring back to the layered network model of the previous section, Table 2.2
gives some examples of the nodes, links, and connections in Fig. 2.3. We only list
those layers that have relevance to the IP layer. We will briefly describe these layers
in the following sections.

2

Structural Overview of ISP Networks

27

Table 2.2 Examples of nodes, links, and connections for network layers of Fig. 2.3
Core layer
Typical node
Typical link
Typical connection
IP
Router
SONET OC-n, 1/10
IP is connection-less
gigabit Ethernet,
ODU-n
Ethernet can refer to both
1/10 Gigabit Ethernet
Ethernet
Ethernet switch or
connection-less and
or rate-limited
router with
connection-oriented
Ethernet private
Ethernet
services
line
functionality
Asynchronous
ATM switch
SONET OC-12/48
Permanent virtual circuit
transfer
(PVC), Switched virtual
mode (ATM)
circuit (SVC)
W-DCS
Wideband digital
SONET STS-1
DS1
cross-connect
(channelized)
system (DCS)
SONET Ring
SONET add-drop
SONET OC-48/192
SONET STS-n, DS-3
multiplexer
(ADM)
SONET OC-48/192
SONET STS-n
IOS
Intelligent optical
switch (IOS) or
broadband digital
cross-connect
system (DCS)
DWDM signal
SONET, SDN, or 1/10/100
DWDM
Point-to-point
gigabit Ethernet
DWDM terminal
or reconfigurable
optical add-drop
multiplexer
(ROADM)
Fiber
Fiber patch panel or
Fiber optic strand
DWDM signal or SONET,
cross-connect
SDH, or Ethernet signal

2.2.4 Fiber Layer
The commercial intercity fiber layer of the USA is privately owned by multiple
carriers. In addition to owning fiber, carriers lease bundles of fiber from one another using various long-term Indefeasible Right of Use (IROU) contracts to cover
needed connectivity in their networks. Fiber networks differ significantly between
metro and rural areas. In particular, in carrier metro networks, optical fiber cables are
usually placed inside PVC pipes, which are in turn placed inside concrete conduits.
Additionally, fiber for core networks is often corouted in conduit or along rightsof-way with metro fiber. Generally, in metro areas, optical cables are routed and
spliced between central offices. In the central office, most carriers prefer to connect
the fibers to a fiber patch panel. Equipment that use (or will eventually use) the interoffice fibers are also cross-connected into the patch panels. This gives the carrier
flexibility to connect equipment by simply connecting fiber patch cords on the patch
panels. Rural areas differ in that there are often long distances between central offices and, as such, intermediate huts are used to splice fibers and place equipment,
such as optical amplifiers.

28

R.D. Doverspike et al.

2.2.5 DWDM Layer
Although many varieties of DWDM systems exist, we show a simplified view of
a (one-way) point-to-point DWDM system in Fig. 2.4. Here, Optical Transponders
(OTs) are Optical-Electrical-to-Optical (O-E-O) converters that input optical digital
signals from routers, switches, or other transmission equipment using a receive device, such as a photodiode, on the add/drop side of the OT. The input signal has a
standard intra-office wavelength, denoted by 0 . The OT converts the signal to electrical form. Various other physical layer protocols may be applied at this point, such
as incorporating various handshaking called Link Management Protocols (LMPs)
between the transmitting equipment and the receiving OT. A transponder is in clear
channel mode if it does not change the transport protocols of the signal that it
receives and essentially remains invisible to the equipment connecting to it. For
example, Gigabit Ethernet (GigE) protocols from some routers or switches sometimes incorporate signaling messages to the far-end switch in the interframe gaps. If
clear channel transmission is employed by the OT, such messages will be preserved
as they are routed over the DWDM layer.
After conversion to electrical form, the signal is retransmitted using a laser on
the network or line-side of the OT. However, typical of traditional point-to-point
systems, the wavelength of the laser is fixed to correspond to the wavelength assigned to a specific channel of the DWDM system, k . The output light pulses from
multiple OTs at different wavelengths are then multiplexed into a single fiber by
sending them through an optical multiplexer, such as an Arrayed Waveguide Grating
Optical multiplexer: combines input optical signals with different
wavelengths (from one optical fiber each) to output on a single
optical fiber. Can be implemented with an optical grating.

Optical amplifier

client signals
(SONET, Ethernet)
λ0
λ0

λ0
optical
multiplexer

λ1
λ2
λn

λ0

Optical Transponder (OT): inputs standard intraoffice wavelength (λ0), electrically regenerates
signal, and outputs specific wavelength for longdistance transport (λk over channel k)

Fig. 2.4 Simplified view of point-to-point DWDM system

optical
demultiplexer

λ0
λ0

OT: inputsλk, electrically
regenerates signal, and
outputs λ0

2

Structural Overview of ISP Networks

29

(AWG) or similar device. If the distance between the DWDM terminals is sufficiently long, optical amplifiers are used to boost the power of the signal. However,
power balancing among the DWDM channels is a major concern of the design of the
DWDM system, as are other potential optical impairments. These topics are beyond
the scope of this chapter. On the right side of Fig. 2.4, typically, the same (or similar)
optical multiplexer is used in reverse, in which case, it becomes an optical demultiplexer. The OTs on the right side (the receive direction of the DWDM system)
basically work in reverse to the transmit direction described above, by receiving the
specific interoffice wavelength, k , converting to electrical, and then using a laser
to generate the intra-office wavelength, 0 .
Carrier-based DWDM systems are usually deployed in bidirectional configurations. To see this, the reader can visually reproduce the entire system in Fig. 2.4 and
then flip it (mirror it) right to left. The multiplexed DWDM signal in the opposite
direction is transmitted over a separate fiber. Therefore, even though the electronics
and lasers of the one-way DWDM system in the reverse direction operate separately
from the shown direction, they are coupled operationally. For example, the two fiber
ports (receive and transmit) of the OT are usually deployed on the same line card
and arranged next to one another.
Optical amplification is used to extend the distance between terminals of a
DWDM system. However, multiple systems are required to traverse the continental USA. Connections can be established between different point-to-point DWDM
systems in an intermediate CO via an intermediate-regenerator OT (not pictured in
Fig. 2.4). An intermediate-regenerator OT has the same effect on a signal as backto-back OTs. Since the signal does not have to be cross-connected elsewhere in
the intermediate central office, cost savings can be achieved by omitting the intermediate lasers and receivers of back-to-back OTs. However, we note that most
core DWDM networks have many vintages of point-to-point systems from different
equipment suppliers. Typically, an intermediate-regenerator OT can only be used to
connect between DWDM systems of the same equipment supplier.
A difficulty with deploying point-to-point DWDM systems is that in central
offices that interface multiple fiber spans (i.e., the node in the fiber layer has degree
>2), all connections demultiplex in that office and pass through OTs. OTs are typically expensive and it is advantageous to avoid their deployment where possible.
A better solution is the Reconfigurable Optical Add-Drop Multiplexer (ROADM).
We show a simplified diagram of a ROADM in Fig. 2.5. The ROADM allows for
multiple interoffice fibers to connect to the DWDM system. Appropriately, it is often called a multidegree ROADM or n-degree ROADM. As Fig. 2.5 illustrates, the
ROADM is able to optically (i.e., without use of OTs) cross-connect channel k
(transmitting at wavelength k ) arriving on one fiber to channel k (wavelength k )
outgoing on another fiber. Note that the same wavelength must be used on the two
fibers. This is called the wavelength continuity constraint. The ROADM can also be
configured to terminate (or “drop”) a connection at that location, in which case it
is cross-connected to an OT to connect to routers, switches, or transmission equipment. A “dropped” connection is illustrated by 2 on the second fiber from the top
on the left in Fig. 2.5 and an “added” connection is illustrated by n on the bottom

30

R.D. Doverspike et al.
Optical Transponders (OT) also provided in bidirectional
mode for regeneration at intermediate nodes
λ1
λ2
λn
λ1
λ2

in

out

λn
λ0
λ0
λ0

λ0

λ1
optical
multiplexer

λn

ROADM

optical
demultiplexer

λ0
λ0

Fig. 2.5 Simplified view of Reconfigurable Optical Add-Drop Multiplexer (ROADM)

fiber on the left. As with the point-to-point DWDM system, optical properties of the
system impose distance (also called reach) constraints.
Many transmission technologies, including optical amplification, are used to
extend the distance between the optical add/drop points of a DWDM system.
Today, this separation is designed to be about 1,500 km for a long-distance DWDM
system, as a trade-off between cost and the all-optical distance for a US-wide
network. Longer connections have to regenerate their signals, usually with an
intermediate-regenerator OT. As with point-to-point DWDM systems, connections
crossing ROADMS from different equipment suppliers usually must add/drop and
connect through OTs.
We illustrate a representative ROADM layer for the continental USA in Fig. 2.6.
The links represent fiber spans between ROADMS. As described above, to route
a connection over the network of Fig. 2.6 may require points of regeneration. We
also note, though, that today’s core transport carriers usually have many vintages
of DWDM technology and, thus, there may be several ROADM networks from different equipment suppliers, plus several point-to-point DWDM networks. All this
complexity must be managed when routing higher-layer links, such as those of the
IP backbone, over the DWDM layer.
We finish this introduction of the DWDM layer with a few observations. While
most large carriers have DWDM technology covering their core networks, this
is not generally true in the metro segment. The metro segment typically consists of a mixture of DWDM spans and fiber spans (i.e., spans with no DWDM).
If fact, in metro areas usually only a fraction of central office fiber spans have
DWDM technology routed over them. This affects how customers interface to the
IP backbone network for higher-rate interfaces. Finally, we note that while most

2

Structural Overview of ISP Networks

31
Note: This figure is a simplified
illustration. It does not
represent the specific design of
any commercial carrier

Seattle
Portland
Chicago

Salt Lake City

Reconfigurable Optical Add / Drop
Multiplexer (ROADM)

Fig. 2.6 Example of ROADM Layer topology

of the connections for the core DWDM layer arise from links of the IP layer,
many of the connections come from what many colloquially call “wavelength services” (denoted by the rounded rectangle in Fig. 2.3). These come from high-rate
private-line connections emanating from outside the core DWDM layer. Examples are links between switches of large enterprise customers that are connected
by leased-line services.

2.2.6 TDM Cross-Connect Layers
In this section, we will briefly describe the TDM cross-connect layers. TDM
cross-connect equipment can be basically categorized into two common types: a
SONET/SDH Add-Drop Multiplexer (ADM) or a Digital Cross-Connect System
(DCS). Consistent with our earlier remark about the use of terminology, the latter
often goes by a variety of colloquial or outmoded model names of equipment suppliers, such as DCS-3/1, DCS-3/3, DACS, and DSX. A TDM cross-connect device
interfaces multiple high-rate digital signals, each of which uses time division multiplexing to break the signal into lower-rate channels. These channels carry lower-rate
TDM connections and the TDM cross-connect device cross-connects the lower-rate
signals among the channels of the different high-rate signals. Typically, an ADM
only interfaces two high-rate signals, while a DCS interfaces many. However, over
time these distinctions have blurred. Telcordia classified DCSs into three layers:

32

R.D. Doverspike et al.

a narrowband DCS (N-DCS) cross-connects at the DS-0 rate, a wideband-DCS
(W-DCS) cross-connects at the DS-1 rate, and a broadband-DCS (B-DCS) crossconnects at the DS-3 rate or higher. ADMs are usually deployed in SONET/SDH
self-healing rings. The IOS and SONET Ring layers are shown in Fig. 2.3, encircled by the (broader) ellipse that represents the TDM cross-connect devices. More
details on these technologies can be found in [11]. Self-healing rings and DCSs will
be relevant when we illustrate how services access the wide-area ISP network layer
later in this chapter.
Despite the word “optical” in its name, an Intelligent Optical Switch (IOS) is
a type of B-DCS. Examples can be found in [6, 34]. The major differentiator of
the IOS over older B-DCS models is its advanced control plane. An IOS network
can route connection requests under distributed control, usually instigated by the
source node. This requires mechanisms for distributing topology updates and internodal messaging to set up connections. Furthermore, an IOS usually can restore
failed connections by automatically rerouting them around failed links. More detail
is given when we discuss restoration methods.
Many of the connections for the core TDM-cross-connect layers (ring layers,
DCS layers, IOS layer) come from higher layers of the core network. For example,
many connections of the IOS layer are links between W-DCSs, ATM networks, or
lower-rate portions of IP layer networks. However, much of their demand for connections comes from subwavelength private-line services, shown by the rounded
rectangle in Fig. 2.3. A portion of this private-line demand is in the form of
Ethernet Private Line (EPL) services. These services usually represent links between Ethernet switches or routers of large enterprise customers. For example, the
Gigabit Ethernet signal from an enterprise customer’s switch is transported over the
metro network and then interfaces an Ethernet card either residing on the IOS itself
or on an ADM that interfaces directly onto the IOS. The Ethernet card encapsulates the Ethernet frames inside concatenated n  STS-1 signals that are transported
over the IOS layer. The customer can choose the rate of transport, and hence the
value of n he/she wishes to purchase. The ADM Ethernet card polices the incoming
Ethernet frames to the transport rate of n  STS-1.

2.2.7 IP Layer
The nodes of the IP layer shown in Fig. 2.3 represent routers that transport packets among metro area segments. IP generally define pairwise adjacencies between
ports of the routers. In the IP backbone, these adjacencies are typically configured
over SONET, SDH, or Ethernet, or OTN interfaces on the routers. As described
above, these links are then transported as connections over the interoffice lowerlayer networks shown in Fig. 2.3. Note that different links can be carried in different
lower-layer networks. For example, lower-rate links may be carried over the TDM
cross-connect layers (IOS or SONET Ring), while higher-rate links may be carried
directly over the DWDM layer, thus “skipping” the TDM cross-connect layers. We
will describe the IP layer in more detail in subsequent sections.

2

Structural Overview of ISP Networks

33

2.2.8 Ethernet Layer
The Ethernet layer in Fig. 2.3 refers to several applications of Ethernet technology.
For example, Ethernet supports a number of physical layer standards that can be
used for Layer 1 transport. Ethernet also refers to connection-oriented Layer 2 pseudowire services [16] and connection-less transparent LAN services. For example,
intra-office links between routers often use an Ethernet physical layer riding on
optical fiber.
An important application of Ethernet today is providing wide-area Layer 2 Virtual Private Network (VPN) services for enterprise customers. Although many
variations exist, these services generally support enterprise customers that have
Ethernet LANs at multiple locations and need to interconnect their LANs within
a metro area or across the wide area. Most large carriers provide these services as
an overlay on their IP layer, and hence, why we show the layered design in Fig. 2.3.
Prior to the ability to provide such services over the IP layer, Ethernet private lines
were supported by TDM cross-connect layers (i.e., Ethernet frames encapsulated
over Layer 1 TDM private lines as described in Section 2.2.6). However, analogous
to why wide-area Frame Relay displaced wide-area DS-0 private lines in the 1990s,
wide-area packet networks are often more efficient than private lines to connect
LANs of enterprise customers.
The principal approach that intermetro carriers use to provide wide-area Ethernet private network services is Virtual Private LAN Service (VPLS) [24, 25]. In
this approach, carriers provide such Ethernet services with routers augmented with
appropriate Ethernet capabilities. The reason for this approach is to provide the robust carrier-grade network capabilities provided by routers. With wide-area VPLS,
the enterprise customer is connected via the metro network to the edge routers on
the edge of the core IP layer. We describe how the metro network connects to the
core IP layer network in the next section. The VPLS architecture is described in
more detail in Section 2.4.2 when we describe MPLS.
We conclude this section with the comment that standards organizations and industry forums (e.g., IEEE, IETF, and Metro Ethernet Forum) have explored the
use of Ethernet switches with upgraded carrier-grade network control protocols
rather than using routers as nodes in the IP layer. For example, see Provider Backbone Transport (PBT) [27] and Provider Backbone Bridge – Traffic Engineering
(PBB-TE) [15]. However, most large ISPs are deploying MPLS-based solutions.
Therefore, we concentrate on the layering architecture shown in Fig. 2.3 in the remainder of this chapter.

2.2.9 Miscellaneous/Legacy Layers
For completeness, we depict other “legacy” network layers with dashed ovals
in Fig. 2.3. These technologies have been around for decades in most carrierbased core networks. They include network layers whose nodes represent ATM

34

R.D. Doverspike et al.

switches, Frame-Relay switches, DCS-3/3s (a B-DCS that cross-connects DS3s),
Voice-switches (DS-0 circuit switches), and pre-SONET ADMs. Most of these layers are not material to the spirit of this chapter and we do not discuss them here.

2.3 Structure of Today’s Core IP Layer
2.3.1 Hierarchical Structure and Topology
In this chapter, we further break the IP layer into Access Routers (ARs) and
Backbone Routers (BRs). Customer equipment homes to access routers, which in
turn home onto backbone routers. An AR is either colocated with its backbone
routers or not; the latter is called a Remote Access Router (RAR). Of course, there are
alternate terminologies. For example, the IETF defines similar concepts to customer
equipment, access routers, and backbone routers with its definitions, respectively,
of Customer-Edge (CE) equipment, Provider-Edge (PE) routers, and Provider (P)
routers. A simplified picture of a typical central office containing both ARs and BRs
is shown in Fig. 2.7. Access routers are dual-homed to two backbone routers to enable higher levels of service availability. The links between routers in the same office
are typically Ethernet links over intra-office fiber. While we show only two ARs in

Channelized
OC-12

m-GigE
BR
SONET OC-n
(e.g., n= 768)

AR
Intra-office
Fiber

BR

BR
CORE ROADM
Layer Network

AR

IntraOffice
TDM
Layers

Access/
Metro
TDM Layers

Example of DS1
access circuits
multiplexed over
channelized OC-12
interface

BR

RAR

BR = Backbone Router
= IP Layer Logical Link
= IP Layer Access Link

(R)AR = (Remote) Access Router
= Router Line Card
= Central Office

Fig. 2.7 Legacy central office interconnection diagram (Layer 3)

2

Structural Overview of ISP Networks

35
Note: This figure is a simplified
illustration. It does not
represent the specific design of
any commercial carrier

Core Router
Intra-building Access / Edge Router
Remote Access / Edge Router

Fig. 2.8 Example of IP layer switching hierarchy

Fig. 2.7, note that typically there are many ARs in large offices. Also, due to scaling
and sizing limitations, there may be more than two backbone routers or switches per
central office used to further aggregate AR traffic before it enters the BRs.
Moreover, we show a remote access router that homes to one of the BRs.
Figure 2.8 illustrates this homing arrangement in a broader network example, where
small circles represent ARs, diamonds represent RARs, and large squares represent BRs. Note that remote ARs are homed to BRs in different offices. Homing
remote ARs to BRs in different central offices raises network availability. However,
a stronger motivation for doing this is that RAR–BR links are usually routed over
the DWDM layer, which generally does not offer automatic restoration, and so the
dual-homing serves two purposes: (1) protect against BR failure or maintenance
activity and (2) protect against failure or maintenance of a RAR–BR link.
While the homing scheme described here is typical of large ISPs, other variations
exist. For example, there are dual-homing architectures where (nonremote) ARs
are homed to a BR colocated in the same central office and then a second BR in
a different central office. While this latter architecture provides a slightly higher
level of network availability against broader central office failure, it can be more
costly owing to the need to transport the second AR–BR link. However, the latter
architecture allows more load balancing across BRs because of the extra flexibility
in homing ARs.

36

R.D. Doverspike et al.

Improved load balancing can offer other advantages, including lower BR costs.
Also, for ISPs with many scattered locations, but less total traffic, this latter
architecture may be more cost-effective than colocating two BRs in each BR-office.
The right side of Fig. 2.7 also shows the metro/access network-layer clouds
to connect customer equipment to the ARs. In particular, we illustrate DS1 customer
interfaces. The left side of Fig. 2.7 also shows the lower-layer DWDM clouds to
connect the interoffice links between BRs. We will expand these clouds in the next
sections.
The reasons for segregating the IP topology into access and backbone routers are
manifold:
 Access routers aggregate lower-rate interfaces from various customers or other

carriers. This function requires significant equipment footprint and processor resources for customer-related protocols. As a result, major central offices consist
of many access routers to accommodate the low-rate customer interfaces. Without the aggregation function of the backbone router, each such office would be a
myriad of tie links between access routers and interoffice links.
 Access routers are often segregated by different services or functions. For
example, general residential ISP service can be segregated from high-priority
enterprise private VPN service. As another example, some access routers are
sometimes segregated to be peering points with other carriers.
 Backbone routers are primarily designed to be IP-transport switches equipped
only with the highest speed interfaces. This segregation allows the backbone
routers to be optimally configured for interoffice IP forwarding and transport.

2.3.2 Interoffice Topology
Figure 2.9 expands the core lower ROADM Layer cloud of Fig. 2.7. It shows ports
of interoffice links between BRs connecting to ports on ROADMs. These links are
transported as connections in the ROADM network. For example, today these links
go up to 40 gigabits per second (Gb/s) or SONET OC-768. These connections are
routed optically through intermediate ROADMs and regenerated where needed, as
described in Section 2.2.5. Also, we note that the link between the remote ARs and
BRs route over the same ROADM network, although the rate of this RAR–BR link
may be at lower rate, such as 10 Gb/s. Figure 2.10 shows a network-wide example of
the IP layer interoffice topology. There are some network-layering principles illustrated in Fig. 2.10 that we will describe. First, if we compare the IP layer topology
of Fig. 2.8 with that of the DWDM layer (ROADM layer) of Fig. 2.10, we note that
there is more connectivity in the IP layer graph than the DWDM layer. The reason
for this is the existence of what many IP layer planners call express links. If we
examine the link labeled “direct link” between Seattle and Portland, we find that
when we route this link over the DWDM layer topology, there are no intermediate
ROADMs. In fact, there are two types of direct links. The first type connects through

2

Structural Overview of ISP Networks

37

BR
ROADM

Core ROADM
Layer Network

CO D
AMP

BR

BR

ROADM
BR
CO C
AMP
AMP

ROADM

ROADM

OT for transport of links of
IOS Layer or high rate
private line Service

CO A
RAR

CO B

ROADM = Reconfigurable Optical Add-Drop Multiplexer (R)AR = (Remote) Access Router
BR = Backbone Router
= Central Office (CO)
= ROADM Optical Transponder (OT)
= Router Line Card
= ROADM Layer connection transporting IP layer link

Fig. 2.9 Core ROADM Layer diagram

Direct link

Express link

Seattle

Portland

Note: This figure is a simplified
illustration. It does not
represent the specific design of
any commercial carrier

Chicago
Salt Lake
City

Core Router
Aggregate Link

Fig. 2.10 Example of IP layer interbackbone topology

38

R.D. Doverspike et al.

no intermediate ROADMs, as illustrated by the Seattle–Portland link. The second
type connects through intermediate ROADMS, but encounters no BRs in those intermediate central offices, as illustrated by the Seattle–Chicago link.
In contrast, if we examine the express link between Portland and Salt Lake City,
we find that any path in the DWDM layer connecting the routers in that city pair
bypasses routers in at least one of its intermediate central offices. Express links
are primarily placed to minimize network costs. For example, it is more efficient
to place express links between well-chosen router pairs with high network traffic
(enough to raise the link utilization above a threshold level); otherwise the traffic
will traverse through multiple routers. Router interfaces can be the most-expensive
single component in a multilayered ISP network; therefore, costs can usually be
minimized by optimal placement of express links.
It is also important to consider the impact of network layering on network reliability. Referring to the generic layering example of Fig. 2.2, we note that the
placement of express links can cause a single DWDM link to be shared by different IP layer links. This gives rise to complex network disruption scenarios, which
must be modeled using sophisticated network survivability modeling tools. This is
covered in more detail in Section 2.5.3.
Returning to Fig. 2.10, we also note the use of aggregate links. Aggregate links
also go by other names, such as bundled links and composite links. An aggregate
link bundles multiple physical links between a pair of routers into a single virtual
link from the point of view of the routers. For example, an aggregate link could be
composed of five OC-192 (or 10 GigE) links. Such an aggregate link would appear
as one link with 50 Gb/s of capacity between the two routers. Generally, aggregate
links are implemented by a load-balancing algorithm that transparently switches
packets among the individual links. Usually, to reduce jitter or packet reordering,
packets of a given IP flow are routed over the same component link. The main advantage of aggregate links is that as IP networks grow large, they tend to contain
many lower-speed links between a pair of routers. It simplifies routing and topology
protocols to aggregate all these links into one. If one of the component links of
an aggregate link fails, the aggregate link remains up; consequently, the number of
topology updates due to failure is reduced and network rerouting (called reconvergence) is less frequent. Network operators seek to achieve network stability, and
therefore shy away from many network reconvergence events; aggregate links result
in less network reconvergence events.
On the downside, if only one link of a (multiple link) aggregate link fails, the
aggregate link remains “up”, but with reduced capacity. Since many network routing
protocols are capacity in-sensitive, packet congestion could occur over the aggregate
link. To avoid this situation, router software is designed with capacity thresholds for
aggregate links that the network operator can set. If the aggregate capacity falls
below the threshold, the entire aggregate link is taken out of service. While the
network “loses” the capacity of the surviving links in the bundle when the aggregate
link is taken out of service, the alternative is potentially significant packet loss due
to congestion on the remaining links.

2

Structural Overview of ISP Networks

39

2.3.3 Interface with Metro Network Segment
Figure 2.11 is a blowup of the clouds on the right side of Fig. 2.7. It provides a
simplified example of how three business ISP customers gain access to the IP backbone. These could be enterprise customers with multiple branches who subscribe
to a VPN service. Each access method consists of a DS1 link encapsulating IP
packets that is transported across the metro segment. In carrier vernacular, using
packet/TDM links to access the IP backbone is often called TDM backhaul. We do
not show the inner details of the metro network here. Detailed examples can be
found in [11]. Even suppressing the details of the complex metro network, the TDM
backhaul is clearly a complicated architecture. To aid his/her understanding, we
suggest the reader to refer back to the TDM hierarchy shown in Table 2.1.
The customer’s DS-1 (which carries encapsulated IP packets) interfaces to a
low-speed multiplexer located in the customer building, such as a small SONET
ADM. This ADM typically serves as one node of a SONET ring (usually a 2-node
ring). Each link of the ring is routed over diverse fiber, usually at OC-3 or OC-12
rate. Eventually, the DS-1 is routed to a SONET OC-48 or OC-192 ring that has
one of its ADMs in the POP. The DS-1 is transported inside an STS-1 signal that
is divided into 28 time slots called channels (a channelized STS-1), as specified by
the SONET standard. The ADM routes all the SONET STS-1s carrying DS-1 traffic bound for the core carrier to a metro W-DCS. Note that there are often multiple

AR

Channelized
OC-12

Customer
Location

Intra-Office
TDM Layers
W-DCS
(Core)

DS1/DS3

W-DCS
(Metro)

ADM

AR
MSP

MSP

IOS
(Core)

Example
of 3 DS1
access
circuits

OC-12

SONET
ADM
(Metro)

Access | Metro
TDM Layers

(R)AR = (Remote) Access Router
MSP = Multi-service Platform (multiplexes low-rate TDM circuits)
= Layer 3 Logical Link
= IP Layer Access Link
= Router Line Card
W-DCS = Wideband Digital Cross-Connect System
ADM = Add-Drop Multiplexer

ADM
DS1/DS3

Customer
Location

Fig. 2.11 Legacy central office interconnection diagram (intra-office TDM layers)

40

R.D. Doverspike et al.

core carriers in a POP, and hence, the metro W-DCS cross-connects all the DS-1s
destined for a given core carrier into channelized STS-1s and hands them off to the
core W-DCS(s) of that core carrier. However, note that this handoff does not occur
directly between the two W-DCSs, but rather passes through a higher-rate B-DCS,
in this case the Intelligent Optical Switch (IOS) introduced in Section 2.2.6. The
IOS cross-connects most of the STS-1s (multiplexed into OC-n interfaces) in a central office. Also, notice that the IOS is fronted with Multi-Service Platforms (MSPs).
An MSP is basically an advanced form of SONET ADM that gathers many types of
lower-speed TDM interfaces and multiplexes them up to OC-48 or OC-192 for the
IOS. It usually also has Ethernet interfaces that encapsulate IP packets into TDM
signals (e.g., for Ethernet private line discussed earlier). The purpose of such a configuration is to minimize the cost and scale of the IOS by avoiding using its interface
bay capacity for low-speed interfaces.
Finally, the core W-DCS cross-connects the DS1s destined for the access routers
in the central office onto channelized STS-1s. Again, these STS-1s are routed to the
AR via the IOS and its MSPs. The DS-1s finally reach a channelized SONET card
on the AR (typically OC-12). This card on the AR de-multiplexes the DS-1s from
the STS-1, de-encapsulates the packets, and creates a virtual interface for each of
our three example customer access links in Fig. 2.11. The channelized SONET card
is colloquially called a CHOC card (CHannelized OC-n).
Note that the core and metro carriers depicted in Fig. 2.11 may be parts of the
same corporation. However, this complex architecture arose from the decomposition
of long-distance and local carriers that was dictated by US courts and the Federal
Communications Commission (FCC) at the breakup of the Bell System in 1984.
It persists to this day.
If we reexamine the above TDM metro access descriptions, we find that there
are many restoration mechanisms, such as dual homing of the ARs to the BRs and
SONET rings in the metro network. However, there is one salient point of potential
failure. If an AR customer-facing line card or entire AR fails or is taken out of service for maintenance in Fig. 2.11, then the customer’s service is also down. Carriers
offer service options to protect against this. The most common provide two TDM
backhaul connections to the customer’s equipment, often called Customer Premise
Equipment (CPE), each of which terminates on a different access router. This architecture significantly raises the availability of the service, but does incur additional
cost. An example of such a service is given in [1].
To retain accuracy, we make a final technical comment on the example of
Fig. 2.11. Although we show direct fiber connections between the various TDM and
packet equipment, in fact, most of these usually occur via a fiber patch panel. This
enables a craftsperson to connect the equipment via a simple (and well-organized)
patch chord or cross-connect. This minimizes expense, simplifies complex wiring,
and expedites provisioning work orders in the CO.
Figure 2.12 depicts how customers access the AR via emerging metro packet
network layers instead of TDM. Here, instead of the traditional TDM network,
the customer accesses the packet core via Ethernet. The most salient difference is
the substantially simplified architecture. Although many different types of services

2

Structural Overview of ISP Networks

41
Customer
Location

n-GigE

BR

FE | GigE

AR

NTE

Ethernet Virtual
Private Line (combo of
VLAN & Pseudowire)

BR

AR
RE
(Metro)
RE
(Core)

Dual role access
router and Ethernet
switch

AR
RE
NTE

n-GigE

RE
(Metro)

Access | Metro
Ethernet
Layer

= Access Router
= Router | Ethernet Switch
= Network Terminating Equipment
= Layer 3 Link
= Layer 2-3 Link
= Router | Ethernet Line Card
= Virtual access link to IP Layer

NTE
FE | GigE

Customer
Location

Fig. 2.12 Central office interconnection diagram (metro Ethernet interface)

are possible, we describe two fundamental types of Ethernet service: Ethernet
virtual circuits and Ethernet VPLS. Most enterprise customers will use both types
of services.
There are three basic types of connectivity for Ethernet virtual circuits: (1) intrametro, (2) ISP access via establishment of Ethernet virtual circuits between
the customer location and IP backbone, and (3) intermetro. Since our main focus
is the core IP backbone, we discuss the latter two varieties. For ISP access, in the
example of Fig. 2.12, the customer’s CPE interfaces the metro network via Fast
Ethernet (FE) or GigE into a small Ethernet switch placed by the metro carrier
called Network Terminating Equipment (NTE). The NTE is the packet analog of the
small ADM in the TDM access model in Fig. 2.11. For most metro Ethernet services, the customer can usually choose which policed access rate he/she wishes to
purchase in increments of 1 Mb/s or similar. For example, he/she may wish 100 Mb/s
for his/her Committed Information Rate (CIR) and various options for his/her Excess Information Rate (EIR). The EIR options control how his bandwidth bursts are
handled/shared when they exceed his CIR. The metro packet networks uses Virtual Local Area Network (VLAN) identifiers [14] and pseudowires or MPLS LSPs
to route the customer’s Ethernet virtual circuit to the metro Ethernet switch/router
in the POP, as shown in Fig. 2.12. VLANs can also be used to segregate a particular customer’s services, such as the two fundamental services (VPLS vs Internet
access) described here. The metro Ethernet switch/router has high-speed links

42

R.D. Doverspike et al.

(such as 10 Gb/s) to the core Ethernet switch/router. However, the core Ethernet
switch/router is fundamentally an access router, but with the needed features and
configurations needed to provide Ethernet and VPLS, and thus homes to backbone
routers as any other access router. Thus, the customer’s virtual circuit is mapped to
a virtual port on the core AR/Ethernet-Switch and from that point onward is treated
similarly as the TDM DS-1 virtual port in Fig. 2.11. If an intermetro Ethernet virtual
circuit is needed, then an appropriate pseudowire or tunnel can be created between
the ARs in different metros. Such a service can eventually substitute for traditional
private-line service as metro packet networks are deployed.
The second basic type of Ethernet service type is generally provided through
the VPLS model described in Section 2.2.8. For example, the customer might
have two LANs in metro-1, one LAN in metro-2 and another LAN in metro-3.
Wide-area VPLS interconnects these LANs into a large transparent LAN. This is
achieved using pseudowires (tunnels) between the ARs in metros-1, 2, and 3. Since
the core access router has a dual role as access router and Ethernet VPLS switch, it
has the abilities to route customer Ethernet frames among pseudowires among the
remote access routers.
Besides enterprise Ethernet services, connection of cellular base stations to the IP
backbone network is another important application of Ethernet metro access. Until
recently, this was achieved by installing DS-1s from cell sites to circuit switches in
Mobile Telephone Switching Offices (MTSOs) to provide voice service. However,
with the advent and rapid growth of cellular services based on 3G or 4G technology,
there is a growing need for high-speed packet-based transport from cell sites to the
IP backbone. The metro Ethernet structure for this is similar to that of the enterprise
customer access shown in Fig. 2.12. The major differences occur in the equipment
at the cell site, the equipment at the MTSO, and then how this equipment connects
to the access router/Ethernet switch of the IP backbone.

2.4 Routing and Control in ISP Networks
2.4.1 IP Network Routing
The IP/MPLS routing protocols are an essential part of the architecture of the IP
backbone, and are key to achieving network reliability. This section introduces these
control protocols.
An Interior Gateway Protocol (IGP) disseminates routing and topology information within an Autonomous System (AS). A large ISP will typically segment its
IP network into multiple autonomous systems. In addition, an ISP’s network interconnects with its customers and with other ISPs. The Border Gateway Protocol
(BGP) is used to exchange global reachability information with ASs operated by
the same ISP, by different ISPs, and by customers. In addition, IP multicast is becoming more widely deployed in ISP networks, using one of several variants of the
Protocol-Independent Multicast (PIM) routing protocol.

2

Structural Overview of ISP Networks

43

2.4.1.1 Routing with Interior Gateway Protocols
As described earlier, Interior Gateway Protocols are used to disseminate routing
and topology information within an AS. Since IGPs disseminate information about
topology changes, they play a critical role in network restoration after a link or node
failure. Because of the importance of restoration to the theme of this chapter, we
discuss this further in Section 2.5.2.
The two types of IGPs are distance vector and link-state protocols. In link-state
routing [32], each router in the AS maintains a view of the entire AS topology
using a Shortest Path First (SPF) algorithm. Since link-state routing protocols such
as Open Shortest Path First (OSPF) [26] and Intermediate System–Intermediate
System (IS–IS) [30] are the most commonly used IGPs among large ISPs, we will
not discuss distance vector protocols further. For the purposes of this chapter, which
focuses on network restoration, the functionality of OSPF and IS–IS are similar.
We will use OSPF to illustrate how IGPs handle failure detection and recovery.
The view of network topology maintained by OSPF is conceptually a directed
graph. Each router represents a vertex in the topology graph and each link between neighboring routers represents a unidirectional edge. Each link also has an
associated weight (also called cost) that is administratively assigned in the configuration file of the router. Using the weighted topology graph, each router computes
a shortest path tree (SPT) with itself as the root, and applies the results to build its
forwarding table. This assures that packets are forwarded along the shortest paths in
terms of link weights to their destinations [26]. We will refer to the computation of
the shortest path tree as an SPF computation, and the resultant tree as an SPF tree.
As illustrated in Fig. 2.13, the OSPF topology may be divided into areas, typically resulting in a two-level hierarchy. Area 0, known as the “backbone area”,
resides at the top level of the hierarchy and provides connectivity to the nonbackbone areas (numbered 1, 2, etc.). OSPF typically assigns a link to exactly one area.
Links may be in multiple areas, and multi-area links are addressed in more detail in
Chapter 11 (Measurements of Control Plane Reliability and Performance by Aman
Shaikh and Lee Breslau). Routers that have links to multiple areas are called border
routers. For example, routers E, F and I are border routers in Fig. 2.13. Every router
maintains its own copy of the topology graph for each area to which it is connected.
The router performs an SPF computation on the topology graph for each area and
thereby knows how to reach nodes in all the areas to which it connects. To improve
scalability, OSPF was designed so that routers do not need to learn the entire topology of remote areas. Instead, routers only need to learn the total weight of the path
from one or more area border routers to each node in the remote area. Thus, after
computing the SPF tree for the area it is in, the router knows which border router to
use as an intermediate node for reaching each remote node.
Every router running OSPF is responsible for describing its local connectivity in
a Link-State Advertisement (LSA). These LSAs are flooded reliably to other routers
in the network, which allows them to build their local view of the topology. The
flooding is made reliable by each router acknowledging the receipt of every LSA it
receives from its neighbors. The flooding is hop-by-hop and hence does not depend

44

R.D. Doverspike et al.
Z

5

Y

B
1

10
A
X

5

C
1

1
D

Area 1

1

F
1

1
E

H

2

1

1

1
Internal IGP Router

3
G

Border Router (between OSPF Areas)

I

Area 2
2

Area 0

L

1

1
AS Border Router

J

1

K

Fig. 2.13 OSPF topology: areas and hierarchy

on routing. The set of LSAs in a router’s memory is called a Link-State Database
(LSDB) and conceptually forms the topology graph for the router.
OSPF uses several types of LSAs for describing different parts of topology. Every
router describes links to all its neighbor routers in a given area in a Router LSA.
Router LSAs are flooded only within an area and thus are said to have an area-level
flooding scope. Thus, a border router originates a separate Router LSA for every
area to which it is connected. Border routers summarize information about one area
and distribute this information to adjacent areas by originating Summary LSAs. It
is through Summary LSAs that other routers learn about nodes in the remote areas.
Summary LSAs have an area-level flooding scope like Router LSAs. OSPF also allows routing information to be imported from other routing protocols, such as BGP.
The router that imports routing information from other protocols into OSPF is called
an AS Border Router (ASBR). Routers A and B are ASBRs in Fig. 2.13. An ASBR
originates External LSAs to describe the external routing information. The External
LSAs are flooded in the entire AS irrespective of area boundaries, and hence have
an AS-level flooding scope. While the capability exists to import external routing
information from protocols such as BGP, the number of such routes that may be
imported may be very large. As a result, this can lead to overheads both in communication (flooding the external LSAs) as well as computation (SPF computation
scales with the number of routes). As a consequence of the scalability problems they
pose, the importing of external routes is rarely utilized.
Two routers that are neighbor routers have link-level connectivity between each
other. Neighbor routers form an adjacency so that they can exchange routing

2

Structural Overview of ISP Networks

45

information with each other. OSPF allows a link between the neighbor routers to be
used for forwarding only if these routers have the same view of the topology, i.e.,
the same link-state database. This ensures that forwarding data packets over the link
does not create loops. Thus, two neighbors have to make sure that their link-state
databases are synchronized, and they do so by exchanging parts of their link-state
databases when they establish an adjacency. The adjacency between a pair of routers
is said to be “full” once they have synchronized their link-state databases. While
sending LSAs to a neighbor, a router bundles them together into a Link-State Update packet. We will re-examine the OSPF reconvergence process in more detail
when we discuss network disruptions in Section 2.5.2.1.
Although elegant and simple, basic OSPF is insensitive to network capacity and
routes packets hop-by-hop along the SPF tree. As mentioned in Section 2.3.2, this
has some potential shortcomings when applied to aggregate links. While aggregatelink capacity thresholds can be tuned to minimize this potentially negative effect,
a better approach may be to use capacity-sensitive routing protocols, often called
Traffic Engineering (TE) protocols, such as OSPF-TE [21]. Alternatively, one may
use routing protocols with a greater degree of routing control, such as MPLS-based
protocols. Traffic Engineering and MPLS are discussed later in this chapter.

2.4.1.2 Border Gateway Protocol
The Border Gateway Protocol is used to exchange routing information between
autonomous systems, for example, between ISPs or between an ISP and its large
enterprise customers. When BGP is used between ASs, it is referred to as Exterior
BGP (eBGP). When BGP is used within an AS to distribute external reachability
information, it is referred to as Interior BGP (iBGP). This section provides a brief
summary of BGP. It is covered in much greater detail in Chapters 6 and 11.
BGP is a connection-oriented protocol that uses TCP for reliable delivery.
A router advertises Network Layer Reachability Information (NLRI) consisting of
an IP address prefix, a prefix length, a BGP next hop, along with path attributes, to
its BGP peer. Packets matching the route will be forwarded toward the BGP next
hop. Each route announcement can also have various attributes that can affect how
the peer will prioritize its selection of the best route to use in its routing table. One
example is the AS PATH attribute which is a list of ASes through which the route
has been relayed.
Withdrawal messages are sent to remove NLRI that are no longer valid. For example in Fig. 2.14, AjZ denotes an advertisement of NLRI for IP prefix z, and Wjs,r
denotes that routes s and r are being withdrawn and should be removed from the
routing table. If an attribute of the route changes, the originating router announces it
again, replacing the previous announcement. Because BGP is connection-oriented,
there are no refreshes or reflooding of routes during the lifetime of the BGP connection, which makes BGP simpler than a protocol like OSPF. However, like OPSF,
BGP has various timers affecting behavior like hold-offs on route installation and
route advertisement.

46

R.D. Doverspike et al.
Router R1
BGP
process
RIB
----

BGP Adjacency
W |s, r

A |z

Router R2
BGP
process
RIB
----

Fig. 2.14 BGP message exchange

BGP maintains tables referred to as Routing Information Bases (RIBs) containing
BGP routes and their attributes. The Loc-RIB table contains the router’s definitive
view of external routing information. Besides routes that enter the RIB from BGP
itself, routes enter the RIB via distribution from other sources, such as static or directly connected routes or routing protocols such as OSPF. While the notion of a
“route” in BGP originally meant an IPv4 prefix, with the standardization of Multiprotocol BGP (MP-BGP) it can represent other kinds of reachability information,
referred to as address families. For example, a BGP route can be an IPv6 prefix or
an IPv4 prefix within a VPN.
External routes advertised in BGP must be distributed to every router in an AS.
The hop-by-hop forwarding nature of IP requires that a packet address be looked
up and matched against a route at each router hop. Because the address information
may match external networks that are only known in BGP, every router must have
the BGP information. However, we describe later how MPLS removes the need for
every interior router to have external BGP route state.
Within an AS, the BGP next hop will be the IP address of the exit router or exit
link from the AS through which the packet must route and BGP is used by the exit
router to distribute the routes throughout the AS. To avoid creating a full mesh of
iBGP sessions among the edge and interior routers, BGP can use a hierarchy of
Route Reflectors (RR). Figure 2.15 illustrates how BGP connections are constructed
using a Route Reflector.
BGP routes may have their attributes manipulated when received and before
sending to peers, according to policy design decisions of the operator. Of the BGP
routes received by a BGP router, BGP first determines the validity of a route (e.g., is
the BGP next hop reachable) and then chooses the best route among valid duplicates
with different paths. The best route is decided by a hierarchy of tiebreakers among
route attributes such as IGP metric to the next hop and BGP path attributes such as
AS PATH length. The best route is then relayed to all peers except the originating
one. One variation of this relay behavior is that any route received from an iBGP
peer on a nonroute reflector is not relayed to any other iBGP peer.

2

Structural Overview of ISP Networks

47

CE
PE

iBGP client
PE
RR

CE

CE

RR
iBGP

PE

PE
CE
PE
CE
RR
iBGP
eBGP

PE

PE

eBGP
CE

= Provider Edge router (Access Router)
= Customer Edge router
= Route Reflector
= Interior BGP
= Exterior BGP

Fig. 2.15 BGP connections in an ISP with Route Reflectors (RR)

2.4.1.3 Protocol-Independent Multicast
IP Multicast is very efficient when a source sends data to multiple receivers.
By using multicast at the network layer, a packet traverses a link only once, and
therefore the network bandwidth is utilized optimally. In addition, the processing at
routers (forwarding load) as well as at the end-hosts (discarding unwanted packets)
is reduced. Multicast applications generally use UDP as the underlying transport
protocol, since there is no unique context for the feedback received from the various receivers for congestion control purposes. We provide a brief overview of IP
Multicast in this section. It is covered in greater detail in Chapter 11.
IP Multicast uses group addresses from the Class “D” address space (in the
context of IPv4). The range of IP addresses that are used for IP Multicast group
addresses is 224.0.0.0 to 239.255.255.255. When a source sends a packet to an IP
Multicast group, all the receivers that have joined that group receive it. The typical protocol used between the end-hosts and routers is Internet Group Management
Protocol (IGMP). Receivers (end-hosts) announce their presence ( join a multicast
group) by sending an IGMP report to join a group. From the first router, the indication of the intent of an end-host to join the multicast group is forwarded through
routers upwards along the shortest path to the root of the multicast tree. The root
for an IP Multicast tree can be a source in a source-based distribution tree, or it
may be a “rendezvous point” when the tree is a shared distribution tree. The routing
protocol used in conjunction with IP multicast is called Protocol-Independent Multicast (PIM). PIM has variants of the routing protocol used to form the multicast
tree to forward traffic from a source (or sources) to the receivers. A router forwards
a multicast packet only if it was received on the upstream interface to the source
or to a rendezvous point (in a shared tree). Thus, a packet sent by a source follows
the distribution tree. To avoid loops, if a packet arrives on an interface that is not
on the shortest path toward the source of rendezvous point, the packet is discarded

48

R.D. Doverspike et al.

(and thus not forwarded). This is called Reverse Path Forwarding (RPF), a critical
aspect of multicast routing. RPF avoids loops by not forwarding duplicate packets.
PIM relies on the SPT created by the traditional routing protocols such as OSPF to
find the path back to the multicast source using RPF.
IP Multicast uses soft-state to keep the multicast forwarding state at the routers
in the network. There are two broad approaches for maintaining multicast state. The
first is termed PIM-Dense Mode, wherein traffic is first flooded throughout the network, and the tree is “pruned” back along branches where the traffic is not wanted.
The underlying assumption is that there are multicast receivers for this group at
most locations, and hence flooding is appropriate. The flood and prune behavior is
repeated, in principle, once every 3 min. However, this results in considerable overhead (as the traffic would be flooded until it is pruned back) each time. Every router
also ends up keeping state for the multicast group. To avoid this, the router downstream of a source periodically sends a “state refresh” message that is propagated
hop-by-hop down the tree. When a router receives the state refresh message on the
RPF interface, it refreshes the prune state, so that it does not forward traffic received
subsequently, until a receiver joins downstream on an interface.
While PIM-Dense Mode is desirable in certain situations (e.g., when receivers are
likely to exist downstream of each of the routers – densely populated groups – hence
the name), PIM-Sparse Mode (PIM-SM) is more appropriate for wide-scale deployment of IP multicast for both densely and sparsely populated groups. With PIM-SM,
traffic is sent only where it is requested, and receivers are required to explicitly join
a multicast group to receive traffic. While PIM-SM uses both a shared tree (with a
rendezvous point, to allow for multiple senders) as well as a per-source tree, we describe a particular mode, PIM-Source Specific Multicast (PIM-SSM), which is more
commonly used for IPTV distribution. More details regarding PIM-SM, including
PIM using a shared tree, is described in Chapter 11. PIM-SSM is adopted when the
end-hosts know exactly which source and group, typically denoted (S,G), to join
to receive the multicast transmissions from that source. In fact, by requiring that receivers signal the combination of source and group to join, different sources could
share the same group address and not interfere with each other. Using PIM-SSM,
a receiver transmits an IGMP join message for the (S,G) and the first hop router
sends a (S,G)join message directly along the shortest path toward the source.
The shortest path tree is rooted at the source.
One of the key properties of IP Multicast is that the multicast routing operates
somewhat independently of the IGP routing. Changes to the network topology are
reflected in the unicast routing using updates that operate on short-time scales (e.g.,
transmission of LSAs in OSPF reflect a link or node failure immediately). However,
IP Multicast routing reflects the changed topology only when the multicast state
is refreshed. For example, with PIM-SSM, the updated topology is reflected only
when the join is issued periodically (which can be up to a minute or more) by the
receiver to refresh the state. We will examine the consequence of this for wide-area
IPTV distribution later in this chapter.

2

Structural Overview of ISP Networks

49

2.4.2 Multiprotocol Label Switching
2.4.2.1 Overview of MPLS
Multiprotocol Label Switching (MPLS) is a technology developed in the late 1990s
that added new capabilities and services to IP networks. It was the culmination of
various IP switching technology efforts such as multiprotocol over ATM, Ipsilon’s
IP Switching, and Cisco’s tag switching [7,20]. The key benefits provided by MPLS
to an ISP network are:
1. Separation of routing (the selection of paths through the network) from forwarding/switching via IP address header lookup
2. An abstract hierarchy of aggregation
To understand these concepts, we first consider how normal IP routing in an ISP network functions. In an IP network without MPLS, there is a topology hierarchy with
edge and backbone routers. There is also a routing hierarchy with BGP carrying external reachability information and an IGP like OSPF carrying internal reachability
information. BGP carries the information about which exit router (BGP next hop)
is used to reach external address space. OSPF picks the paths across the network
between the edges (see Fig. 2.16). It is important to note that every OSPF router
knows the complete path to reach all the edges. The internal paths that OSPF picks
and the exit routers from BGP are determined before the first packet is forwarded.
The connection-less and hop-by-hop forwarding behavior of IP routing requires that
every router have this internal and external routing information present.

A
CE
PE

PE
P

A.1

P

PE

PE

CE

PE

CE
CE

P
PE

PE

Provider Router
Network

PE
P -Provider router (Backbone Router)
PE - Provider Edge router (Access Router)
CE - Customer Edge switch
Packet forwarded using hopby-hop route lookup

Routes chosen using OSPF
interior routing protocols

Fig. 2.16 Traditional IP routing with external routes distributed throughout backbone

50

R.D. Doverspike et al.

Consider the example in Fig. 2.16, where a packet enters on the left with
address A.1 destined to the external network A on the upper right. When the
first packet arrives, the receiving provider edge router (PE) looks up the destination
IP address. From BGP, it learns that the exit router for that address is the upper
right PE. From OSPF, the path to reach that exit PE is determined. Even though the
ingress PE knows the complete path to reach the exit PE, it simply forwards the
packet to the next-hop backbone router, labeled as a P-router (P) in the figure.
The backbone router then repeats the process: using the packet IP address, it determines the exit from BGP and the path to the exit from OSPF to forward the packet
to the next-hop BR. The process repeats again until the packet reaches the exit PE.
The repeated lookup of the packet destination to find the external exit and internal
path appears to be unnecessary. The lookup operation itself is not expensive, but the
issue is the unnecessary state and binding information that must be carried inside
the network. The ingress router knows the path to reach the exit. If the packet could
somehow be bound to the path itself, then the successive next-hop routers would
only need to know the path for the packet and not its actual destination. This is what
MPLS accomplishes.
Consider Fig. 2.17 where MPLS sets up an end-to-end Label Switched Path (LSP)
by assigning labels to the interior paths to reach exits in the network. The LSP
might look like the one shown in Fig. 2.18. The backbone routers are now called
Label Switch Routers (LSR). Via MPLS signaling protocols, the LSR knows how
to forward a packet carrying an incoming label for an LSP to an outgoing interface
and outgoing label; this is called a “swap” operation. The PE router also acts as an
LSR, but is usually at the head (start) or end (tail) of the LSP where, respectively,
the initial label is “pushed” onto the data or “popped” (removed) from the data.

A
CE
A.1
PE

PE
LSR

A.1

LSR

PE

CE

PE
PE

CE

LSR
PE

PE

CE
LSR - Label Switch Router
PE - Proider Edge router (Access Router)
CE - Customer Edge router

PER

LSP: Route lookup once and associated
label assigned to packet

Routes chosen using OSPF
interior routing protocols

Fig. 2.17 Routing with MPLS creates Label Switched Paths (LSP) for routes across the network

2

Structural Overview of ISP Networks
POP

data

51

SWAP

417 data

SWAP

666 data

PUSH

233 data

data

Label Switched Path
“tail end”

“head end”

Fig. 2.18 Within an LSP, labels are assigned at each hop by the downstream router

In the example of Fig. 2.17, external BGP routing information such as routes to
network A is only needed in the edges of the network. The interior LSRs only need
to know the interior path among the edges as determined by OSPF. When the packet
with address A.1 arrives at the ingress PE, the same lookup operation is done as
previously: the egress PE is determined from BGP and the interior path to reach the
egress is found from OSPF. But this time the packet is given a label for the LSP
matching the OSPF path to the egress. The internal LSRs now forward the packet
hop-by-hop based on the labels alone. At the exit PE, the label is removed and the
packet is forwarded toward its external destination.
In this example, the binding of a packet to paths through the network is only
done once – at the entrance to the network. The assignment of a packet to a path
through the network is separated from the actual forwarding of the packet through
the network (this is the first benefit that was identified above). Further, a hierarchy
of forwarding information is created: the external routes are only kept at the edge of
the network while the interior routers only know about interior paths. At the ingress
router all received packets needing to exit the same point of the network receive the
same label and follow the same LSP.
MPLS takes these concepts and generalizes them further. For example, the LSP to
the exit router could be chosen differently from the IGP shortest path. IPv4 provides
a method for explicit path forwarding in the IP header, but it is very inefficient.
With MPLS, explicit routing becomes very efficient and is the primary tool for traffic
engineering in IP backbones. In the previous example, if an interior link was heavily
utilized, the operator may desire to divert some traffic around that link by taking a
longer path as shown in Fig. 2.19. Normal IP shortest path forwarding does not allow
for this kind of traffic placement.
The forwarding hierarchy can be used to create provider-based VPNs. This is
illustrated in Fig. 2.20. Virtual private routing contexts are created at the PEs, one
per customer VPN. The core of the network does not need to maintain state information about individual VPN routes. The same LSPs for reaching the exits of the
network are used, but there are additional labels assigned for separating the different VPN states.

52

R.D. Doverspike et al.
A
CE
PE

PE
LSR
LSR

PE
CE

PE
PE

CE

LSR
PE

CE

PE
PE

LSR - Label Switch Router
PE - Provider Edge router (Access Router)
CE - Customer Edge router
LSP

Routes chosen using OSPF
interior routing protocols

Fig. 2.19 MPLS with Traffic Engineering can use alternative to the IGP shortest path
A
CE
PE

PE
LSR
LSR

PE

PE
CE

PE

CE

LSR
PE

PE

CE
LSR - Label Switch Router
PE - Provider Edge router
CE - Customer Edge router

PE

LSP

Fig. 2.20 MPLS VPNs support separated virtual routing contexts in PEs interconnected via LSPs

In summary, the advantages to the IP backbone of decoupling of routing and
forwarding are:
 It achieves efficient explicit routing.
 Interior routers do not need any external reachability information.

2

Structural Overview of ISP Networks

53

 Packet header information is only processed at head of LSP (e.g., edges of the

network).
 It is easy to implement nested or hierarchical identification (such as with VPNs).

2.4.2.2 Internet Route Free Core
The ability of MPLS to remove the external BGP information plus Layer 3 address
lookup from the interior of the IP backbone is sometimes referred to as an Internet
Route Free Core. The “interior” of the IP backbone starts at the left-side (BR-side)
port of the access routers in Fig. 2.7. Some of the advantages of Internet Route Free
Core include:
 Traffic engineering using BGP is much easier.
 Route reflectors no longer need to be in the forwarding plane, and thus can be

dedicated to IP layer control plane functions or even placed on a server separate
from the routers.
 Denial of Service (DoS) attacks and security holes are better controlled because
BGP routing decisions only occur at the edges of the IP backbone.
 Enterprise VPN and other priority services can be better isolated from the “Public
Internet”.
We provide more clarification for the last advantage. Many enterprise customers,
such as financial companies or government agencies, are concerned about mixing
their priority traffic with that of the public Internet. Of course, all packets are mixed
on links between backbone routers; however, VPN traffic can be functionally segregated via LSPs. In particular, since denial of service attacks from the compromised
hosts on the public Internet rely on reachability from the Internet, the private MPLS
VPN address space isolates VPN customers from this threat. Further, enterprise premium VPN customers are sometimes clustered onto access routers dedicated to the
VPN service. Furthermore, higher performance (such as packet loss or latency) for
premium VPN services can be provided by implementing priority queueing or providing them bandwidth-sensitive LSPs (discussed later). A similar approach can be
used to provide other performance-sensitive services, such as Voice-over-IP (VoIP).
2.4.2.3 Protocol Basics
MPLS encapsulates IP packets in an MPLS header consisting of one or more MPLS
labels, known as a label stack. Figure 2.21 shows the most commonly used MPLS
encapsulation type. The first 20 bits are the actual numerical label. There are three
bits for inband signaling of class of service type, followed by and End-of-Stack bit
(described later) and a time-to-live field, which serves the same function as an IP
packet time-to-live field.
MPLS encapsulation does not define a framing mechanism to determine the
beginning and end of packets; it relies on existing underlying link-layer technologies.

54

R.D. Doverspike et al.
Layer 2 Header | PID

MPLS Label 1 MPLS Label 2

Label (20bits)

…

| CoS (3 bits) | Stack (1 bit)

MPLS Label n

|

Layer 3 Packet

TTL (8 bits)

Fig. 2.21 Generic MPLS encapsulation and header fields

Existing protocols such as Ethernet, Point-to-Point Protocol (PPP), ATM, and
Frame Relay have been given new protocol IDs or new link-layer control fields to
allow them to directly encapsulate MPLS-labeled packets.
Also, MPLS does not have a protocol ID field to indicate the type of packet
encapsulated, such as IPv4, IPv6, Ethernet, etc. Instead, the protocol type of the
encapsulated packet is implied by the label and communicated by the signaling protocol when the label is allocated.
MPLS defines the notion of a Forwarding Equivalence Class (FEC) (not to be
confused with Forward Error Correction (FEC) in lower network layers defined earlier). All packets with the same forwarding requirements, such as path and priority
queuing treatment, can belong to the same FEC. Each FEC is assigned a label. Many
FEC types have been defined by the MPLS standards: IPv4 unicast route, VPN IPv4
unicast route, IPv6 unicast route, Frame Relay permanent virtual circuit, ATM virtual circuit, Ethernet VLAN, etc.
Labels can be stacked, with the number of stacked labels indicated by the endof-stack bit. This allows hierarchical nesting of FECs, which permits VPNs, traffic
engineering, and hierarchical routing to be created simultaneously in the same
network. Consider the previous VPN example where a label may represent the interior path to reach an exit and an inner label may represent a VPN context.
MPLS is entitled “multiprotocol” because it can be carried over almost any
transport as mentioned above, ironically even IP itself, and because it can carry
the payload for many different packet types – all the FEC types mentioned above.
Signaling of MPLS FECs and their associated label among routers and switches
can be done using many different protocols. A new protocol, the Label Distribution Protocol (LDP), was defined specifically for MPLS signaling. However,
existing protocols have also been extended to signal FECs and labels: Resource
Reservation Protocol (RSVP) [3] and BGP, for example.

2.4.2.4 IP Traffic Engineering and MPLS
The purpose of IP traffic engineering is to enable efficient use of backbone capacity. That is, both to ensure that links and routers in the network are not congested
and that they are not underutilized. Traffic engineering may also mean ensuring that
certain performance parameters such as latency or minimum bandwidth are met.

2

Structural Overview of ISP Networks

55

To understand how MPLS traffic engineering plays a role in ISP networks, we first
explain the generic problem to be solved – the multicommodity flow problem – and
how it was traditionally solved in IP networks versus how MPLS can solve the problem.
Consider an abstract network topology with traffic demands among nodes.
There are:
Demands d.i; j / from node i to j
Constraints – link capacity b.i; j / between nodes
Link costs C.i; j /
Path p.k/ or route for each demand
The traffic engineering problem is to find paths for the demands that fit the link
constraints. The problem can be specified at different levels of difficulty:
1. Find any feasible solution, regardless of the path costs.
2. Find a solution that minimizes the costs for the paths.
3. Find a feasible or a minimum cost solution after deleting one or more nodes
and/or links.

Traffic Engineering an IP Network
In an IP network, the capacities represent link bandwidths between routers and the
costs might represent delay across the links. Sometimes, we only want to find a
feasible solution, such as in a multicast IPTV service. Sometimes, we want to minimize the maximum path delay, such as in a Voice-over-IP service. And sometimes,
we want to ensure a design that is survivable (meaning it is still feasible to carry the
traffic) for any single- or dual-link failure.
Consider how a normal ISP without traffic engineering might try to solve the
problem. The tools available on a normal IP network are:
 Metric manipulation, i.e., pick OSPF weights to create a feasible solution.
 Simple topology or link augmentation: this tends to overengineer the network

and restrict the possible topology.
 Source or policy route using the IPv4 header option or router-based source routes.

Source routes are very inefficient resulting in tremendously lower router capacity
and they are not robust, making the network very difficult to operate.
Figure 2.22 illustrates a network with a set of demands and an example of the way
that particular demands might be routed using OSPF. Although the network has
sufficient total capacity to carry the demands, it is not possible to find a feasible
solution (with no congested links) by only setting OSPF weights. A small ISP facing
this situation without technology like MPLS would probably resort to installing
more link capacity on the A-D-C node path.
The generic solution to an arbitrary traffic engineering problem requires specifying the explicit route (path) for each demand. This is a complex problem that can
take an indeterminate time to solve. But there are other approaches that can solve
a large subset of problems. One suboptimal approach is Constraint-based Shortest

56
Fig. 2.22 IP routing
is limited in its ability to meet
resource demands. It cannot
successfully route
the demands within the link
bandwidths in this example

R.D. Doverspike et al.
D
2

3

A

C

1
B
4
All link capacities = 1 unit, except C-3 = 2 units
Demand (2,3) = 0.75 units
Demand (1,3) = 0.4 units
Demand (1,4) = 0.4 units

Path First (CSPF). CSPF has been implemented in networks with ATM Private
Network-to-Network Interface (P-NNI) and IP MPLS. For currently defined MPLS
protocols, the constraints can be bandwidths per class of service for each link. Also,
links can be assigned a set of binary values, which can be used to include or exclude
the links from routing a given demand.
CSPF is implemented in a distributed fashion where all nodes have a full
knowledge of network resource allocation. Then, each node routes its demands
independently by:
1. Pruning the network to only feasible paths
2. Pick the shortest of the feasible paths on the pruned network
Although CSPF routing is suboptimal when compared with a theoretical multicommodity flow solution, it is a reasonable compromise to solving many traffic
engineering problems in which the nodes route their demands independently of each
other. For more complex situations where CSPF is inadequate, network planners
must use explicit paths computed by an offline system. The next section discusses
explicit routing in more detail.

Traffic Engineering Using MPLS
The main problems with traffic engineering an IP backbone with only a Layer 3
IGP routing protocol (such as OSPF) are (1) lack of knowledge of resource allocation and (2) no efficient explicit routing. The previous example of Fig. 2.22 shows
how OPSF would route all demands onto a link that does not have the necessary capacity. Another example problem is when a direct link is needed for a small demand
between nodes to meet certain delay requirements. But OSPF cannot prevent other
traffic demands from routing over this smaller link and causing congestion. MPLS
solves this with extensions to OSPF (OSPF-TE) [21] to provide resource allocation
knowledge and RSVP-TE [2] for efficient signaling of explicit routes to use those
resources.
See Fig. 2.23 for a simple example of how an explicit path is created. RSVP-TE
can create an explicit hop-by-hop path in the PATH message downstream. The PATH

2

Structural Overview of ISP Networks

57
2

D

3
A

1

C

51

9

B

1
3.
PATH  0.4 Mbps

RESV with labels

Fig. 2.23 RSVP messaging to set up explicit paths

Fig. 2.24 MPLS-TE enables
efficient capacity usage
through traffic engineering
to solve the example
in Fig. 2.22

D
2

3

A

C

1
B
4
All link capacities = 1 unit, except C-3 = 2 units
Demand (2,3) = 0.75 units
Demand (1,3) = 0.4 units
Demand (1,4) = 0.4 units

message can request resources such as bandwidth. The return message is an RESV,
which contains the label that the upstream node should use at each link hop. In
this example, a traffic-engineered LSP is created along path A-B-C for 0.4 Mb/s.
These LSPs are referred to as traffic engineering tunnels. Tunnels can be created
and differentiated for many purposes (including restoration to be defined in later
sections). But in general, primary (service route) tunnels can be considered as a
routing mechanism for all packets of a given FEC between a given pair of routers or
router interfaces. Using this machinery, Fig. 2.24 illustrates how MPLS-TE can be
used to solve the capacity overload problem in the network shown in Fig. 2.22.
The explicit path used in RSVP-TE signaling can be computed by an offline
system and automatically configured in the edge routers or the routers themselves
can compute the path. In the latter case, the edge routers must be configured with
the IP prefixes and their associated bandwidth reservations that are to be trafficengineered to other edges of the network. Because the routers do this without
knowledge of other demands being routed in the network, the routers must receive
periodic updates about bandwidth allocations in the network.

58

R.D. Doverspike et al.

OSPF-TE provides a set of extensions to OSPF to advertise traffic engineering
resources in the network. For example, bandwidth resources per class of service can
be allocated to a link. Also, a link can be assigned binary attributes, which can be
used for excluding or including a link for routing an LSP. These resources are advertised in an opaque LSA via OSPF link-state flooding and are updated dynamically
as allocations change. Given the knowledge of link attributes in the topology and the
set of demands, the router performs an online CSPF to calculate the explicit paths.
The path outputs of the CSPF are given to RSVP-TE to signal in the network. As TE
tunnels are created in the network, the link resources change, i.e., available bandwidth is reduced on a link after a tunnel is allocated using RSVP-TE. Periodically,
OSPF-TE will advertise the changes to the link attributes so that all routers can have
an updated view of the network.

2.4.2.5 VPNs with MPLS
Figure 2.20 illustrates the key concept in how MPLS is used to create VPN services.
VPN services here refer to carrier-based VPN services, specifically the ability of the
service provider to create private network services on top of a shared infrastructure.
For the purposes of this text, VPNs are of two basic types: a Layer 3 IP routed VPN
or a Layer 2 switched VPN. Generalized MPLS (GMPLS) [19] can also be used for
creating Layer 1 VPNs, which will not be discussed here.
A Layer 3 IP VPN service looks to customers of the VPN as if the provider
built a router backbone for their own use – like having their own private ISP. VPN
standards define the PE routers, CE routers, and backbone P-routers interconnecting
the PEs. Although the packets share (are mixed over) the ISP’s IP layer links, routing
information and packets from different VPNs are virtually isolated from each other.
A Layer 2 VPN provides either point-to-point connection services or multipoint Ethernet switching services. Point-to-point connections can be used to support
end-to-end services such as Frame Relay permanent virtual circuits, ATM virtual
circuits, point-to-point Ethernet circuits (i.e., with no Media Access Control (MAC)
learning or broadcasting) and even a circuit emulation over packet service. Interworking between connection-oriented services, such as Frame Relay to ATM
interworking, is also defined. This kind of service is sometimes called a Virtual
Private Wire Service (VPWS).
Layer 2 VPN multipoint Ethernet switching services support a traditional Transparent LAN over a wide-area network called Virtual Private LAN Service (VPLS)
[24, 25].

Layer 3 VPNs over MPLS
As mentioned previously, Layer 3 VPNs maintain a separate virtual routing context
for each VPN on the PE routers at the edge of the network. External CEs connect to
the virtual routing context on a PE that belongs to a customer’s VPN.

2

Structural Overview of ISP Networks

59

Layer 3 VPNs implemented using MPLS are often referred to as BGP MPLS
VPNs because of the important role BGP has in the implementation. BGP is used
to carry VPN routes between the edges of the network. BGP keeps the potentially
overlapping VPN address spaces unique by prepending onto the routes a route distinguisher (RD) that is unique to each VPN. The RD + VPN IPv4 prefix combination
creates a new unique address space carried by BGP, sometimes called the VPNv4
address space.
VPN routes flow from one virtual routing instance into other virtual routing instances on PEs in the network using a BGP attribute called a Route Target (RT). An
RT is an address configured by the ISP to identify all virtual routing instances that
belong to a VPN. RTs constrain the distribution of VPN routes among the edges of
the network so that the VPN routes are only received by the virtual routing instances
belonging to the intended (targeted) VPN.
We note that RDs and RTs are only used in the BGP control plane – they are
not values that are somehow applied to user packets themselves. Rather, for every
advertised VPNv4 route, BGP also carries a label assignment that is unique to a
particular virtual router on the advertising PE.
Every VPN packet that is forwarded across the network receives two labels at
the ingress PE: an inner label associated with the advertised VPNv4 route and an
outer label associated with the LSP to reach the egress advertising PE (dictated by
the BGP next-hop address). See Fig. 2.25 for a simplified example. In this example,

LSR3
L2 → pop
LNK1 data:
vr1
vr1:
RT1, RD1
table:
Rt Z → L4, PE2
PE2 → L1, LSR1

L1→L2
LSR1
PE1

L1|L4|Z| packet

LSR2

PE2
Route Z

CE1

Li- labels
LSP

LNK2 data:
vr1
vr1:
RT1, RD1
table:
Rt Z → L4,CE2,LNK2

CE2

Fig. 2.25 In this VPN example, a virtual routing context (vr1) in the PEs contains the VPN label
and routing information such as route target (RT1) and route distinguisher (RD1), attached CE
interfaces, and next-hop lookup and label binding. VPN traffic is transported using a label stack of
VPN label and interior route label

60

R.D. Doverspike et al.

there is a VPN advertising a route Z, which enters the receiving virtual router (vr1)
and is distributed by BGP to other PE virtual routers using RTs. A packet entering the VPN destined toward Z is looked up in the virtual routing instance, where
the two labels are found – the outer label to reach the egress PE and the inner label
for the egress virtual routing instance.

Layer 2 VPNs over MPLS
The implementation of Layer 2 VPNs over MPLS is similar to Layer 3 VPNs.
Because there is no IP routing in the VPN service, there is instead a virtual
switching context created on the edge PEs to isolate different VPNs. These virtual
switching contexts keep the address spaces of the edge services from conflicting
with each other across different VPNs.
Layer 2 VPNs use a two-label stack approach that is similar to Layer 3 VPNs.
Reaching an egress PE from an ingress PE is done using the same network interior
LSPs that the Layer 3 VPN service would use. And then, there is an inner label
associated with either the VPWS or VPLS context at the egress PE. This inner label
can be signaled using either LDP or BGP. The inner label and the packet encapsulation comprise a pseudowire, as defined in the PWE3 standards [16]. The pseudowire
connects an ingress PE to an egress PE switching context and is identified by the
inner label. The VPWS service represents a single point-to-point connection, so
there will only be a single pseudowire setup in each direction. For VPLS however,
carriers typically set up a full mesh of pseudowires/LSPs among all PEs belonging
to that VPLS.
Forwarding for a VPWS is straightforward: the CE connection is associated
with the appropriate pseudowires in each direction when provisioned. For VPLS,
forwarding is determined by the VPLS forwarding table entry for the destination
Ethernet MAC address. Populating the forwarding table is based on source MAC
address learning. The forwarding table records the inbound interface on which a
source MAC was seen. If the destination MAC is not in the table, then the packet
is flooded to all interfaces attached to the VPLS. Flooding of unknown destination
MACs and broadcast MACs follows some special rules within a VPLS. All PEs
within a backbone are assumed to be full mesh connected with pseudowires. So,
packets received from the backbone are not flooded again into the backbone, but are
only flooded onto CE interfaces. On the other hand, packets from a CE to be flooded
are sent to all attached CE interfaces and all pseudowire interfaces toward the other
backbone PEs.
There is also a VPLS variation called Hierarchical VPLS to constrain the
potential explosion of mesh point-to-point LSPs needed among the PE routers.
This might happen with a PE that acts like a spoke with a single pseudowire attached to a core of meshed PEs. In this model, a flooding packet received at a
mesh connected PE from a spoke PE pseudowire is sent to all attached CEs and
pseudowires. In such a model, the PE interconnectivity must be guaranteed to be
loop-free or a spanning tree protocol may be run among the PEs for that VPLS.

2

Structural Overview of ISP Networks

61

2.5 Network Restoration and Planning
The design of an IP backbone is driven by the traffic demands that need to be
supported, and network availability objectives. The network design tools model the
traffic carried over the backbone links not only in a normal “sunny day” scenario,
but also in the presence of network disruptions.
Many carriers offer Service Level Agreements (SLAs). SLAs will vary across
different types of services. For example, SLAs for private-line services are quite
different from those for packet services. SLAs also usually differ among different
types of packet services. The SLAs for general Internet, VPN, and IPTV services
will generally differ. A packet-based SLA might be expressed in terms of Quality
of Service (QoS) metrics:For example, the SLA for a premium IP service may cover
up to three QoS metrics: latency, jitter, and packet loss. An example of the latter is
“averaged over time period Y , the customer will receive at least X % of his/her
packets transmitted.” Some of these packet services may be further differentiated by
offering different levels of service, also called Class of Service (CoS).
To provide its needed SLAs, an ISP establishes internal network objectives. Network availability is a key internal metric used to control packet loss.
Furthermore, network availability is also sometimes used as the key QoS metric for
private-line services. Network availability is often stated colloquially in “9s”. For
example, “four nines” of availability means the service is available at least 0.9999
of the time. Stated in the contra-positive, the service should not be down more than
0.0001 of the time (approximately 50 min per year). Given its prime importance,
we will concentrate on network availability in the remainder of this section.
The single largest factors in designing and operating the IP backbone such that
it achieves its target network availability are modeling its potential network disruptions and the response of the network to those disruptions. Network disruptions most
typically are caused by network failures and maintenance activities. Maintenance
activities include upgrading of equipment software, replacement of equipment, and
reconfiguration of network topologies or line cards. Because of the complex layering and segmentation of networks surrounding the IP backbone and because of the
variety and vintage of equipment that accumulates over the years, network planners,
architects, network operators, and engineers spend considerable effort to maintain
network availability. In this section, we will briefly describe the types of restoration
methods we find at the various network layers. Then, we will describe how network
disruptions affect the IP backbone, the types of restoration methods used to handle
them, and finally how the network is designed to meet the needed availability.
Table 2.3 summarizes typical restoration methods used in some of today’s
network core layers that are most relevant to the IP backbone. See [11] for descriptions of restoration methods used in other layers shown in Fig. 2.3. In the next
sections, we will describe the rows of this table. Note that the table is approximate
and does not apply universally to all telecommunication carriers.

62

R.D. Doverspike et al.

Table 2.3 Example of core-segment restoration methods

Network layer
Fiber
DWDM

SONET Ring
IOS (DCS)
W-DCS
IP backbone

Restoration method(s) against network failures
that originate at that layer or lower layers
No automatic rerouting
1) Manual
2) 1 C 1 restoration (also called dedicated
protection)
Bidirectional Line-Switched Rings (BLSR)
Distributed path-based mesh restoration
No automatic rerouting
1) IGP reconfiguration
2) MPLS Fast Reroute (FRR)

Exemplary
restoration
time scale
Hours (manual)
1) Hours (manual)
2) 3–20 ms
50–100 ms
Sub-second to seconds
Hours
1) 10–60 s
2) 50–100 ms

2.5.1 Restoration in Non-IP Layers
2.5.1.1 Fiber Layer
As we described earlier, in most central offices today, optical interfaces on switching
or transport equipment connect to fiber patch panels. Some carriers have installed
an automated fiber patch panel, also called a Fiber Cross-Connect (FXC), which
has the ability for an operator to remotely control the cross-connects. Some
of the enabling technologies include physical crossbars using optical collometers and Micro-Electro-Mechanical Systems (MEMS). A good overview of these
technologies can be found in [12]. When disruptions occur to the fiber layer, most
commonly from construction activity, network operators can reroute around the
failed fiber by using a patch panel to cross-connect the equipment onto undamaged fibers. This may require coordination of cross-connects at intermediate central
offices to patch a path through alternate COs if an entire cable is damaged. Of
course, this typically is a slow manual process, as reflected in Table 2.3 and so
higher-layer restoration is usually utilized for disruptions to the fiber layer.

2.5.1.2 DWDM Layer
Some readers may be surprised to learn that carriers have deployed few (if any)
automatic restoration methods in their DWDM layers (neither metro nor core
segment). The one type of restoration occasionally deployed is one-by-one (1:1)
or one-plus-one (1 C 1) tail-end protection switching, which switches at the endpoints of the DWDM layer connection. With 1C1 switching, the signal is duplicated
and transmitted across two (usually) diversely routed connections. The path of the
connection during the nonfailure state is usually called the working path (also called
the primary or service path); the path of the connection during the failure state is
called the restoration path (also called protection path or backup path). The receiver

2

Structural Overview of ISP Networks

63

consists of a simple detector and switch that detects failure of the signal on the
working path (more technically, detects performance errors such as average BER
threshold crossings) and switches to the restoration path upon alarm. Once adequate
signal performance is again achieved on the signal along the working path (including
a time-out threshold to avoid link “flapping”), it switches back to the working path.
In 1:1 protection switching, there is no duplication of signal, and thus the restoration connection can be used for other transport in nonfailure states. The transmitted
signal is switched to the restoration path upon detection of failure of the service path
and/or notification from the far end.
Technically speaking, in ROADM or Point-to-point DWDM systems, 1 C 1
or 1:1 protection switching is usually implemented electronically via the optical
transponders. Consequently, these methods can be implemented at other transport
layers, such as DCS, IOS, and SONET. The major advantage of 1 C 1 or 1:1 methods is that they can trigger in as little as 3–20 ms. However, because these methods
require restoration paths that are dedicated (one-for-one) for each working connection, the resulting restoration capacity cannot be shared among other working
connections for potential failures. Furthermore, the restoration paths are diversely
routed and are often much longer than their working paths. Consequently, 1 C 1 and
1:1 protection switching tend to be the costliest forms of restoration.

2.5.1.3 SONET Ring Layer
The two most common types of deployed SONET or SDH self-healing ring
technology are Unidirectional Path Switched Ring (UPSR-2F) and Bidirectional
Line-Switched Ring (BLSR-2F). The “2F ” stands for “2-Fibers”. For simplicity, we
will limit our discussion to SONET rings, but there is a very direct analogy for SDH
rings. However, note that ADM-ADM ring links are sometimes transported over a
lower DWDM layer, thus forming a “connection” that is routed over channels of
DWDM systems, instead of direct fiber. Although there is no inherent topographical
orientation in a ring, many people conceptually visualize each node of a SONET
self-healing ring as an ADM with an east bidirectional OC-n interface (i.e., a transmit port and a receive port) and a west OC-n interface. Typically, n D 48 or 192.
An STS-k SONET-Layer connection enters at an add/drop port of an ADM, routes
around the ring on k STS-1 channels of the ADM–ADM links and exits the ring
at an add/drop port of another ADM. The UPSR is the simplest of the devices and
works similarly to the 1 C 1 tail-end switch described in Section 2.5.1.2, except
that each direction of transmission of a connection routes counterclockwise on the
“outer” fiber around the ring (west direction) and therefore an STS-k connection
used the same k STS-1 channels on all links around the ring. At each add/drop
transmit port, the signal is duplicated in the opposite direction on the “inner” fiber.
The selector responds to a failure as described above.
The BLSR-2F partitions the bidirectional channels of its East and West highspeed links in half. The first half is used for working (nonfailure) state, and
the second half is reserved for restoration. When a failure to a link occurs,

64

R.D. Doverspike et al.

the surrounding ADMs loop back that portion of the connection paths onto the
restoration channels around the opposite direction of the ring. The UPSR has
very rapid restoration, but suffers the dedicated-capacity condition described in
Section 2.5.1.2; as a consequence, today UPSRs are now confined mostly to the
metro network, in particular to the portion closest to the customer, often extending into the feeder network. Because BLSR signaling is used to advertise failures
among ADMs and real-time intermediate cross-connections have to be made, a
BLSR restores more slowly than a UPSR. However, the BLSR is capable of having
multiple connections share restoration channels over nonsimultaneous potential
network failures, and is thus almost always deployed in the middle of the metro
network or parts of the core network. Rings are described in more detail in [11].

2.5.1.4 IOS Layer
The typical equipment that comprise today’s IOS layer use distributed control
to provision (set-up) connections. Here, links of the IOS network (SONET bidirectional OC-n interfaces) are assigned routing weights. When a connection is
provisioned over the STS-1 channels of an IOS network, its source node (IOS) computes its working path (usually along a minimum-weight path) plus also computes
its restoration path that is diversely routed from the working path. After the connection is set up along its working path, the restoration path is stored for future
use. The nodes communicate the state of the network connectivity via topology
update messages transmitted over the SONET overhead on the links between the
nodes. When a failure occurs, the nodes flood advertisement messages to all nodes
indicating the topology change. The source node for each affected connection then
instigates the restoration process for its failed connections by sending connection request messages along the links of the (precalculated) restoration path, seeking spare
STS-1 channels to reroute its connections. Various handshaking among nodes of
the restoration paths are implemented to complete the rerouting of the connections.
Note that in contrast to the dedicated and ring methods, the restoration channels are
not prededicated to specific connections and, therefore, connections from a varied
set of source/destination pairs can potentially use them. Such a method is called
shared restoration because a given spare channel can be used by different connections across nonsimultaneous failures. Shared mesh restoration is generally more
capacity-efficient than SONET rings in mesh networks (i.e., networks with average
connectivity greater than 2).
We now delve a little more into IOS restoration to make a key point that will
become relevant to the IP backbone, as well. The example in Fig. 2.2 shows two
higher-layer connections routing over the same lower-layer link. In light of the discussion above about the restoration path being diverse from the working path in the
IOS layer, the astute reader may ask “diverse relative to what?” The answer is that,
in general, the path should be diverse all the way down through the DWDM and
Fiber Layers. This requires that the IOS links contain information about how they
share these lower-layer links. Often, this is accomplished via a mechanism called

2

Structural Overview of ISP Networks

65

“bundle groups”. That is, a bundle group is created for each lower-layer link, but
is expressed as a group of IOS links that share (i.e., route over) that link. Diverse
restoration paths can be discovered by avoiding IOS links that belong to the same
bundle group of a link on the working path. Of course, the equipment in the IOSLayer cannot “see” its lower layers, and consequently has no idea how to define
and create the bundle groups. Therefore, bundle groups are provisioned in the IOSs
using an Operations Support System (OSS) that contains a database describing the
mapping of IOS links to lower-layer networks. This particular example illustrates
the importance of understanding network layering; else we will not have a reliable
method to plan and engineer the network to meet the availability objective. This
point will be equally important to the IP backbone. A set of bundled links is also
referred to as a Shared Risk Link Group (SRLG) in the telecommunications industry,
since it refers to a group of links that are subject to a shared risk of disruption.

2.5.1.5 W-DCS Layer and Ethernet Layer
There are few restoration methods provided at the W-DCS layer itself. This is because most disruptions to a W-DCS link occurs from a disruption of (1) a W-DCS
line card or (2) a component in a lower layer of which the link routes. Disruptions of
type (1) are usually handled by providing 1:1 restorable intra-office links between
the W-DCS and TDM node (IOS or ADM). Disruptions of type (2) are restored
by the lower TDM layers. This only leaves failure or maintenance of the W-DCS
itself as an unrestorable network disruption. However, a W-DCS is much less sophisticated than a router and less subject to failure.
Restoration of Layer 2 VPNs in an IP/MPLS backbone is discussed in
Section 2.5.2. We note here that restoration in enterprise Ethernet networks is typically based on the Rapid Spanning Tree Protocol (RSTP). When enterprise Ethernet
VPNs are connected over the IP backbone (such as VPLS), an enterprise customer
who employs routing methods such as RSTP expects it to work in the extended network. By encapsulating the customer’s Ethernet frames inside pseudowires ensures
that the client’s RTSP control packets are transported transparently across the wide
area. For example, a client VPN may choose to restore local link disruptions by
routing across other central offices or even distant metros. Since all this appears as
one virtual network to the customer, such applications may be useful.

2.5.2 IP Backbone
There are two main restoration methods we describe for the IP layer: IGP reconfiguration and MPLS Fast Reroute (FRR).

66

R.D. Doverspike et al.

2.5.2.1 OSPF Failure Detection and Reconvergence
In a formal sense, the IGP reconvergence process responds to topology changes.
Such topology changes are usually caused by four types of events:
1. Maintenance of an IP layer component
2. Maintenance of a lower-layer network component
3. Failure of an IP layer component (such as a router line card or common
component)
4. Failure of a lower-layer network component (such as a link)
When network operations staff perform planned maintenance on an IP layer link,
it is typical to raise the OSPF administrative weight of the link to ensure that all
traffic is diverted from the link (this is often referred to as “costing out” the link).
In the second case, most carriers have a maintenance procedure where organizations
that manage the lower-layer networks schedule their daily maintenance events and
inform the IP layer operations organization. The IP layer operations organization
responds by costing out all the affected links before the lower-layer maintenance
event is started.
In the first two cases (planned maintenance activity), the speed of the reconvergence process is usually not an issue. This is because the act of changing an IGP
routing weight on a link causes LSAs to be issued. During the process of updating
the link status and recomputation of the SPF tree, the affected links remain in service
(i.e., “up”). Therefore, once the IGP reconfiguration process has settled, the routers
can redirect packets to their new paths. While there may be a transient impact during the “costing out” period, in terms of transient loops and packet loss, the service
impact is kept to a minimum by using this costing out technique to remove a link
from the topology for performing maintenance.
In the last two cases (failures), once the affected links go down, packets may be
lost or delayed until the reconvergence process completes. Such a disruption may
be unacceptable to delay or loss-sensitive applications. This motivates us to examine
how to reduce the time required for OSPF to converge from unexpected outages.
This is the focus of the remainder of this section.
While most large IP backbones route over lower layers, such as DWDM, those do
not provide restoration. Layer 1 failure detection is a key component of the IP layer
restoration process. A key component of the overall failure recovery time in OSPFbased networks is the failure detection time. However, lower-layer failure detection
mechanisms sometimes do not coordinate well with higher-layer mechanisms and
do not detect disruptions that originate in the IP layer control plane. As a result,
OSPF routers periodically exchange Hello messages to detect the loss of a link
adjacency with a neighbor.
If a router does not receive a Hello message from its neighbor within a
RouterDeadInterval, it assumes that the link to its neighbor has failed, or the
neighbor router itself is down, and generates a new LSA to reflect the changed topology. All such LSAs generated by the routers affected by the failure are flooded
throughout the network. This causes the routers in the network to redo the SPF

2

Structural Overview of ISP Networks

67

calculation and update the next-hop information in their respective forwarding
tables. Thus, the time required to recover from a failure consists of: (1) the failure detection time, (2) LSA flooding time, (3) the time to complete the new SPF
calculations and update the forwarding tables.
To avoid a false indication that an adjacency is down because of congestion related loss of Hello messages, the RouterDeadInterval is usually set to be four
times the HelloInterval – the interval between successive Hello messages sent
by a router to its neighbor. With the RFC suggested default values for these timers
(HelloInterval value of 10 s and RouterDeadInterval value of 40 s), the
failure detection time can take anywhere between 30 and 40 s. LSA flooding times
consist of propagation delay and additional pacing delays inserted by the router.
These pacing delays serve to rate-limit the frequency with which LSUpdate packets are sent on an interface. Once a router receives a new LSA, it schedules an SPF
calculation. Since the SPF calculation using Dijkstra’s algorithm (see e.g., [8]) constitutes a significant processing load, a router typically waits for additional LSAs to
arrive for a time interval corresponding to spfDelay (typically 5 s) before doing
the SPF calculation on a batch of LSAs. Moreover, routers place a limit on the frequency of SPF calculations (governed by a spfHoldTime, typically 10 s, between
successive SPF calculations), which can introduce further delays.
From the description above, it is clear that reducing the HelloInterval
can substantially reduce the Hello protocol’s failure detection time. However,
there is a limit to which the HelloInterval can be safely reduced. As the
HelloInterval becomes smaller, there is an increased chance that network
congestion will lead to loss of several consecutive Hello messages and thereby
cause a false alarm that an adjacency between routers is lost, even though the routers
and the link between them are functioning. The LSAs generated because of a false
alarm will lead to new SPF calculations by all the routers in the network. This
false alarm would soon be corrected by a successful Hello exchange between the
affected routers, which then causes a new set of LSAs to be generated and possibly
new path calculations by the routers in the network. Thus, false alarms cause an
unnecessary processing load on routers and sometimes lead to temporary changes
in the path taken by network traffic. If false alarms are frequent, routers have to
spend considerable time doing unnecessary LSA processing and SPF calculations,
which may significantly delay important tasks such as Hello processing, thereby
leading to more false alarms.
False alarms can also be generated if a Hello message gets queued behind a
burst of LSAs and thus cannot be processed in time. The possibility of such an event
increases with the reduction of the RouterDeadInterval. Large LSA bursts
can be caused by a number of factors such as simultaneous refresh of a large number of LSAs or several routers going down/coming up simultaneously. Choudhury
[5] studies this issue and observes that reducing the HelloInterval lowers the
threshold (in terms of number of LSAs) at which an LSA burst will lead to generation of false alarms. However, the probability of LSA bursts leading to false alarms
is shown to be quite low.

68

R.D. Doverspike et al.

Since the loss and/or delayed processing of Hello messages can result in false
alarms, there have been proposals to give such packets prioritized treatment at the
router interface as well as in the CPU processing queue [5]. An additional option
is to consider the receipt of any OSPF packet (e.g., an LSA) from a neighbor as an
indication of the good health of the router’s adjacency with the neighbor. This provision can help avoid false loss of adjacency in the scenarios where Hello packets
get dropped because of congestion, caused by a large LSA burst, on the link between
two routers. Such mechanisms may help mitigate the false alarm problem significantly. However, it will take some time before these mechanisms are standardized
and widely deployed.
It is useful to make a realistic assessment regarding how small the
HelloInterval can be, to achieve faster detection and recovery from network
failures while limiting the occurrence of false alarms. We summarize below the key
results from [13]. This assessment was done via simulations on the network topologies of commercial ISPs using a detailed implementation of the OSPF protocol in
the NS2 simulator. The work models all the important OSPF protocol features as
well as various standard and vendor-introduced delays in the functioning of the
protocol. These are shown in Table 2.4.
Goyal [13] observes that with the current default settings of the OSPF parameters, the network takes several tens of seconds before recovering from a failure.
Since the main component in this delay is the time required to detect a failure using
the Hello protocol, Goyal [13] examines the impact of lower HelloInterval
values on failure detection and recovery times.
Table 2.5 shows typical results for failure detection and recovery times after a
router failure. As expected, the failure detection time is within the range of three
to four times the value of HelloInterval. Once a neighbor detects the router
failure, it generates a new LSA about 0.5 s after the failure detection. The new LSA
is flooded throughout the network and will lead to scheduling of an SPF calculation
5 s (spfDelay) after the LSA receipt. This is done to allow one SPF calculation
to take care of several new LSAs. Once the SPF calculation is done, the router
takes about 200 ms more to update the forwarding table. After including the LSA
propagation and pacing delays, one can expect the failure recovery to take place
about 6 s after the ‘earliest’ failure detection by a neighbor router.
Notice that many entries in Table 2.5 show the recovery to take place much
sooner than 6 s after failure detection. This is partly an artifact of the simulation
because the failure detection times reported by the simulator are the “latest” ones
rather than the “earliest”. In one interesting case (seed 2, HelloInterval 0.75 s),
the failure recovery takes place about 2 s after the ‘latest’ failure detection. This happens because the SPF calculation scheduled by an earlier false alarm takes care of
the LSAs generated because of router failure. There are also many cases in which
failure recovery takes place more than 6 s after failure detection (notice entries
for HelloInterval 0.25 s, seeds 1 and 3). Failure recovery can be delayed because of several factors. The SPF calculation frequency of the routers is limited by
spfHoldTime (typically 10 s), which can delay the new SPF calculation in response to the router failure. The delay caused by spfDelay is also a contribution.

2

Structural Overview of ISP Networks

69

Table 2.4 Various delays affecting the operation of OSPF protocol
Standard configurable delays
RxmtInterval
The time delay before an un-acked LSA is retransmitted.
Usually 5 s.
HelloInterval
The time delay between successive Hello packets.
Usually 10 s.
RouterDeadInterval
The time delay since the last Hello before a neighbor is
declared to be down. Usually four times the
HelloInterval.
Vendor-introduced configurable delays
Pacing delay
The minimum delay enforced between two successive
Link-State Update packets sent down an interface.
Observed to be 33 ms. Not always configurable.
spfDelay
The delay between the shortest path calculation and the first
topology change that triggered the calculation. Used to
avoid frequent shortest path calculations. Usually 5 s.
spfHoldTime
The minimum delay between successive shortest path
calculations. Usually 10 s.
Standard fixed delays
LSRefreshTime
MinLSInterval
MinLSArrival
Router-specific delays
Route install delay
LSA generation delay

LSA processing delay

SPF calculation delay

The maximum time interval before an LSA needs to be
reflooded. Set to 30 min.
The minimum time interval before an LSA can be
reflooded. Set to 5 s.
The minimum time interval that should elapse before a new
instance of an LSA can be accepted. Set to 1 s.
The delay between the shortest path calculation and update
of forwarding table. Observed to be 0.2 s.
The delay before the generation of an LSA after all the
conditions for the LSA generation have been met.
Observed to be around 0.5 s.
The time required to process an LSA including the time
required to process the Link-State Update packet before
forwarding the LSA to the OSPF process. Observed to
be less than 1 ms.
The time required to do shortest path calculation. Observed
to be 0.00000247x 2 C 0.000978 s on Cisco 3600 series
routers; x being the number of nodes in the topology.

Finally, the routers with a low degree of connectivity may not get the LSAs in the
first try because of loss due to congestion. Such routers may have to wait for 5 s
(RxmtInterval) for the LSAs to be retransmitted.
The results in Table 2.5 show that a smaller value of HelloInterval speeds
up the failure detection but is not effective in reducing the failure recovery times
beyond a limit because of other delays like spfDelay, spfHoldTime, and
RxmtInterval. Failure recovery times improve as the HelloInterval reduces down to about 0.5 s. Beyond that, as a result of more false alarms, we find
that the recovery times actually go up. While it may be possible to further speed up

70

R.D. Doverspike et al.
Table 2.5 Failure detection time and failure recovery time for a router failure
with different HelloInterval values
Seed 1
Seed 2
Seed 3
Hello interval (s) FDT (s) FRT (s) FDT (s) FRT (s) FDT (s) FRT (s)
10
2
1
0.75
0.5
0.25

32:08
7:82
3:81
2:63
1:88
0:95

36:60
11:68
9:02
7:84
6:98
10:24

39:84
7:63
3:80
2:97
1:82
0:84

46:37
12:18
8:31
5:08
6:89
6:08

33:02
7:79
3:84
2:81
1:79
0:99

38:07
12:02
10:11
7:82
6:85
13:41

the failure recovery by reducing the values of these delays, eliminating such delays
altogether is not prudent. Eliminating spfDelay and spfHoldTime will result
in potentially additional SPF calculations in a router in response to a single failure (or false alarm) as the different LSAs generated because of the failure arrive
one after the other at the router. The resulting overload on the router CPUs may
have serious consequences for routing stability, especially when there are several
simultaneous changes in the network topology. Failure recovery below the range of
1–5 s is difficult with OSPF.
In summary, OSPF recovery time can be lowered by reducing the value of
HelloInterval. However, too small a value of HelloInterval will lead
to many false alarms in the network, which cause unnecessary routing changes
and may lead to routing instability. The optimal value for the HelloInterval
that will lead to fast failure recovery in the network, while keeping the false
alarm occurrence within acceptable limits for a network, is strongly influenced
by the expected congestion levels and the number of links in the topology. While
the HelloInterval can be much lower than current default value of tens of
seconds, it is not advisable to reduce it to the millisecond range because of potential false alarms. Further, it is difficult to prescribe a single HelloInterval
value that will perform optimally in all cases. The network operator needs to set the
HelloInterval conservatively taking into account both the expected congestion
as well as the number of links in the network topology.

2.5.2.2 MPLS Fast Reroute
MPLS Fast Reroute (FRR) was designed to improve restoration performance using
the additional protocol layer provided by MPLS LSPs [17]. Primary and alternate
(backup) LSPs are established. Fast rerouting over the alternate paths after a network disruption is achieved using preestablished router forwarding table entries.
Equipment suppliers have developed many flavors of FRR, some of which are not
totally compliant with standardized MPLS FRR. This section provides an overview
of the basic concept.
There are two basic varieties of backup path restoration in MPLS FRR, called
next-hop and next-next-hop. The next-hop approach identifies a unidirectional link
to be protected and a backup (or bypass) unidirectional LSP that routes around the

2

Structural Overview of ISP Networks

71

MPLS secondary
LSP tunnel

X
MPLS primary
LSP tunnels

PHY layer
links

MPLS next-hop
backup path

X

MPLS next-nexthop backup paths

Fig. 2.26 Example of Fast Reroute backup paths

link if it fails. The protected link can be a router–router link adjacency or even
another layer of LSP tunnel itself. The backup LSP routes over alternate links. The
top graph in Fig. 2.26 illustrates a next-hop backup path for the potential failure of
a given link (designated with an “X”). For now ignore the top path labeled “MPLS
secondary LSP tunnel”, which will be discussed later. With the next-next-hop approach, the primary entities to protect are two-link working paths. The backup path
is an alternate path over different links and routers than the protected entity. In general, a next-hop path is constructed to restore against individual link failures while
next-next-hop paths are constructed to restore against both individual link failures
and node failures. The trade-off is that next-hop paths are simpler to implement
because all flows routing over the link can be rerouted similarly, whereas next-nexthop requires more LSPs and routing combinations. This is illustrated in the lower
example of Fig. 2.26, wherein the first router along the path carries flows that terminate on different second hop routers, and therefore must create multiple backup
LSPs that originate at that node.
We will briefly describe an implementation of the next-hop approach to FRR.
A primary end-to-end path is chosen by RSVP. This path is characterized by the
Forwarding Equivalence Class (FEC) discussed earlier and reflects packets that are
to be corouted and have similar CoS queuing treatment and ability to be restored
with FRR. Often, a mesh of fully connected end-to-end LSPs between the backbone
routers (BRs) is created.

72

R.D. Doverspike et al.

As discussed in earlier sections, an LSP is identified in forwarding tables by
mappings of pairs of label and interface: (In-Label, In-Interface)! (Out-Label,
Out-Interface). An end-to-end LSP is provisioned (set up) by choosing and populating these entries at each intermediate router along the path by a protocol such
as RSVP-TE. For the source router of the LSP, the “In-Label” variable is equivalent
to the FEC. As a packet hops along routers, the labels are replaced according to
the mapping until it reaches the destination router, in which case, the MPLS shim
headers are popped and packets are placed on the final output port. With next-hop,
facility-based FRR, a backup (or bypass) LSP is set up for each link. For example,
consider a precalculated backup path to protect a link between routers A and B,
say (A-1, B-1), where A-1 is the transmit interface at router A, B-1 is the receive
interface at router B, and L-1 is the MPLS label for the path over this link. The
forwarding table entries are of form (L-i, A-k) ! (L-1, A-1) at router A and (L-1,
B-1) ! (L-j, B-s) at router B. When this link fails, a Layer 1 alarm is generated
and forwarded to the router controller or line card at A and B. For packets arriving
at router A, mapping entries in the forwarding table with the Out-Interface D A-1
have another (outer) layer of label pushed on the MPLS stack to coincide with the
backup path. This action is preloaded into the forwarding table and triggered by the
alarm. Forwarding continues along the routers of this backup LSP by processing the
outer layer labels as with any MPLS packet. The backup path ends at router B and,
therefore, when the packets arrive at router B, their highest (exterior) layer label is
popped. Then, from the point of view of router B, after the outer label is popped,
the MPLS header is left with (In-Label, In-Interface) D (L-1, B-1) and therefore
the packets continue their journey beyond router B just as they would if link (A-1,
B-1) were up. In this way, all LSPs that route over the particular link are rerouted
(hence the term “facility based”). Various other specifications can be made to segregate the backup path to be pushed on given classes of LSPs, for example to provide
restoration for some IP CoSs rather than others.
Another common implementation of next-hop FRR defines 1-hop pseudowires
for each key link. Each pseudowire has defined a primary LSP and backup LSP
(a capability found in most routers). If the link fails, a similar alarm mechanism
causes the pseudowire to reroute over the backup LSP. When the primary LSP is
again declared up, the pseudowire switches back to the primary path. An advantage
of this method is that the pseudowire appears as a link to the IGP routing algorithm.
Weights can be used to control how packets route over it or the underlying Layer
1 link. Section 2.6 illustrates this method for an IPTV backbone network.
MPLS FRR has been demonstrated to work very rapidly (less than 100 ms) in
response to single-link (IP layer PHY link) failures by many vendors and carriers.
Most FRR implementations behave similarly during the small interval immediately
after the failure and before IGP reconvergence. However, implementations differ
in what happens after IGP reconvergence. We describe two main approaches in
the context of next-hop FRR here. In the first approach, the backup LSP stays in
place until the link goes back into service and IGP reconverges back to its nonfailure state. This is most common when a separate LSP or pseudowire is associated
with each link in next-hop FRR. In this case, the link-LSP is rerouted onto its backup
LSP and stays that way until the primary LSP is repaired.

2

Structural Overview of ISP Networks

73

In the second approach, FRR provides rapid restoration and then, after a short
settling period, the network recomputes its paths [4]. Here, each primary end-toend LSP is recomputed during the first IGP reconfiguration process after the failure.
Since the IGP knows about the failed link(s), it reroutes the primary end-to-end
LSPs around them and the backup LSPs become moot. This is illustrated in the
three potential paths in the topmost diagram of Fig. 2.26. The IP flow routes along
the primary LSP during the nonfailure state. Then, the given link fails and the path
of the flow over the failed link deviates along the backup LSP, as shown by the lower
dashed line. After the first IGP reconfiguration process, the end-to-end LSP path is
recomputed, illustrated by the topmost dashed line.
When a failed component is repaired or a maintenance procedure is completed,
the disrupted links are put back into service. The process to return the network
to its nonfailure state is often called normalization. During the normalization process, LSAs are broadcast by the IGP and the forwarding tables are recalculated.
The normalization process is often controlled by an MPLS route mechanism/timer.
A similar procedure would occur for next-next hop.
The reason for the second approach is that while FRR enables rapid restoration,
because these paths are segmental “patches” to the primary paths, the alternate route
is often long and capacity-inefficient. With the first approach, IP flows continue routing over the backup paths until the repair is completed and alarms clear, which may
span hours or days. Another reason is that if multiple link failures occur, then some
of the backup FRR paths may fail; some response is needed to address this situation.
These limitations of the first approach were early key inhibitors to implementation
of FRR in large ISPs.
The key to implementing this second FRR strategy is that the switch from FRR
backup paths to new end-to-end paths is hitless (i.e., negligible packet loss), else
we may suffer three hits from each single failure (the failure itself, the process to
reroute the end-to-end paths immediately after the failure, and then the process to
revert to the original paths after repair). If the alternate end-to-end LSPs are presetup
and the forwarding table changes implemented efficiently for most routers (often using pointers), this process is essentially hitless for most IP unicast (point-to-point)
applications. However, we note that today’s multicast does not typically enjoy hitless switchover to the new forwarding table because most multicast trees are usually
built via join and prune request messages issued backwards (upstream) from the
destination nodes. However, it is expected that different implementations of multicast will fix this problem in the future. We discuss this again in Section 2.6 and refer
the reader to [36] for more discussion of hitless multicast.
For the network design phase of implementing FRR, for next-hop FRR, each link
(say L) along the primary path needs a predefined a backup path whose routing is
diverse in lower layers. That is, the paths of all lower-layer connections that support
the links of the backup path are disjoint from the path of the lower-layer connection for link L. The key is in predefining the backup tunnels. While next-next-hop
paths can be also used to restore against single-link failures, the network becomes
more complex to design if there is a high degree of lower-layer link overlap. More
generally, the major difficulty for the FRR approach is defining the backup LSPs so

74

R.D. Doverspike et al.

that the service paths can be rerouted, given a predefined set of lower-layer failures.
Furthermore, when multiple lower-layer failures occur and MPLS backup paths fail,
FRR does not work and the network must revert to the slower primary path recalculation approach (described in method 2 above).

2.5.3 Failures Across Multiple Layers
Now that the reader is armed with background on network layering and restoration
methods, we are poised to delve deeper into the factors and carrier decision variables
that shape the availability of the IP backbone.
Let us briefly revisit Fig. 2.9, which gives a simple example of the core ROADM
Layer Diagram. Consider a backbone router (BR) in central office B with a link to
one of the backbone routers in central office A. Furthermore, consider the remote
access router (RAR) that is homed to the backbone router in office A. However,
let us add a twist wherein the link between the RAR and BR routes over the IOS
layer instead of directly onto the ROADM (DWDM layer) as pictured in Fig. 2.9.
This can occur for RAR–BR links with lower bandwidth. This modification will
illustrate more of the potential failure modes. In particular, we have constructed this
simple example to illustrate several key points:
 Computing an estimate of the availability of the IP backbone involves analysis

of many network layers.
 Network disruptions can originate from many different sources within each layer.
 Some lower layers may provide restoration and others do not; how does this
affect the IP backbone?
Figure 2.27 gives examples of the types of individual component disruptions (“down
events”) that might cause links to fail in this network example, but still only shows a
few of the many disruptions that can originate at these layers. As one can see, this is
a four-layer example; and, some of the layers are skipped. Note that for simplicity,
we illustrate point-to-point DWDM systems at the DWDM layer; however, the concepts apply equally well for ROADMs. Some readers perhaps may think that the
main source of network failures is fiber cuts and, therefore, the entire area of multilayer restoration can be reduced to analyzing fiber cuts. However, this oversimplifies
the problem. For example, an amplifier failure can often be as disruptive as a fiber
cable cut and will likely result in the failure of multiple IP layer links. Furthermore,
amplifier failures are more frequent. Let us examine the effect of some of the failures
illustrated in Fig. 2.27.
IOS interface failure: The IOS network has restoration capability, as described
in earlier sections. Consequently, the IOS layer reroutes its failed SONET STS-n
connection that supports the RAR–BR link onto its restoration path. In this case,
once the SONET alarms are detected by the two routers (the RAR and BR), they
take the link out of service and generate appropriate LSAs to the correct IGP

2

Structural Overview of ISP Networks

75

OC-n
router
common
component

OC-n

BR

BR

AR

router line card
IP Layer
IOS common
component

DWDM common
component or Amplifier

intra-office
fiber

IOS

IOS

IOS

IOS line card

IOS Layer

OTs

D
W
D
M

D
W
D
M

OTs

D
W
D
M

D
W
D
M

OTs

OTs

D
W
D
M

D
W
D
M

OTs

OT
ROADM/Point-to-point DWDM Layer

fiber cable

Fiber Layer

BR = Backbone Router

ROADM = Reconfigurable Optical Add//Drop
Multiplexer

AR = Access Router

OT = Optical Transponder

DWDM = (Dense) Wavelength Division
Multiplexer

IOS = Intelligent Optical Switch

Fig. 2.27 Example of components disruptions (failure or maintenance activity) at multiple layers

administrative areas or control domains to announce the topology change. Assuming that the IOS-layer restoration is successful, the AR–BR link comes back after a
short time (as specified in the IOS layer of Table 2.3) and the SONET alarm clears.
After perhaps, an appropriate time-out on the routers to avoid link flapping, the link
is brought back up by the router and the topology change is announced via LSAs.
We note that in a typical AR/BR homing architecture, the LSAs from an AR–BR
link are only announced in subareas and so do not affect unaffected ARs or BRs.
Fiber cut: In the core network, the probability of a fiber cut is roughly proportional to its length. They are less frequent than many of the other failures, but highly
disruptive, where usually many simultaneous IP layer links fail because of the concentration of capacity enabled by DWDM.
Optical Transponder: OT failure is the most common of the failures shown in
Fig. 2.27. However, a single OT failure only affects individual IP backbone links.
Some of the more significant problems with OT failures are (1) performance degradation, where bit errors occasionally trip BER threshold crossing alerts and (2) there
is a nonnegligible probability of multiple failures in the network, in which an OT
fails while another major failure is in progress or vice versa.
DWDM terminal or amplifier: Amplifier failure is usually the most disruptive of
failures because of its impact (multiple wavelengths) and sheer quantity, often
placed every 50–100 miles, depending on the vintage and bit rate of the wavelengths of the DWDM equipment. Failure of the DWDM terminal equipment not
associated with amplifiers and OTs is less probable because of the increased use of

76

R.D. Doverspike et al.

passive (nonelectrical or powered) components. Note that in Fig. 2.27, for the OT,
fiber cut, and amplifier failure, the affected connections at their respective layers are
unrestored. Thus, the IP layer must reroute around its lost link capacity.
Intra-office fiber: These disruptions usually occur from maintenance, reconfiguration, and provisioning activity in the central office. This has been minimized
over the years due to the use of fiber patch panels; however, when significant network capacity expansion or reconfiguration occurs, especially for the deployment of
new technologies, architectures, or services, downtime from these class of failures
typically spikes. However, it is typical to lump the intra-office fiber disruptions into
the downtime for a linecard or port and model them as one unit.
Router: These network disruptions include failure of router line cards, failure of
router common equipment, and maintenance or upgrade of all or parts of the router.
Note that for these disruptions that originate at the IP layer, no lower-layer restoration method can help because rerouting the associated connections at the lower
layers will not bring the affected link back up. However, in the dual-homing AR–BR
architecture, all the ARs that home to the affected router can alternatively reroute
through the mate BR.
The method of rerouting the AR traffic to the surviving AR–BR links differs
per carrier. Usually, IGP reconfiguration is used. However, this can be unacceptably slow for some high-priority services, as evidenced by Table 2.3. Therefore,
other faster techniques are sometimes used, such as Ethernet link load balancing or
MPLS FRR.
We generalize some simple observations on multilayer restoration illustrated by
Fig. 2.27 and its subsequent discussion:
1. Because of the use of express links, a single network failure or disruption at a
lower layer usually results in multiple link failures at higher layers.
2. Failures that originate at an upper layer cannot be restored at a lower layer.
3. To meet most ISP network availability objectives, some form of restoration (even
if rudimentary) must be provided in upper layers.

2.5.4 IP Backbone Network Design
Network design is covered in more detail in Chapter 5. However, to tie together the
concepts of network layering, network failure modeling, and restoration, we provide
a brief description of IP network design here to illustrate its importance in meeting
network availability targets. In this section, we give a brief description about how
these factors are accommodated in the network design. To illustrate this, we describe a very simplified network design (or network planning) process as follows.
This process would occur every planning period or whenever major changes to the
network occur:

2

Structural Overview of ISP Networks

77

1. Derive a traffic matrix.
2. Input the existing IP backbone topology and compute any needed changes. That
is, determine the homing of AR locations to the BR locations and determine
which BR pairs are allowed to have links placed between them.
3. Determine the routing of BR–BR links over the lower-layer networks (e.g.,
DWDM, IOS, fiber).
4. Route the traffic matrix over the topology and size the links. This results in an
estimate of network cost across all the needed layers.
5. Resize the links by finding their maximum needed capacity over all possible
events in the Failure Set, which models potential network disruptions (both
component failures and maintenance activity). This step simulates each failure
event, determining which IP layer link or nodes fail after lower-layer restoration,
if it exists, is applied and determining the capacity needed after traffic is rerouted
using IP layer restoration.
6. Re-optimize the topology by going back to step 2 and iterating with the objective
of lowering network cost.
Note in steps 2 and 3 that most carriers are reluctant to make large changes to the
existing IP backbone topology, since these can be very disruptive and costly events.
Therefore, steps 2 and 3 usually incur small topology changes from one planning
period to another planning period. We will not describe detailed algorithms for the
above in detail here. Approaches to the above problem can be found in [22, 23].
The traffic matrix can come in a variety of forms, such as the peak 5-min average
loads between AR-pairs or average loads, etc. Unfortunately, many organizations
responsible for IP network design either have little or no data about their current
or future traffic matrices. In fact, many engineers who manage IP networks expand their network by simply observing link loads. When a link load exceeds some
threshold, they add more capacity. Given no knowledge or high uncertainty of the
true, stochastic traffic matrix, this may be a reasonable approach. However, network
failures and their subsequent restorations are the phenomena that cause the greatest
challenges with such a simple approach. Because of the extensive rerouting that can
occur after a network failure, there is no simple or intuitive parameter to determine
the utilization threshold for each link. Traffic matrix estimation is discussed in detail
in Chapter 5.
A missing ingredient in the above network design algorithm is we did not describe how to model the needed network availability for an ISP to achieve its
SLAs. Theoretically, even if we assume the traffic matrix (present and/or future)
is completely accurate, to achieve the network design availability objective, all the
component failure modes and all the network layering must be modeled to design
the IP backbone. The decision variables are the layers where we provide restoration
(including what type of restoration should be used) and how much capacity should
be deployed at each layer to meet the QoS objectives for the IP layer. This is further
complicated by the fact that while network availability objectives for transport layers
are often expressed in worst-case or average-case connection uptimes, IP backbone
QoS objective often use packet-loss metrics.

78

R.D. Doverspike et al.

However, we can approximate the packet loss constraints in large IP layer
networks by establishing maximum link utilization targets. For example, through
separate analysis it might be determined that every flow can achieve the objective maximum packet loss target by not exceeding 90% utilization on any 40 Gb/s
link, with perhaps lower utilization maxima needed on lower-rate links. Then, one
can model when this utilization condition is met over the set of possible failures, including subsequent restoration procedures. By modeling the probabilities
of the failure set, one can compute a network availability metric appropriate for
packet networks. The probabilities of events in the failure set can be computed using Markov models and the Mean Time Between Failures (MTBF) and the Mean
Time to Repair (MTTR) of the component disruptions. These parameters are usually
obtained from a combination of equipment-supplier specifications, network observation/data, and carrier policies and procedures.
A major stumbling block with this theoretical approach is that the failure event
space is exponential in size. Even for very small networks and a few layers, it is
intractable to compute all potential failures, let alone the subsequent restoration
and network loss. An approach to probabilistic modeling to solve this problem is
presented in more detail in Chapter 4 and in [28].
Armed with this background, we conclude this section by revisiting the issue
of why we show the IP backbone routing over an unrestorable DWDM layer in
the network layering of Fig. 2.3. This at first may seem counterintuitive because it
is generally true that, per unit of capacity, the cost of links at lower layers is less
than that of higher layers. Some of the reasons for this planning decision, which
is consistent with most large ISPs, were hinted at in Section 2.5.3. We summarize
them here.
1. Backbone router disruptions (failures or maintenance events) originate within
the IP layer and cannot be restored at lower layers. Extra link capacity must be
provided at the IP layer for such disruptions. Once placed, this extra capacity can
then also be used for IP layer link failures that originate at lower layers. This
obviates most of the cost advantages of lower-layer restoration.
2. Under nonfailure conditions, there is spare capacity available in the IP layer to
handle uncertain demand. For example, restoration requirements aside, to handle
normal service demand, IP layer links could be engineered to run below 80% utilization during peak intervals of the traffic matrix and well below that at off-peak
intervals. If we allow higher utilization levels during network disruption events,
then this provides an existing extra buffer during those events. Furthermore, there
may be little appreciable loss during network disruptions during off-peak periods.
As QoS and CoS features are deployed in the IP backbone, there is yet another
advantage to IP layer restoration. Namely, the IP layer can assign different QoS
objectives to different service classes. For example, one such distinction might be to
plan network restoration so that premium services receive better performance than
best-effort services during network disruptions. In contrast, the DWDM layer cannot
make such fine-grain distinctions; it either restores or does not restore the entire IP
layer link, which carries a mixture of different classes of services.

2

Structural Overview of ISP Networks

79

2.6 IPTV Backbone Example
Some major carriers now offer nationwide digital television, high-speed Internet,
and Voice-over-IP services over an IP network. These services typically include
hundreds of digital television channels. Video content providers deliver their content to the service provider in digital format at select locations called super hub
offices (SHOs). This in turn requires that the carrier have the ability to deliver
high-bandwidth IP streaming to its residential customers on a nationwide basis. If
such content is delivered all the way to residential set-top boxes over IP, it is commonly called IPTV. There are two options to providing such an IPTV backbone.
The first option is to create a virtual network on top of the IP backbone. Since video
service consists mostly of streaming channels that are broadcast to all customers, IP
multicast is usually the most cost-effective protocol to transport the content. However, users have high expectations for video service and even small packet losses
negatively impact video quality. This requires the IP backbone to be able to transport multicast traffic at a very high level of network availability and efficiency. The
first option results in a mixture of best-effort traffic and traffic with very high quality
of service on the same IP backbone, which in turn requires comprehensive mechanisms for restoration and priority queuing.
Consequently, some carriers have followed the second option, wherein they create a separate overlay network on top of the lower-layer DWDM or TDM layers. In
reality, this is another (smaller) IP layer network, with specialized traffic, network
structure, and restoration mechanisms. We describe such an example in this section.
Because of the high QoS objectives needed for broadcast TV services, the reader
will find that this section builds on most of the previous material in this chapter.

2.6.1 Multicast-Based IPTV Distribution
Meeting the stringent QoS required to deliver a high-quality video service (such as
low latency and loss) requires careful consideration of the underlying IP-transport
network, network restoration, and video and packet recovery methods.
Figure 2.28 (borrowed from [9]) illustrates a simplified architecture for a network providing IPTV service. The SHO gathers content from the national video
content providers, such as TV networks (mostly via satellite today) and distributes
it to a large set of receiving locations, called video hub offices (VHOs). Each VHO in
turn feeds a metropolitan area. IP routers are used to transport the IPTV content in
the SHO and VHOs. The combination of SHO and VHO routers plus the links that
connect them comprise the IPTV backbone. The VHO combines the national feeds
with local content and other services and then distributes the content to each metro
area. The long-distance backbone network between the SHO and the VHO includes
a pair of redundant routers that are associated with each VHO. This allows for protection against router component failures, router hardware maintenance, or software

80

R.D. Doverspike et al.
Dashed Links used
for restoration

SHO
VHO

VHO

Edges of Multicast Tree

VHO

VHO

VHO

VHO

VHO

VHO

VHO
VHO

S / VHO = Super / Video Hub Office
Router
Metro Intermediate Office

RG

Metro

Set-top
Box

Access

Video Serving Office
DSLAM = Digital Subscriber
Loop Access Multiplexer
RG RG = Residential Gateway

Fig. 2.28 Example nationwide IPTV network

upgrades. IP multicast is used for delivery as it provides economic advantages for
the IPTV service to distribute video. With multicast, packets traverse each link at
most once.
The video content is encoded using an encoding standard such as H.264. Video
frames are packetized and are encapsulated in the Real-Time Transport Protocol
(RTP) and UDP. In this example, PIM-SSM is used to support IP multicast over the
video content. Each channel from the national live feed at the SHO is assigned
a unique multicast group. There are typically hundreds of channels assigned to
standard-definition (SD) (1.5 to 3 Mb/s) and high-definition (HD) (6 to 10 Mb/s)
video signals plus other multimedia signals, such as “picture-in-picture” channels
and music. So, the live feed can be multiple gigabits per second in aggregate
bandwidth.

2.6.2 Restoration Mechanisms
The IPTV network can use various restoration methods to deliver the needed
video QoS to end-users. For example, it can recover from relatively infrequent and
short bursts of loss using a combination of video and packet recovery mechanisms
and protocols, including the Society of Motion Picture and Television Engineers
(SMPTE; www.smpte.org/standards) 2022–1 Forward Error Correction (FEC)

2

Structural Overview of ISP Networks

81

standard, retransmission approaches based on RTP/RTCP [33] and Reliable UDP
(R-UDP) [31], and video player loss-concealment algorithms in conjunction with
set-top box buffering. R-UDP supports retransmission-based packet-loss recovery.
In addition to protecting against video impairments due to last-mile (loop) transmission problems in the access segment, a combination of these methods can recover
from a network failure (e.g., fiber link or router line card) of 50 ms or less. Repairing
network failures usually takes far more than 50 ms (potentially several hours), but
when combined with link-based FRR, this restoration methodology could meet the
stringent requirements needed for video against single-link failures.
Figure 2.29 (borrowed from [9]) illustrates how we might implement link-based
FRR in an IPTV backbone by depicting a network segment with four node pairs
that have defined virtual links (or pseudowires). This method is the pseudowire,
next-hop FRR approach described in Section 2.5.2.2. For example, node pair E-C
has a lower-layer link (such as SONET OC-n or Gigabit Ethernet) in each direction
and a pseudowire in each direction (a total of four unidirectional logical links) used
for FRR restoration. The medium dashed line shows the FRR backup path for the
pseudowire E!C. Note that links such as E-A are for restoration and, hence, have
no pseudowires defined. Pseudowire E!C routes over a primary path that consists
of the single lower-layer link E!C (see the solid line in Fig. 2.29). If a failure occurs
to a lower-layer link in the primary path such as C-E, then the router at node E
attempts to switch to the backup path using FRR. The path from the root to node A
will switch to the backup path at node E (E-A-B-C). Once it reaches node C, it will

A

E

F

Backup path for
Pseudowire E→C

B
IGP view of
Multicast tree

C

D
Root
Path of flow from
Root to node A

A

E
F

X

B

D
C
Layer 1 Link (High weight)
Layer 1 Link (High weight – used for restoration only)
Pseudowire (Low weight – sits on top of Layer 1 solid black link)

Fig. 2.29 Fast Reroute in IPTV backbone

82

R.D. Doverspike et al.

continue on its previous (primary) path to node A (C-B-F-A). The entire path from
E to A during the failure is shown by the outside dotted line. Although the path
retraces itself between the routers B and C, the multicast traffic does not overlap
because of the links’ unidirectionality. Also, although the IGP view of the topology
realizes that the lower-layer links between E and C have gone “down,” because the
pseudowire from E!C is still “up” and has the least weight, the shortest path tree
remains unchanged. Consequently, the multicast tree remains unchanged. The IGP
is unaware of the actual routing over the backup path. Note that these backup paths
are precomputed, by analyzing all possible link failures in a comprehensive manner,
a priori.
If we route the pseudowire FRR backup path on a lower-layer path that is diverse
from its primary path, FRR operates rapidly (suppose around 50 ms), and we set the
hold-down timers appropriately, IGP will not detect the effect of any single fiber
or DWDM layer link failure. Therefore, the multicast tree will remain unaffected,
reducing the outage time of any single-link failure from tens of seconds to approximately 50 ms. This order of restoration time is needed to achieve the stringent IPTV
network availability objectives.

2.6.3 Avoiding Congestion from Traffic Overlap
A drawback of restoration using next-hop FRR is that since it reroutes traffic on
a link-by-link basis, it can suffer traffic overlap during link failures, thus requiring
more link capacity to meet the target availability. Links are deployed bidirectionally,
and traffic overlap means that the packets of the same multicast flows travel over the
same link (in the same direction) two or more times. If we avoid overlap, we can
run the links at higher utilization and thus design more cost-effective networks. This
requires that the multicast tree and backup paths be constructed so that traffic does
not overlap.
To illustrate traffic overlap, Fig. 2.30a shows a simple network topology with
node S as the source and nodes d1 to d8 as the destinations. Here, each router is
connected by a pair of directed links (in opposite directions). The two links of the
pair are assigned the same IGP weight and the multicast trees are derived from these
weights. The Fig. 2.30a illustrates two sets of link weights. Figure 2.30b shows the
multicast tree derived from the first set of weights. In this case, there exists a singlelink failure that causes traffic overlap. For example, the dotted line shows the backup
route for link d1–d4. If link d1–d4 fails, then the rerouted traffic will overlap with
other traffic on links S -d 2 and d 2–d 6, thereby resulting in congestion on those
links. Client routers downstream of d 2 and d 6 will see impairments as a result of
this congestion. It is desirable to avoid this congestion wherever possible by constructing a multicast tree such that the backup path for any single-link failure does
not overlap with any downstream link on the multicast tree. This is achieved by
choosing OSPF link weights suitably.
The tree derived from the second pair of weights is shown in Fig. 2.30c. In this
case, the backup paths do not cause traffic overlap in response to any single-link

2

Structural Overview of ISP Networks

83

a

S
1,10

1,10

1,100
d1
1,10
d4

d5
1,10

b

d2

1,10

1,10

d6
d7
d8
1,10 1,100 1,10
Topology
S

d3

d1

X
d4

d3

c

S

d1

d2

d2

d3

X
d5

d6

d7

d8

d4

Multicast Tree with
1st weights

d5

d6

d7

d8

Multicast Tree with 2nd
weights

Fig. 2.30 Example of traffic overlap from single-link failure

failure. The multicast tree link is now from d 6 to d 2. The backup path for link
d1–d4 is the same as in Fig. 2.30b. Observe that traffic on this backup path does not
travel in the same direction as any link of the multicast tree. An algorithm to define
FRR backup paths and IGP weights so that the multicast tree does not overlap from
any single failure can be found in [10].

2.6.4 Combating Multiple Concurrent Failures
The algorithm and protocol in [10] helps in avoiding traffic overlap of the multicast tree during single-link failures. However, multiple link failures can still cause
overlap. An example is shown in Fig. 2.31. Assume that links d1–d4 and d3–d8
are both down. If the backup path for edge d1–d4 is d1-S-d2-d6-d5-d4 (as shown in
Fig. 2.30b and in Fig. 2.31) and the backup path for edge d3–d8 is d3-S-d2-d6-d7-d8,
traffic will overlap paths on edges S-d2 and d2–d6. There would be significant traffic
loss due to congestion if the links of the network are sized to only handle a single
stream of multicast traffic.
This situation essentially occurs because MPLS FRR occurs at Layer 2 and therefore the IGP is unaware of the FRR backup paths. Furthermore, the FRR backup
paths are precalculated and there is no real-time (dynamic) accommodation for

84

R.D. Doverspike et al.

Fig. 2.31 Example of traffic
overlap from multiple link
failures

S

d1

d2

d3
X

X
d4

d5

d6

d7

d8

different combinations of multiple-link failures. In reality, multiple (double and even
triple) failures can happen. When they occur, they can have a large impact on the
performance of the network.
Yuksel [36] describes an approach that builds on the FRR mechanism but limits
its use to a short period. When a single link fails and a pseudowire’s primary path
fails, the traffic is rapidly switched over to the backup path as described above.
However, soon afterwards, the router sets the virtual link weight to a high value and
thus triggers the IGP reconvergence process – this is colloquially called “costing
out” the link. Once IGP routing converges, a new PIM tree is rebuilt automatically.
This avoids long periods where routing occurs over the FRR backup paths, which
are unknown to the IGP. This ensures rapid restoration from single-link failures
while allowing the multicast tree to dynamically adapt to any additional failures
that might occur during a link outage. It is only during this short, transient period
when FRR starts and IGP reconvergence finishes that another failure could expose
the network to a path overlapping on the same link. The potential downside of this
approach is that it incurs two more network reconvergence processes – that is, the
period right after FRR has occurred and then again when the failure is repaired.
If it is not carefully executed, this alternative approach can cause many new video
interruptions due to small “hits” after single failures.
Yuksel [36] proposes a careful multicast recovery methodology to accomplish
this approach, yet avoid such drawbacks. A key component of the method is the
make-before-break change of the multicast tree – that is, the requirement to hitlessly
switch traffic from the old multicast tree to the new multicast tree. When the failure
is repaired, the method normalizes the multicast tree to its original shortest path tree
again in a hitless manner. The key modification to the multicast tree-building process
(pruning and joining nodes) is that the prune message to remove the branch to the
previous parent is not sent until the router receives PIM–SSM data packets from its
new parent for the corresponding (S,G) group. Another motivation for this modification is because current PIM–SSM multicast does not have an explicit acknowledgement to a join request. It is only through the receipt of a data packet on that
interface that the node knows that the join request was successfully received and
processed at the upstream node. The soft-state approach of IP Multicast (refresh the
state by periodically sending join requests) is also used to ensure consistency. This
principle is used to guide the tree reconfiguration process at a node in reaction to a

2

Structural Overview of ISP Networks

85

failure. In this way, routers do not lose data packets during the switchover period.
Of course, this primarily works in the PIM-SSM case, where there is a single source.
As we can observe from the description above, building an IPTV backbone with
high network availability builds on most of the protocols, multilayer failure models,
and restoration machinery we have described in the previous sections of the chapter.
In particular, given the underlying probabilities of network failures plus these complex failure and restoration mechanisms, such an approach must include the network
design methodology to evaluate and estimate the theoretical network availability of
the IPTV backbone. If such a methodology was not utilized, a carrier would run the
risk of having its video customers dissatisfied with their video service because of
inadequate network availability.

2.7 Summary
This chapter presents an overview of the layered network design that is typical in a
large ISP backbone. We emphasized three aspects that influence the design of an IP
backbone. The first aspect is that the IP network design is strongly influenced by its
relationship with the underlying network layers (such as DWDM and TDM layers)
and the network segments (core, metro, and access). ISP networks use a hierarchy
of specialized routers, generally called access and backbone routers. At the edge of
the network, the location of access routers, and the types of interfaces that they need
to support are strongly influenced by the way the customers connect to the backbone through the metro network. In the core of a large carrier network, backbone
routers are interconnected using DWDM transmission technology. As IP traffic is
the dominant source of demand for the DWDM layer, the backbone demands drive
requirements for the DWDM layer. The need for multiple DWDM links has driven
the evolution of aggregate links in the core.
The second aspect is that ISP networks have evolved from traditional IP forwarding to support MPLS. The separation of routing and forwarding and the ability to
support a routing hierarchy allow ISPs to support new functionality including Layer
2 and Layer 3 VPNs and flexible traffic engineering that could not be as easily supported in a traditional IP network.
Finally, this chapter provided an overview of the issues that affect IP network reliability, including the impact of network disruptions at multiple network layers and,
conversely, how different network layers respond to disruptions through network
restoration. We described how failures and maintenance events originate at various
network layers and how they impact the IP backbone. We presented an overview
of the performance of OSPF failure recovery to motivate the need for MPLS Fast
Reroute. We summarized the interplay between network restoration and the network
design process.
To tie these concepts together, we presented a “case study” of an IPTV backbone.
An IPTV network can be thought of as an IP layer with a requirement for very
high performance, essentially high network availability and low packet loss. This

86

R.D. Doverspike et al.

requires the interlacing of multiple protocols, such as R-UDP, MPLS Fast Reroute,
IP Multicast, and Forward Error Control. We described how lower-layer failures
(including multiple failures) affect the IP layer and how these IP layer routing and
control protocols respond. Understanding the performance of network restoration
protocols and the overall availability of the given network design requires careful
modeling of the types and likelihood of network failures, as well as the behavior
of the restoration protocols. This chapter endeavored to lay a good foundation for
reading the remaining chapters of this book.
We conclude by alerting the reader to an important observation about IP network
design. Telecommunications and its technologies undergo constant change. Therefore, this chapter describes a point in time. The contents of this chapter are different
from what they would have been 5 years ago. There will be further changes over
the next 5 years and, consequently, the chapter written 5 years from now may look
quite different.

References
1. AT&T (2003). Managed Internet Service Access Redundancy Options, from http://www.
pnetcom.com/AB-0027.pdf. Accessed 15 April 2009.
2. Awduche, D., Berger, L., Gan, D., Li. T., Srinivasan, V., & Swallow, G. (2001). RSVP-TE:
Extensions to RSVP for LSP Tunnels. IETF RFC 3209, Dec. http://tools.ietf.org/html/rfc3209.
Accessed 29 January 2010.
3. Braden, R., Zhang, L., Berson, S., Herzog, S., & Jamin, S. (1997). Resource ReSerVation Protocol (RSVP) – Version 1 Functional Specification. IETF RFC 2205, Sept.
http://tools.ietf.org/html/rfc2205. Accessed 29 January 2010.
4. Chiu, A., Choudhury, G., Doverspike, R., & Li, G. (2007). Restoration design in IP over reconfigurable all-optical networks. NPC 2007, Dalian, P.R. China, September 2007.
5. Choudhury, G. (Ed.) (2005). Prioritized Treatment of Specific OSPF Version 2 Packets and
Congestion Avoidance. IETF RFC 4222, Oct.
6. Ciena Core Director. http://www.ciena.com/products/products coredirector product overview.
htm. Accessed 13 April 2009.
7. Cisco (1999). Tag Switching in Internetworking Technology Handbook, Chapter 23, http://
www.cisco.com/en/US/docs/internetworking/technology/handbook/Tag-Switching.pdf, accessed 12/26/09.
8. Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to algorithms,
second edition (pp. 595–601). Cambridge: MIT Press, New York: McGraw-Hill. ISBN 0–262–
03293–7. Section 24.3: Dijkstra’s algorithm.
9. Doverspike R., Li, G., Oikonomou, K. N., Ramakrishnan, K. K., Sinha, R. K., Wang, D., et al.
(2009). Designing a reliable IPTV network. IEEE Internet Computing Magazine May/June,
pp. 15–22.
10. Doverspike, R., Li, G., Oikonomou, K., Ramakrishnan, K. K., & Wang, D. (2007). IP backbone design for multimedia distribution: architecture and performance. INFOCOM-2007,
Anchorage Alaska April 2007.
11. Doverspike, R., & Magill, P. (2008). Commercial optical networks, overlay networks and
services. In I. Kaminow, T. Li, & A. Willner, (Eds), Chapter 13 in Optical fiber telecommunications VB. San Diego, CA: Academic.
12. Feuer, M., Kilper, D., & Woodward, S. (2008). ROADMs and their system applications. In
I. Kaminow, T. Li, & A. Willner, (Eds), Chapter 8 in Optical fiber telecommunications VB.
San Diego, CA: Academic.

2

Structural Overview of ISP Networks

87

13. Goyal, M., Ramakrishnan K. K., & Feng W. (2003) “Achieving Faster Failure Detection in
OSPF Networks,” IEEE International Conference on Communications (ICC 2003), Alaska,
May 2003.
14. IEEE 802.1Q-2005 (2005) Virtual Bridged Local Area Networks; ISBN 0–7381–3662-X.
15. IEEE: 802.1Qay – Provider Backbone Bridge Traffic Engineering. http://www.ieee802.
org/1/pages/802.1ay.html. Accessed October 7, 2008.
16. IETF PWE3: Pseudo Wire Emulation Edge to Edge (PWE3) Working Group. http://www.
ietf.org/html.charters/pwe3-charter.html. Accessed 7 Nov 2008.
17. IETF RFC 4090 (2005) Fast Reroute Extensions to RSVP-TE for LSP Tunnels. http://
www.ietf.org/rfc/rfc4090.txt. May 2005. Accessed 7 Nov 2008.
18. ITU-T G.709, “Interfaces for the Optical Transport Network,” March 2003.
19. ITU-T G.7713.2. Distributed Call and Connection Management: Signalling mechanism using
GMPLS RSVP-TE.
20. Kalmanek, C. (2002). A Retrospective View of ATM. ACM Sigcomm CCR, Vol. 32, Issue 5,
Nov, ISSN: 0146–4833.
21. Katz, D., Kompella, K., & Yeung, D. (2003). IETF RFC 3630: Traffic Engineering (TE) Extensions to OSPF Version 2. http://tools.ietf.org/html/rfc3630. Accessed 4 May 2009.
22. Klincewicz, J. G. (2005). Issues in link topology design for IP networks. SPIE Conference on
performance, quality of service and control of next-generation communication networks III,
SPIE Vol. 6011, Boston, MA.
23. Klincewicz, J. G. (2006). Why is IP network design so difficult? Eighth INFORMS telecommunications conference, Dallas, TX, March 30–April 1, 2006.
24. Kompella, K., & Rekhter, Y. (2007). IETF RFC 4761: Virtual private LAN service (VPLS)
using BGP for auto-discovery and signaling. http://tools.ietf.org/html/rfc4761, accessed
12/26/09.
25. Lasserre, M., & Kompella, V. (2007). IETF RFC 4762: Virtual private LAN service (VPLS)
using label distribution protocol (LDP) signaling. http://tools.ietf.org/html/rfc4762, accessed
12/26/09.
26. Moy, J. (1998). IETF RFC 2328: OSPF Version 2. http://tools.ietf.org/html/rfc2328, accessed
12/26/09.
27. Nortel. (2007). Adding scale, QoS and operational simplicity to Ethernet. http://www.nortel.
com/solutions/collateral/nn115500.pdf, accessed 12/26/09.
28. Oikonomou, K., Sinha, R., & Doverspike, R. (2009). Multi-Layer Network Performance and
Reliability Analysis. The International Journal of Interdisciplinary Telecommunications and
Networking (IJITN), Vol. 1 (3), pp. 1–29, Sept.
29. Optical Internetworking Forum (OIF) (2008). OIF-UNI-02.0-Common–User Network Interface (UNI) 2.0 Signaling Specification: Common Part. http://www.oiforum.com/public/
documents/OIF-UNI-02.0-Common.pdf.
30. Oran, D. (1990). IETF RFC 1142: OSI IS-IS intra-domain routing protocol. http://tools.
ietf.org/html/rfc1142.
31. Partridge, C., & Hinden, R. (1990). Version 2 of the Reliable Data Protocol (RDP), IETF RFC
1151. April.
32. Perlman, R. (1999). Interconnections: Bridges, Routers, Switches, and Internetworking Protocols, 2e. Addison-Wesley Professional Computing Series.
33. Schulzrinne, H., Casner, S., Frederick, R., & Jacobson, V. (2003). RTP: A Transport Protocol for Real-Time Application, IETF RFC 3550. http://www.ietf.org/rfc/rfc3550.txt, accessed
12/26/09.
34. Sycamore Intelligent Optical Switch. (2009). http://www.sycamorenet.com/products/sn16000.
asp. Accessed 13 April 2009.
35. Telcordia GR-253-CORE (2000) Synchronous Optical Network (SONET) Transport Systems:
Common Generic Criteria.
36. Yuksel, M., Ramakrishnan, K. K., & Doverspike, R. (2008). Cross-layer failure restoration for
a robust IPTV service. LANMAN-2008, Cluj-Napoca, Romania September.
37. Zimmermann, H. (1980). OSI reference model – the ISO model of architecture for open
systems interconnection. IEEE Transactions on Communications, 28(Suppl. 4), 425–432.

88

R.D. Doverspike et al.

Glossary of Acronyms and Key Terms
1:1
1C1

Access
Network
Segment
ADM
Administrative
Domain
Aggregate
Link
AR
AS
ASBR
ATM
AWG
B-DCS
Backhaul
BER
BGP
BLSR
BR
Bundled Link
CE switch
Channelized
CHOC Card
CIR
CO
Composite
Link
Core Network
Segment
CoS
CPE

One-by-one (signal switched to restoration path on
detection of failure)
One-plus-one (signal duplicated across both service path
and restoration path; receiver chooses surviving signal upon
detection of failure)
The feeder network and loop segments associated with a
given metro segment
Add/Drop Multiplexer
Routing area in IGP
Bundles multiple physical links between a pair of routers
into a single virtual link from the point of view of the
routers. Also called bundled or composite link
Access Router
Autonomous System
Autonomous System Border Router
Asynchronous Transfer Mode
Arrayed Waveguide Grating
Broadband Digital Cross-connect System (cross-connects at
DS-3 or higher rate)
Using TDM connections that encapsulate packets to
connect customers to packet networks
Bit Error Rate
Border Gateway Protocol
Bidirectional Line-Switched Ring
Backbone Router
See Aggregate Link
Customer-Edge switch
A TDM link/connection that multiplexes lower-rate signals
into its time slots
CHannelized OC-n card
Committed Information Rate
Central Office
See Aggregate Link
Equipment in the POPs and network structures that connect
them for intermetro transport and switching
Class of Service
Customer Premises Equipment

2

Structural Overview of ISP Networks

CSPF
DCS
DDoS
DoS
DS-0
DS-1
DS-3
DWDM
E-1
eBGP
EGP
EIGRP
EIR
EPL
FCC
FE
FEC
FEC
Feeder
Network
FRR
FXC
Gb/s
GigE
GMPLS
HD
HDTV
Hitless
iBGP
IETF
IGP
Internet Route
Free Core
IGMP
Inter-office
Links

Constraint-based Shortest Path First
Digital Cross-connect System
Distributed Denial of Service (security attack on router)
Denial of Service (security attack on router)
Digital Signal – level 0 a pre-SONET signal carrying one
voice-frequency channel at 64 kb/s)
Digital Signal – level 1 (a 1.544 Mb/s signal).
A channelized DS-1 carries 24 DS0s
Digital Signal – level 3 (a 44.736 Mb/s signal).
A channelized DS-3 carries 28 DS1s
Dense Wavelength-Division Multiplexing
European plesiosynchronous (pre-SDH) rate of 2.0 Mb/s
External Border Gateway Protocol
Exterior Gateway Protocol
Enhanced Interior Gateway Routing Protocol
Excess Information Rate
Ethernet Private Line
Federal Communications Commission
Fast Ethernet (100 Mb/s)
Forward Error Correction – bit-error recovery technique in
TDM transmission and some IPs
Forwarding Equivalence Class – classification of flows
defined in MPLS
The portion of the access network between the loop and
first metro central office
Fast Re-Route
Fiber Cross-Connect
Gigabits per second (1 billion bits per second)
Gigabit Ethernet (nominally 1 Gb/s)
Generalized MPLS
High definition (short for HDTV)
High-definition TV (television with resolution exceeding
7201280)
Method of changing network connections or routes that
incur negligible loss
Interior Border Gateway Protocol
Internet Engineering Task Force
Interior Gateway Protocol
Where MPLS removes external BGP information plus
Layer 3 address lookup from the interior of the IP backbone
Internet Group Management Protocol
Links whose endpoints are contained in different central
offices

89

90

Intra-office
Links
IOS
IP
IPTV
IROU
IS-IS
ISO
ISP
ITU
Kb/s
LAN
LATA
Layer n

LDP
LMP
Local Loop
LSA
LSDB
LSP
LSR
MAC
MAN
Mb/s
MEMS
Metro
Network
Segment
MPEG
MPLS
MSO
MSP
MTBF

R.D. Doverspike et al.

Links that are totally contained within the same central
office
Intelligent Optical Switch
Internet Protocol
Internet Protocol television (i.e., entertainment-quality
video delivered over IP)
Indefeasible Right of Use
Intermediate-System-to-Intermediate-System (IP routing
and control plane protocol)
International Organization for Standardization (not an
acronym)
Internet Service Provider
International Telecommunication Union
Kilobits per second (1,000 bits per second)
Local Area Network
Local Access and Transport Area
A colloquial packet protocol layering model, with origins to
the OSI reference model. Today, roughly Layer 3
corresponds to IP packets, Layer 2 to MPLS LSPs,
pseudowires, or Ethernet-based VLANs, and Layer 1 to all
lower-layer transport protocols
Label Distribution Protocol
Link Management Protocol
The portion of the access segment between the customer
and feeder network. Also called “last mile”
Link-State Advertisement
Link-State Database
Label Switched Path
Label Switch Router
Media Access Control
Metropolitan Area Network
Megabits per second (1 Million bits per second)
Micro-Electro-Mechanical Systems
The network layers of the equipment located in the central
offices of a given metropolitan area
Moving Picture Experts Group
Multiprotocol Label Switching
Multiple System Operator (typically coaxial cable
companies)
Multi-Service Platform – A type of ADM enhanced with
many forms of interfaces
Mean Time Between Failure

2

Structural Overview of ISP Networks

MTSO
MTTR
Multicast
N-DCS
n-degree
ROADM
Next-hop
Next-next-hop
Normalization
NTE
OC-n
ODU
O-E-O
OIF
OL
OSPF
OSPF-TE
OSS
OT
OTN
P Router
PBB-TE
PBT
PE Router
PIM
PL
P-NNI
POP
PPP
PPPoE
Pseudowire
PVC
PWE3
QoS
RAR
RD
Reconvergence

Mobile Telephone Switching Office
Mean Time to Repair
Point-to-multipoint flows in packet networks
Narrowband Digital Cross-connect System (cross-connects
at DS0 rate)
A ROADM that can fiber to more than three different
ROADMS (also called multidegree ROADM)
Method in MPLS FRR that routes around a down link
Method in MPLS FRR that routes around a down node
Step in network restoration after all failures are repaired to
bring the network back to its normal state
Network Terminating Equipment
Optical Carrier – level n (designation of optical transport of
a SONET STS-n)
Optical channel Data Unit – protocol data unit in ITU OTN
Optical-to-Electrical-to-Optical
Optical Internetworking Forum
Optical Layer
Open Shortest Path First
Open Shortest Path First – Traffic Engineering
Operations Support System
Optical Transponder
Optical Transport Network – ITU optical protocol
Provider Router
Provider Backbone Bridge – Traffic Engineering
Provider Backbone Transport
Provider-Edge Router
Protocol-Independent Multicast
Private Line
Private Network-to-Network Interface (ATM routing
protocol)
Point Of Presence
Point-to-Point Protocol
Point-to-Point Protocol over Ethernet
A virtual connection defined in the IETF PWE3 that
encapsulates higher-layer protocols
Permanent Virtual Circuit
Pseudo-Wire Emulation Edge-to-Edge
Quality of Service
Remote Access Router
Route Distinguisher
IGP process to update network topology and adjust routing
tables

91

92

RIB
ROADM
RR
RSTP
RSVP
RT
RD
RTP
SD
SDH

Serving CO
SHO
SLA
SRLG
SONET

SONET/SDH
self-healing
rings
SPF
STS-n
SVC
TCP
TDM
UDP
UNI
Unicast
UPSR
VHO
VLAN
VoD
VoIP
VPLS
VPN

R.D. Doverspike et al.

Router Information Base
Reconfigurable Optical Add/Drop Multiplexer
Route Reflector
Rapid Spanning Tree Protocol
Resource Reservation Protocol
Route Target (also Remote Terminal in metro TDM
networks)
Route Distinguisher
Real-Time Protocol
Standard Definition (television with resolution of about
640  480)
Synchronous Digital Hierarchy (a synchronous optical
networking standard used outside North America,
documented by the ITU in G.707 and G.708)
The first metro central office to which a given customer
homes
Super Hub Office
Service Level Agreement
Shared Risk Link Group
Synchronous Optical Network (a synchronous optical
networking standard used in North America, documented in
GR-253-CORE from Telcordia)
Typically UPSR or BLSR rings

Shortest Path First
Synchronous Transport Signal – level n (a signal level of
the SONET hierarchy with a data rate of n  51.84 Mb/s)
Switched Virtual Circuit
Transmission Control Protocol
Time Division Multiplexing
User Data Protocol
User-Network Interface
Point-to-point flows in packet networks
Unidirectional Path-Switched Ring
Video Hub Office
Virtual Local Area Network
Video on Demand
Voice-over-Internet Protocol
Virtual Private LAN Service (i.e., Transparent LAN
Service)
Virtual Private Network

2

Structural Overview of ISP Networks

WAN
Wavelength
continuity
W-DCS
DWDM

Wide Area Network
A restriction in DWDM equipment that a through
connection must be optically cross-connected to the same
wavelength on both fibers
Wideband Digital Cross-connect System (cross-connects at
DS-1, SONET VT-n or higher rate)
Wavelength-Division Multiplexing

93

Part II

Reliability Modeling and Network
Planning

Chapter 3

Reliability Metrics for Routers in IP Networks
Yaakov Kogan

3.1 Introduction
As the Internet has become an increasingly critical communication infrastructure for
business, education, and society in general, the need to understand and systematically analyze its reliability has become more important. Internet Service Providers
(ISPs) face the challenge of needing to continuously upgrade the network and grow
network capacity, while providing a service that meets stringent customer-reliability
expectations. While telecommunication companies have long experience providing
reliable telephone service, the challenge for an ISP is more difficult because changes
in Internet technology, particularly router software, are significantly more frequent
and less rigorously tested than was the case in circuit-switched telephone networks.
ISPs cannot wait until router technology matures – a large ISP has to meet high reliability requirements for critical applications like financial transactions, Voice over
IP, and IPTV using commercially available technology. The need to use less mature
technology has resulted in a variety of redundancy solutions at the edge of the network, and in well-thought-out designs for a resilient core network that is shared by
traffic from all applications.
The reliability objective for circuit-switched telephone service of “no more
than 2 hours downtime in 40 years” has been applied to voice communication
since 1964 [1]. It has been achieved using expensive redundancy solutions for
both switches and transmission facilities. Though routers are less reliable than
circuit switches, commercial IP networks have three main advantages when designing for reliability, in comparison with legacy telephone networks. First, packet
switching is a far more economically efficient mechanism for multiplexing network
resources than circuit switching, given the bursty nature of data traffic. Second, protocols like Multi-Protocol Label Switching (MPLS) support a range of network
restoration options that are more economically efficient in restoration from failures of transmission facilities than traditional 1:1 redundancy. Third, commercial
Y. Kogan ()
AT&T Labs, 200 S. Laurel Ave, Middletown, NJ 07748, USA
e-mail: yaakovkogan@att.com

C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications,
Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 3,
c Springer-Verlag London Limited 2010


97

98

Y. Kogan

IP networks can provide different levels of redundancy to different commercial
customers, for example, by offering access diversity or multihoming options, pricing the service depending on its reliability. This allows Internet service providers
to satisfy customers who are price-sensitive [2] while recovering the high cost of
redundancy from customers who require increased reliability to support mission
critical applications.
The reliability of modern provider edge routers, which have a large variety of interface cards, cannot be accurately characterized by a single downtime or reliability
metric because it requires averaging the contributions of the various line cards that
may hide the poor reliability of some components. We address this challenge by
introducing granular metrics for quantifying the reliability of IP routers. Section 3.2
provides an overview of the main router elements and redundancy mechanisms.
In Section 3.3, we use a simplified router reliability model to demonstrate the application of different reliability metrics. In Section 3.4, we define metrics for measuring
the reliability of IP routers in production networks. Section 3.5 provides an overview
of challenges with measuring end-to-end availability.

3.2 Redundancy Solutions in IP Routers
This section provides an overview of the primary elements of a modern router and
associated redundancy mechanisms, which are important for availability modeling
of services in IP networks. A high-speed IP router is a special multiprocessor system
with two types of processors, each with its own memory and CPU: Route Processors
(RPs) and Line–Cards (LCs). Each line–card receives packets from other routers via
one or more logical interfaces, and performs forwarding operations by sending them
to outbound logical interfaces using information in its local Forwarding Information
Base (FIB). The route processor controls the operation of the entire router, runs
the routing protocols, maintains the necessary databases for route processing, and
updates the FIB on each line–card. This separation implies that each LC can continue forwarding packets based on its copy of the FIB when the RP fails. Figure 3.1
provides a simplified illustration of router hardware architecture, where two route
processors (active and backup) and multiple line-cards are interconnected through
a switch fabric. The Monitor bus is used exclusively for transmission of error and
management messages that help one to isolate the fault when a component is faulty
and to restore the normal operation of the router, if the failed component is backed
up by a redundant unit. Data traffic never goes through Monitor bus but across the
switch fabric. These hardware (HW) components operate under the control of an
Operating System (OS). Additional details for Cisco and Juniper routers can be
found in [3, 4] and [5], respectively.
A typical Mean Time Between Failures (MTBF) for both RPs and LCs is about
100,000 h (see, e.g., Table 9.3 in [6]). This MTBF accounts only for hard failures
requiring replacement of the failed component, in contrast with soft failures, from
which the router can recover, for example, by card reset. A typical example of a soft

3

Reliability Metrics for Routers in IP Networks

99

Line Card 1
Switch
fabric

Active RP

Backup RP
Line Card n

Monitor Bus

Cooling system

Power supplies

Fig. 3.1 Generic router hardware architecture

hardware failure is parity error. Router vendors do not usually provide an MTBF
for the OS, as it varies over a wide range. According to our experience, a new OS
version may have an MTBF well below 100,000 h as a result of undetected software
errors that are first encountered after the OS is deployed to the field. According to
our experience, the MTBF for a stable OS is typically above 100,000 h, though
even with a stable OS, changes in the operating environment can trigger latent
software errors.
Without redundancy solutions at the edge of the network, component failures interrupt customer traffic until the failed component is recovered by reset, which may
take about a minute, or until it is replaced, which can take hours. To reduce failure
impacts, shared HW components whose failure would impact the entire router (e.g.,
RP, switch fabric, power supply, and cooling system) are typically redundant. In this
case, the restoration time (assuming a successful failover to the redundant component) is defined by the failover time. For example, in Cisco 12000 series routers
[3] and Juniper T640 router [7], the switch fabric consists of five cards, four of
which are active and one provides redundancy with a subsecond restoration time
when an active card fails. Failure of one power supply or cooling element does not
have any impact on service.
RP redundancy is provided by a configuration with two RP cards: primary and
backup. A first attempt at reducing the failover time has been made by running the
backup RP in standby mode with partial synchronization between the active and
standby RPs that enables the standby RP to maintain all Layer 1 and Layer 2 sessions and recover the routing database from adjacent nodes when the primary RP
fails. However, when a primary RP fails, BGP adjacencies with adjacent routers go
down. The loss of BGP adjacency has the same effect on network routing as failure of the entire router until the standby RP comes on-line and re-establishes
BGP adjacencies with its neighbors. During this time, the routing protocols will
reconverge to another route and then back again that will cause transient packet

100

Y. Kogan

loss – a phenomenon known as “route flapping.” (Route flapping occurs when a
router alternately advertises a network destination via one route, then another (or as
unavailable, and then available again) in quick sequence [8].)
To prevent the adjacent routers from declaring the failed router out of service
and removing it from their routing tables and forwarding databases, vendors have
developed high availability (HA) routing protocol extensions, which allow a router
to restart its routing software gracefully in such a way that packet forwarding is
not disrupted when the primary RP fails. If the routers adjacent to a given router
support these extensions, they will continue to advertise routes from the restarting
router during the grace period. Cisco’s and Juniper’s HA routing protocol extensions
are known under the name of Non-Stop Forwarding (NSF) [9] and Graceful Restart
(GR) [10], respectively. A detailed description of the Cisco NSF support for BGP,
OSPF, IS-IS, and EIGRP routing protocols as well as for MPLS-related protocols
can be found in [9]. Here, we describe the BGP protocol extension procedures that
follow the implementation specification provided in the IETF proposed standard
“Graceful Restart Mechanism for BGP” [11]. Let R1 be the restarting router and
R2 be a peer. The goal is to restart a BGP session between R1 and peering routers
without redirecting traffic around R1.
1. R1 and R2 signal each other that they understand Graceful Restart in their initial
exchange of BGP OPEN messages when the initial BGP connection is established between R1 and R2.
2. An RP failover occurs, and the router R1 BGP process starts on the newly active
RP. R1 does not have a routing information base and must reacquire it from its
peer routers. R1 will continue to forward IP packets destined for (or through)
peer routers (R2) using the last updated FIB.
3. When R2 detects that the TCP session with R1 is cleared, it marks routes, learned
from R1, as STALE, but continues to use them to forward packets. R2 also initializes a Restart-timer for R1. Router R2 will remove all STALE routes unless it
receives an OPEN message from R1 within the specified Restart-time.
4. R1 establishes a new TCP session with R2 and sends an OPEN message to R2,
indicating that its BGP software has restarted. When R2 receives this OPEN
message, it resets its own Restart-timer and starts a Stalepath-timer.
5. Both routers re-established their session. R2 begins to send UPDATE messages
to R1. R1 starts an Update-delay timer and waits until up to 120 s to receive
End-of-RIB (EOR) from all its peers.
6. When R1 receives EOR from all its peers, it will begin the BGP Route Selection
Process.
7. When this process is complete, it will begin to send UPDATE messages to R2.
R1 indicates completion of updates by EOR and R2 starts its Route Selection
Process.
8. While R2 waits for an EOR, it also monitors Stalepath time. If the timer expires,
all STALE routes will be removed and “normal” BGP process will be in effect.
When R2 has completed its Route Selection Process, then any STALE entries
will be refreshed with newer information or removed from the BGP RIB and
FIB. The network is now converged.

3

Reliability Metrics for Routers in IP Networks

101

One drawback of NSF/GR is that there is a potential for transient routing loops
or packet loss if a restarting router loses its forwarding state (e.g., owing to a
power failure). A second drawback of NSF/GR is that it can prolong delays of
network-layer re-routing in cases where the service is NOT restored by RP failover.
In addition, to be effective in a large ISP backbone, NSF/GR extensions would need
to be deployed on all of the peering routers. However, the OSPF NSF extension is
Cisco proprietary. The respective drafts were submitted to the IETF but not approved
as standards. Since most large ISP networks use routers from multiple vendors, the
lack of standardization and universal adoption by vendors limits the usefulness of
the NSF and GR extensions.
Another approach to router reliability, called Non-Stop Routing (NSR), is free
from the drawbacks of graceful restart. It is a self-contained solution that does not
require protocol extensions and has a faster failover time. With NSR, the standby
RP runs its own version of each protocol and there is continuous synchronization
between the active and standby RPs to the extent that it enables the standby RP to
take over when the active RP fails without any disruption in the existing peering
sessions. The first implementation of NSR was done by Avici Systems [12] in 2003
in the Terabit Switch Router (TSR) router that was used in the AT&T core network.
Later, other router vendors implemented their versions of NSR (see, e.g., [13]).
It is important to note that router outages can be divided into two categories:
planned and unplanned outages. Much of the preceding discussion focused on RP
failures or unplanned outages. Planned outages are caused by scheduled maintenance activities, which include software and hardware upgrades as well as card
replacement and installation of additional line-cards. Router vendors are developing a software solution on top of NSR to support in-service software upgrade, or
ISSU (see, e.g., [13–15]). The goal of ISSU is a significant reduction in downtime
due to software upgrades, potentially eliminating this category of downtime if both
the old and new SW versions support ISSU.
We now turn our attention to line-card failures. Line-card failures are distinct
from link failures – while link failures can often be recovered by the underlying
transport technology, e.g., SONET ring (see Chapter 2), line-card failures require
traffic to be handled by a redundant line-card provisioned on the same or a different
router. Line-card redundancy is particularly important for reducing the outage duration of PE (provider-edge) routers that terminate thousands of low-speed customer
ports. The first candidate for redundancy is an uplink LC that is used for connection
to a P (core) router. Without redundancy, any uplink LC downtime will cause PE
router isolation. In addition, a redundant uplink LC allows us to connect a PE router
to two P routers using physically diverse transport links. This configuration results
in the near elimination of PE router downtime caused by periodic maintenance activities on P routers, under the assumption that maintenance is not performed on
these two P routers simultaneously. PE router downtime is nearly eliminated in this
case because the probability of PE isolation caused by the failure of the second uplink or the other P router is negligibly small if the maintenance window is short.
Restoration from an uplink LC failure is provided at the IP-Layer with restoration
time of the order of 10 s as described in Chapter 2.

102

Y. Kogan

SONET interfaces on IP routers may support the ability to automatically switch
traffic from a failed line-card to a redundant line-card, using a technique called
Automatic Protection Switching (APS) [16]. Implementation of APS requires installation of two identical line-cards; one card is designated as primary, the other
as secondary. A port on the primary LC is configured as the working interface and
the port with the same port number on the secondary LC as the protection interface. The ports form a single virtual interface. Ports on the secondary LC cannot
be configured with services; they can only be configured as protection ports for the
corresponding ports on the primary LC. The protection and working interfaces are
connected to a SONET ADM (Add-Drop Multiplexer), which sends the same signal
payload to the working and protection interfaces. When the working interface fails,
its traffic is switched to the protection interface. According to our experience, the
switchover time is of the order of 1 min. Hitless switchover requires protocol synchronization between the line–cards, which was not available at the time of writing
of this chapter. APS is only available in a 1:1 configuration. As a result, it is considered to be expensive. An alternative line-card redundancy approach developed at
AT&T [17] is based on a new ISP edge architecture called RouterFarm. RouterFarm
utilizes 1:N redundancy, in which a single PE backup router can support multiple
active routers. The RouterFarm architecture supports customer access links that connect to PE routers over a dynamically reconfigurable access network. When a PE
router fails or is taken out of service for planned maintenance, control software rehomes the customer access links from the affected router to a selected backup router
and copies the appropriate router configuration data to the backup router. Service is
provided by the backup router once the rehoming is complete. After the primary
router is repaired or required maintenance is performed, customers can be rehomed
back to the primary router.

3.3 Router Reliability Modeling
As described in Section 3.2, router outages can be divided into two categories:
planned and unplanned. Planned outages are caused by scheduled maintenance
activities. Customers with a single connection to an ISP edge router are notified in advance about planned maintenance. Outages outside of the maintenance
window are referred to as unplanned. The common practice is to evaluate router
reliability metrics for planned and unplanned outages separately. Table 3.1 provides
an example1 of downtime calculation for software (SW) and hardware (HW) upgrades that require the entire router to be taken out of service. The downtime is
calculated based on upgrade frequency per year in the second column and mean upgrade duration in the third column. The total mean downtime per year for planned
outages is 42 min.
1
All examples are for illustrative purposes only and are not meant to model or describe any network
or vendor’s product.

3

Reliability Metrics for Routers in IP Networks

103

Table 3.1 Planned downtime for SW and HW upgrades
Activity
Freq/year Duration (min) Downtime (min)
SW upgrade 2
15
30
HW upgrade 0.2
60
12

The router downtime is close to 0 for unplanned outages if the router supports RP
and LC redundancy. If LC redundancy is not supported, unplanned router downtime
depends on the ratio rLC =mLC where rLC and mLC denote LC MTTR (Mean Time To
Repair) and MTBF, respectively. Using the fact that rLC  mLC , one can approximate the downtime probability by rLC =mLC and calculate the average unplanned
router downtime per year as
dLC D .rLC =mLC/  525; 600 .min =year/:
The factor 525; 600 D 365  24  60 is the number of minutes in a 365-day year.
With stable hardware and software, rLC =mLC  4  105 and unplanned downtime
dLC is around 21 min, which is less than the planned downtime due to upgrades by
a factor of 2.
The reliability improvement due to RP and LC redundancy for unplanned outages
can be evaluated using the following simplified router reliability model described
by a system consisting of two independent components representing the LC and RP.
Component 1 corresponds to the LC and component 2 corresponds to the RP. Each
component alternates between periods when it is up and periods when it is down.
The system is working if both components are up. For nonredundant component
i; i D 1; 2, denote MTBF and MTTR by mi and ri , respectively. For a component
consisting of primary and backup units, we assume that once a primary unit fails, the
backup unit starts to function with probability pi after a random delay with mean
i  ri . With probability 1  pi , the switchover to the backup unit fails, in which
case the mean downtime is ri . Thus, the MTTR for a redundant component is
bi D pi i C .1  pi /ri :

(3.1)

Two important particular cases correspond to pi D 0 (no redundancy) and i D 0
(instantaneous switchover). The MTBF for a redundant component is
ci D mi if i > 0
ci D mi =.1  pi / if i D 0:

(3.2)

The steady state probability that the system (component) is working is referred to as
availability. The complementary probability is referred to as unavailability. Based
on our assumptions, the availability of component i is
Ai D

ci
ci C bi

(3.3)

104

Y. Kogan

and the system availability is
A D A1 A2 :

(3.4)

In our case, ri  mi that allows us to obtain the following simple approximation
for the system unavailability:
U D 1  A1 A2 D 1  .1  U1 /.1  U2 /  U1 C U2

(3.5)

where Ui D bi =.ci C bi / is unavailability of component i . Another important reliability metric is the rate fs at which the system fails. In our case (see, e.g., 7c
in [18])
fs  1=c1 C 1=c2 :
(3.6)
Redundancy without instantaneous switchover decreases the mean component
downtime bi and the component and the system unavailability. However, the system
failure rate does not decrease because the component uptime ci D mi remains
unchanged if i > 0. Instantaneous switchover decreases both the unavailability
and the system failure rate.
The availability of LCs and RPs with no redundancy is typically better than
0.9999 (four nines) but worse than 0.99999 (five nines). We can compute an estimate of the improvement due to redundancy using Eq. (3.1). If the redundancy of
component i is characterized by a probability of successful switchover pi D 0:95
and i =ri D 0:05, then the mean component downtime bi and therefore its unavailability would decrease by about a factor of 10, resulting in a component availability
exceeding five nines. The system availability would be limited by the availability of
any nonredundant component.

3.4 Reliability Metrics for Routers in Access Networks
Figure 3.2 depicts a typical Layer 3 access topology for enterprise customers. It
includes n provider-edge routers PE1, : : : , PEn and two core or backbone routers
P1 and P2, which are responsible for delivering traffic from customer edge (CE)

CE

PE1

P1

CE

PEn

Fig. 3.2 Access network elements

Backbone

··
·
P2

3

Reliability Metrics for Routers in IP Networks

105

routers at a customer location into the commercial IP network backbone. The service
provided by an ISP to an enterprise customer is typically associated with a customer
“access port.” An access port is a logical interface on the line-card in a PE, where
the link from a customer’s CE router terminates. In general, a PE has a variety of
line-cards with different port densities depending on the port speed. For example,
a channelized OC-12 card provides up to 336 T1/E1 ports, while a channelized
OC-48 card can provide up to either 48 T3 ports, or 16 OC3 ports, or 4 OC12
ports. In Fig. 3.2, each PE is dual-homed to two different P (core) routers using
two physically diverse transport links terminating on different line-cards at the PE
router. (These transport links are referred to as uplinks.) The links that connect P
routers at different nodes are generally provided by an underlying transport network.
Dual-homing is used to reduce the impact on the customer due to outages – from
a potentially long repair interval to short-duration packet loss caused by protocol
reconvergence. Dual-homing is used to address the following outage scenarios:





Outage of uplink transport equipment
Outage of an uplink line-card at PE routers
Outage of an uplink line-card at P routers
Outage of one P router or its associated backbone links

Customer downtime can be caused by a failure in a PE component, such as a failed
interface or line-card, or from a total PE outage.
Our goal in this section is to provide a practical way of applying the traditional
reliability metrics like availability and MTBF to a large network of edge routers. The
calculation of these metrics is straightforward in the case of K identical systems
s1 ; : : : ; sK , where each system alternates between periods when it is up and periods
when it is down. Assume that k  K different systems si1 ; : : : ; sik failed during
time interval of length T , and let tj be the total outage duration of system j . The
unavailability Uj of system j can be estimated as
Uj D tj =T for j D i1 ; : : : ; ik

(3.7)

and Uj D 0 otherwise. Then, the average unavailability is
K
P

U D

j D1

k
P

Uj

K

D

j D1

ti j

KT

(3.8)

and the average availability is
A D 1  U:

(3.9)

Finally, the average time between failures is estimated as KT=L, where L  k is
the total number of failures during time interval T .
There are two main difficulties with extending these estimates to routers. First,
routers experience failures of a single line-card in addition to entire router failures.
Second, routers may not be identical. The initial approach to overcome these difficulties was to assign to each failure a weight that represents the fraction of the

106

Y. Kogan

access network impacted by the failure. Such an approach is adequate for access
networks consisting of the same type routers and line-cards with port speeds in a
sufficiently narrow range, which was the case of early access networks with Cisco’s
7500 routers. Modern access networks may consist of several router platforms and
high-speed routers may have line-cards with port speed varying in a wide range. For
these networks, averaging failures over various router platforms and line-cards with
different port speeds is not sufficient. We start with presenting the existing averaging techniques and demonstrating their deficiencies and then describe a granular
approach where availability is described by a vector with components representing
the availability for each type of access line-cards.
Two frequently used expressions for calculating the fraction of the impacted access network are based on different parameterizations of impacted access ports in
service and have the following forms [19]:
Number of impacted access ports in service
Total number of all access ports in service

(3.10)

Total bandwidth of impacted access ports in service
Total bandwidth of all access ports in service

(3.11)

f D
and
f D

Having the fraction fi of access port impacted and failure duration Di for each
failure i; i D 1; : : : ; L during time interval of length T , we can estimate the average
access unavailability and availability as
U access D

L
X
i D1

fi

Di
T

and Aaccess D 1  U access

(3.12)

respectively. Formally, one can use Eq. (3.12) with port-weighting or bandwidthweighting fractionsfi for estimating the average unavailability (availability) of any
access network with different router platforms. However, there are several problems
with these averaging techniques that limit their usefulness:
 Port-weighted fraction (3.10) emphasizes line-card failures with low-speed ports

while failures of high-speed ports are heavily discounted because the port density
on a line-card is inversely proportional to the port speed.
 Bandwidth-weighted fraction (3.11) assigns lower weight to failures of line-cards
with low-speed ports because they do not utilize the entire bandwidth of the
line-card.
 Any averaging over different router platforms or even for one router platform
with a variety of line-cards that have different quality of hardware and software
may hide defects.
These issues are illustrated by the following example. Consider an access network
consisting of 100 Cisco gigabit switch routers (GSRs) and assume that each router
has two access line-cards of each of the following three types:

3

Reliability Metrics for Routers in IP Networks

107

 Channelized OC12 with up to 336 T1 ports
 Channelized OC48: one card is with up to 48 T3 ports while another card is either

with up to 16 OC3 ports (50 routers) or with up to 4 OC12 ports (50 routers)
 1-port OC48.

The total number of ports in service and their respective bandwidth (BW) are shown
in Table 3.2. The number of ports in the third column of Table 3.2 is obtained by
multiplying the number of ports in service given in the second column of Table 3.3
by the total number of cards with the respective port speed. For T1 and OC48, the
total number of cards of each type is 200 D 2100. For T3, OC3, and OC12, the total number of cards is 100, 50, and 50, respectively. In Table 3.3, we use Eqs. (3.10)
and (3.11) to calculate port-weight and bandwidth-weight for failure of one linecard depending on the number of ports in service given in the second column. The
bandwidth of a line-card is obtained as a product of the number of ports in service, given in the second column of Table 3.3, and the respective speed given in the
second column of Table 3.2. One can see that port-weighting practically disregards
failures of line-cards with OC48 and OC12 ports, while contribution of failures of
line-cards with T3 and OC3 ports is discounted relative to T1 ports by a factor of
6.7 and 20, respectively. As a result, the availability of the access network is dominated by the availability of channelized OC12 card with T1 ports. As one could
expect, bandwidth-weighting is biased toward failures of line-cards with an OC48
port. However, failures of other line-cards, except for a channelized OC12 card with
T1 ports, become more visible in comparison with port-weighting.
As a result of these problems with port and bandwidth-weighting techniques, a
more useful approach is to evaluate average availability for each router platform
and for each type of access LC separately. The increasing variety of edge routers
and access line-cards justifies such an approach, since it allows the ISP to track

Table 3.2
Port
T1
T3
OC3
OC12
OC48
Total

Total number of ports in service and their bandwidth
Speed (Mbps)
Number of ports
BW (Gbps)
1.5
40,000
60.0
45
3,000
135.0
155
500
77.5
622
150
93.3
2,400
200
480.0
43,850
845.8

Table 3.3 Port-weight and bandwidth-weight per line-card
Port
In service
P-weight
BW-weight
T1
200
0.00456
0.00035
T3
30
0.00068
0.00160
OC3
10
0.00023
0.00183
OC12
3
6.8E-05
0.00221
OC48
1
2.3E-05
0.00284

108

Y. Kogan

the reliability with finer granularity. Consider a set of edge routers of the same
type with J types of access line–cards, which are monitored for failures during
time interval of length T . For each customer impacting failure i; i D 1; : : : ; L,
we record the number nij of type j cards affected and the respective failure duration tij . In the case of access line-card redundancy, only failures of active (primary)
line-card are counted and then only if the failover to the backup line-card was not
hitless. The average unavailability of type j access line-card is calculated as
L
P

Uj D

i D1

nij tij
(3.13)

Nj T

where Nj is the total number of type j active cards. The average unavailability can
be expressed as
Rj
(3.14)
Uj D
Mj
where

L
P

Rj D

nij tij

i D1
L
P

i D1

(3.15)
nij

is the average repair time for an LC of type j , and
Mj D

Nj T
L
P
nij

(3.16)

i D1

can be interpreted as the average time between router failures impacting customers
on access line-cards of type j . Metric Mj can be considered as an extension of
the traditional field hardware MTBF. For the field MTBF, only individual line-card
failures, which require card replacement, are counted in the denominator. In Mj ,
we count all failures of type j cards outside the maintenance window, including
those caused by reset, software bugs, and all impacted cards of type j in case of
entire router failure. This distinction is important since we want a metric that accurately captures customer impact caused by all HW and SW failures. For example,
each reset of an active (primary) line-card can cause a protocol reconvergence event
resulting in short-duration packet loss. Metrics R; M , and U can also be defined for
the entire population of access line-cards without differentiating failure by LC type.
Denote
L
L
J
J X
J X
X
X
X
N D
Nj ; n D
nij ; t D
tij :
(3.17)
j D1

j D1 i D1

j D1 i D1

3

Reliability Metrics for Routers in IP Networks

Then
RD

109

NT
t
; M D
n
n

(3.18)

and the average unavailability
R
:
(3.19)
M
The value of using Mj in addition to the average unavailability is demonstrated by
the following example.
U D

Example 3.1. Consider a set of 400 routers and let T D 1;000 h. Each router has
two cards of Type 1, three cards of Type 2, and five cards of Type 3. The number
of failures for the entire router and each card type with their duration is given in
Table 3.4. In case of single card failures, nij D 1 if LC of type j failed and nij D 0
otherwise. In the case of entire router failure, .ni1 ; ni 2 ; ni 3 / D .2; 3; 5/. In this
example, we assume constant failure duration tij D tj of type j cards and a constant duration of the entire router failure. The failure duration is measured in hours.
The failure parameters in Table 3.4 are referred to as Scenario 1. We also consider
a Scenario 2, in which the only difference with Scenario 1 is that the number of
failures of entire routers is increased from 1 to 5.
The reliability metrics for two scenarios are given in Table 3.5. The results in
columns R and M for LC Type j; j D 1; 2; 3, and for All Cards are calculated using Eqs. (3.15), (3.16), and .3:18/, respectively. The unavailability for LC
Type j; j D 1; 2; 3, and for All Cards is calculated using Eqs. (3.14) and (3.19),
respectively. The defects per million (DPM) is a commonly used metric that is obtained by multiplying the respective unavailability by 1,000,000.
Note that for All Cards, defects per million (DPM) are below 10 in both scenarios, implying a high availability exceeding 99.999% (five nines), while the average
time between customer impacting failures M in Scenario 2 is almost half of that in
Scenario 1. Therefore, DPM, in contrast with average time between customer impacting failures, is not sensitive to the frequency of short failures of the entire router.
Table 3.4 Failures and their duration: Scenario 1
Failure
# Failures
Router
1
LC type 1
30
LC type 2
6
LC type 3
2
Table 3.5 Reliability metrics
Scenario 1
LC type
R
M
DPM
1
0.76
25,000 30.25
2
1.03 133,333
7.75
3
0.21 285,714
0.75
All Cards 0.73
83,333
8.75

Duration
0.1
0.8
1.5
0.5

Scenario 2
R
M
0.63 20,000
0.50 57,143
0.13 74,074
0.44 45,455

DPM
31.25
8.75
1.75
9.75

110

Y. Kogan

If an ISP were only tracking DPM and router outages increased from one outage per
1,000 h to five outages per 1,000 h, it might miss the significant decrease in reliability as seen from the customer’s perspective.
The metrics in the All Cards row hide a low average time between failures and
high DPM for LC Type 1 in both scenarios. The average time between customer
impacting failures by LC type amplifies the difference between the two scenarios.
For example, for LC Type 3, the average time between failures M3 decreased almost by a factor of 4 in Scenario 2, in comparison with Scenario 1. This example
illustrates the importance of measuring reliability metrics by the type of access linecards. It also illustrates the significant impact that even short-duration outages of
an entire router have on reliability. Furthermore, it shows why nonstop routing and
in-service software-upgrade capabilities described in Section 3.2 are considered to
be so important by ISPs.

3.5 End-to-End Availability
Evaluation of the end-to-end availability requires evaluation of the backbone availability in addition to the access availability discussed in Section 3.4. Given the scale
and complexity of a large ISP backbone, there is no generally agreed upon approach for measuring and modeling end-to-end availability. Chapter 4 provides a
fairly general approach for performance and reliability (performability) evaluation
of networks consisting of independent components with finite number of failure
modes. Its application involves the steady state probability distribution that is used
for calculation of the expected value of the measure F defined on the set of network states. This section presents a brief overview of some results related to state
aggregation and the selection of function F for evaluating the backbone availability.
Large ISP backbones are typically designed to ensure that the network stays
connected under all single-failure scenarios. Furthermore, the links are designed
with enough capacity to carry the peak traffic load under all single-failure scenarios. Therefore, the majority of failures do not cause loss of backbone connectivity.
Typically, when a failure happens, P routers detect the failure and trigger a failover
to a backup path. If the failover were hitless and the backup path did not increase
the end-to-end delay and also had enough capacity to carry all traffic, then the failure would not have any customer impact. Failures impacting customer traffic include
the following events:
1.
2.
3.
4.

Loss of connectivity
Increased end-to-end delay on the backup path
Packet loss due to insufficient capacity of the backup path
Routing reconvergence triggered by the original failure. Such a reconvergence
may cause packet loss during several seconds.

Assume that the duration of each event can be measured. Two approaches to measuring the backbone availability are based on knowing the actual point-to-point

3

Reliability Metrics for Routers in IP Networks

111

traffic demand matrix that allows us to calculate the amount of impacted traffic
for each event. In the first approach [20], only events 3 and 4 are included. The
backbone unavailability is defined as the fraction of traffic lost over a given time
period. In the second approach [21], all four events are included. Availability is
measured for each origin–destination pair as the percentage of time that the network
can satisfy a service-level agreement including 100% connectivity and thresholds
on packet loss and delay. The main complexity in the implementation of either
approach is in measuring event durations. The determination of event durations
requires specially designed network instrumentation involving synthetic (active)
measurements. Reference [22] describes a standardized point-to-point approach to
path-level measurements and reference [23] describes a novel approach that uses
a single measurement host to collect network-wide one-way performance data.
These approaches also require a well-thought-out data management infrastructure
and computationally intensive processing of their output [24]. Application of edgeto-edge availability distribution to evaluation of VoIP (Voice over IP) reliability [25]
is addressed in [26].

References
1. Malec, H., (1998). Communications reliability: A historical perspective. IEEE Transactions on
Reliability, 47, 333–345.
2. Claffy, kc., Meinrath, S., & Bradner, S. (2007). The (un)economic Internet? IEEE Internet
Computing, 11, 53–58.
3. Bollapragada, V., Murphy, C., & White, R. (2000). Inside Cisco IOS software architecture.
Indianapolis, IN: Cisco Press.
4. Schudel, G., & Smith, D. (2008). Internet protocol operations fundamentals. In Router security
strategies. Indianapolis, IN: Cisco Press.
5. Garrett, A., Drenan, G., & Morris, C. (2002). Juniper networks field guide and reference. Reading, MA: Addison-Wesley.
6. Oggerino, C. (2001). High availability network fundamentals: A practical guide to predicting
network availability. Indianapolis, IN: Cisco Press.
7. T640 Internet router node overview, from http://www.juniper.net/techpubs/software/nog/noghardware/download/t640-router.pdf.
8. Route flapping, from http://en.wikipedia.org/wiki/Route flapping.
9. Cisco nonstop forwarding with stateful switchover (2006). Deployment guide. Cisco Systems,
from http://www.cisco.com/en/US/technologies/tk869/tk769/technologies white paper0900
aecd801dc5e2.html.
10. Graceful restart concepts, from http://www.juniper.net/techpubs/software/junos/junos93/
swconfig-high-availability/graceful-restart-concepts.html#section-graceful-restart-concepts.
11. Sangli, S., Chen, E., Fernando, R., & Rekhter, Y. (2007). Graceful restart mechanism for BGP.
RFC 4724. Internet Official Protocol Standards, from http://www.ietf.org/rfc/rfc4724.txt.
12. Kaplan, H. (2002). NSR Non-stop routing technology. White paper. Avici Systems Inc., from
http://www.avici.com/technology/whitepapers/reliability series/NSRTechnology.pdf.
13. Router high availability for IP networks (2005). White paper. Alcatel, from http://www.
telecomreview.ca/eic/site/tprp-gecrt.nsf/vwapj/Router HA for IP.pdf/$FILE/Router HA for
IP.pdf.
14. ISSU: A planned upgrade tool (2009). White paper. Juniper Networks, from http://www.
juniper.net/us/en/local/pdf/whitepapers/2000280-en.pdf.

112

Y. Kogan

15. Cisco IOS XE In Service Software Upgrade process (2009). Cisco Systems, from http://
www.cisco.com/en/US/docs/ios/ios xe/ha/configuration/guide/ha-inserv updg xe.pdf.
16. Single-router APS for the Cisco 12000 series router, from http://www.cisco.com/
en/US/docs/ios/12 0s/feature/guide/12ssraps.pdf.
17. Agraval, M., Bailey, S., Greenberg, A., et al. (2006). RouterFarm: Towards a dynamic manageable network edge. In: SIGCOMM’06 Workshops, Pisa, Italy.
18. Ross, S. (1989). Introduction to probability models. San Diego, CA: Academic.
19. Access availability of routers in IP-based networks (2003) Committee T1 tech rep
T1.TR.78–2003.
20. Kogan, Y., Choudhury, G., & Tarapore, P. (2004). Evaluation of impact of backbone outages in
IP networks. In ITCOM 2004, Philadelphia, PA.
21. Wang, H., Gerber, A., Greenberg, A., et al. (2007). Towards quantification of IP network reliability, from http://www.research.att.com/jiawang/rmodel-poster.pdf.
22. Ciavattone, L., Morton, A., & Ramachandran, G. (2003). Standardized active measurements
on a Tier 1 IP backbone. IEEE Communications Magazine, 41, 90–97.
23. Burch, L., & Chase, C. (2005). Monitoring link delays with one measurement host. ACM
SIGMETRICS Performance Evaluation Review 33, 10–17.
24. Choudhury, G., Eisenberg, M., Hoeflin, D., et al. (2007). New reliability metrics and measurement techniques for IP networks. Proceedings of Distributed computer and communication
networks, RAS, Moscow, 126–130.
25. Johnson, C., Kogan, Y., Levy, Y., et al. (2004). VoIP Reliability: A service provider perspective.
IEEE Comunications Magazine, 42, 48–54.
26. Lai, W., Levy, Y., & Saheban, F. (2007). Characterizing IP network availability and VoIP
service reliability. Proceedings of Distributed computer and communication networks, RAS,
Moscow, 126–130.

Chapter 4

Network Performability Evaluation
Kostas N. Oikonomou

4.1 Introduction
This chapter is an introduction to the area of performability evaluation of networks.
The term performability, which stands for performance plus reliability, was introduced in the 1980s in connection with the performance evaluation of faulttolerant, degradable computer systems [23].1 In network performability evaluation,
we are interested in investigating a network’s performance not only in the “perfect”
state, where all network elements are operating properly, but also in states where
some elements have failed or are operating in a degraded mode (see, e.g., [8]). The
following example will introduce the main ideas.
Consider the network (graph) of Fig. 4.1. On the left, the network is in its perfect
state, and on the right one node and one edge have failed.2 Node and edge failures occur independently, according to certain probabilities, which we assume to
be known. An assignment of “working” or “failed” states to the network elements
defines a state of the network. By the independence assumption, the probability of
that state is the product of the state probabilities of the elements.
There are two traffic flows in this network: one from node 1 to node 5, and the
other from 7 to 3. The flows are deterministic, of constant size, and there is no queuing at the nodes. Our interest is in the latency of each flow, defined as the minimum
number of hops (edges) that the flow must traverse to get to its destination when it is
routed on the shortest path. In each state of the network, a flow has a given latency:
in the perfect state, both flows have latency 2 (hops), but in the example failure state
the first flow has latency 3 and the second 1. The simplest characterization of the
latency metric would be to find its expected value over the possible network states,

K.N. Oikonomou ()
200 Laurel Ave, Middletown, NJ, 07748
e-mail: ko@research.att.com
1
Unfortunately, the terminology is not completely standard and some authors still use the term
“reliability” for what we call performability; see, e.g., [1]. One may also encounter other terms
such as “availability” or “dependability”.
2
When a node fails, we consider that all edges incident to it also fail.

C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications,
Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 4,
c Springer-Verlag London Limited 2010


113

114

K.N. Oikonomou
2

2

3
1

3

1
4

7

4
5

5
6

6
Perfect state

Failure of node 7 and edge (1,6)

Fig. 4.1 A 7-node, 10-edge network with 217 possible states. The performance metric is traffic
latency, measured in hops

of which there are 217  130;000. A more complete characterization would be
to find its entire probability distribution. This would allow one to answer questions
such as “what is the probability that the latency of flow 1 does not exceed 3?”,
and “what upper bound on the latency of flow 2 can be guaranteed with probability
0.999?”. The answers to these questions ( performability guarantees) are useful in
setting performance targets for the network, or SLAs.
This basic example illustrates several points, all of which will be covered in more
detail in later sections.

Reliability/Performance Trade-Off in the Analysis
A fundamental fact is that the size of the state space is exponential in the number
of network elements. In the above example, if the number of network elements is
doubled, the number of network states becomes about 17109 , and this is still a small
network, with only 34 elements; a network model with several hundred elements
would be much more typical. This means that for any realistic network model the
state space is practically infinite, so the amount of work that can be done in each state
to compute the performance metrics is limited. In other words, in performability,
analysis there is a fundamental trade-off between the reliability (state space) and
performance aspects. A consequence of this trade-off is that the performance model
cannot be as detailed as it would be in a pure performance analysis: in the example,
we assumed constant traffic flows and no queuing at nodes. Another aspect of the
trade-off is that only the investigation of the steady-state behavior of the model is,
in general, feasible: in the example, we treated the network elements as two-valued
random variables, not as two-state random processes. However, a mitigating factor is
that the network states generally have very different probabilities, so that we may be
able to calculate bounds on the performance metrics by computing their values only
on a reasonable number of states, those with high probability. With this fundamental
trade-off in mind, we now discuss ways in which the simple performability model
of the example can be extended.

4

Network Performability Evaluation

115

Enhancements to the Simple Model
To make the model presented in the example more useful for a realistic analysis, we
could add capacities to the graph’s edges. We could also add sizes to the traffic flows,
and have more sophisticated routing that allows only shortest paths that have enough
capacity for a flow. Further, for a better latency measure, we could add lengths to
the graph edges. Another category of enhancements would be aimed at representing
failures more realistically. To begin with, the network elements could be allowed to
have more than one failure mode, e.g., an edge could operate at full capacity, half
capacity, or zero capacity (fail). We could separate the network elements from the
entities that fail by introducing “components” that have failure modes and affect the
graph elements in certain ways. For example, such a component could represent an
optical fiber over which two graph edges are carried, and whose failure (cut) would
fail both of these edges at the same time. In Section 4.2 we describe a hierarchical
network model that has all the features mentioned above, among others. Finally, we
could allow different types of routing for traffic flows, and also introduce the notion
of network restoration into the model. These additions are described in Section 4.3.

Network Performability in the Literature
A number of network performability studies have appeared in the literature. Levy
and Wirth [21] investigate the call completion rate in a communications network.
Alvarez et al. [4] study performability guarantees for the time required to satisfy a
web request in a network with up to 50 nodes, where only nodes can fail, but without
restoration. Levendovszky et al. [19] study the expected lost traffic in the Hungarian
backbone SDH network with 52 nodes and 59 links, and no restoration. Carlier et al.
[7] use a three-level network model, and study expected lost traffic in a 111-node,
180-link network using k-shortest path restoration. Gomes and Craveirinha [12]
study a 46-node, 693-link representation of the Lisbon urban network with a threelevel performability model, and compute blocking probabilities for a Poisson model
of the network traffic, with no restoration. Finally, layered specification of a network
for the purposes of performability evaluation has been used in [7,12], which separate
the network into a “physical” and a “functional” layer, and in [22], which uses a
special-purpose separation into “node cluster” and “call-processing path” layers.
Some further references are given in Section 4.4.3.

Chapter Outline
In Section 4.2 we describe a four-level, hierarchical network model, suited for
performability analysis, and illustrate it with an IP-over-optical network example. In Section 4.3 we discuss the performability evaluation problem in general,
give a mathematical formulation, present the state-generation approach to the performability evaluation of networks, and discuss basic performance measures and

116

K.N. Oikonomou

related issues. We also introduce the nperf network performability analyzer, a
software package developed in AT&T Labs Research. In Section 4.4 we conclude
by presenting two case studies that illustrate the material of this chapter, the first
involving an IPTV distribution network, and the second dealing with architecture
choices for network access.

4.2 Hierarchical Network Model
For the purpose of our performability modeling, we will think of a “real” network as
consisting of three layers3 : a traffic layer, a transport layer, and a physical layer. On
the other hand, as shown in Fig. 4.2, our performability model is divided into four
levels: traffic, graph, component, and reliability. (In terms of the ISO OSI reference
model, both models address layers 1 through 3.) To illustrate the correspondence
between the three network layers and the four model levels, we use the case of an
IP-over-optical “real” network. The four-level performability model applies to many
other types of real networks as well: for example, Oikonomou et al. [25] describe
its application to a set of satellites that communicate among themselves and a set of
ground stations via microwave links, whereas the ground stations are interconnected
by a terrestrial network.

4.2.1 IP-Over-Optical Network Example
A modern commercial packet network typically consists of IP routers connected by
links, which are transported by an underlying optical network. We describe how we
model the traffic, transport, and physical layers of such a network, and how we map
them to the levels of the performability model in Fig. 4.2. (For more on this topic,
see Chapter 2.)

Traffic Layer
Based on an estimate of the peak or average traffic pattern, we create a matrix giving
the demand or “flow” between each pair of routers. (Methods for creating such a
traffic matrix from measurements are described in Chapter 5.) A demand has a rate,
a unit, and possibly a type or class associated with it.

3

We say “real” because any description is itself at some level of abstraction and omits aspects
which may be important if one adopts a different viewpoint.

4

Network Performability Evaluation

117
point

-to-po

int de
man

d

Traffic level

Routing and
restoration
F

Graph level

Component level

Reliability level
λ2

λ1
W

μ1

F

W

μ2

F

λ3
W

μ3

λ4
F

W

μ4

F

Fig. 4.2 The four-level network performability model used by the nperf performability analyzer.
F is the performance measure, discussed in Section 4.3.3

Transport Layer Nodes
A network node represents an IP router. At the component level this node expands
into a data plane, a control plane, a hardware and software upgrade component, and
a number of networking interfaces (line cards/ports). The data plane, or switching fabric, is responsible for routing packets, while the control plane computes
routing tables and processes other network signaling protocols, such as OSPF or
BGP. When a data plane component fails, all the links incident to its router fail.
When a control plane component fails, the router continues to switch packets, but
cannot participate in rerouting, including restoration. Failure of a port component
fails the corresponding link(s). The “upgrade” component represents the fact that,

118

K.N. Oikonomou

periodically, the router is effectively down because it is undergoing an upgrade of
its hardware or software. (This is by no means a very sophisticated router reliability model, see Chapter 3, but exemplifies the performance-reliability trade-off
discussed in Section 4.1.) Finally, fix one of the above classes of components, say
router cards. At the reliability level we think of all these components as independent
copies of a continuous-time Markov process (see, e.g., [5] or [6]) with failure transition rate  and repair transition rate , which may be specified in terms of MTBF
(mean time between failures, D 1=), and MTTR (mean time to repair, D 1=).

Transport Layer Links
A link between routers fails if either of the port components at its endpoints fails,
if a data plane of one of the endpoint nodes fails, or if a lower-layer component
over which the link is routed fails (e.g., a network span, discussed next). Two network nodes may be connected by multiple parallel links. These parallel links may
be grouped into a type of virtual link called a composite or bundled link, whose
capacity is the sum of the capacities of its constituent links. For the purposes of IP
routing, the routers see only a single bundled link. When a constituent link fails, the
capacity of the bundled link is reduced accordingly. A bundled link fails (or more
precisely is “taken out of service”) when the aggregate capacity of its non-failed
constituent links falls below a specified threshold.

Physical Layer Spans
We use the term “span” to refer to the network equipment and media (e.g., optical
fiber) at the physical layer that carries the transport-layer links. Failure of a span
component affects all transport-layer links which are routed over this span. When
modeling an IP-over-optical layered network, the physical layer usually uses dense
wavelength division multiplexing (DWDM), and a span consists of a concatenation
of point-to-point DWDM systems called optical transport systems (OTS).4 In turn,
an OTS is composed of many elements, such as optical multiplexers/demultiplexers,
optical amplifiers, and optical transponders. Also, a basic constraint in commercial
transport networks is that a span is considered to be working only if both of its
directions are working. With this assumption, it is not difficult to compute the failure
probability of a span based on the failure probabilities of its individual elements in
both directions. Thus, for simplicity, we generally represent a network span by a
single “lumped” component whose MTBF and MTTR are calculated as explained
in [28].

4

There are more complex DWDM systems with various optically-transparent “add/drop” capabilities, which, for simplicity, we do not discuss here.

4

Network Performability Evaluation

119

Other Types of Components
A set of fibers that is likely to fail together because they are contained in a single
conduit/bundle can be represented by a fiber cut component that brings down all network spans (hence all the higher IP-layer links) that include this fiber bundle. Other
types of catastrophic failures of sets of graph nodes and edges may be similarly
represented.
So far we have mentioned only binary components, i.e., with just two modes
of operation, “working” or “failed”. We discuss components with more than two
modes in Section 4.2.2.2.

4.2.2 More on the Graph and Component Levels
4.2.2.1 Graph Element Attributes
The graph is the level of the performability model at which the network routing and
restoration algorithms operate. Graph edges have associated capacities and (routing)
costs. In general, an edge’s capacity can be a vector, and this vector has a capacity
threshold associated with it, such that the edge is considered failed if the sum of the
capacities of its non-failed elements falls below the threshold. An edge with vector
capacity can directly represent a bundled link. The nperf performability analyzer
presented in Section 4.3 also allows many other attributes for edges, such as lengths,
latencies, etc., as well as operations on these attributes. These operations are covered
in Section 4.2.2.3.

4.2.2.2 Multi-Mode Components
Each component, representing an independent failure or degradation mechanism,
has a single working mode and an arbitrary number of failure modes. If it has a
single failure mode it is referred to as a “binary” component, otherwise it is called
“multi-mode”. In the nperf analyzer a component is represented by a star Markov
process, as shown in Fig. 4.3.
At the reliability level, the i th failure mode of a particular component is defined by its mean time between failures and its mean time to repair by setting
i D 1=MTBFi and i D 1=MTTRi .
We now give some examples of using multi-mode components in network
modeling.
Router Upgrades We mentioned in Section 4.2.1 (binary) software and hardware
upgrade components for routers. Now suppose that there is an intelligent network
maintenance policy in place, by which router upgrades are scheduled so that only
one router in the network undergoes a software or hardware upgrade at any time.

120

K.N. Oikonomou
μ1

f1
λ1

μ2

f2
λ2

w

λm
μm

.
.
.

λ
w

μ

f

fm

Fig. 4.3 A multi-mode component with m failure modes f1 ; : : : ; fm (left), and the special case of
a binary component (right). The components are continuous-time Markov processes of the “star”
form. The i th mode is entered with (failure) rate i and exited with (repair) rate i

This policy cannot be modeled by using binary upgrade components associated with
the routers, because (independence) there is nothing to prevent more than one of
them failing at a time. However, for an n-router network, the mutually exclusive
upgrade events can be represented by defining an .n C 1/-mode component whose
mode 1 corresponds to no upgrades occurring anywhere in the network, and each of
the remaining n modes corresponds to the upgrade of a single router.
Traffic Matrix Suppose we want to take into account daily variations in traffic
patterns/levels, e.g., for 60% of a typical day the traffic is represented by matrix
T1 , for 20% by matrix T2 , and for another 20% by matrix T3 . This can be done by
letting the traffic matrix be controlled by a multi-mode component whose modes
w; f1 ; f2 have probabilities 0:6; 0:2; 0:2, respectively, and they set the traffic matrix
to T1 ; T2 ; T3 , respectively.
Restoration Figure 4.2 implicitly assumes that network restoration happens at only
one level. However, multi-mode components afford the capability to model restoration occurring at more than one network layer. The details of how this is done, using
the example of IP over SONET, can be found in [25].

4.2.2.3 Failure Mapping
Recall that failure of a binary component may affect a whole set of graph-level
elements: the spans of Section 4.2.1 are an example. More generally, when a multimode component enters one of its failure modes, the effect on a graph element is
to change some of the element’s attributes. For example, the capacity of an edge
may decrease, or a node may become unable to perform routing. Depending on
the final values of the attributes, e.g., total edge capacity 6 some threshold, the
graph element may be considered “failed”. We refer to the effects of the components on the graph as the component-to-graph- level failure mapping. Some of the
ways that a component can affect a graph element attribute are to add a constant
to it, subtract a constant from it, multiply it by a constant, or set its value to a
constant.

4

Network Performability Evaluation

121

4.3 The nperf Network Performability Analyzer
In this section, we begin by discussing how the general, i.e., not specific to networks,
performability evaluation problem can be defined mathematically, and then discuss
various aspects of this definition. We then review the so-called state generation approach to performability evaluation, and some basic ingredients of the performance
measures used when evaluating the performability of networks. We finally present
an outline of the nperf network performability analyzer, a tool developed in AT&T
Labs Research.
Useful background on performability in general is in [16] and in [32]. A more
extensive reference on the nperf analyzer itself and the material of this section
is [28].

4.3.1 The Performability Evaluation Problem
It is useful to understand the mathematical formulation of the network performability evaluation problem. Let C D fc1 ; : : : ; cn g be a set of “components”, each
of which is either working or failed. (As already mentioned in Section 4.2.2, components can be in more than two states, called “modes” to distinguish them from
network states, but to simplify the exposition here we restrict ourselves to two mode,
or “binary” components.) Abstractly, a component represents a failure or degradation mechanism; examples were given in Section 4.2.1.
Component ci is in its working mode with probability pi and in its failed mode
with probability qi D 1  pi , both assumed known. Our basic assumption is that
all components are independent of one another, so that, e.g., the probability that ci
is down, cj is up, and ck is down is qi pj qk . A network state is an assignment of
a mode to every component in C and can be represented by a binary n-vector. The
set of all network states S.C/ has size 2n , and the probability of a particular state
is the product of the mode probabilities of the n components. Let F be a vectorvalued performance measure (a function) defined on S.C/, mapping each state to an
m-tuple of real numbers; examples are given in Section 4.3.3.
The performability evaluation problem consists in computing the expected value
of the measure F over the set S.C/ of network states:
X

FN D

F .s/ Pr.s/:

(4.1)

s2S.C/

There are various points to note here.
Complexity It is well known that the exact evaluation of (4.1) is difficult, even
if F is very simple. Intuitively this is because the size of the state space S.C/ is
exponential in the size of the set of components C. For a more precise demonstration

122

K.N. Oikonomou

of the complexity, suppose that each component corresponds to an edge of a graph,
the graph’s nodes do not fail, and we want to know the probability that there is a
path between two specific nodes a and b of the graph. This is known as the T WO
T ERMINAL N ETWORK R ELIABILITY evaluation problem, and in this case F takes
only two values: F .s/ is 1 if there is a path from a to b in the graph state s, and
0 otherwise. Despite the very simple F , this problem is known to be #P-complete
(see e.g., [15, 32], or [8]). A consequence of this computational complexity is that,
in general, only approximate performability evaluation is feasible. We will return
to this in Sect. 4.3.2.
Performability Guarantees In practice, we are interested in computing more
sophisticated characteristics of F than its expectation FN , such as the probability,
over the set of network states, that F is less than some number x, or greater than
some number y. For example, we may want to claim that “with probability at least
99.9%, at most 2% of the total traffic is down, and with probability at least 90% at
most 10% of it is down”. Formally, such claims are statements of the type
Pr.F < x1 / > P1 ; Pr.F < x2 / > P2 ; : : : ;
Pr.F > y1 / 6 Q1 ; Pr.F > y2 / 6 Q2 ; : : :

or

(4.2)

that hold over the entire network state space; they are known as performability
guarantees, and they can, for example, be used to set SLAs. The important point is
that the computation of (4.2) reduces easily to just the computation of expectations
of the type (4.1); see, e.g., [28].
Network When we are using the formalism leading to (4.1) to evaluate the performability of a network, all the complexity is in the measure F . As Fig. 4.2 shows,
F then includes the failure mapping from the component to the graph level, the
routing and restoration algorithms, and the traffic level.
Time Recalling the reliability level of Fig. 4.2, each ci is in reality a two-state
Markov process, whose state fluctuates in time. If so, what is the meaning of the
expectation FN of the measure F ? It can be shown that if we average F over a long
time as the network moves through its states, this average will approach FN , if we
take the probabilities pi and qi associated with ci to be the steady-state probabilities
of the working and failed states of the Markov process representing ci .
Steady State The reader familiar with the performance analysis of Markov reward
models (see, e.g., [5, 11]) will recognize that the definition (4.1) of the performability evaluation problem is based on steady state expectations of measures. In many
cases it is transient, also known as finite-time, measures that may be of interest. The
evaluation of such measures on very large state spaces is much more difficult than
that of steady state measures, and outside the scope of the treatment in this chapter,
but it is currently an area of further development of the nperf tool.

4

Network Performability Evaluation

123

4.3.2 State Generation and Bounds
A number of approaches to computing the expectation FN in (4.1) approximately
have been developed. Without attempting to be comprehensive, they can be classified into (a) upper and lower bounds for certain F such as connectivity (using the
notions of cut and path sets), or special network/graph structures (see [16, 32]), (b)
“most probable states” methods ([13, 14, 16, 17, 31–33]), (c) Monte Carlo sampling
approaches ([7, 16]), and (d) probabilistic approximation algorithms for simple F ,
e.g., [18]. Methods of types (a) and (b) produce algebraic bounds on FN (i.e., not
involving any random sampling), while (c) and (d) yield statistical bounds.
Here we will discuss the “most probable states” methods, which are algorithms
for generating network states in order of decreasing probability. The rationale is that
if the component failure probabilities are small, most of the probability mass is concentrated on a relatively small fraction of the state space. Thus, as these methods
generate states one by one and evaluate F on them, they are attempting to update FN
with terms of highest value first. The most probable states methods are particularly
well suited to evaluating the performability of complex networks because they make
no assumptions (at least to first order) about what the performance measure F might
be or what properties it might have, which is especially important in view of the fact
that the complexity of network routing and restoration schemes is included in F .
The classical algorithms of [13, 33] apply to systems of only binary components,
whereas the algorithms of [14,17,30] can handle arbitrary multi-mode components.
nperf uses a hybrid state-generation algorithm described in [28], which handles arbitrary multi-mode components and is suited especially to “mostly binary” systems,
that is systems where the proportion of components with more than two modes
is small. We find that such systems dominate performability models for practical
networks.
To explain what we mean by “at least to first order”, let ! and ˛ be the
smallest and largest values of F over S.C/, and suppose we generate the k highestprobability elements of S.C/. If these states have total probability P , we have the
algebraic lower and upper bounds on FN
FNl D

k
X
i D1

F .si / Pr.si / C .1  P /!;

FNu D

k
X

F .si / Pr.si / C .1  P /˛; (4.3)

i D1

first pointed out in [20]. The bounds (4.3) are valid for arbitrary F , but may sometimes require the generation of a large number of states to achieve a small enough
FNu  FNl D .1  P /.˛  !/. Tighter bounds are possible, but only by requiring F to
have some special property, such as monotonicity, limited growth, etc. See [27] for
further details.

124

K.N. Oikonomou

4.3.3 Performance Measures
There are two measures of fundamental importance in network performability
analysis, both having to do with lost traffic. These are
tlnr .s/ D total traffic lost because of no route in s
tlcg .s/ D total traffic lost because of congestion in s

(4.4)

(We do not mean to imply that these are the only measures of importance. Depending on the application, the focus may shift to considerations other than lost traffic,
e.g., to latency, or to many others.) To define terms, we refer to the IP-over-optical
example of Section 4.2.1. A demand corresponds to a source-destination pair of
routers; we use traffic to mean the size (volume) of a demand, or of a set of demands.
The definition of tlnr is straightforward: a demand fails if a link (multi-edge) on
its route fails, and a failed demand is lost because of no route if no path for it can
be found after the network restoration process completes. tlnr .s/ is the sum of the
volumes of all lost demands in state s.
Our definition of tlcg is more involved.5 If the network routing allows congestion,
a demand is congested if its route includes an edge with utilization that exceeds
a threshold Uc . tlcg is a certain function (not the sum) of all congested demands.
Suppose we fix a routing R in state s; then we define tlcg to be the total traffic
offered to the network minus the maximum possible total flow F that can be carried
in state s using routing R without congestion. Here “there is congestion under R”
means “there is a (working) edge with utilization above the threshold Uc ”. Equation
(4.5) formalizes this definition. Note that if the network uses flow control, such as
TCP in an IP network, the flow control will “throttle” traffic as soon as it detects
congestion, so that few packets will be really lost; in that case it is more accurate to
call our measure loss in network bandwidth. Now using the “link-path” formulation
[29], let D be the set of all subdemands (path flows) and D.e; R/ be the set of
subdemands using the non-failed edge e under the routing R. Also let fd be the
flow corresponding to subdemand d . Then F is the solution of the linear program
F D max

X

fd

(4.5)

d 2D

subject to
8e;

X

fd  Uc ce ;

fd  vd ;

d 2D.e;R/

where ce is the capacity of edge e and vd the volume of demand d .

5

This definition is by no means unique, we claim only that it is useful in a wide variety of contexts.

4

Network Performability Evaluation

125

Consistent with what we noted in Section 4.3.1, the above discussion centered
around steady-state expectations of measures as the quantities of interest. In the
context of the case study in Section 4.4.2 we will touch on one interesting sub-class
of finite-time measures, event counts.

4.3.4 Network Routing and Restoration
The presence of network routing and restoration in the performance measure makes
the performability analysis of networks different from other such analyses. The
nperf analyzer incorporates three main kinds of network routing methods:
Uncapacitated Minimum-Cost This is meant to represent routing by, e.g., the
OSPF (Open Shortest Path First) protocol [24]. Link costs correspond to OSPF
administrative weights. OSPF path computation does not take into account the capacities or utilizations of the links. Another main IP-layer routing protocol, IS-IS
(Intermediate System–Intermediate System) behaves similarly for our purposes.
“Optimal” Routing This routing is based on multi-commodity flows ([2, 29]).
nperf incorporates both integral and non-integral (“real”) multi-commodity flow
methods. These methods could be regarded as representing variants of OSPF-TE.
Details are in [28].
Multicast Routing This type of routing sends the traffic originating from a source
node on a shortest-path tree rooted at this node and spanning a set of destination
nodes. The shortest paths to the destinations are determined by so-called reversepath forwarding.
These routing methods are not meant to be emulations of real network protocols;
they include only the features of these protocols that are important for the kind of
analysis that nperf is aimed at. In particular, a lot of details associated with timing
and signaling are absent (another instance of the reliability/performance trade-off
noted in Section 4.1).

4.3.5 Outline of the nperf Analyzer
With the above material in mind, Fig. 4.4 depicts the structure of the core of the
nperf tool. At the top we have the most probable state generation algorithms
of [13, 28, 33], mentioned in Section 4.3.2. The “routers” at the bottom of the
figure are the routing methods discussed in Section 4.3.4: “iMCF” corresponds
to integral multi-commodity flow, “rMCF” to non-integral (“real” or “fractional”)
multi-commodity flow, and “USP” to uncapacitated shortest paths.
The four-level network model is specified by a set of plain text files, listed in
Table 4.1.

126

K.N. Oikonomou

YK

Hybrid

GC

State
generation
algorithms

Reliability level R
Hierarchical network model:
definition of
F = ( f1 , . . . , fm)

Component level C
Failure map C → G
Graph level G
Demand (traffic) level D

iMCF
router

rMCF
router

USP
router

F = ( f1 , . . . , fm)

Multicast
...
tree router

Failure
map
G→D
Measure
F

...≤ Pr( f i ≤ x i )≤ ...
Fig. 4.4 Structure of the core nperf software
Table 4.1 Network model specification files
net.graph
Specifies the network graph (nodes and edges)
net.dmd, net.units Specify the traffic demands, if the network has a traffic layer
net.comp
Specifies the network components and the C ! G failure mapping
net.rel
Lists (MTBF, MTTR) pairs for the modes of the components
net.perf
Parameters for the performance measure(s)

The MTBFs for the components are typically obtained from a combination of
manufacturer data and in-house testing. The MTTRs are usually determined by
network maintanance policies, except for some special types of repairs, such as a
software reboot. (Of course, one always has the freedom to use hypothetical values
when performing a “what-if” analysis.) Uncertainties in the MTBFs and MTTRs
may be dealt with by repeating an analysis with different values of MTBFS and/or
MTTRs, and nperf has some facilities to ease this task. A more sophisticated

4

Network Performability Evaluation

127

Table 4.2 Publicly-available tools that have some relation to nperf. Web sites valid as of 2009
P TOLEMY
Modeling and design of concurrent, real-time, embedded systems
http://ptolemy.eecs.berkeley.edu/
TANGRAM II Computer and communication system modeling
http://www.land.ufrj.br/tools/tangram2/tangram2.html
M OBIUS
Model-based environment for validation of system reliability, availability
security, and performance
http://www.mobius.uiuc.edu/
Probabilistic model checker
P RISM
http://www.prismmodelchecker.org/
T OTEM
Toolbox for Traffic Engineering Methods
http://totem.run.montefiore.ulg.ac.be/

alternative is to assign uncertainties (prior probability distributions) to the MTBFs
and MTTRs and propagate them to posterior distributions on FN via a Bayesian analysis. However, this is outside the scope of this chapter.

4.3.6 Related Tools
Performance and reliability analyses of systems are vast areas with many ramifications. At this point there exist a number of tools that are, in one way or another,
related to some of what nperf does. Table 4.2 mentions some of the author’s favorites, all in the public domain; the interested reader may pursue them further.
Vis-a-vis these tools, the main distinguishing features of nperf are that it
is geared toward networks (hierarchical model, routing, restoration), and represents them by large numbers of relatively simple independent (noninteracting)
components.

4.4 Case Studies
We conclude by presenting two case studies that, among other things, illustrate the
application of the nperf tool. The first study is on a multicast network for IPTV
distribution, and the second involves choosing among a set of topologies for network
access.

4.4.1 An IPTV Distribution Network
In this study we analyzed a design for an IPTV distribution network similar to
the one discussed in [9], but with 65 nodes distributed across the continental US.

128

K.N. Oikonomou

These nodes are called VHOs (Video Head Offices), and there is an additional node
called an SHO (Super Hub Office), which is the source of all the traffic. The traffic stream from the SHO is sent to the VHOs by multicast6 : when a node receives
a packet, it puts a copy of it on each of its outgoing links. Thus traffic flows on
the edges of a multicast tree rooted at the SHO, and each VHO is a node on this
tree. The tree forms a sub-network of the provider’s overall network. The multicast
sub-network uses two mechanisms to deal with failures:
 fast re-route: each edge of the tree has a pre-defind backup path for it, which uses

edges of the encompassing network that are not on the tree.
 tree re-computation: if a tree edge fails, and fast-reroute is unable to protect it

because the backup path itself has also failed, a new tree is computed. This computation is done by so-called reverse path forwarding: each VHO computes a
shortest path from it to the SHO, and the SHO then sends packets along each
such path in the reverse direction.
The advantage of fast re-route (FRR) is that it takes much less time, milliseconds instead of seconds, than tree re-computation. Given a properly designed FRR
capability, an interesting feature of the multicast network from the viewpoint of performability analysis is that it essentially tolerates any single link failure.7 Therefore,
interesting behavior appears only under failures of higher multiplicity. Indeed, it
turns out that multiple failures can result in congestion: the backup paths for different links are not necessarily disjoint and so when FRR is used to bypass a whole set
of failed links, a particular network link belonging to more than one backup path
may receive traffic belonging to more than one flow. If the link capacity is such that
this causes congestion, the congestion will last until the failure is repaired, which
may take time of the order of hours. One way to deal with this problem is to compute
a new multicast tree after FRR is done, and to begin using this new tree as soon as
the computation is complete, as suggested in [9]. This retains the speed advantage
of FRR and limits the duration of any congestion to the tree re-computation time.
For this network, performance must be guaranteed for every VHO (worst case),
not just overall. So, in the terms of Section 4.3.3, the multicast performability measures are two 65-element vectors, one for loss due to no path and one for loss due to
congestion, whose elements are computed on each network state.
We now summarize some of the results of this study. An initial network design,
known as design A, was carried out by experienced network designers. Its performance, after normalizing the expectations of the measures by the total traffic and
converting the result to time per year,8 is shown in Fig. 4.5, top. Since this was a
well-designed network to begin with, its levels of traffic loss were quite low, better than “five 9s”. Within these low levels, Fig. 4.5 shows that the loss due to no
path, the tlnr of (4.4), is dominant for most VHOs, but some of them also exhibit
6

Specifically by Protocol Independent Multicast (PIM).
By “link” here we mean an edge at the graph level of the model of Fig. 4.2.
8
For example, a traffic loss of 0.01% of the total translates to 1=10; 000 of a year, i.e., about
52 min/year.
7

4

Network Performability Evaluation

129

τOSPF = 1 sec, τFRR = 0.05 sec
2.5

No path
Congestion

time / yr.

2

1.5

1

0.5

0

5 10 15 20 25 30 35 40 45 50 55 60 65
VHO #

τOSPF = 1 sec, τFRR = 0.05 sec
2.5

No path
Congestion

time / yr.

2

1.5

1

0.5

0

5 10 15 20 25 30 35 40 45 50 55 60 65
VHO #

Fig. 4.5 Expected lost traffic, expressed in time per year, because of no path and congestion in
design A (top), and in design C (bottom). These are the tlnr and tlcg defined in (4.4). Design C is
A with tuned OSPF weights. For the purposes of comparing the two designs, the time unit of the
y-axis is irrelevant

significant loss because of congestion (tlcg ). Even though the performability of this
network was entirely acceptable, we decided to see if the loss due to congestion
could be reduced. A detailed study of the network states generated by nperf that
led to congestion in Fig. 4.5 top, revealed that they were double and triple failures.
Further, we found that for VHOs 30 to 41 congestion could be practically eliminated
by tuning a certain set of OSPF link weights. The result, known as design C , performed as shown in Fig. 4.5 bottom. It can be seen that a lot of congestion-induced

130

K.N. Oikonomou

losses were eliminated while the loss due to no path remained at the same level
throughout, and this was achieved without adding to the cost of the network design
at all. See [10] for more details on the subject of reliable IPTV network design.

4.4.2 Access Topology Choices
An issue that arose for a major Internet service provider was that traffic in its network was increasing, but the backbone routers had limited expansion capability
(numbers of slots in the chassis). To get around this limitation it was proposed to
introduce intermediate aggregation routers in the access part of the network, and
the question was how this would affect the reliability of the access.
The configuration of the provider’s backbone offices before the introduction of
aggregation routers is shown in Fig. 4.6 top, and is referred to as “base”; there is a
“local” variant in which all routers are located within a single office, and a “remote”
variant in which the routers are in different offices. In reality there are many access
routers connecting to a pair of backbone routers, but showing just one in Fig. 4.6 is
enough for our purposes. There were two proposals for introducing the aggregation
routers, called the “box” and the “butterfly” designs, shown in Fig. 4.6 middle and
bottom. These had local and remote variants as well. Further, there was a premium
“diverse” option in the butterfly remote design in which the links between a backbone router and its two aggregation routers were carried on two separate underlying
optical transport (DWDM) systems, instead of the same transport (the “common”
option).
It was clear that the box alternative was cheaper because of fewer links, but what
was the reduction in availability relative to the costlier butterfly design? Also, how
did either of these options compare with the existing base design? The failure modes
of interest in all these designs were network spans, router ports, and software failures
or procedural errors; these failure modes are depicted as components in Fig. 4.7.
The metric chosen to compare the availabilities of the various designs was the mean
time between access disconnections, i.e., situations where the access router A had
no path to any backbone router BB. Note that network restoration is immaterial for
such events.
nperf models for the designs of Fig. 4.6 were constructed; given the metric of
interest, the models did not include a traffic layer. Typical values for the reliability
attributes of the components were selected as in Chapter 3. At a high level, note that
the longer links between the aggregation and backbone routers in the remote designs
are less reliable than the corresponding links in the local designs. The results of the
study are summarized in Table 4.3.
The mean access disconnection times are separated into two categories, of which
“hardware” includes the first three types of components listed in Fig. 4.7. The most
notable result in Table 4.3 is that irrespective of the architecture, software and procedural errors are by far the dominant cause for access router isolations. These
events are the ones that cannot be helped by redundancy. The second most important

4

Network Performability Evaluation

131
Base

local
BB1

remote
BB2

BB1

BB2

A

A

Box
local

remote

BB1

BB2

BB1

BB2

AG1

AG2

AG1

AG2

A

A
Butterf ly
remote diverse

local

remote common

BB1

BB2

BB1

BB2

BB1

BB2

AG1

AG2

AG1

AG2

AG1

AG2

A

A

A

Fig. 4.6 “Base”, “box”, and “butterfly” access configurations. Each has a “local” and a “remote”
version. The remote versions have routers spread among different offices (the enclosing blue
boxes). BB are backbone routers, AG are aggregation routers, and A is an access router

feature is that compared to the base case, the introduction of aggregators doubles the
risk of access router isolation due to software and procedural errors, again irrespective of the design. With respect to hardware failures in the local case, the box design
increases the risk of isolation by a factor of 3 compared to the base case, but the
butterfly design is just as good as the base. In the remote case, the box design is
about twice as bad as the base, but the butterfly is in fact better, by at least a factor
of 2.75.

132

K.N. Oikonomou
Z

BB1

Z

BB2

A

network equipment
(DWDM) span

BB1

BB2

AG1

AG2

A

router port (module) pair
router port (module)
software failure or
procedural error

Fig. 4.7 Components for the simplest “base” and most complex “butterfly remote common”
topologies. A component affects the edges or nodes which it overlaps in the diagram (the connection to the Z router is fictitious, representing the part of the network beyond the backbone
routers, which is common to all alternatives)

Table 4.3 Mean access disconnection time (years), i.e., time between disconnections of access router A from both backbone routers BB, for the access topologies
of Fig. 4.6
Hardware
Software & procedural error
Local
Base
700
10
Box
232
5
Butterfly
699
5
Remote
Base
120
10
Box
61
5
Butterfly diverse
676
5
Butterfly common
329
5

Summarizing availability by reporting only means makes comparisons easy, but
hides information that is important in assessing the risk. By making the reasonable
assumption that the isolation events occur according to a Poisson distribution with
means as specified in Table 4.3, we see that the 5-year mean implies that in a single
year one isolation event occurs with probability 16% and two events with probability 2%.

4

Network Performability Evaluation

133

4.4.3 Other Studies
Besides what was presented above, nperf has been used in a variety of other
studies: the performability of a backbone network under two different types of routing was analyzed in [3], the performability of a multimedia distribution network that
tolerates any single link failure was studied in [9, 10], two-layer IP-over-SONET
restoration in a satellite network was investigated in [25], and techniques for setting
thresholds for bundled links in an IP backbone network were studied in [26].

4.5 Conclusion
This chapter presents an overview of analyzing the combined performance and reliability, known as performability, of networks. Performability analysis may be thought
of as repeating a performance analysis in many different states (failures or degradations) of the network, and is thus much more difficult than either reliability or
performance analysis on its own. Successful analysis rests on finding a point on the
reliability–performance spectrum appropriate to the problem at hand. Our particular
approach to network performability analysis is based on a four-level hierarchical
network model, and on the nperf software tool, which embodies a number of
methods known in the literature, some new techniques developed by us, and is
under active development in AT&T Labs Research (finite-time measures, qualityof-service additions to the traffic layer, etc.). We illustrated the ideas of analysing
performability by two case studies carried out with nperf and gave references to
other studies in the literature.

References
1. Aven, T., & Jensen, U. (1999). Stochastic models in reliability. New York: Springer.
2. Ahuja, R., Magnanti, T., & Orlin, J. (1998). Network flows. Englewood Cliffs, NJ: PrenticeHall.
3. Agrawal, G. Oikonomou, K. N., & Sinha, R. K. (2007). Network performability evaluation for
different routing schemes. Proceedings of the OFC. Anaheim, CA.
4. Alvarez, G., Uysal, M., & Merchant, A. (2001). Efficient verification of performability guarantees. In PMCCS-5: The fifth international workshop on performability modelling of computer
and communication systems. Erlangen, Germany.
5. Bolch, G., Greiner, S., de Meer, H., & Trivedi, K. S.(2006). Queueing networks and
Markov chains. Wiley, New Jersey.
6. Bremaud, P. (2008). Markov chains, Gibbs fields, Monte Carlo simulation, and queues.
New York: Springer.
7. Carlier, J., Li, Y., & Lutton, J. (1997). Reliability evaluation of large telecommunication networks. Discrete Applied Mathematics, 76(1–3), 61–80.
8. Colbourn, C. J. (1999). Reliability issues in telecommunications network planning. In B. Sansó
(Ed.), Telecommunications network planning. Boston: Kluwer.

134

K.N. Oikonomou

9. Doverspike, R. D., Li, G., Oikonomou, K. N., Ramakrishnan, K. K., & Wang, D. (2007).
IP backbone design for multimedia distribution: architecture and performance. In Proceedings of the IEEE INFOCOM, Alaska.
10. Doverspike, R. D., Li, G., Oikonomou, K. N., Ramakrishnan, K. K., Sinha, R. K., Wang, D., &
Chase, C. (2009). Designing a reliable IPTV network. IEEE internet computing, 13(3), 15–22.
11. de Souza e Silva, E., & Gail, R. (2000). Transient solutions for Markov chains. In
W. K. Grassmann (Ed.), Computational probability. Kluwer, Boston.
12. Gomes, T. M. S., & Craveirinha, J. M. F. (1997). A case ctudy of reliability analysis of a
multiexchange telecommunication network. In C. G. Soares (Ed.), Advances in safety and
reliability. Elsevier Science.
13. Gomes, T. M. S., & Craveirinha J. M. F. (April 1998). Algorithm for sequential generation of
states in failure-prone communication network. IEE proceedings-communications, 145(2).
14. Gomes, T., Craveirinha, J., & Martins, L. (2002). An efficient algorithm for sequential generation of failures in a network with multi-mode components. Reliability Engineering & System
Safety, 77, 111–119.
15. Garey, M., & Johnson, D. (1978). Computers and intractability: a guide to the theory of
NP-completeness. San Francisco, CA: Freeman.
16. Harms, D. D., Kraetzl, M., Colbourn, C. C., & Devitt, J. S. (1995). Network reliability: experiments with a symbolic algebra environment. Boca Raton, FL: CRC Press.
17. Jarvis, J. P., & Shier, D. R. (1996). An improved algorithm for approximating the performance
of stochastic flow networks. INFORMS Journal on Computing, 8(4).
18. Karger, D. (1995). A randomized fully polynomial time approximation scheme for the allterminal network reliability problem. In Proceedings of the 27th ACM STOC.
19. Levendovszky, J., Jereb, L., Elek, Zs., & Vesztergombi, Gy. (2002). Adaptive statistical algorithms in network reliability analysis. Performance Evaluation, 48(1–4), 225–236.
20. Li, V. K., & Silvester, J. A. (1984). Performance analysis of networks with unreliable components. IEEE Transactions on Communications, 32, 1105–1110.
21. Levy, Y. & Wirth, P. E. (1989). A unifying approach to performance and reliability objectives.
In Teletraffic science for new cost-effective systems, networks and services, ITC-12. Elsevier
Science.
22. Mendiratta, V. B. (2001). A hierarchical modelling approach for analyzing the performability
of a telecommunications system. In PMCCS-5: the fifth international workshop on performability modelling of computer and communication systems.
23. Meyer, J. F. (1995). Performability evaluation: where it is and what lies ahead. In First IEEE
computer performance and dependability symposium (IPDS), pp 334–343. Erlangen, Germany.
24. Moy, J. T. (1998). OSPF: anatomy of an internet routing protocol. Reading, MA: Addison
Wesley.
25. Oikonomou, K. N. Ramakrishnan, K. K., Doverspike, R. D., Chiu, A., Martinez Heath, M., &
Sinha, R. K. (2007). Performability analysis of multi-layer restoration in a satellite network.
Managing traffic performance in converged networks, ITC 20 (LNCS 4516). Springer.
26. Oikonomou, K. N., & Sinha, R. K. (2008). Techniques for probabilistic multi-layer network
analysis. In Proceedings of the IEEE Globecomm, New Orleans.
27. Oikonomou, K. N., & Sinha, R. K. (February 2009). Improved bounds for performability
evaluation algorithms using state generation. Performance Evaluation, 66(2).
28. Oikonomou, K. N., Sinha, R. K., & Doverspike, R. D. (2009). Multi-layer network performance and reliability analysis. The International Journal of Interdisciplinary Telecommunications & Networking (IJITN), 1(3).
29. Pióro, M., & Medhi, D. (2004). Routing, flow, and capacity design in communication and
computer networks. Morgan-Kaufmann.
30. Rauzy, A. (2005). An m log m algorithm to compute the most probable configurations of a
system with multi-mode independent components. IEEE Transactions on Reliability, 54(1),
156–158.

4

Network Performability Evaluation

135

31. Shier, D. R., Bibelnieks, E., Jarvis, J. P., & Lakin, R. J. (1990). Algorithms for approximating
the performance of multimode systems. In Proceedings of IEEE Infocom.
32. Shier, D. R. (1991). Network reliability and algebraic structures. Oxford: Clarendon.
33. Yang, C. L., & Kubat, P. (1990). An algorithm for network reliability bounds. ORSA Journal
on Computing, 2(4), 336–345.

Chapter 5

Robust Network Planning
Matthew Roughan

5.1 Introduction
Building a network encompasses many tasks: from network planning to hardware
installation and configuration, to ongoing maintenance. In this chapter, we focus on
the process of network planning. It is possible (though not always wise) to design a
small network by eye, but automated techniques are needed for the design of large
networks. The complexity of such networks means that any “ad hoc” design will
suffer from unacceptable performance, reliability, and/or cost penalties.
Network planning involves a series of quantitative tasks: measuring the current
network traffic and the network itself; predicting future network demands; determining the optimal allocation of resources to meet a set of goals; and validating the
implementation. A simple example is capacity planning: deciding the future capacities of links in order to carry forecast traffic loads, while minimizing the network
cost. Other examples include traffic engineering (balancing loads across our existing network) and choosing the locations of Points-of-Presence (PoPs) though we do
not consider this latter problem in detail in this chapter because of its dependence
on economic and demographic concerns rather than those of networking.
Many academic papers about these topics focus on individual components of
network planning: for instance, how to make appropriate measurements, or on particular optimization algorithms. In contrast, in this chapter we will take a system
view. We will present each part as a component of a larger system of network planning. In the process of describing how the various components of network planning
interrelate, we observe several recurring themes:
1. Internet measurements are of varying quality. They are often imperfect or incomplete and can contain errors or ambiguities. Measurements should not be
taken at face value, but need to be continually recalibrated [48], so that we have

M. Roughan ()
School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia
e-mail: matthew.roughan@adelaide.edu.au

C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications,
Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 5,
c Springer-Verlag London Limited 2010


137

138

M. Roughan

some understanding of the errors, and can take them into account in subsequent
processing. We will describe common measurement strategies in Section 5.2.
2. Analysis and modeling of data can allow us to estimate and predict otherwise
unmeasurable quantities. However, in the words of Box and Draper, “Essentially, all models are wrong, but some are useful” [9]. We must be continually
concerned with the quality of model-based predictions. In particular, we must
consider where they apply, and the consequences of using an inaccurate model.
A number of key traffic models are described in Section 5.3, and their use in
prediction is described in Section 5.4.
3. Decisions based on quantitative data are at best as good as their input data, but
can be worse. The quality of input data and resulting predictions are variable,
and this can have consequences for the type of planning processes we can apply.
Numerical techniques that are sensitive to such errors are not suitable for network
engineering. Discussion of robust, quantitative network engineering is the main
consideration of Sections 5.5 and 5.6.
Noting all of the above, it should not be surprising that a robust design process
requires validation. The strategy of “set and forget” is not viable in today’s rapidly
changing networking environment. The errors in initial measurements, predictions,
and the possibility for mistakes in deployment mean that we need to test whether
the implementation of our plan has achieved our goals.
Moreover, actions taken at one level of operations may impact others. For example, Qiu et al. [51] noted that attempts to balance network loads by changing routing
can cause higher-layer adaptive mechanisms such as overlay networks to change
their decisions. These higher-level changes alter traffic, leading to a change of the
circumstances that originally lead us to reroute traffic.
Thus, the process of measure!analyze/predict!control!validate should not
stop. Once we complete this process, the cycle begins again, with our validation
measurements feeding back into the process as the input for the next round of network planning, as illustrated in Fig. 5.1. This cycle allows our planning process to
correct problems, leading to a robust process.
In many ways this resembles the more formal feedback used in control systems,
though robust planning involves a range of tasks not typically modeled in formal
control theory. For instance, the lead times for deploying network components such
as new routers are still quite long. It can take months to install, configure, and test
new equipment when done methodically. Even customers ordering access facilities

measurement

Fig. 5.1 Robust network
planning is cyclic

decision/ control

analysis / prediction

5

Robust Network Planning

139

can experience relatively long intervals from order to delivery, despite the obvious
benefits to both parties of a quick startup. So if our network plan is incorrect, we
cannot wait for the planning cycle to complete to redress the problem.
We need processes where the cycle time is shorter. It is relatively simple to
reroute traffic across a network. It usually requires only small changes to router
configurations, and so can be done from day to day (or even faster if automated).
Rebalancing traffic load in the short term – in the interim before the network capacities can be physically changed – can alleviate congestion caused by failures of
traffic predictions. This process is called traffic engineering.
Another aspect of robust planning is incorporation of reliability analysis. Internet switches and routers fail from time to time, and must sometimes be removed
from service for maintenance. The links connecting routers are also susceptible to
failures, given their vulnerability to natural or man-made accident (the canonical
example is the careless back-hoe driver). Most network managers plan for the possibility of node or link failures by including redundant routers and links in their
network. A network failure typically results in traffic being rerouted using these redundant pathways. Often, however, network engineers do not plan for overloads that
might occur as a result of the rerouted traffic. Again, we need a robust planning process that takes into account the potential failure loads. We call this approach network
reliability analysis.
We organize this chapter around the key steps in network planning. We first
consider the standard network measurements that are available today. Their characteristics determine much of what we can accomplish in network planning. We then
consider models and predictions, and then finally the processes used in making decisions, and controlling our network. As noted, robust planning does not stop there,
we must continue to monitor our network, but there are a number of additional steps
we can perform in order to achieve a robust network plan and we consider them in
the final section of this chapter.
The focus of this chapter is backbone networks. Though many of the techniques
described here remain applicable to access networks, there are a number of critical
differences. For instance, access network traffic is often very bursty, and this affects
the approaches we should adopt for prediction and capacity planning. Nevertheless,
the fundamental ideas of robust planning that we discuss here remain valid.

5.2 Standard Network Measurements
Internet measurements are considered in more detail in Chapters 10 and 11, but a
significant factor in network planning is the type of measurements available, and
so we need some planning-specific discussions. In principle, it is possible to collect
extremely good data, but in practice the measurements are often flawed, and the
nature of the flaws are important when considering how to use the data.
The traffic data we might like to collect is a packet trace, consisting of a record of
all packets on a subsection of a network along with timestamps. There are various

140

M. Roughan

mechanisms for collecting such a trace, for instance, placing a splitter into an optical
fiber, using a monitor port on a router, or simply running tcpdump on one of the
hosts on a shared network segment. A packet trace gives us all of the information we
could possibly need but is prohibitively expensive at the scale we require for planning. The problem with a packet trace (apart from the cost of installing dedicated
devices) is that the amount of data involved can be enormous, for example, on an
OC48 (2.5 Gbps) link, one might collect more than a terabyte of data per hour. More
importantly, a packet trace is overkill. For planning we do not need such detail, but
we do need good coverage of the whole network. Packet traces are only used on
lower speed networks, or for specific studies of larger networks.
There are several approaches we can use to reduce data to a more manageable
amount. Filtering, so that we view only a segment of the traffic (say the HTTP
traffic) is useful for some tasks, but not planning. A more useful approach is aggregation, where we only store records for some aggregated version of the traffic,
thereby reducing the number of such records needed. A common form of aggregation is at the flow-level where we aggregate the traffic through some common
characteristics. The definition of “flow” depends on the keys used for aggregation,
but we mean here flows aggregated by the five-tuple formed from IP source and
destination address, TCP port numbers, and protocol number. Flow data is typically
collected within some time frame, for instance, 15 min periods. What is more, flowlevel collection is often a feature of a router, and so does not require additional
measurement infrastructure other than the Network Management Station (NMS) at
which the data is stored. However, the volume of data can still be large (one network under study collected 500 GB of data per day), and the collection process may
impact the performance of the router.
As a result, flow-level data is often collected in conjunction with a third method
for data reduction: sampling. Sampling can be used both before the flows are created, and afterward. Prior to flow aggregation, sampling is used at rates of around
1:100–1:500 packets. That is, less than 1% of packets are sampled. This has the
advantage that less processing is required to construct flow records (reducing the
load on the router collecting the flows) and typically fewer flow records will be
created (reducing memory and data transmission requirements). However, sampling
prior to flow aggregation does have flaws, most obviously, it biases the data collection toward long flows. These flows (involving many packets) are much more
likely to be sampled than short flows. However, this has rarely been seen as a problem in network planning where we are not typically concerned with the flow length
distribution.
Sampling can also be used after flow aggregation to reduce the transmission and
storage requirements for such data. The degree of sampling depends on the desired trade-off between accuracy of measurements, and storage requirements for the
data. Good statistical approaches for this sampling, and for estimating the resulting
accuracy of the samples are available [16,17], though, as noted above, these are predominantly aimed at preserving details such as flow-length distributions, which are
largely inconsequential for the type of planning discussed here, so sampling prior to
flow construction is often sufficient for planning.

5

Robust Network Planning

141

Of more importance here is the fact that any type of sampling introduces errors
into measurements. Any large-scale flow archives must involve significant sampling,
and so will contain errors.
An alternative to flow-level data is data collected via the Simple Network Management Protocol (SNMP) [39]. Its advantage over flow-level data collection is that
it is more widely supported, and less vendor specific. However, the data provided is
less detailed. SNMP allows an NMS to poll MIBs (Management Information Bases)
at routers. Routers maintain a number of counters in these MIBs. The widely supported MIB-II contains counters of the number of packets and bytes transmitted and
received at each interface of a router. In effect, we can see the traffic on each link
of a network. In contrast to flow-level data, SNMP can only see link volumes, not
where the traffic is going.
SNMP has a number of other issues with regard to data collection. The polling
mechanism typically uses UDP (the User Datagram Protocol), and SNMP agents
are given low priority at routers. Hence SNMP measurements are not reliable, and
it is difficult to ensure that we obtain uniformly sampled time series. The result is
missing and error-prone data.
Flow-level data contains only flow start and stop times, not details of packet arrivals, and typically SNMP is collected at 5-min intervals. The limit on timescale
of both data sets is important in network planning. We can only see average traffic
rates over these periods, not the variations inside these interval. However, congestion and subsequent packet loss often occur on much shorter timescales. The result
is that such average measurements must always be used with care. Typically some
overbuild of capacity is required to account for the sub-interval variations in traffic. The exact overbuild will depend on the network in question, and has typically
been derived empirically through ongoing performance and traffic measurements.
Values are usually fairly conservative in major backbones resulting in apparent underutilization (though this term is unfair as it concerns average utilizations not peak
loads), and more aggressive in smaller networks.
In addition to traffic data, network planning requires a detailed view of any existing network. We need to know
 The (layer 3) topology (the locations of, and the links between routers)
 The network routing policies (for instance, link weights in a shortest-path proto-

col, areas in protocols such as OSPF, and BGP policies where multiple interdomain links exist)
 The mapping between current layer 3 links and physical facilities (WDM equipment and optical fibers), and the details of the available physical network facilities and their associated costs
The topology and routing data is principally needed to allow us to map traffic to
links. The mapping is usually expressed through the routing matrix. Formally, A D
fAir g is the matrix defined by

Air D

Fir ;
0;

if traffic for r traverses link i
otherwise;

(5.1)

142

M. Roughan

where Fi r is the fraction of traffic from source/destination pair r D .s; d / that
traverses link i . A network with N nodes, and L links will have an L  N.N  1/
routing matrix.
Network data is also used to assess how changes in one component will affect the
network (e.g., how changes in OSPF link weights will impact link loads); determine
shared risk-of-failure between links; and determine how to improve our network
incrementally without completely rebuilding it in each planning cycle. The latter is
an important point because although it might be preferable to rebuild a network from
scratch, the capital value of legacy equipment usually prevents this option, except at
rare intervals.
For a small, static network, the network data may be maintained in a database,
however, best practice for large, complex, or dynamic networks is to use tools to
extract the network structure directly from the network. There are several methods available for discovering this information. SNMP can provide this information
through the use of various vendor tools (HP Openview, or Cisco NCM, e.g.), but it is
not the most efficient approach. A preferable approach for finding layer 3 information is to parse the configuration files of routers directly, for instance, as described in
[22,24]. The technique has been applied in a number of networks [5,38]. The advantages of using configuration files are manifold. The detail of information available
is unparalleled in other data sources. For instance, we can see details of the links
(such as their composition should a single logical link be composed of more than
one physical link).
The other major approach for garnering topology and routing information is to
use a route monitor. Internet routing is built on top of distributed computations
supported by routing protocols. The distribution of these protocols is often considered a critical component in ensuring reliability of the protocols in the face of
network failures. The distribution also introduces a hook for topology discovery. If
any router must be able to build its routing table from the routing information distributed through these protocols, then it must have considerable information about
the network topology. Hence, we can place a dummy router into the network to collect such information. Such routing monitors have been deployed widely over the
last few years. Their advantage is that they can provide an up-to-date dynamic view.
Examples of such monitors exist for OSPF [61, 62], and IS-IS [1, 30], as well as for
BGP (the Border Gateway Protocol) [2, 3].

5.3 Analysis and Modeling of Internet Traffic
5.3.1 Traffic Matrices
We will now consider the analysis and modeling of Internet data, in particular, traffic
data. When considering inputs to network planning, we frequently return to the topic
of traffic matrices. These are the measurements needed for many network planning
tasks, and thus the natural structure around which we shall frame our analysis.

5

Robust Network Planning

143

A Traffic Matrix (TM) describes the amount of traffic (the number of packets or
more commonly bytes) transmitted from one point in a network to another during
some time interval, and they are naturally represented by a three-dimensional data
structure Tt .i; j /, which represents the traffic volume (in bytes or packets) from i to
j during a time interval Œt; t C t/. The locations i and j are generally considered
to be physical geographic locations making i and j spatial variables. However, in
the Internet, it is common to associate i and j with logical structures related to the
address structure of the Internet, i.e., IP addresses, or natural groupings of such by
common prefix corresponding to a subnet.
Origin/Destination Matrices One natural approach to describe traffic matrices is
with respect to traffic volumes between IP addresses or prefixes. We refer to this
as an origin/destination TM because the IP addresses represent the closest approximation we have for the end points of the network (though HTTP-proxies, firewalls,
and NAT and other middle-boxes may be obscuring the true end-to-end semantics).
IPv4 admits nearly 232 potential addresses, so we cannot describe the full matrix
at this level of granularity. Typically, such a traffic matrix would be aggregated into
blocks of IP addresses (often using routing prefixes to form the blocks as these are
natural units for the control of traffic). The origin/destination matrix is our ideal
input for many network planning tasks, but the Internet is made up of many connected networks. Any one network operator only sees the traffic carried by its own
network. This reduced visibility means that our observed traffic matrix is only a
segment of the real network traffic. So we can’t really observe the origin/destination
TM. Instead we typically observe the ingress/egress traffic matrix.
Ingress/Egress versus Origin/Destination A more practical TM, the ingress/
egress TM provides traffic volumes from ingress link to egress link across a single network. Note that networks often interconnect at multiple points. The choice
of which route to use for egress from a network can profoundly change the nature of ingress/egress TMs, so these may have quite different properties to the
origin/destination matrix. Forming an ingress/egress TM from an origin/destination
TM involves a simple mapping of prefixes to ingress/egress locations in a network,
but in practice this mapping can be difficult unless we monitor traffic as it enters
the network. We can infer egress points of traffic using the routing data described
above, but inferring ingress is more difficult [22, 23], so it is better to measure this
directly.
Spatial Granularity of Traffic Matrices As we have started to see with origin/destination traffic matrices, we can measure them at various levels of granularity
(or resolution). The same is true of ingress/egress TMs. At the finest level, we measure traffic per ingress/egress link (or interface). However, it is common to aggregate
this data to the ingress/egress router. We can often group routers into larger subgroups. A common such group is a Point-of-Presence (PoP), though there are other
sub- and super-groupings (e.g., topologically equivalent edge routers are sometimes

144

M. Roughan

grouped, or we may form a regional group). Given subsets S and D of locations,
may simply aggregate a TM across these by taking
Tt .S; D/ D

XX

Tt .i; j /:

(5.2)

i 2S j 2D

Typical large networks might have 10s of PoPs, and 100s of routers, and so such
TMs are of a more workable size. In addition, as we aggregate traffic into larger
groupings, statistical multiplexing reduces the relative variance of the traffic and
allows us to perform better estimates of traffic properties such as the mean and
variance.
Temporal Granularity of Traffic Matrices We cannot make instantaneous measurements of a traffic matrix. All such observations occur over some time interval
Œt; t C t/. It would be useful to make the interval t smaller (for instance, for
detecting anomalies), but typically we face a trade-off against the errors and uncertainties in our measurements. A longer time interval allows more “averaging-out”
of errors, and minimizes the impact of missing data. The best choice of time interval
for TMs is typically determined by the task at hand, and the network under study,
but a common choice is a 1 hour interval. In addition to being easily understood by
human operators, this interval integrates enough SNMP or flow-level data to reduce
the impact of (typical) missing data and errors, while allowing us to still observe
important diurnal patterns in the traffic.

5.3.2 Patterns in Traffic
It is useful to have some understanding of the typical patterns we see in network
traffic. Such patterns are only visible at a reasonable level of aggregation (otherwise
random temporal variation dominates our view of the traffic), but for high degrees
of aggregation (such as router-to-router traffic matrices on a large backbone network) the pattern can be very regular. There are two main types of patterns that have
been observed: patterns across time, and patterns in the spatial structure. Each is
discussed below.
Temporal Patterns Internet traffic has been observed to follow both daily (diurnal)
and weekly cycles [33–35,57,64]. The origin of these cycles is quite intuitive. They
arise because most Internet traffic is currently generated by humans whose activities follow such cycles. Typical examples are shown in Figs. 5.2 and 5.3. Figure 5.2
shows a RRD Tool graph1 of the traffic on a link of the Australian Academic Research Network (AARNet). Figure 5.3 shows the total traffic entering AT&T’s North
American backbone network at a Point of Presence (PoP) over two consecutive
1

RRDTool (the Round Robin Database tool) [47] and its predecessor MRTG (the Multi-Router
Traffic Grapher [46]) are perhaps the most common tools for collecting and displaying SNMP
traffic data.

5

Robust Network Planning

145

Bits per Second

20.4 M
15.3 M
10.2 M
5.1 M
0.0 M
Sat

Sun

Mon

Tue

Wed

Thu

Fri

Sat

Sun

Fig. 5.2 Traffic on one link in the Australian Academic Research Network (AARNet) for just over
1 week. The two curves show traffic in either direction along the link
Traffic: 08−May−2001 (GMT)

traffic rate

traffic rate

Traffic: 07−May−2001 (GMT)

start 08−May−2001
the following week

start 07−May−2001
the following week
Mon

Tue

Wed

Thu

Fri

Sat

Sun

Mon

09:00 12:00 15:00 18:00 21:00 00:00 03:00 06:00 09:00
time (GMT)

Fig. 5.3 Total traffic into a region over 2 consecutive weeks. The solid line is the first week’s data
(starting on May 7), and the dashed line shows the second week’s data. The second figure zooms
in on the shaded region of the first

weeks in May 2001. The figure illustrates the daily and weekly variations in the
traffic by overlaying the traffic from the 2 weeks. The striking similarity between
traffic patterns from week to week is a reflection of the high level of aggregation
that we see in a major backbone network.
The observation of cycles in traffic is not new. For many years they have been
seen in telephony [13]. Typically telephone service capacity planning has been
based on a “busy hour”, i.e., the hour of the day that has the highest traffic. The
time of the busy hour depends on the application and customer base. Access networks typically have many domestic consumers, and consequently their busy hour
is in the evening when people are at home. On the other hand, the busy hour of business customers is typically during the day. Obviously, time-zones have an effect on
the structure of the diurnal cycle in traffic, and so networks with a wide geographic
dispersion may experience different busy hours on different parts of their network.
In addition to cyclical patterns, Internet traffic has shown strong growth over
many years [45]. This long-term trend has often been approximated by exponential
growth, although care must be taken because sometimes such estimates have been
based on poor (short or erratic) data [45]. Long-term trends should be estimated
from multiple years of carefully collected data.

146

M. Roughan

traffic (PB/quar ter)

102

101

100
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Fig. 5.4 ABS traffic measurements showing Australian Internet traffic, with an exponential fit to
the data from 2000 to 2005. Data is shown by ‘o’, and the fit by the straight line. Note that the line
continuing past 2005 is a prediction based on the pre-2005 data, showing also the 95th percentile
confidence bounds for the predictions

One public example is the data collected by the Australian Bureau of Statistics
(ABS)2 who have collected historical data on Australian ISP traffic for many years.
Figure 5.4 shows Australia’s network traffic in petabytes per quarter with a log-y
axis. Exponential growth appears as a straight line on the log-graph, so we can obtain simple predictions of traffic growth through linear regression. The figure shows
such a prediction based on pre-2005 data. It is interesting to note that the most recent
data point does not, as one might assume without analysis, represent a significant
drop in traffic growth. Relative to the long-term data the last point simply represents
a reversion to the long-term growth from rather exceptional traffic volumes over
2007. We will discuss such prediction in more detail in the following sections.
Standard time-series analysis [10] can be used to build a model of traffic containing long-term trends, cyclical components (often called seasonal components in
other contexts), and random fluctuations. We will use the following notation here:
S.t/ D seasonal (cyclical) component;

(5.3)

L.t/ D long-term trend;

(5.4)

W .t/ D random fluctuations:

(5.5)

The seasonal component is periodic, i.e., S.t C kTS / D S.t/, for all integers k,
where TS is the period (which is either 24 hour or 1 week). Before we can consider how to estimate the seasonal (and trend) components of the traffic, we must

2

www.abs.gov.au

5

Robust Network Planning

147

model these components.3 At the most basic level, consider the traffic to consist
of two components, a time varying (but deterministic) mean m.t/ and a stochastic
component W .t/. At this level we could construct the traffic by addition or multiplication of these components (both methods are used in econometric and census
data). However, in traffic data, a more appropriate model [43, 56] is
x.t/ D m.t/ C

p

am.t/ W .t/;

(5.6)

where a is called the peakedness of the traffic, W .t/ is a stochastic process with
zero mean, and unit variance, and x.t/ represents the average rate of some traffic
(say a particular traffic matrix element) at time t. More highly aggregated traffic is
smoother, and consequently would have a smaller value for a. The reason for this
choice of model lies in the way network traffic behaves when aggregated. When
multiple flows are aggregated onto a non-congested link, we should expect them to
obey the same model (though perhaps with different parameters). Our model has
this property: for instance, take N traffic streams xi with mean mi , peakedness ai ,
and stochastic components, which are independent realizations of a (zero mean, unit
variance) Gaussian process. The multiplexed traffic stream is
xD

N
X
i D1

mi C

N
X
p

ai mi Wi :

(5.7)

i D1

P
The mean of the new process is m D N
i D1 mi , and the peakedness (derived from
PN
1
the variance) is a D m i D1 ai mi , which is a weighted average of the component
peakednesses. The relative variance becomes
Vx D Varfxg=Efxg D

N
1 X
ai mi :
m2

(5.8)

i D1

If we take identical streams, then the relative variance decreases as we multiplex
more together, which is to be expected. The result is that in network traffic the
level of aggregation is important in determining the relative variance: more highly
aggregated traffic exhibits less random behavior. The data in Fig. 5.3 from AT&T
shows an aggregate of a very large number of customers (an entire PoP of one of
North America’s largest networks). The consequence is that we can see the traffic is
very smooth. In contrast the traffic shown in Fig. 5.2 is much less aggregated, and
shows more random fluctuations.
The model described above is not perfect (none are), but it is useful because it
(i) allows us to calculate variances for aggregated traffic streams in a consistent way
and to use these when planning our network, and (ii) its parameters are relatively

3

The reader should beware of methods, which do not explicitly model the data, because in these
methods there is often an implicit model.

148

M. Roughan

easy to measure, and therefore to use in traffic analysis. To do so, however, we find
it useful to spilt the mean m.t/ into the cyclic component (which we denote S.t/)
and the long-term trend L.t/ by taking the product
m.t/ D L.t/S.t/:

(5.9)

We combine the two components through a product because as the overall load
increases the range of variation in the size of cycles also increases. When estimating
parameters of our models, it is important to allow for unusual or anomalous events,
for instance, a Denial of Service (DoS) attack. These events are rare (we hope), but it
is important to separate them from the normal traffic. Such terms can sometimes be
very large, but we do not plan network capacity to carry DoS attacks! The network is
planned around the paying customers. We separate them by including an impulsive
term, I.t/, in the model, so that the complete model is
x.t/ D L.t/S.t/ C

p

aL.t/S.t/ W .t/ C I.t/:

(5.10)

We will further discuss this model in Section 5.4, where we will consider how to
estimate its parameters, and to use it in prediction.
Spatial Patterns Temporal models are adequate for many applications: for instance, where we consider dimensioning of a single bottleneck link (perhaps in the
design of an access network). However, spatial patterns in traffic provide us with addition planning capabilities. For instance, if two traffic sources are active at different
times, then clearly we can carry them both with less capacity than if they activate
simultaneously.
Spatial patterns refer to the structure of a Traffic Matrix (TM) at a single time
interval. It is common that TM elements are strongly correlated because they show
similar diurnal (and weekly) patterns. For example, in a typical network (without
wide geographic distribution) one will find that the busy hour is almost the same for
all elements of the TM, but there is additional structure.
For a start, TMs often come from skewed distributions. A common example is
where the distribution follows a rough 80–20 law (80% of traffic is generated by
the largest 20% of TM elements). Similar distributions have often been observed,
though often even more skewed: for instance 90–10 laws are not uncommon. However, the distribution is not “heavy-tailed”. Observed distributions have shown a
lighter tail than the log-normal distribution [55]. Consequently, traffic matrix work
often concentrates on these larger flows, but traditional (rather than heavy-tailed)
statistical techniques are still applicable.
Another simple feature one might naively expect of TMs – symmetry – is not
present. Internet routing is naturally asymmetric, as is application traffic (a large
amount of traffic still follows a client–server model, which results in strongly
asymmetric traffic). Hence, the matrix will not (generally) be symmetric [21], i.e.,
T .i; j / ¤ T .j; i /.
We observe some additional structure in these matrices. The simplest model
that describes some of the observed structure is the gravity model. In network

5

Robust Network Planning

149

applications, gravity models have been used to model the volume of telephone calls
in a network [31]. Gravity models take their name from Newton’s law of gravitation,
and are commonly used by social scientists to model the movement of people, goods
or information between geographic areas [49,50,63]. In Newton’s law of gravitation
the force is proportional to the product of the masses of the two objects divided by
the distance squared. Similarly, in gravity models for interactions between cities, the
relative strength of the interaction might be modeled as proportional to the product
of the cities’ populations, so a general formulation of a gravity model is given by
T .i; j / D

Ri  Aj
;
fij

(5.11)

where Ri represents the repulsive factors that are associated with leaving from i ;
Aj represents the attractive factors that are associated with going to j ; and fij is a
friction factor from i to j . The gravity model was first used in the context of Internet
traffic matrices in [67] where we can naturally interpret the repulsion factor Ri as
the volume of incoming traffic at location i , and the attractivity
factor Aj as the
 
outgoing traffic volume at location j . The friction matrix fij encodes the locality
information specific to different source–destination pairs, however, as locality is not
as large a factor in Internet traffic as in the transport of physical goods, it is common
to assume fij D const. The resulting gravity model simply states that the traffic
exchanged between locations is proportional to the volumes entering and exiting at
those locations.
Formally, let T in .i / and T out .j / denote the total traffic that enters the network
via i , and exits via j , respectively. The gravity model can then be computed by
T .i; j / D

T in .i /T out .j /
;
T tot

(5.12)

where T tot is the total traffic across the network. Implicitly, this model relies on a
conservation assumption,
P in i.e., traffic
P isoutneither created nor destroyed in the network
.k/: The assumption may be violated, for
so that T tot D
k T .k/ D
kT
instance, when congestion causes packet loss. However, in most backbones congestion is kept low, and so the assumption is reasonable.
In the form just described, the gravity model has distinct limitations. For instance,
real traffic matrices may have non-constant fij (perhaps as a result of different time
zones). Moreover, even if an origin destination traffic matrix matches the gravity
model well, the ingress/egress TM may be systematically distorted [7]. Typically,
networks use hot-potato routing, i.e., they choose the egress point closest to the
ingress point, and this results in a systematic distortion of ingress/egress traffic
matrices away from the simple gravity model. These distortions and others related to the asymmetry of traffic and distance sensitivity may be incorporated in
generalizations of the gravity model where sufficient data exists to measure such
deviations [13, 21, 67].

150

M. Roughan

The use of temporal patterns in planning is relatively obvious. The use of spatial
patterns such as the gravity model is more subtle. The spatial structure gives us the
capability to fill in missing values of the traffic matrix when our data is not perfect.
Hence we can still plan our network, even in the extreme case where we have no
data at all.

5.3.3 Application Profile
We have so far discussed network traffic along two dimensions: the temporal and
spatial. There is a third aspect of traffic to consider: its application breakdown, or
profile. Common applications on the Internet are email, web browsing (and other
server-based interactions), peer-to-peer file transfers, video, and voice. Each may
have a different traffic matrix, and as some networks move toward differentiated
Quality of Service (QoS) for different classes of traffic, we may have to plan networks based on these different traffic matrices.
Even where differentiated service is not going to be provided, a knowledge of the
application classes in our network can be very useful. For instance
 Voice traffic is less variable than data, and so can require less overhead for sub-

measurement interval variations.
 Peer-to-peer applications typically generate more symmetric traffic than web

traffic, and so downstream capacity (toward customer eyeballs) is likely to be
more balanced when peer-to-peer applications dominate.
 We may be planning to eliminate some types of traffic in future networks (e.g.,
peer-to-peer traffic has often been considered to violate service agreements that
prohibit running servers).
The breakdown of traffic on a network is not trivial to measure. As noted, typical
flow-level data collection includes TCP/UDP port numbers, and these are often associated with applications using the IANA (Internet Assigned Numbers Authority)
list of registered ports.4 However, the port numbers used today are often associated
with incorrect applications because:
Ports are not defined with IANA for all applications, e.g., some peer-to-peer
applications.
An application may use ports other than its well-known ports to circumvent
access control restrictions, e.g., nonprivileged users often run WWW servers on
ports other than port 80, which is restricted to privileged users on most operating
systems, while port 80 is often used for other applications (than HTTP) in order
to work around firewalls.
In some cases server ports are dynamically allocated as needed. For example,
FTP allows the dynamic negotiation of the server port used for the data transfer.

4

http://www.iana.org/assignments/port-numbers

5

Robust Network Planning

151

This server port is negotiated on an initial TCP, connection which is established
using the well-known FTP control port, but which would appear as a separate
flow.
Malicious traffic (e.g., DoS attacks) can generate a large volume of bogus traffic
that should not be associated with the applications that normally use the affected
ports.
In addition, there are some incorrect implementations of protocols, and ambiguous
port assignments that complicate the problem. Better approaches to classification of
traffic exist (e.g., [58]), but are not always implemented on commercial measurement systems.
Application profiles can be quite complex. Typical Internet providers will see
some hundreds of different applications. However, there are two major simplifications we can often perform. The first is a clustering of applications into classes. QoS
sometimes forms natural classes (e.g., real-time vs bulk-transfer classes), but regardless we can often group many applications into similarly structured classes, e.g., we
can group a number of protocols (IMAP, POP, SMTP, etc.) into one class “email”.
Common groupings are shown in Table 5.1, along with exemplar applications.
There may be a larger number of application classes, and often there is a significant group of unknown applications, but a typical application profile is highly
skewed. Again, it is common to see 80–20 or 90–10 rules. In these cases, it is common to focus attention on those applications that generate the most traffic, reducing
the complexity of the profile.
However, care must be taken because some applications that generate relatively
little traffic on average may be considered very important, and/or may generate high
volumes of traffic for short bursts. There are several such examples in enterprise
networks, for instance, consider a CEO’s once-a-week company-wide broadcast, or
nightly backups. Both generate a large amount of traffic, but in a relative short-time
interval, so their proportion of the overall network traffic may be small. More generally, much of the control-plane traffic (e.g., routing protocol traffic) in networks is
relatively low volume, but of critical importance.

Table 5.1 Typical
application classes grouped
by typical use

Class
Bulk-data
Database access
Email
Information
Interactive
Measurement
Network control
News
Online gaming
Peer-to-peer
Voice over IP
www

Example applications
FTP, FTP-Data
Oracle, MySQL
IMAP, POP, SMTP
finger, CDDBP, NTP
SSH, Telnet
SNMP, ICMP, Netflow
BGP, OSPF, DHCP, RSVP, DNS
NNTP
Quake, Everquest
Kazaa, Bit-torrent
SIP, Skype
HTTP, HTTPS

152

M. Roughan

5.4 Prediction
There are two common scenarios for network planning:
1. Incremental planning for network evolution
2. Green-fields planning
In the first case, we have an existing network. We can measure its current traffic, and
extrapolate trends to predict future growth. In combination with business data, quite
accurate assessments of future traffic are possible. Typically, temporal models are
sufficient for incremental network planning, though better results might be possible
with recently developed full spatio-temporal models [52].
In green-fields planning, we have the advantage that we are not constrained in
our network design. We may start with a clean slate, without concerning ourselves
with a legacy network. However, in such planning we have no measurements on
which to base predictions. All is not lost, however, as we may exploit the spatial
properties of traffic matrices in order to obtain predictions. We discuss each of these
cases below.
There are other scenarios of concern to the network planner. For example
 Network mergers, for instance when two companies merge and subsequently

combine their networks.
 Network migrations, for instance, as significant services such as voice or frame-

relay are migrated to operate on a shared backbone.
 Addition (or loss) of a large customer (say a broadband access provider, a major

content provider, or a hosting center).
 A change in interdomain routing relationships. For instance, the conversion of

a customer to a peer would mean that traffic no longer transits from that peer,
altering traffic patterns.
The impact of these types of event is obviously dependent on the relative volume
of the traffic affected. Such events can be particularly significant for smaller networks, but it is not unheard of for them to cause unexpected demands on the largest
networks (for instance, the migration of an estimated half-million customers from
Excite@home to AT&T in 20025). However, the majority of such cases can be covered by one or both of the techniques below.

5.4.1 Prediction for Incremental Planning
Incremental planning involves extending, or evolving a current network to meet
changing patterns of demands, or changing goals. The problem involves prediction of future network demands, based on extrapolation of past and present network

5

http://news.cnet.com/ExciteHome-to-shut-down-ATT-drops-bid/2100-1033 3-276550.html

5

Robust Network Planning

153

measurements. The planning problems we encounter are often constrained by the
fact that we can make only incremental changes to our network, i.e., we cannot
throw away the existing network and start from a clean slate, but let us first consider
the problem of making successful traffic predictions.
Obviously, our planning horizon (the delay between our planning decisions and
their implementation) is critical. The shorter this horizon, the more accurate our
predictions are likely to be, but the horizon is usually determined by external factors
such as delays between ordering and delivery of equipment, test and verification of
equipment, planned maintenance windows, availability of technical staff, and capital
budgeting cycles. These are outside the control of the network planner, so we treat
the planning horizon as a constant.
The planning horizon also suggests how much historical data is needed. It is
a good idea to start with historical data extending several planning horizons into
the past. Such a record allows not only better determination of trends, but also an
assessment of the quality of our prediction process through analysis of past planning
periods. If such data is unavailable, then we must consider green-fields planning (see
Section 5.4.2), though informed by what measurements are available.
Given such a historical record, our primary means for prediction is temporal
analysis of traffic data. That is, we consider the traffic measurements of interest
(often a traffic matrix) as a set of time-series.
However, as noted earlier the more highly we aggregate traffic, the smaller its
relative variance, and the easier it is to work with. As a result, it can be a good idea
to predict traffic at a high level of aggregation, and then use a spatial model to break
it into components. For instance, we might perform predictions for the total traffic
in each region of our network, and then break it into components using the current
traffic matrix percentages, rather than predicting each element of the traffic matrix
separately.
There are many techniques for prediction, but we concentrate here on just one,
which works reasonably for a wide range of traffic, but we should note that as in all
of the work presented here, the key is not the individual algorithms but their robust
application through a process of measurement, planning, and validation.

5.4.1.1 Extracting the Long-Term Trend
We will exploit the previously presented temporal model for traffic, and note that
the key to providing predictions for use in planning is to estimate the long-term
trend in the data. We could form such an estimate simply by aggregating our timeseries over periods of 1 week (to average away the diurnal and weekly cycles) and
then performing standard trend analysis. However, knowledge of the cycles in traffic
data is often useful. Sometimes we design networks to satisfy the demand during a
“busy hour.” More generally though, the busiest hours for different components of
the traffic may not match (particularly in international networks distributed over
several time-zones), and so we need to plan our network to have sufficient capacity
at all hours of the day or night.

154

M. Roughan

Hence, the approach we present provides the capability to estimate both the longterm trend, and the seasonal components of the traffic. It also allows an estimate of
the peakedness, providing the ability to estimate the statistical variations around the
expected traffic behavior. The method is hardly the only applicable time-series algorithm for this type of analysis (for another example see [44]), but it has the advantage
of being relatively simple. The method is based on a simple signal processing tool,
the Moving Average (MA) filter, which we discuss in detail below.
The moving average can be thought of as a simple low-pass filter as it “passes”
low-frequencies, or long-term behavior, but removes short-term variations. As such
it is ideally suited to extracting the trend in our traffic data. Although there are many
forms of moving average, we shall restrict our attention to the simplest: a rectangular
moving average
sDt
Cn
X
1
MAx .tI n/ D
x.s/;
(5.13)
2n C 1 sDt n
where n is the width of the filter, and 2n C 1 is its length. The length of the filter
must be longer than the period of the cyclic component in order to filter out that
component. Longer filters are often used to allow for averaging out of the stochastic
variation as well. The shortest filter we should consider for extracting the trend is
three times the period, which in Internet traffic data is typically 1 week. For example,
given traffic data x.t/, measured in 1 hour intervals, we could form our estimate
O
L.t/
of the trend by taking a filter of length 3 weeks (e.g., 2n C 1 D 504 D
O
24  7  3), i.e., we might take L.t/
D MAx .tI 252/ where MAx is defined in (5.13).
Care must always be taken around the start and end of the data. Within n data
points of the edges the MA filter will be working with incomplete data, and so these
estimates should be discounted in further analysis.
Once we have obtained estimates for the long-term trend, we can model its behavior. Over the past decade, the Internet has primarily experienced exponential
growth (for instance, see Fig. 5.4 or [45]) i.e.,
L.t/ D L.0/e ˇ t ;

(5.14)

where L.0/ is the starting value, and ˇ is the growth rate. If exponential growth is
suspected the standard approach is to transform the data using the log function so
that we see
log L.t/ D log L.0/ C ˇt;
(5.15)
where we can now estimate L.0/ and ˇ from linear regression of the observed data.
Care should obviously be taken that this model is reasonable. Regression provides
diagnostic statistics to this end, but comparisons to other models (such as a simple
linear model) can also be helpful.
Such a model can be easily extrapolated to provide long-term predictions of traffic volumes. Standard diagnostics from the regression can also be used to provide
confidence bounds for the predictions, allowing us to predict “best” and “worst” case
scenarios for traffic growth, and an example of such predictions is given in Fig. 5.4
using the data from 2000 to 2004 to estimate the trend, and then extrapolating this

5

Robust Network Planning

155

until 2009. The figure shows the extrapolated optimistic and pessimistic trend estimates. We can see that actual traffic growth from 2005 to 2007 was on the optimistic
side of growth, but that in 2008 the measured traffic was again close to the long-term
trend estimate.
This example clearly illustrates that understanding the potential variations in our
trend estimate is almost as important as obtaining the estimate in the first place. It
also illustrates how instructive historical data can be in assessing appropriate models
and prediction accuracy.
Often, in traffic studies, managers are keen to know the doubling time, the time
it takes traffic to double. This can be easily calculated by estimating the value of t
such that L.t/ D 2L.0/, or e ˇ t D 2. Again, taking logs we get the doubling time
t D

1
ln 2:
ˇ

(5.16)

The Australian data shown in Fig. 5.4 has a doubling time of 477 days.
The trend by itself can inform us of growth rate but modeling the cyclic variations
in traffic is also useful. We do this by extending the concept of moving average to
the seasonal moving average, but before doing so we broadly remove the long-term
O
trend from the data (by dividing our measurements x.t/ by L.t/).

5.4.1.2 Extracting the Cyclical Component
The goal of a Seasonal Moving Average (SMA) is to extract the cyclic component
of our traffic. We know, a priori, the period (typically 7 days) and so the design of
a filter to extract this component is simple. It resembles the MA used previously in
that it is an average, but in this case it is an average of measurements separated in
time by the period. More precisely we form the SMA of the traffic with the estimated
trend removed, e.g.,
N 1
1 X
O C nTS /;
SO .t/ D
x.t C nTS /=L.t
N nD0

(5.17)

where TS is the period, and N TS is the length of the filter. In effect the SMA estimates the traffic volume for each time of day and week as if they were separate
time series. It can be combined with a short MA filter to provide some additional
smoothing of the results if needed.
The advantage of using an SMA as opposed to a straightforward seasonal average
is that the cyclical component of network traffic can change over time. Using the
SMA allows us to see such variability, while still providing a reasonably stable
model for extrapolation. There is a natural trade-off between the length of the SMA,
and the amount of change we allow over time (longer filters naturally smooth out
transient changes). Typically, the length of filter desired depends on the planning
horizon under which we are operating. We extrapolate the SMA in various ways,

156

M. Roughan

but the simplest is to repeat the last cycle measured in our data into the future, as
if the cyclical component remained constant into the future. Hence, when operating
with a short planning horizon (say a week), we can allow noticeable week-to-week
variations, and still obtain reasonable predictions, and so a filter length of three to
four cycles is often sufficient. Where our planning horizon is longer (say a year) we
must naturally assume that the week-to-week variations in the cyclical behavior are
smaller in order to extrapolate, and so we use a much longer SMA, preferably at
least of the order of the length of the planning horizon.
5.4.1.3 Estimating the Magnitude of Random Variations
Once we understand the periodic and trend components of the traffic, the next thing
to capture is the random variation around the mean. Most metrics of variation used in
capacity planning do not account for the time-varying component, and so are limited
O SO .t/
to busy-hour analysis. In comparison, we now have an estimate of m.t/
O
D L.t/
and so can use (5.6) to estimate
the
stochastic
or
random
component
of
our
traffic
p
by z.t/ D .x.t/  m.t//=
O
m.t/.
O
We can now measure the variability of the random
component of the traffic using the variance of z.t/, which forms an estimate aO for the
traffic’s peakedness. The estimator for aO including the correction for bias is given
in [57]. Note that it is also important to separate the impulsive, anomaly terms from
the more typical variations. There are many anomaly detection techniques available
(see [66] for a review of a large group of such algorithms). These algorithms can
be used to select anomalous data points that can then be excluded from the above
analysis.
5.4.1.4 From Traffic Matrix to Link Loads
Once we have predictions of a TM, we often need to use these to compute the link
loads that would result. The standard approach is to write the TM in vectorized
form x, where the vector x consists of the columns of the TM (at a particular time)
stacked one on top of another. The link loads y can then be estimated through the
equation
y D Ax;
(5.18)
where A is the routing matrix. The equation above can also be extended to project
observations or predictions of a TM over time into equivalent link loads.
Although there are multiple time-series approaches that can be used to predict
future behavior (e.g., Holt-Winters [11]), our approach has the advantage that it
naturally incorporates multiplexing. As a result, Eq. 5.18 can be extended to other
aspects of the traffic model. For instance, the variances of independent flows are
additive (the variance of the multiplexed traffic is the sum of the variances of the
components), and so the variance of link traffic follows the same relationship, i.e.,
vy D Avx ;

(5.19)

5

Robust Network Planning

157

where vy and vx are the variances of the link loads and TM, respectively. We can
use vy to deduce peakedness parameters for the link traffic using (5.7).
So far, we have assumed that the network (at least the location of links, and the
routing) is static. In reality, part of network planning involves changing the network,
and so the matrix A is really a potential variable. When we consider network planning, A appears implicitly as one of our optimization variables. Likewise, A may
change in response to link or router failures.
The reason-traffic matrices are so important is that they are, in principle, invariant
under changes to A. Hence predictions of link loads under the changes in A can be
easily made. For example, imagine a traffic engineering problem where we wish
to balance the load on a network’s internal links more effectively. We will change
routing in the network in order to balance the traffic on links more effectively. In
doing so, the link loads are not invariant (the whole point of traffic engineering is to
change these). However, the ingress/egress TM is invariant, and projecting this onto
the links (via the routing matrix) will predict the link loads under proposed routing
changes.
In reality, invariance is an approximation. Real TMs are not invariant under all
network changes, for instance, if network capacities are chosen to be too small,
congestion will result. However, the Transmission Control Protocol (TCP) will act
to alleviate this congestion by reducing the actual traffic carried on the network,
thereby changing the traffic matrix. In general, different sets of measurements will
have different degrees of invariance. For instance, an origin/destination TM is invariant to changes in egress points (due to routing changes), whereas an ingress/egress
TM is not. It is clearly better to use the right data set for each planning problem, but
the desired data is not always available.
The lack of true invariance is one of the key reasons for the cyclic approach
to network planning. We seek to correct any problems caused by variations in our
inputs in response to our new network design.

5.4.2 Prediction for Green-Fields Planning
The above section assumes that we have considerable historical data to which we apply time-series techniques to extrapolate trends, and hence predict the future traffic
demands on our network. This has two major limitations:
1. IP traffic is constrained by the pipe through which it passes. TCP congestion
control ensures that such traffic does not overflow by limiting the source transmission rate. In most networks our measurements only provide the carried load,
not the offered load. If the network capacities change, the traffic may increase in
response. This is a concern if our current network is loaded to near its capacity,
and in this case we must discount our measurements, or at least treat them with
caution.
2. When we design a new network there is nothing in place for us to measure.

158

M. Roughan

We will start by considering available strategies for the latter case. We can draw
inspiration from the spatial models previously presented. The fact that the simple
gravity model describes, to some extent, the spatial structure of Internet traffic matrices presents us with a simple approach to estimate an initial traffic matrix.
The first step is to estimate the total expected traffic for the network, based on
demographics and market projections. Let us take a simple example: in Australia the
ABS measures Internet usage. Across a wide customer base the average usage per
customer was roughly 3 GB/month (since 2006). The total traffic for our network
is the usage per customer multiplied by the projected number of customers. We
can derive traffic estimates per marketing region in the same fashion. Note that the
figure used above is for the broad Australian market and is unlikely to be correct
elsewhere (typical Australian ISPs have an tiered pricing structure). Where more
detailed figures exist in particular markets these should be used.
The second step is to estimate the “busy-hour” traffic. As we have seen previously the traffic is not uniformly distributed over time. In the absence of better data,
we might look at existing public measurements (such as presented in Figs. 5.2 and
5.3, or as appears in [44]) where the peak to mean ratio is of the order of 3 to 2.
Increasing our traffic estimates by this factor gives us an estimate of the peak traffic
loads on the network.
The third step is to estimate a traffic matrix. The best approach, in the absence of
other information, to derive the traffic matrix is to apply the gravity model (5.12). In
the simple case, the gravity model would be applied directly using the local regional
traffic estimates. However, where additional information about the expected application profile exists, we might use this to refine the results using the “independent
flow model” of [21]. Additional structural information about the network might allow use of the “generalized gravity model” of [68]. Each of these approaches allows
us to use additional information, but in the absence of such information the simple
gravity model gives us our initial estimate of the network traffic matrix.
What about the case where we have historical network traffic measurements, but
suspect that the network is congested so that the carried load is significantly below
the offered load? In this case, our first step is to determine what parts of the traffic
matrix are affected. If a large percentage of the traffic matrix is affected, then the
only approach we have available is to go back through the historical record until
we reach a point (hopefully) where the traffic is not capacity constrained. This has
limitations: for one thing, we may not find a sufficient set of data where capacity
constraints have left the measurements uncorrupted. Even where we do obtain sufficient data, the missing (suspect) measurements increase the window over which we
must make predictions, and therefore the potential errors in these predictions.
However, if only a small part of the traffic matrix is affected we may exploit
techniques developed for traffic matrix inference to fill in the suspect values with
more accurate estimates. These methods originated due to the difficulties in collecting flow-level data to measure traffic matrices directly. Routers (particularly older
routers) may not support an adequate mechanism for such measurements (or suffer
a performance hit when the measurements are used), and installation of stand-alone
measurement devices can be costly. On the other hand, the Simple Network Management Protocol (SNMP) is almost ubiquitously available, and has little overhead.

5

Robust Network Planning

159

Unfortunately, it provides only link-load measurements, not traffic matrices. However, the two are simply related by (5.18). Inferring x from y is a so-called “network
tomography” problem. For a typical network the number of link measurements is
O.N / (for a network of N nodes), whereas the number of traffic matrix elements is
O.N 2 / leading to a massively underconstrained linear inverse problem. Some type
of side information is needed to solve such problems, usually in the form of a model
that roughly describes a typical traffic matrix. We then estimate the parameters of
this crude model (which we shall call m), and perform a regularization with respect
to the model and the measurements by solving the minimization problem
argmin ky  Axk22 C 2 d .x; m/;
x

(5.20)

where k  k2 denotes the l 2 norm,  > 0 is a regularization parameter, and d.x; m/
is a distance between the model m and our estimated traffic matrix x. Examples of suitable distance metrics are standard or weighted Euclidean distance and
the Kullback–Leibler divergence. Approaches of this type, generally called strategies for regularization of ill-posed problems are more generally described in [29],
but have been used in various forms in many works on traffic matrix inference.
The method works because the measurements leave the problem underconstrained,
thereby allowing many possible traffic matrices that fit the measurements, but the
model allows us to choose one of these as best. Furthermore, through  the method
allows us to tradeoff our belief about the accuracy of the model against the expected
errors in the measurements.
We can utilize TM structure to interpolate missing values by solving a similar
optimization problem
argmin kA .x/  Mk22 C 2 d.x; mg /;
x

(5.21)

where A .x/ D M expresses the available measurements as a function of the traffic
matrix (whether these be link measurements or direct measurements of a subset of
the TM elements we do not care), and mg is the gravity model. This regularizes
our model with respect to the measurements that are considered valid. Note that the
gravity model in this approach will be skewed by missing elements, so this approach
is only suitable for interpolation of a few elements of the traffic matrix. If larger
numbers of elements are missing, we can use more complicated techniques such as
those proposed in [52] to interpolate the missing data.

5.5 Optimal Network Plans
Once we have obtained predictions of the traffic on our network we can commence
the actual process of making decisions about where links and routers will be placed,
their capacities, and the routing policies that will be used. In this section we discuss
how we may optimize these quantities against a set of goals and constraints.

160

M. Roughan

The first problem we consider concerns capacity planning. If this component of
our network planning worked as well as desired, we could stop there. However,
errors in predictions, coupled with the long planning horizon for making changes
to a network mean that we need also to consider a short-term way of correcting
such problems. The solution is typically called traffic engineering or simply load
balancing, and is considered in Section 5.5.2.

5.5.1 Network Capacity Planning
There are many good optimization packages available today. Commercial tools such
as CPLEX are designed specifically for solving optimization problems, while more
general purpose tools such as Matlab often include optimization toolkits that can be
used for such problems. Even Excel includes some quite sophisticated optimization
tools, and so we shall not consider optimization algorithms in detail here. Instead we
will formulate the problem, and provide insight into the practical issues. There are
three main components to any optimization problem: the variables, the objective,
and the constraints.
The variables here are obviously the locations of links, and their capacities.
The objective function – the function which we aim to minimize – varies depending on business objectives. For instance, it is common to minimize the cost of
a network (either its capital or ongoing cost), or packet delays (or some other network performance metric). The many possible objectives in network design result
in different problem formulations, but we concentrate here on the most common
objective of cost minimization.
The cost of a network is a complex function of the number and type of routers
used, and the capacities of the links. It is common, however, to break up the problem hierarchically into inter-PoP, and intra-PoP design, and we consider the two
separately here.
The constraints in the problem fall into several categories:
1. Capacity constraints require that we have “sufficient” link capacity. These are the
key constraints for this problem so we consider these in more detail below.
2. Other technological constraints, such as limited port numbers per router.
3. Constraints arising as a result of the difficulties in multiobjective optimization.
For example, we may wish to have a network with good performance and low
cost. However, multiobjective optimization is difficult, so instead we minimize
cost subjected to a constraint on network performance.
4. Reliability constraints require that the network functions even under network
failures. This issue is so important that other chapters of this book have been
devoted to this issue, but we shall consider some aspects of this problem here
as well.

5

Robust Network Planning

161

5.5.1.1 Capacity Constraints and Safe-Operating Points
Unsurprisingly, the primary constraints in capacity planning are the capacity constraints. We must have a network with sufficient capacity to carry the offered traffic.
The key issue is our definition of “sufficient.” There are several factors that go into
this decision:
1. Traffic is not constant over the day, so we must design our network to carry loads
at all times of day. Often this is encapsulated in “busy hour” traffic measurements,
but busy hours may vary across a large network, and between customers, and so
it is better to design for the complete cycle.
2. Traffic has observable fluctuations around its average behavior. Capacity planning can explicitly allow for these variations.
3. Traffic also has unobservable fluctuations on shorter times than our measurement
interval. Capacity planning must attempt to allow for these variations.
4. There will be measurement and prediction errors in any set of inputs.
Ideally, we would use queueing models to derive an exact relationship between measured traffic loads, variations, and so determine the required capacities. However,
despite many recent advances in data traffic modeling, we are yet to agree on sufficiently precise and general queueing models to determine sufficient capacity from
numerical formulae. There is no “Erlang-B” formulae for data networks. As a result,
most network operators use some kind of engineering rule of thumb, which comes
down to an “over-engineering factor” to allow for the above sources of variability.
We adopt the same approach here, but the term “over-engineering factor” is misleading. The factor allows for known variations in the traffic. The network is not
over-engineered, it only appears so if capacity is directly compared to the available
but flawed measurements. In fact, if we follow a well-founded process, the network
can be quite precisely engineered.6
We therefore prefer to use the term Safe Operating Point (SOP). A SOP is defined statistically with respect to the available traffic measurements on a network.
For instance, with 5-min SNMP traffic measurements, we might define our SOP
by requiring that the load on the links (as measured by 5-min averages) should not
exceed 80% of link capacity more than five times per month. The predicted traffic
model could then be used to derive how much capacity is needed to achieve this
bound.
Traffic variance depends on the application profile and the scale of aggregation.
Moreover, the desired trade-off between cost and performance is a business choice
for network operators. So there is no single SOP that will satisfy all operators. Given
the lack of precision in current queueing models and measurements, the SOP needs
to be determined by each network operator experimentally, preferably starting from
conservative estimates. Natural variations in network conditions often allow enough
6
It is a common complaint that backbone networks are underutilized. This complaint typically
ignores the issues described above. In reality, many of these networks may be quite precisely
engineered, but crude average utilization numbers are used to defer required capacity increases.

162

M. Roughan

scope to see the impact of variable levels of traffic, and from these determine more
accurate SOP specifications, but to do this we need to couple traffic and performance
measurements (a topic we consider later).
A secondary set of capacity constraints arises because there is a finite set of
available link types, and capacity must be bought in multiples of these links. For
instance, many high-speed networks use either SONET/SDH links (typically giving
155 Mbps times powers of 4) and/or Ethernet link capacities (powers of 10 from
10 Mbps to 10 Gbps). We will denote the set of available link capacities (including
zero) by C .
Finally, most high-speed link technologies are duplex, and so we need to allocate
capacity in each direction, but we typically do so symmetrically (i.e., a link has the
same capacity from i ! j as from j ! i even when the traffic loads in each
direction are different).

5.5.1.2 Intra-PoP Design
We divide the network design or capacity planning problem into two components
and first consider the design of the network inside a PoP. Typically this involves
designing a tree-like network to aggregate traffic up to regional hubs, which then
transit the traffic onto a backbone.7 The exact design of a PoP is considered in more
detail in Chapter 4, but note that in each of the cases considered there we end up
with a very similar optimization problems at this level.
There are two prime considerations in such planning. Firstly, it is typical that
the majority of traffic is nonlocal, i.e., that it will transit to or from the backbone.
Local traffic between routers within the PoP in the Internet is often less than 1%
of the total. There are exceptions to this rule, but these must be dealt with on an
individual basis. Secondly, limitations on the number of ports on most high-speed
routers mean that we need at least one layer of aggregation routers to bring traffic
onto the backbone: for instance, see Fig. 5.5. For clarity, we show a very simple
design (see Chapter 4 for more examples). In our example, Backbone Routers (BRs)

to backbone
BR

Fig. 5.5 A typical PoP
design. Aggregation Routers
(AR) are used to increase the
port density in the PoP and
bring traffic up to the
Backbone Routers (BR)

AR

AR

BR

AR

customers

7
In small PoPs, a single router (or redundant pair) may be sufficient for all needs. Little planning
is needed in this case beyond selecting the model of router, and so we do not include this simple
case in the following discussions.

5

Robust Network Planning

163

and the corresponding links to Aggregation Routers (ARs) are assigned in pairs in
order to provide redundancy, but otherwise the topology is a simple tree.
There are many variations on this design, for instance, additional BRs may be
needed, or multiple layers. However, in our simple model, the design is determined
primarily by the limitations on port density. The routers lie within a single PoP, so
links are short and their cost has no distance dependence (and they are relatively
cheap compared to wide-area links). The number of ARs that can be accommodated depends on the number of ports that can be supported by the BRs, so we shall
assume that ARs have a single high-capacity uplink to each BR to allow for a maximum expansion factor in a one-level tree. As a result, the job of planning a PoP is
primarily one of deciding how many ARs are needed.
As noted earlier we do not need a TM for this task. The routing in such a network
is predetermined, and so current port allocations and the uplink load history are
sufficiently invariant for this planning task. We use these to form predictions of
future uplink requirements and the loads on each router. When predictions show
that a router is reaching capacity (either in terms of uplink capacity, traffic volume,
or port usage) we can install additional routers based on our predictions over the
planning horizon for router installation.
There is an additional improvement we can make in this type of problem. It is rare
for customers to use the entire capacity of their link to our network, and so the uplink
capacity between AR and BR in our network need not be the sum of the customers’
link capacities. We can take advantage of this fact through simple measurementbased planning, but with the additional detail that we may allocate customers with
different traffic patterns to routers in such a way as to leverage different peak hours
and traffic asymmetries (between input and output traffic), so as to further reduce
capacity requirements.
The problem resembles the bin packing problem. Given a fixed link capacity C
for the uplinks between ARs and BRs, and K customers with peak traffic demands
fTi gK
i D1 , the bin packing problem would be as follows: determine the smallest inteof the customers8 such that
ger B, such that we can find a B-partition fSk gB
kD1
X

Ti  C

for all k D 1; : : : ; B:

(5.22)

i 2Sk

The number of subsets B gives the number of required ARs, and although the problem is NP-hard, there are reasonable approximation algorithms for its solution [18],
some of which are online, i.e., they can be implemented without reorganization of
existing allocations.
The real problem is more complicated. There are constraints on the number
of ports that can be supported by ARs dependent on the model of ARs being

8
A B-partition of our customers is a group of B non-empty subsets Sk  f1; 2; : : : ; Kg
that are disjoint, i.e., Si \ Sj D  for all i ¤ j , and which include all customers, i.e.,
[BkD1 Sk D f1; 2; : : : ; Kg.

164

M. Roughan

deployed, constraints on router capacity, and in addition, we can take advantage
of the temporal, and directional characteristics of traffic. Customer demands take
the form ŒIi .t/; Oi .t/, where Ii .t/ and Oi .t/ are incoming and outgoing traffic demands for customer i at time t. So the appropriate condition for our problem is to
find the minimal number B of ARs such that
X
X
Ii .t/  C and
Oi .t/  C for all k; t:
(5.23)
i 2Sk

i 2Sk

This is the so-called vector bin packing problem, which has been used to model
resource constrained processor scheduling problems, and good approximations have
been known for some time [15, 28].
The major advantage of this type of approach is that customers with different
peak traffic periods can be combined onto one AR so that their joint traffic is more
evenly distributed over each 24-hour period. Likewise, careful distribution of customers whose primary traffic flows into our network (for instance, hosting centers)
together with customers whose traffic flows out of the network (e.g., broadband access companies) can lead to more symmetric traffic on the uplinks, and hence better
overall utilization. In practice, multiplexing gains may improve the situation, so that
less capacity is needed when multiple customers’ traffic is combined, but this effect
only plays a dominant role when large numbers (say hundreds) of small customers
are being combined.

5.5.1.3 Inter-PoP Backbone Planning
The inter-PoP backbone design problem is somewhat more complicated. We start
by assuming, we know the locations at which we wish to have PoPs. The question of how to optimize these locations does come up, but it is common that these
locations are predetermined by other aspects of business planning. In inter-PoP planning, distance-based costs are important. The cost of a link is usually considered to
be proportional to its length, though this is approximate. The real cost of a link has a
fixed component (in the equipment used to terminate a line) in addition to distancedependent terms derived from the cost to install a physical line, e.g., costs of cables,
excavation and right of ways. Even where leased lines are used (so there are minimal installation costs) the original capital costs of the lines are usually passed on
through some type of distance sensitive pricing.
In addition, higher speed links generally cost more. The exact model for such
costs can vary, but a large component of the bandwidth-dependent costs is in
the end equipment (router interface cards, WDM mux/demux equipment, etc.). In
actuality-real costs are often very complicated: vendors may have discounts for bulk
purchases, whereas cutting-edge technology may come at a premium cost. However, link costs are often approximated as linear with respect to bandwidth because
we could, in principle, obtain a link with capacity 4c by combining four links of
capacity c.

5

Robust Network Planning

165

In the simple case then, cost per link has the form
f .de ; ce / D ˛ C ˇde C ce ;

(5.24)

where ˛ is the fixed cost of link installation, ˇ is the link cost per unit distance,
and  is the cost per unit bandwidth. As the distance of a link is typically a fixed
property of the link, we often rewrite the above cost in the form
fe .ce / D ˛e C ce ;

(5.25)

where now the cost function depends on the link index e.
We further simplify the problem by assuming that BRs are capable of dealing
with all traffic demands so that only two (allowing for redundancy) are needed in
each PoP, thus removing the costs of the router from the problem.
Finally, we simplify our approach by assuming that routes are chosen to follow
the shortest possible geographic path in our network. There are reasons (which we
shall discuss in the following section) why this might not be the case, however, a
priori, it makes sense to use the shortest geographic path. There are costs that arise
from distance. Most obviously, if packets traverse longer paths, they will experience
longer delays, and this is rarely desirable. In addition, packets that traverse longer
paths use more resources. For instance, a packet that traverses two hops rather than
one uses up capacity on two links rather than one.
As noted earlier, we need to specify the problem constraints, the basic set of
which are intended to ensure that there is sufficient capacity in the network. When
congestion is avoided, queueing delays will be minimal, and hence delays across
the network will be dominated by propagation delays (the speed of light cannot be
increased). So ensuring sufficient capacity implicitly serves the purpose of reducing
networking delays. As noted, we adopt the approach of specifying an SOP, which
we do in the form of a factor  2 .0; 1/, which specifies the traffic limit with respect
to capacity. That is, we shall require that the link capacity ce be sufficient that traffic
takes up only  of the capacity, leaving 1   of the capacity to allow for unexpected
variations in the traffic.
The possible variables are now the link locations and their capacities. So, given
the (vectorized) traffic matrix x, our job is to determine link locations and capacities
ce , which implicitly defined the network routes (and hence the routing matrix A),
such that we solve
X
˛e I.ce > 0/ C ce
minimize
e2E

such that Ax  c;
ce 2 C;

(5.26)

where Ax D y, the link loads, c is the vector of links capacities, E is the set of
possible links, I.ce > 0/ is an indicator function (which is 1 where we build a link,
and 0 otherwise), and C is the set of available link capacities (which includes 0).
Implicit in the above formulation is the routing matrix A, which results from the
particular choice of links in the network design, so A is in fact a function of the

166

M. Roughan

network design. Its construction imposes constraints requiring that all traffic on the
network can be routed. The problem can be rewritten in a more explicit form using
flow-based constraints, but the above formulation is convenient for explaining the
differences and similarities between the range of problems we consider here.
There may be additional constraints in the above-mentioned problem resulting
from router limitations, or due to network performance requirements. For instance,
if we have a maximum throughput on each router, we introduce a set of constraints
of the form Bx  r, where r are router capacities, and B is similar to a routing
matrix in that it maps end-to-end demands to the routers along the chosen path.
Port
P constraints on a router might be expressed by taking constraints of the form
j I.ci;j > 0/  pi , where pi is the port limit on router i . Port constraints are
complicated by the many choices of line cards available for high-speed routers, and
so have sometimes been ignored, but they are a key limitation in many networks.
The issue is sometimes avoided by separation of inter- and intra-PoP design, so that
a high port density on BRs is not needed.
The other complication is that we should aim to optimize the network for 24  7
operations. We can do so simply by including one set of capacity constraints for each
time of day and week, i.e., Axt  c. The resulting constraints are in exactly the
same form as in (5.26) but their number increases. However, it is common that many
of these constraints are redundant, and so can be removed from the optimization
(without effect) by a pre-filtering phase.
The full optimization problem is a linear integer program, and there are many
tools available for solution of such programs. However, it is not uncommon to relax
the integer constraints to allow any ce  0. In this case, there is no point in having
excess capacity, and so we can replace the link capacity constraint by Ax D c. We
then obtain the actual design by rounding up the capacities. This approach reduces
the numerical complexity of the problem, but results in a potentially suboptimal
design. Note though, that integer programming problems are often NP hard, and
consequently solved using heuristics, which likewise can lead to suboptimal designs. Relaxation to a linear program is but one of a suite of techniques that can be
used to solve problems in this context, often in combination with other methods.
Moreover, it is common, the mathematical community to focus on finding provably optimal designs, but this is not a real issue. In practical network design we
know that the input data contains errors, and our cost models are only approximate.
Hence, the mathematically optimal solution may not have the lowest cost of all realizable networks. The mathematical program only needs to provide us with a very
good network design.
The components of real network suffer outages on a regular basis: planned maintenance and accidental fiber cuts are simple examples (for more details see Chapters
3 and 4). The final component of network planning that we discuss here is reliability
planning: analyzing the reliability of a network. There are many algorithms aimed
at maintaining network connectivity, ranging from simple designs such as rings or
meshes, to formal optimization problems including connectivity constraints. Commonly, networks are designed to survive all single link or node outages, though more
careful planning would concern all Shared Risk Groups (SRG), i.e., groups of links

5

Robust Network Planning

167

and/or nodes who share fates under common failures. For instance, IP links that use
wavelengths on the same fiber will all fail simultaneously if the fiber is cut.
However, when a link (or SRG) fails, maintaining connectivity is not the only
concern. Rerouted traffic creates new demands on links. If this demand exceeds
capacity, then the resulting congestion will negatively impact network performance.
Ideally, we would design our network to accommodate such failures, i.e., we would
modify our earlier optimization problem (5.26) as follows:
minimize

X
e2E

˛e I.ce > 0/ C ce

such that Ax  c;
and Ai x  c; 8i 2 F ;

(5.27)

where F is the set of all failure scenarios considered likely enough to include,
and Ai is the routing matrix under failure scenario i . Naively implemented with
 D , this approach has the limitation that the capacity constraints under failures
can come to dominate the design of the network so that most links will be heavily
underutilized under normal conditions. Hence, we allow that the SOPs with respect
to normal loads, and failure loads to be different,  < < 1, so that the mismatch is
somewhat balanced, i.e., under normal conditions links are not completely underutilized, but there is likely to be enough capacity under common failures. For example,
we might require that under normal loads, peak utilizations remain at 60%, while
under failures, we allow loads of 85%.
Additionally, the number of possible failure scenarios can be quite large, and
as each introduces constraints, it may not be practical to consider all failures. We
may need to focus on the likely failures, or those that are considered to be most
potentially damaging. However, it is noteworthy that only constraints that involve
rerouting need be considered. In most failures, a large number of links will be unaffected, and hence the constraints corresponding to those links will be redundant,
and may be easily removed from the problem.
The above formulation presumes that we design our network from scratch, but
this is the exception. We typically have to grow our network incrementally. This
introduces challenges – for instance, it is easy to envisage a series of incremental
steps that are each optimal in themselves, but which result in a highly suboptimal
network over time. So it is sometimes better to design an optimal network from
scratch, particularly when the network is growing very quickly. In the mean time we
can include the existing network through a set of constraints in the form ce  le Cce0 ,
where le is the legacy link capacity on link e, and ce0 is the additional link capacity.
The real situation is complicated by some additional issues: (i) typical IP router load
balancing is not well suited for multiple parallel links of different capacities so we
must choose between increasing capacity through additional links (with capacity
equal to the legacy links) or paying to replace the old links with a single higher
capacity link; and (ii) the costs for putting additional capacity between two routers
may be substantially different from the costs for creating an entirely new link. Some
work [40] has considered the problem of evolvability of networks, but without all

168

M. Roughan

of the addition complexities of IP network management, so determining long-term
solutions for optimal network evolution is still an open problem.

5.5.2 Traffic Engineering
In practice, it takes substantial time to build or change a network, despite modern
innovations in reconfigurable networks. Typical changes to a link involve physically
changing interface cards, wiring, and router configurations. Today these changes are
often made manually. They also need to be performed carefully, through a process
where the change is documented, carefully considered, acted upon, and then tested.
The time to perform these steps can vary wildly between companies, but can easily
be 6 months once budget cycles are taken into account.
In the mean time we might find that our traffic predictions are in error. The best
predictions in the world cannot cope with the convulsive changes that seem to occur
on a regular basis in the Internet. For instance, the introduction of peer-to-peer networking both increased traffic volumes dramatically in a very short time frame, and
changed the structure of this traffic (peer-to-peer traffic is more symmetric that the
previously dominant client–server model). YouTube again reset providers’ expectations for traffic. The result will be a suboptimal network, in some cases leading to
congestion.
As noted, we cannot simply redesign the network, but we can often alleviate congestion by better balancing loads. This process, called traffic engineering (or just
load balancing) allows us to adapt the network on shorter time scales than capacity
planning. It is quite possible to manually intervene in a network’s traffic engineering
on a daily basis. Even finer time scales are possible in principle if traffic engineering is automated, but this is uncommon at present because there is doubt about the
desirability of frequent changes in routing. Each change to routing protocols can
require a reconvergence, and can lead to dropped packets. More importantly, if such
automation is not very carefully controlled it can become unstable, leading to oscillations and very poor performance.
The Traffic Engineering (TE) problem is very similar to the network design problem. The goal or optimization objective is often closely related to that in design. The
constraints are usually similar. The major difference is in the planning horizon (typically days to weeks), and as a result the variables over which we have control.
The restriction imposed by the planning horizon for TE is that we cannot change
the network hardware: the routers and links between them are fixed. However, we
can change the way packets are routed through the network, and we can use this to
rebalance the traffic across the existing network links.
There are two methods of TE that are most commonly talked about. The most
often mentioned uses MultiProtocol Label Switching (MPLS) [54], by which we
can arbitrarily tunnel traffic across almost any set of paths in our network. Finding a general routing minimizing max-utilization is an instance of the classical
multi-commodity flow problem, which can be formulated as a linear program

5

Robust Network Planning

169

[6, Chapter 17], and is hence solvable using commonly available tools. We shall
not spend much time on MPLS TE, because there is sufficient literature already (for
instance, see [19, 36]). We shall instead concentrate on a simpler, less well known,
and yet almost as powerful method for TE.
Remember that we earlier argued that shortest-geographic paths made sense for
network routing. In fact, shortest-path routing does not need to be based on geographic distances. Most modern Interior Gateway Protocols allow administratively
defined distances (for instance, Open Shortest Path First (OSPF) [42] and Intermediate System-Intermediate System (IS-IS) [14]). By tweaking these distances we can
improve network performance. By making a link distance smaller, you can make a
link more “attractive”, and so route more traffic on this link. Making the distance
longer can remove traffic. Configurable link weights can be used, for example, to
direct traffic away from expensive (e.g., satellite) links.
However, we can formulate the TE problem more systematically. Let us consider
a shortest-path protocol with administratively configured link weights (the link distances) we on each link e. We assume that the network is given (i.e., we know its
link locations and capacities), and that the variables that we can control are the
link weights. Our objective is to minimize the congestion on our network. Several
metrics can be used to describe congestion. Network-wide metrics such as that proposed in [25, 26] can have advantages, but we use the common metric of maximum
utilization here for its simplicity.
In many cases, there are additional “human” constraints on the weights we can
use in the above optimization. For instance, we may wish that the resulting weights
do not change “too much” from our existing weights. Each change requires reconfiguration of a router, and so reducing the number of changes with respect to the
existing routing may be important. Likewise, the existing weights are often chosen
not just for the sake of distance, but also to make the network conceptually simpler.
For instance, we might choose smaller weights inside a “region” and large weights
between regions, where the regions have some administrative (rather than purely geographical) significance. In this case, we may wish to preserve the general features
of the routing, while still fine-tuning the routes. We can express these constraints in
various ways, but we do so below by setting minimum and maximum values for the
weights. Then the optimization problem can be written: choose the weights w, such
that we
minimize max ye =ce
e2E
(5.28)
such that Ax D y;
 we  wmax
and wmin
e
e ; 8e 2 E

where A is the routing matrix generated by shortest-path routing given by link
weights we , and the link utilizations are given by ye=ce (the link load divided
by its capacity). The wmin
and wmax
constrain the weights for each link into a
e
e
range determined by existing network policies (perhaps within some bound of the
existing weights). Additional constraints might specify the maximum number of
weights we are allowed to change, or require that links weights be symmetric, i.e.,
w.i;j / D w.j;i / .

170

M. Roughan

The problem is in general NP-hard, so it is nontrivial to find a solution. Over the
years, many heuristic methods [12,20,25,26,37,41,53] have been developed for the
solution of this problem.
The exciting feature of this approach is that it is very simple. It uses standard IP
routing protocols, with no enhancements other than the clever choice of weights.
One might believe that the catch was that it cannot achieve the same performance as
full MPLS TE. However, the performance of the above shortest-path optimization
has been shown on real networks to suffer only by a few percent [59,60], and importantly, it has been shown to be more robust to errors in the input traffic matrices than
MPLS optimization [60]. This type of robustness is critical to real implementations.
Moreover, the approach can be used to generate a set of weights that work well
over the whole day (despite variations in the TM over the day) [60], or that can
help alleviate congestion in the event of a link failure [44], a problem that we shall
consider in more detail in the following section.

5.6 Robust Planning
A common concern in network planning is the consequence of mistakes. Traffic matrices used in our optimizations may contain errors due to measurement
artifacts, sampling, inference, or predictions. Furthermore, there may be inconsistencies between our planned network design, and the actual implementation through
misconfiguration or last minute changes in constraints. There may be additional inconsistencies introduced through the failure of invariance in TMs used as inputs, for
example, caused by congestion alleviation in the new network.
Robust planning is the process of acknowledging these flaws, and still designing
good networks. The key to robustness is the cyclic approach described in Section
5.1: measure ! predict ! plan ! and then measure again. However, with some
thought, this process can be made tighter. We have already seen one example of
this through TE, where a short-term alteration in routing is used to counter errors in
predicted traffic. In this section we shall also consider some useful additions to our
kitbag of robust planning tools.

5.6.1 Verification Measurements
One of the most common sources of network problems is misconfiguration. Extreme
cases of misconfigurations that cause actual outages are relatively obvious (though
still time-consuming to fix). However, misconfigurations can also result in more
subtle problems. For instance, a misconfigured link weight can mean that traffic
takes unexpected paths, leading to delays or even congestion.
One of the key steps to network planning is to ensure that the network we planned
is the one we observe. Various approaches have been used for router configura-

5

Robust Network Planning

171

tion validation: these are considered in more detail in Chapter 9. In addition, we
recommend that direct measurements of the network routing, link loads, and performance can be made at all times. Routing can be measured through mechanisms such
as those discussed in Section 5.2 and in more detail in Chapter 11. When performed
from edge node to edge node, we can use such measurements to confirm that traffic
is taking the routes we intended it to take in our design.
By themselves, routing measurements only confirm the direction of traffic flows.
Our second requirement is to measure link traffic to ensure that it remains within
the bounds we set in our network design. Unexpected traffic loads can often be dealt
with by TE, but only once we realize that there is a problem.
Finally, we must always measure performance across our network. In principle,
the above measurements are sufficient, i.e., we might anticipate that a link is congested only if traffic exceeds the capacity. However, in reality, the typical SNMP
measurements used to measure traffic on links are 5-min averages. Congestion can
occur on smaller time scales, leading to brief, but nonnegligible packet losses that
may not be observable from traffic measurements alone. We aim to reduce these
through choice of SOP, but note that this choice is empirical in itself, and an accurate choice relies on feedback from performance measurements. Moreover, other
components of a network have been known to cause performance problems even
on a lightly loaded network. For instance, such measurements allowed us to discover and understand delays in routing convergence times [32, 61], and that during
these periods bursts of packet loss would occur, from which improvements to Interior Gateway Protocols have been made [27]. The importance of the problem would
never have been understood without performance measurements. Such measurements are discussed in more detail in Chapter 10.

5.6.2 Reliability Analysis
IP networks and the underlying SONET/WDM strata on which they run are often
managed by different divisions of a company, or by completely different companies. In our planning stages, we would typically hope for joint design between these
components, but the reality is that the underlying physical/optical networks are often multiuse, with IP as one of several customers (either externally or internally)
that use the same infrastructure. It is often hard to prescribe exactly which circuits
will carry a logical IP link. Therefore, it is hard in some cases to determine, prior to
implementation, exactly what SRG exist.
We may insist, in some cases, that links are carried over separate fibers, or even
purchase leased lines from separate companies, but even in these cases great care
should be taken. For instance, it was only during the Baltimore train tunnel fire
(2001) [4] it was discovered that several providers ran fiber through the same tunnel.
Our earlier network plan can only accommodate planned network failure scenarios. In robust planning, we must somehow accommodate the SRGs that have
arisen in the implementation of our planned network. The first step, obviously, is to

172

M. Roughan

determine the SRGs. The required data mapping IP links to physical infrastructure
is often stored in multiple databases, but with care it is possible to combine the two
to obtain a list of SRGs. Once we have a complete list of failure scenarios we could
go through the planning cycle again, but as noted, the time horizon for this process
would leave our network vulnerable for some time.
The first step therefore is to perform a network reliability analysis. This is a simple process of simulating each failure scenario, and assessing whether the network
has sufficient capacity, i.e., whether Ai x  c. If this condition is already satisfied,
then no action need to be taken. However, where the condition is violated, we must
take one of two actions. The most obvious approach to deal with a specific vulnerability is to expedite an increase in capacity. It is often possible to reduce the planning
horizon for network changes at an increased cost. Where small changes are needed,
this may be viable, but it is clearly not satisfactory to try to build the whole network
in this way.
The second alternative is to once again use traffic engineering. MPLS provides
mechanisms to create failover paths, however, it does not tell you where to route
these to ensure that congestion does not occur. Some additional optimization and
control is needed. However, we cannot do this after the failure, or recovery will
take an unacceptable amount of time. Likewise, it is impractical in today’s networks
to change link weights in response failures. However, previous studies have shown
that shortest-path link weight optimization can be used to provide a set of weights
that will alleviate congestive effects under failures [44], and such techniques have
(anecdotally) been used in large networks with success.

5.6.3 Robust Optimization
The fundamental issue we deal with is “Given that I have errors in my data, how
should I perform optimization?” Not all the news are bad. For instance, once we
acknowledge that our data is not perfect, we realize that finding the mathematically
optimal solution for our problem is not needed. Instead, heuristic solutions that find
a near optimal solution will be just as effective. This chapter is not principally concerned with optimization, and so we will not spend a great deal of time on specific
algorithms, but note that once we decide that heuristic solutions will be sufficient,
several meta-heuristics such as genetic algorithms and simulated annealing become
attractive. They are generally easy to program, and very flexible, and so allow us to
use more complex constraints and optimization objective functions than we might
otherwise have chosen. For instance, it becomes easy to incorporate the true link
costs, and technological constraints on available capacities.
The other key aspect to optimization in network planning directly concerns robustness. We know there are errors in our measurements and predictions. We can
save much time and effort in planning if we accommodate some notion of these errors in our optimization. A number of techniques for such optimization have been
proposed: oblivious routing [8], and Valiant network design [69, 70]. These papers

5

Robust Network Planning

173

present methods to design a network and/or its routing so that it will work well for
any arbitrary traffic matrix. However, this is perhaps going too far. In most cases we
do have some information about possible traffic whose use is bound to improve our
network design.
A simple approach is to generate a series of possible traffic matrices by
adding random noise to our predicted matrix, i.e., by taking xi D x C ei , for
i D 1; 2; : : : ; M . Where sufficient historical data exist, the noise terms ei should
be generated in such a way as to model the prediction errors. We can then optimize
against the set of TMs, i.e.,
minimize

X
e2E

˛e I.ce > 0/ C ce

(5.29)

such that Axi  c; 8i D 1; 2; : : : ; M:
Once again this can increase the number of constraints dramatically, particularly in
combination with reliability constraints, unless we realize that again many of these
constraints will be redundant, and can be pruned by preprocessing.
The above approach is somewhat naive. The size of the set of TMs to use is
not obvious. Also we lack guidance about the choice we should make for . In
principle, we already accommodate variations explicitly in the above optimization
and so we might expect  D 1. However, as before we need  < 1 to accommodate
inter-measurement time interval variations in traffic, though the choice should be
different than in past problems.
Moreover, there may be better robust optimization strategies that can be applied in the future. For instance, robust optimization has been applied to the
traffic engineering problem in [65], where the authors introduce the idea of COPE
(Common-case Optimization with a Penalty Envelope) where the goal is to find the
optimal routing for a predicted TM, and to ensure that the routing will not be “too
bad” if there are errors in the prediction.

5.6.4 Sensitivity Analysis
Even where we believe that our optimization approach is robust, we must test this
hypothesis. We can do so by performing a sensitivity analysis. The standard approach in such an analysis is to vary the inputs and examine the impact on the
outputs. We can vary each possible input to detect robustness to errors in this input, though the most obvious to test is sensitivity to variations in the underlying
traffic matrix. We can test such sensitivity by considering the link loads under a
set of TMs generated, as before, by adding prediction errors, i.e., xi D x C ei ,
for i D 1; 2; : : : ; M , and then simply calculating the link loads yi D Axi . There
is an obvious relationship to robust optimization, in that we should not be testing
against the same set of matrices against which we optimized. Moreover, in sensitivity analysis it is common to vary the size of the errors. However, simple linear

174

M. Roughan

algebra allows us to reduce the problem to a fixed load component y D Ax and a
variable component wi D Aei , which scales linearly with the size of the errors, and
which can be used to see the impact of errors in the TM directly.

5.7 Summary
“Reliability, reliability, reliability” is the mantra of good network operators.
Attaining reliability costs money, but few companies can afford to waste millions of dollars on an inefficient network. This chapter is aimed at demonstrating
how we can use robust network planning to attain efficient but reliable networks,
despite the imprecision of measurements, uncertainties of predictions, and general
vagaries of the Internet.
Reliability should mean more than connectivity. Network performance measured
in packet delay or loss rates is becoming an important metric for customers deciding
between operators. Network design for reliability has to account for possible congestion caused by link failures. In this chapter we consider methods for designing
networks where performance is treated as part of reliability.
The methodology proposed here is built around a cyclic approach to network design exemplified in Fig. 5.1. The process of measure ! analyze/predict ! control
! validate should not end, but rather, validation measurements are fed back into
the process so that we can start again. In this way, we attain some measure of robustness to the potential errors in the process. However, the planning horizon for
network design is still quite long (typically several months) and so a combination
of techniques such as traffic engineering are used at different time scales to ensure
robustness to failures in predicted behavior. It is the combination of this range of
techniques that provides a truly robust network design methodology.
Acknowledgment This work was informed by the period M. Roughan was employed at AT&T
research, and the author owes his thanks to researchers there for many valuable discussions on
these topics. M. Roughan would also like to thank the Australian Research Council from whom he
receives support, in particular through grant DP0665427.

References
1. Python routing toolkit (‘pyrt’). Retrieved from http://ipmon.sprintlabs.com/pyrt/.
2. Ripe NCC: routing information service. Retrieved from http://www.ripe.net/projects/ris/.
3. University of Oregon Route Views Archive Project. Retrieved from www.routeviews.org.
4. CSX train derailment. Nanog mailing list. Retrieved July 18, 2001 from http://www.merit.edu/
mail.archives/nanog/2001-07/msg00351.html.
5. Abilene/Internet2. Retrieved from
http://www.internet2.edu/observatory/archive/datacollections.html#netflow.
6. Ahuja, R. K., Magnanti, T. L., & Orlin, J. B. (1993). Network flows: Theory, algorithms, and
applications. Upper Saddle River, NJ: Prentice Hall.

5

Robust Network Planning

175

7. Alderson, D., Chang, H., Roughan, M., Uhlig, S., & Willinger, W. (2006). The many facets of
Internet topology and traffic. Networks and Heterogeneous Media, 1(4), 569–600.
8. Applegate, D., & Cohen, E. (2003) Making intra-domain routing robust to changing and uncertain traffic demands: Understanding fundamental tradeoffs. In ACM SIGCOMM (pp. 313–324).
Germany: Karlsruhe. 2003.
9. Box, G. E. P., & Draper, N. R. (2007). Response surfaces, mixtures and ridge analysis (2nd
ed.). New York: Wiley.
10. Brockwell, P., & Davis, R. (1987). Time series: Theory and methods. New York: Springer.
11. Brutag, J. D. (2000). Aberrant behavior detection and control in time series for network monitoring. In Proceedings of the 14th Systems Administration Conference (LISA 2000), New
Orleans, LA, USA, USENIX.
12. Buriol, L. S., Resende, M. G. C., Ribeiro, C. C., & Thorup, M. (2002) A memetic algorithm
for OSPF routing. In Proceedings of the 6th INFORMS Telecom (pp. 187–188).
13. Cahn, R. S. (1998). Wide area network design. Los Altos, CA: Morgan Kaufman.
14. Callon, R. (1990). Use of OSI IS-IS for routing in TCP/IP and dual environments. Network
Working Group, Request for Comments: 1195.
15. Chekuri, C., & Khanna, S. (2004) On multidimensional packing problems. SIAM Journal of
Computing, 33(4), 837–851.
16. Duffield, N., & Lund, C. (2003). Predicting resource usage and estimation accuracy in an
IP flow measurement collection infrastructure. In ACM SIGCOMM Internet Measurement
Conference, Miami Beach, Florida, October 2003.
17. Duffield, N., Lund, C., & Thorup, M. (2004). Flow sampling under hard resource constraints.
SIGMETRICS Performance Evaluation Review, 32(1), 85–96.
18. Coffman, J. E. G., Garey, M. R., & Johnson, D. S. (1997). Approximation algorithms for bin
packing: A survey. In D. Hochbaum (Ed.), Approximation algorithms for NP-hard problems.
Boston: PWS Publishing.
19. Elwalid, A., Jin, C., Low, S. H., & Widjaja, I. (2001). MATE: MPLS adaptive traffic engineering. In INFOCOM (pp. 1300–1309).
20. Ericsson, M., Resende, M., & Pardalos P. (2002). A genetic algorithm for the weight setting
problem in OSPF routing. Journal of Combinatorial Optimization, 6(3), 299–333.
21. Erramilli, V., Crovella, M., & Taft, N. (2006). An independent-connection model for traffic
matrices. In ACM SIGCOMM Internet Measurement Conference (IMC06), New York, NY,
USA, ACM (pp. 251–256).
22. Feldmann, A., Greenberg, A., Lund, C., Reingold, N., & Rexford, J. (2000). Netscope: Traffic
engineering for IP networks. IEEE Network Magazine, 14(2), 11–19.
23. Feldmann, A., Greenberg, A., Lund, C., Reingold, N., Rexford, J., & True, F. (2001). Deriving traffic demands for operational IP networks: Methodology and experience. IEEE/ACM
Transactions on Networking, 9, 265–279.
24. Feldmann, A., & Rexford, J. (2001). IP network configuration for intradomain traffic engineering. IEEE Network Magazine, 15(5), 46–57.
25. Fortz, B., & Thorup, M. (2000). Internet traffic engineering by optimizing OSPF weights.
In Proceedings of the 19th IEEE Conference on Computer Communications (INFOCOM)
(pp. 519–528).
26. Fortz, B., & Thorup, M. (2002). Optimizing OSPF/IS-IS weights in a changing world. IEEE
Journal on Selected Areas in Communications, 20(4), 756–767.
27. Francois, P., Filsfils, C., Evans, J., & Bonaventure, O. (2005). Achieving sub-second IGP
convergence in large IP networks. SIGCOMM Computer Communication Review, 35(3),
35–44.
28. Garey, M., Graham, R., Johnson, D., & Yao, A. (1976). Resource constrained scheduling as
generalized bin packing. Journal of Combinatorial Theory A, 21, 257–298.
29. Hansen, P. C. (1997). Rank-deficient and discrete ill-posed problems: Numerical aspects of
linear inversion. Philadelphia, PA: SIAM.
30. Iannaccone, G., Chuah, C.-N., Mortier, R., Bhattacharyya, S., & Diot, C. (2002). Analysis
of link failures over an IP backbone. In ACM SIGCOMM Internet Measurement Workshop,
Marseilles, France, November 2002.

176

M. Roughan

31. Kowalski, J., & Warfield, B. (1995). Modeling traffic demand between nodes in a telecommunications network. In ATNAC’95.
32. Labovitz, C., Ahuja, A., Bose, A., & Jahanian, F. (2000). Delayed Internet routing convergence.
In Proceedings of ACM SIGCOMM.
33. Lakhina, A., Crovella, M., & Diot, C. (2004). Characterization of network-wide anomalies in
traffic flows. In ACM SIGCOMM Internet Measurement Conference, Taormina, Sicily, Italy.
34. Lakhina, A., Crovella, M., & Diot, C. (2004). Diagnosing network-wide traffic anomalies. In
ACM SIGCOMM.
35. Lakhina, A., Papagiannaki, K., Crovella, M., Diot, C., Kolaczyk, E. D., & Taft, N. (2004).
Structural analysis of network traffic flows. In ACM SIGMETRICS/Performance.
36. Lakshman, U., & Lobo, L. (2006). MPLS traffic engineering. Cisco Press. Available from
http://www.ciscopress.com/articles/article.asp?p=426640, 2006.
37. Lin, F., & Wang, J. (1993). Minimax open shortest path first routing algorithms in networks
supporting the SMDS services. In Proceedings of the IEEE International Conference on Communications (ICC), 2, 666–670.
38. Maltz, D., Xie, G., Zhan, J., Zhang, H., Hjalmtysson, G., & Greenberg, A. (2004). Routing design in operational networks: A look from the inside. In ACM SIGCOMM, Portland, OR, USA.
39. Mauro, D. R., & Schmidt, K. J. (2001) Essential SNMP. Sabastopol, CA: O’Reilly.
40. Maxemchuk, N. F., Ouveysi, I., & Zukerman, M. (2000). A quantitative measure for comparison between topologies of modern telecommunications networks. In IEEE Globecom.
41. Mitra, D., & Ramakrishnan, K. G. (1999). A case study of multiservice, multipriority traffic engineering design for data networks. In Proceedings of the IEEE GLOBECOM (pp.
1077–1083).
42. Moy, J. T. (1998). OSPF version 2. Network Working Group, Request for comments: 2328,
April 1998.
43. Norros, I. (1994). A storage model with self-similar input. Queueing Systems, 16, 387–396.
44. Nucci, A., & Papagiannaki, K. (2009) Design, measurement and management of large-scale
IP networks. New York: Cambrigde University Press.
45. Odlyzko, A. M. (2003). Internet traffic growth: Sources and implications. In B. B. Dingel,
W. Weiershausen, A. K. Dutta, & K.-I. Sato (Eds.), Optical transmission systems and equipment for WDM networking II (Vol. 5247, pp. 1–15). Proceedings of SPIE.
46. Oetiker, T. MRTG: The multi-router traffic grapher. Available from http://oss.oetiker.ch/mrtg//.
47. Oetiker, T. RRDtool. Available from http://oss.oetiker.ch/rrdtool/.
48. Paxson, V. (2004). Strategies for sound Internet measurement. In ACM Sigcomm Internet
Measurement Conference (IMC), Taormina, Sicily, Italy.
49. Potts, R. B., & Oliver, R. M. (1972). Flows in transportation networks. New York: Academic
Press.
50. Pyhnen, P. (1963).
A tentative model for the volume of trade between countries.
Weltwirtschaftliches Archive, 90, 93–100.
51. Qiu, L., Yang, Y. R., Zhang, Y., & Shenker, S. (2003). On selfish routing in internet-like
environments. In ACM SIGCOMM (pp. 151–162).
52. Qui, L., Zhang, Y., Roughan, M., & Willinger, W. (2009). Spatio-Temporal Compressive Sensing and Internet Traffic Matrices”, Yin Zhang, Matthew Roughan, Walter Willinger, and Lili
Qui, ACM Sigcomm, pp. 267–278, Barcellona, August 2009.
53. Ramakrishnan, K., & Rodrigues, M. (2001). Optimal routing in shortest-path data networks.
Lucent Bell Labs Technical Journal, 6(1), 117–138.
54. Rosen, E. C., Viswanathan, A., & Callon, R. (2001). Multiprotocol label switching architecture.
Network Working Group, Request for Comments: 3031, 2001.
55. Roughan, M. (2005). Simplifying the synthesis of Internet traffic matrices. ACM SIGCOMM
Computer Communications Review, 35(5), 93–96.
56. Roughan, M., & Gottlieb, J. (2002). Large-scale measurement and modeling of backbone
Internet traffic. In SPIE ITCOM, Boston, MA.
57. Roughan, M., Greenberg, A., Kalmanek, C., Rumsewicz, M., Yates, J., & Zhang, Y. (2003).
Experience in measuring Internet backbone traffic variability: Models, metrics, measurements
and meaning. In Proceedings of the International Teletraffic Congress (ITC-18) (pp. 221–230).

5

Robust Network Planning

177

58. Roughan, M., Sen, S., Spatscheck, O., & Duffield, N. (2004). Class-of-service mapping for
QoS: A statistical signature-based approach to IP traffic classification. In ACM SIGCOMM
Internet Measurement Workshop (pp. 135–148). Taormina, Sicily, Italy.
59. Roughan, M., Thorup, M., & Zhang, Y. (2003). Performance of estimated traffic matrices in
traffic engineering. In ACM SIGMETRICS (pp. 326–327). San Diego, CA.
60. Roughan, M., Thorup, M., & Zhang, Y. (2003). Traffic engineering with estimated traffic
matrices. In ACM SIGCOMM Internet Measurement Conference (IMC) (pp. 248–258). Miami
Beach, FL.
61. Shaikh, A., & Greenberg, A. (2001). Experience in black-box OSPF measurement. In Proceedings of the ACM SIGCOMM Internet Measurement Workshop (pp. 113–125).
62. Shaikh, A., & Greenberg, A. (2004). OSPF monitoring: Architecture, design and deployment
experience. In Proceedings of the USENIX Symposium on Networked System Design and
Implementation (NSDI).
63. Tinbergen, J. (1962). Shaping the world economy: Suggestions for an international economic
policy. The Twentieth Century Fund.
64. Uhlig, S., Quoitin, B., Balon, S., & Lepropre, J. (2006). Providing public intradomain traffic
matrices to the research community. ACM SIGCOMM Computer Communication Review,
36(1), 83–86.
65. Wang, H., Xie, H., Qiu, L., Yang, Y. R., Zhang, Y., & Greenberg, A. (2006). COPE: Traffic
engineering in dynamic networks. In ACM SIGCOMM (pp. 99–110).
66. Zhang, Y., Ge, Z., Roughan, M., & Greenberg, A. (2005). Network anomography. In Proceedings of the Internet Measurement Conference (IMC ’05), Berkeley, CA.
67. Zhang, Y., Roughan, M., Duffield, N., & Greenberg, A. (2003). Fast accurate computation
of large-scale IP traffic matrices from link loads. In ACM SIGMETRICS (pp. 206–217). San
Diego, CA.
68. Zhang, Y., Roughan, M., Lund, C., & Donoho, D. (2003). An information-theoretic approach
to traffic matrix estimation. In ACM SIGCOMM (pp. 301–312). Karlsruhe, Germany.
69. Zhang-Shen, R., & McKeown, N. (2004). Designing a predictable Internet backbone. In
HotNets III, San Diego, CA, November 2004.
70. Zhang-Shen, R., & McKeown, N. (2005). Designing a predictable Internet backbone with
Valiant load-balancing. In Thirteenth International Workshop on Quality of Service (IWQoS),
Passau, Germany, June 2005.

Part III

Interdomain Reliability and Overlay
Networks

Chapter 6

Interdomain Routing and Reliability
Feng Wang and Lixin Gao

6.1 Introduction
Routing as the “control plane” of the Internet plays a crucial role on the performance
of data plane in the Internet. That is, routing aims to ensure that there are forwarding
paths for delivering packets to their intended destinations. Routing protocols are the
languages that individual routers speak in order to cooperatively achieve the goal in
a distributed manner. The Internet routing architecture is structured in a hierarchical
fashion. At the bottom level, an Autonomous System (AS) consists of a network
of routers under a single administrative entity. Routing within an AS is achieved
via an Interior Gateway Protocol (IGP) such as OSPF or IS-IS. At the top level, an
interdomain routing protocol glues thousands of ASes together and plays a crucial
role in the delivery of traffic across the global Internet. In this chapter, we provide
an overview of the interdomain routing architecture and its reliability in maintaining
global reachability.
Border Gateway Protocol (BGP) is the current de-facto standard for interdomain
routing. As a path vector routing protocol, BGP requires each router to advertise
only its best route for a destination to its neighbors. Each route includes attributes
such as AS path (the sequence of ASes to traverse to reach the destination), and
local preference (indicating the preference order in selecting the best route). Rather
than simply selecting the route with the shortest AS path, routers can apply complex
routing policies (such as setting a higher local preference value for a route through a
particular AS) to influence the best route selection, and to decide whether to propagate the selected route to their neighbors. Although BGP is a simple path vector
protocol, configuring BGP routing policies is quite complex. Each AS typically

F. Wang
School of Engineering and Computational Sciences, Liberty University
e-mail: fwang@liberty.edu
L. Gao ()
Department of Electrical and Computer Engineering, University of Massachusetts, Amherst,
Amherst, MA01002, USA
e-mail: lgao@ecs.umass.edu

C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications,
Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 6,
c Springer-Verlag London Limited 2010


181

182

F. Wang and L. Gao

configures its routing policy according to its own goals, such as load-balancing
traffic among its links, without coordinating with other networks. However, arbitrary policy configurations might lead to route divergence or persistent oscillation
of the routing protocol. That is, although BGP allows flexibility in routing policy
configuration, BGP itself does not guarantee routing convergence. Arbitrary policy
configurations, such as unintentional mistakes or intentional malicious configuration, can lead to persistent route oscillation [9, 11].
Besides being a policy-based routing protocol, BGP has many features that aim
to scale a large network such as the global Internet. One feature is that BGP sends
incremental updates upon routing changes rather than sending complete routing information. BGP speaking routers send new routes only when there are changes.
Related with the incremental update feature, BGP uses a timer, referred to as the
Minimum Route Advertisement Interval (MRAI) timer, to determine the minimum
amount of time that must elapse between routing updates in order to limit the
number of updates for each prefix. Therefore, BGP does not react to changes in
topology or routing policy configuration immediately. Rather, it controls the frequency in which route changes can be made in order to avoid overloading router
CPU cycles or reduce route flap. While MRAI timers can be effective in reducing
routing update frequency, the slow reaction to changes can delay route convergence.
More importantly, during the delayed route convergence process, routes among
neighboring routers might be inconsistent. This can lead to transient routing loops
or transient routing outages (referred to as transient routing failures) caused by the
delay in discovering alternate routes.
The goal of this chapter is to provide an overview of BGP, to give practical guidelines for configuring BGP routing policy and offer a framework for understanding
how undesirable routing states such as persistent routing oscillation and transient
routing failures or loops can arise. We also present a methodology for measuring the
extent to which these undesirable routing states can affect the quality of end-to-end
packet delivery. We will further describe proposed solutions for reliable interdomain
routing. Toward this end, we outline this chapter as follows.
We begin with an introduction to BGP in Section 6.2. We first describe
interdomain routing architecture, and then illustrate the details of how BGP enables
ASes to exchange global reachability information and various BGP route attributes.
We further present routing policy configurations that enable each individual AS to
meet its goal of traffic engineering or commercial agreement.
In Section 6.3, we introduce multihoming technology. Multihoming allows an
AS to have multiple connections to upstream providers in order to survive a single
point of failure. We present various multihoming approaches, such as multihoming
to multip le upstream providers or single upstream provider to show the redundancy
and load-balancing benefits associated with being multihomed.
In Section 6.4, we highlight the limitations of BGP. For example, the protocol
design does not guarantee that routing will converge to a stable route. We further
show how incentive compatible routing policies can prevent routing oscillation, and
how transient routing failures or loops can occur even under incentive compatible
routing configuration or redundant underlying infrastructure.

6

Interdomain Routing and Reliability

183

Having understood the potential transient routing failures and routing loops, we
describe a measurement methodology, and measurement results that quantify the
impact of transient routing failures and routing loops on end-to-end path performance in Section 6.5. This illustrates the severity that routing outages can affect the
quality of packet delivery.
In Section 6.6, we present a detailed overview of the existing solutions to achieve
reliable interdomain routing. We show that both protocol extensions and routing
policies can enhance the reliability of interdomain routing. Finally, we conclude the
chapter by pointing out possible future research directions in Section 6.7.

6.2 Interdomain Routing
This section introduces the interdomain routing architecture, the interdomain
routing protocol, BGP, and BGP routing policy configuration.

6.2.1 Interdomain Routing Architecture
The Internet consists of a large collection of hosts interconnected by networks of
links and routers. The Internet is divided into thousands of ASes. Examples range
from college campuses and corporate networks to global Internet Service Providers
(ISPs). An AS has its own routers and routing policies, and connects to other
ASes to exchange traffic with remote hosts. A router typically has very detailed
knowledge of the topology within its AS, and limited reachability information about
other ASes. Figure 6.1 shows an example of the Internet topology, where there are
large transit ISPs such as MCI or AT&T, and stub ASes, such as the University
of Massachusetts’ network, which does not provide transit service to other ASes.

Google.com

Sprint
AS 15169
AS 1249

Servers

Umass.edu

MCI

AT & T

Fig. 6.1 An example topology of interconnection among Internet service providers and stub
networks

184

F. Wang and L. Gao

Note that the topologies of the transit ISPs and stub ASes shown in this example are
much simpler than those in reality. Typically, a large transit ISP consists of hundreds
or thousands of routers.
ASes interconnect at public Internet exchange points (IXPs) such as MAE-EAST
or MAE-WEST, or dedicated point-to-point links. Public exchange points typically
consist of a shared medium such as a Gigabit Ethernet, or an ATM switch, that interconnects routers from several different ASes. Physical connectivity at the IXP does
not necessarily imply that every pair of ASes exchanges traffic with each other. AS
pairs negotiate contractual agreements that control the exchange of traffic. These
relationships include provider-to-customer, peer-to-peer, and backup, and are discussed in more detail in Section 6.4.1.
Each AS has responsibility for carrying traffic to and from a set of customer
IP addresses. The scalability of the Internet routing infrastructure depends on the
aggregation of IP addresses in contiguous blocks, called prefixes, each consisting of
a 32-bit IP address and a mask length (e.g., 1:2:3:0=24). An IP address is generally
shown as four octets of numbers from 0 to 255 represented in decimal form. The
mask length is used to indicate the number of significant bits in the IP address. That
is, a prefix aggregates all IP addresses that match the IP address in the significant
bits. For example, prefix 1:2:3:0=24 represents all addresses between 1:2:3:0 and
1:2:3:255.
An AS employs an intradomain routing protocol (IGP) such as OSPF or ISIS to determine how to reach routers and networks within itself, and employs an
interdomain routing protocol, i.e., Border Gateway Protocol (BGP) in the current Internet, to advertise the reachability of networks (represented as prefixes) to
neighboring ASes.

6.2.2 IGP
Each AS uses an intradomain routing protocol or IGP for routing within the AS.
There are two classes of IGP: (1) distance vector and (2) link state routing protocol.
In distance-vector routing, every routing message propagated by a router to its
neighbors contains the length of the shortest path to a destination. In link-state routing, every router learns the entire network topology along with the link costs. Then
it computes the shortest path (or the minimum cost path) to each destination. When
a network link changes state, a notification, called link state advertisement (LSA),
is flooded throughout the network. All routers note the change and recompute their
routes accordingly.

6.2.3 BGP
The interdomain routing protocol, BGP, is the glue that pieces together the various
diverse networks or ASes that comprise the global Internet today. It is used among

6

Interdomain Routing and Reliability

185

ASes to exchange network reachability information. Each AS has one or more border routers that connect to routers in neighboring ASes, and possibly a number of
internal BGP speaking routers.
BGP is a path-vector routing protocol that facilitates routers to exchange the
path used for reaching a destination. By including the path in the route update information, one can avoid loops by eliminating any path that traverses the same node
twice. Using a path vector protocol, routers running BGP distribute reachability
information about destinations (network prefixes) by sending route updates – containing route announcements or withdrawals – to their neighbors in an incremental
manner. BGP constructs paths by successively propagating advertisements between
pairs of routers that are configured as BGP peers. Each advertisement concerns
a particular prefix and includes the list of ASes along the path (the AS path) to
the network containing the prefix. By representing the path to be traversed by the
ASes, BGP hides the details of the topology and routing information inside each
AS. Before accepting an advertisement, the receiving router checks for the presence
of its own AS number in the AS path to discard routes with loops. Upon receiving
an advertisement, a BGP speaking router must decide whether or not to use this
path and, if the path is chosen, whether or not to propagate the advertisement to
neighboring ASes (after adding its own AS number of the AS path). BGP requires
that a router simply advertise its best route for each destination to its neighbors.
A BGP speaking router withdraws an advertisement when the prefix is no longer
reachable with this route, which may lead to a sequence of withdrawals by upstream
ASes that are using this path.
When there is an event affecting a router’s best route to a destination, that router
will compute a new best route and advertise the routing change to its neighbors.
If the router no longer has any route to the destination, it will send a withdrawal
message to neighbors for that destination. When an event causes a set of routers
to lose their current routing information, the routing change will be propagated to
other routers. To limit the number of updates that a router has to process within a
short time period, a rate-limiting timer, called the Minimum Route Advertisement
Interval (MRAI) timer, determines the minimum amount of time that must elapse
between routing updates to a neighbor [26]. This has the potential to reduce the
number of routing updates, as a single routing change might trigger multiple transient routes during the path exploration or route convergence process before the final
stable route is determined. If new routes are selected multiple times while waiting
for the expiration of the MRAI timer, the latest selected route shall be advertised
at the end of MRAI. To avoid long time loss of connectivity, RFC 4271 [26] specifies that the MRAI timer is applied to only BGP announcements, not to explicit
withdrawals. However, some router implementations might apply the MRAI timer
to both announcements and withdrawals.
BGP sessions can be established between router pairs in the same AS (we refer the BGP session as iBGP session) or different ASes (we refer the BGP session
as eBGP session). Figure 6.2 illustrates examples of iBGP and eBGP sessions. Each
BGP speaking router originates updates for one or more prefixes, and can send the
updates to the immediate neighbors via an iBGP or eBGP session. iBGP sessions

186

F. Wang and L. Gao

AS 1
iBGP

GP
iB

GP

eB
iBGP

P

iBGP

iBG

P
iBG

eBGP

P

iBGP
iBG
P

iBGP

iBG
P

GP
iB

iBGP

AS 2

iBG

P

iBG
eBGP

iBGP

P
G
iB

iBGP

P
iBG

GP

GP

GP

iB

iB

iBGP

GP
eB

eBGP

iB

iBGP

AS 3

Fig. 6.2 Internal BGP (iBGP) versus external BGP (eBGP)

are established between routers in the same AS in order for the routers to exchange
routes learned from other ASes. In the simplest case, each router has an iBGP
session with every other router (i.e., fully meshed iBGP configuration). In the fullymeshed iBGP configuration, a route received from an iBGP router cannot be sent
to another iBGP speaking router, since a route via an iBGP peer should be directly
received from the iBGP peer.
In practice, an AS with hundreds or thousands of routers may need to improve scalability using route reflectors to avoid a fully-meshed iBGP configure.
These optimizations are intended to reduce iBGP traffic without affecting the routing decision. Each route reflector and its clients (i.e., iBGP neighbors that are not
route reflectors themselves) form a cluster. Figure 6.3 shows an example of route
reflector cluster, where cluster 1 contains route reflector RR1 and its three clients.
Typically, route reflectors and their clients are located in the same facility, e.g., in
the same Point of Presence (PoP). Route reflectors themselves are fully meshed. For
example, in Fig. 6.3, the three route reflectors RR1, RR2 and, RR3 are fully meshed.
A route reflector selects the best route among the routes learned via clients in the
cluster, and sends the best route to all other clients in the cluster except the one from
which the best route is learned, as well as to all other route reflectors. Similarly, it
also reflects routes learned from other route reflectors to all of its own clients.

6.2.4 Routing Policy and Route Selection Process
The simplest routing policy is the shortest AS path routing, where each AS selects
a route with the shortest AS path. BGP, however, allows much more flexible routing

6

Interdomain Routing and Reliability

187

Clu
client

sterclient
1

AS 1
client
client
Cluster 2

client

RR1

RR2
client
client

RR3

client

Cl

us
ter
3
client

client

Fig. 6.3 An example of route reflector configuration for scaling iBGP
BGP
Updates
Import Policies

Best Route
Selection

BGP
Updates
Export Policies

Fig. 6.4 Import policies, route selection, and export policies

policies than the shortest AS path routing. An AS can favor a path with a longer AS
path length by assigning a higher local preference value. BGP also allows an AS to
send a hint to a neighbor on the preference that should be given to a route by using
the community attribute. BGP also enables an AS to control how traffic enters its
network by assigning a different multiple exit discriminator (MED) value to the advertisements it sends on each link to a neighboring AS. Otherwise, the neighboring
AS would select the link based on the link cost within its own intradomain routing
protocol. An AS can also discourage traffic from entering its network by performing AS prepending, which inflates the length of the AS path by listing an AS number
multiple times.
Processing an incoming BGP update involves three steps as shown in Fig. 6.4:
1. Import policies that decide which routes to consider
2. Path selection that decides which route to use
3. Export policies to decide whether (and what) to advertise a neighboring AS
An AS can apply both implicit and explicit import policies. Every eBGP peering
session has an implicit import policy that discards a routing update when the receiving BGP speaker’s AS already appears in the AS path; this is essential to avoid

188
Table 6.1 Steps in the BGP
path selection process

F. Wang and L. Gao
1.
2.
3.
4.
5.
6.

Highest local preference
Shortest AS path
Lowest origin type
Smallest MED
Smallest IGP path cost to egress router
Smallest next-hop router id

introducing a cycle in the AS path. The explicit import policy includes denying or
permitting an update, and assigning a local-preference value. For example, an explicit import policy could assign local preference to be 100 if a particular AS appears
in the AS path or deny any update that includes AS 2 in the path.
After applying the import policies for a route update from an eBGP session,
each BGP speaking router then follows a route selection process that picks the
best route for each prefix, which is shown in Table 6.1. The BGP speaking router
picks the route with the highest local preference, breaking ties by selecting the route
with the shortest AS path. Note that local preference overrides the AS-path length.
Among the remaining routes, the BGP speaking router picks the one with the smallest MED, breaking ties by selecting the route with the smallest cost to the BGP
speaking router that passes the route via an iBGP session. Note that, since the tiebreaking process draws on intradomain cost information, two BGP speaking routers
in the same AS may select different best routes for the same prefix. If a tie still
exists, the BGP speaking router picks the route with the smallest next hop router ID.
Each BGP speaking router sends only its best route (one best route for each
prefix) via BGP sessions, including eBGP and iBGP sessions. The BGP speaking
router applies implicit and explicit export policies on each eBGP session to a neighboring BGP speaker. Each BGP speaking router applies an implicit policy that sets
MED to default values, assigns next hop to interface that connects the BGP session,
and prepends the AS number of the BGP speaking router to the AS path. Explicit
export policies include permitting or denying the route, assigning MED, assigning
community set, and prepending the AS number one or more times to the AS path.
For example, an AS could prepend its AS number several times to the AS path for
a prefix.
Although the BGP route selection process aims to select routes based mostly
on BGP attributes, it is not totally independent from IGP. In fact, IGP cost can
influence route selection when the best path is based on the comparison of the IGP
cost to the egress routers. We refer to this tie-break BGP route selection as hotpotato routing, since with all other BGP attributes being equal, each AS selects the
route with the shortest path to exit its network. For example, in Fig. 6.5, AS 3 learns
BGP routes to destination, originated by AS 0 at egress routers C1 and C2 from
AS 1 and AS 2, respectively. The value on each link within AS 3 represents the
corresponding IGP cost. Suppose that the two learned routes to the destination have
identical local preferences. We see that the AS path lengths of the two routes are
equal. Router C3 learned two routes from C1 and C2, respectively, and selects the
one learned from C1 as the best route because the IGP cost of path (C3 C1) is smaller

6

Interdomain Routing and Reliability

189

Fig. 6.5 An example
illustrating hot-potato routing
at AS 3. The value around
a link represents an IGP
weight

AS 3
C3

C4
8

6
9

14

C2

C1

AS 2

AS 1

AS 0

1.1.1.1
Set local pref 100
12.1.1.0/24

2.2.2.1
Set local pref 90
12.1.1.0/24

1.1.1.2

RTA

4.4.4.1

2.2.2.2
4.4.4.2

RTB

Fig. 6.6 Local preference configuration

than that of path (C3 C2). Similarly, router C4 will select the route learned from C2
as the best route because the path has smaller IGP cost than path (C4 C2). However,
hot-potato routing means that changing IGP weight can cause BGP speaking routers
to select a different best rout and therefore, shift egress routers. For instance, by
changing the IGP link cost between router C1 and C3 from 8 to 10, router C3 will
change its egress router from C1 to C2.
BGP routing policy configuration is typically indicated by a router configuration
file. A BGP routing policy can be assigned based on the destination prefix or the next
hop AS. For example, in Fig. 6.6, AS 0 advertises a prefix “10.1.1.0/24” to the Internet. AS 3 connects to AS 1 and AS 2, and will get routing updates about the
destination “10.1.1.0/24” from the two ASes. AS 3 decides what path its outbound

190

F. Wang and L. Gao

traffic to the destination is going to take. Suppose that AS 3 prefers to use the connection via AS 1 to reach the destination. As shown in the following configuration
based on Cisco IOS commands, Router RTA at AS 3 sets an explicit import policy
that assigns a local preference value 100 to the route from AS 1:
router bgp 3
neighbor 1.1.1.1 remote-as 1
neighbor 1.1.1.1 route-map AS1-IN in
neighbor 4.4.4.2 remote-as 3
access-list 1 permit 0.0.0.0 255.255.255.255
route-map AS1-IN permit
match ip address 1
set local-preference 100
We describe the commands in the above configuration as follows. The first
command starts a BGP process with an AS number of 3 at router RTA. The second
command sets up an eBGP session with router at AS 1. The route-map command
associated with the neighbor statement applies route map AS1-IN to inbound updates from AS 1. Just like the first neighbor command, the fourth command sets
up an iBGP session with router RTB. The access-list command creates an access
list named 1 to permit all advertisements. The route-map command creates a route
map named AS1-IN that uses the access list 1 to identify routes to be assigned local
preference of 100.

6.2.5 Convergence Process of BGP
In this section, we illustrate how BGP routing processes converge to stable routes.
Figure 6.7 shows an example of a routing policy configuration of a simple topology.
In this chapter, we simplify the representation of the network using graph theoretical
notations of nodes and edges, where a node represents either an AS or a BGP speaking router, and an edge represents the link between two nodes. In this example, we
use a node to represent an AS. Furthermore, throughout this chapter, we focus on
one destination prefix, d , which is always originated from AS 0. The figure indicates
the export policy by showing all AS paths that an AS can receive from the adjacent

Fig. 6.7 An example of
policy configuration that
converges. The paths around
a node represents its
permissible AS paths and the
paths are ordered in the
descending order of
preference

2

230
20

0
10
120

1

3

310
30

6

Interdomain Routing and Reliability

191

router on the associated interface (referred to as permissible AS paths). The figure
also indicates the import policy by ordering the paths in the descending order of
local preference. The BGP routing process converges as follows.
1. Destination prefix d is announced to ASes 1, 2, 3 via direct links.
2. ASes 1, 2, and 3 all choose its direct path as their best route since those are the
only route they received, and announce these direct paths to neighbors.
3. AS 1 now has two paths, (1 0) and (1 2 0), since these are only permissible paths.
AS 2 now has two paths, (2 0) and (2 3 0). AS3 now has two paths, (3 0) and
(3 1 0). According to the local preference of each AS, AS 1 ends up choosing
(1 0) as its best route, AS 3 chooses (3 1 0) as its best route, and AS 2 chooses
(2 3 0) as its best route.
4. AS 3 announces its best path (3 1 0), and therefore, implicitly withdraws its route
announcement of (3 0) from AS 2. Now, with (2 0) as its only path, AS 2 chooses
(2 0) as its best path.
5. AS 2 announces its best path to both AS 1 and AS 3. However, such an announcement does not change the route that AS 1 or AS 3 chooses.
Therefore, all ASes choose a stable route where no routers need to send new update
messages, and hence the BGP process converges. Note that during the convergence
process, each AS selects and/or announces its best route in an asynchronous manner
that is determined by the expiration of MRAI timers. We simplify the process by
assuming that route announcements are performed in “a lock step”. Nevertheless, it
can be proved that in this example, no matter what the exact steps of the convergence
process are, the stable route reached by each AS is the same.

6.3 Multihoming Technology
In this section, we provide an overview of the current multihoming technology,
which is widely used to provide redundant connection. Multihoming refers to the
technology where an AS connects to the Internet through multiple connections via
one or more upstream providers. It is intended to enhance the reliability of the Internet connectivity. When one of the connections fails or is in maintenance, the AS
can still connect to the Internet via other connections. Multihoming configuration
can be achieved using BGP configuration, static routes, Network Address Translation (NAT), or a combination of the above. In this section, we focus on describing
multihoming with BGP configuration.
The redundancy provided by multihoming can bring additional complexity to the
network configuration. First of all, it is imperative to designate primary and backup
connections in such a manner so that when the primary connection fails, it can automatically fall back to the backup connection. Second, it is desirable to distribute
traffic across multiple connections. Traffic can be classified into inbound and outbound traffic. Outbound traffic is the traffic originating within the multihomed AS
or its customers destined to other ASes; inbound traffic is the traffic destined to the
AS or its customers coming from other ASes.

192

F. Wang and L. Gao

A multihomed AS can be multihomed to a single provider, or to multiple
providers. We will describe how multihoming to a single provider and multiple
providers can be configured in the next two Sections 6.3.1 and 6.3.2.

6.3.1 Multihoming to a Single Provider
The simplest way for an AS to connect to the Internet is by setting up a single
connection with a provider. However, the AS has only one connection to send and
receive data. This single-homed configuration cannot be resilient to a single point
of failure such as link or router failure or maintenance. To address this issue, the AS
can set up multiple connections to the provider. Four types of connections can be established between an AS and its provider. We describe each type of the connections
as follows:
 Multiple Connections Between a Single Customer Router and Single Provider

Access Router (SSA) An AS has a single border router connected to its provider’s
access router with multiple links. As illustrated in Fig. 6.8a, AS 0 has a single

(a) SSA

(c) MMA
Fig. 6.8 Four types of multihoming connections

(b) SMA

(d) MMB

6

Interdomain Routing and Reliability

193

border router BoR1, which connects to AS 1’s access router, AR1, via two links.
If one of the links fails, the other link can be used.
 Multiple Connections Between a Single Customer Router and Multiple Provider
Access Routers (SMA) An AS has a single router connected to its provider’s
multiple access routers. For example, in Fig. 6.8b, BoR1 connects to AS 1 at
both AR1 and AR2. This configuration can maintain connectivity with a single
point of failure of links or the access routers, but cannot do so with a failures of
the customer router.
 Multiple Connections Between Multiple Customer Routers and Multiple Provider
Access Routers (MMA) An AS has multiple routers connected to its provider’s
multiple access routers. Note that those multiple access routers at the provider
are connected to the same backbone router. For example, in Fig. 6.8c, AS 0 has
two routers: BoR1 and BoR2. Each border router connects to an access router
(AR) in AS 1. This configurations can maintain connectivity with a single point
of failure of access routers or border routers. However, the two access routers
connect to the same backbone router, BaR1. A failure at BaR1 can cause both
the connections to become unavailable.
 Multiple Connections Between Multiple Customer Routers and Multiple Provider
Backbone Routers (MMB) An AS has multiple connections between its multiple
border routers and multiple backbone routers as its provider. This configuration
can achieve higher reliability than that of MMA. For example, in Fig. 6.8d, AS
0 has two border routers, BoR1 and BoR2, which are connected to geographically separate backbone routers at AS 1. AS 0’s BoR1 connects to AS 1’s access
router AR 1, and they are at the same geographical location, while the border
router BoR2 is connected to another backbone router BaR1. A private physical
connection connects the customer AS’s border router BoR2 and the backbone
router BaR1. This method can maintain connectivity even under a failure of the
backbone router.
Next, we describe how an AS can control traffic over the primary and backup
link. First, we discuss the control of outbound traffic. A multihomed AS can assign
different local preference values to the routes learned from its provider to control its
outgoing traffic. For example, in Fig. 6.8b, BoR1 will receive two identical routes
for each destination prefix. AS 0 can assign higher local preference values to prefer
the routes received through one particular connection over other routes for the same
destination received through the other connection. Multihomed configurations of
SSA, MMA or MMB can apply the same method to control outbound traffic over
the primary link. In addition, an AS multihomed to a single provider with SSA, can
use another method – setting the next hop to a virtual address to control outbound
traffic. For example, in Fig. 6.8a, AR1 can be assigned a virtual address – a loopback
interface. BoR1 will set up a connection with the loopback address. As a result, all
routes that BoR1 receives from AR1 will have the same next hop 20.10.10.1. Since
next hop 20.10.10.1 can be reached via two connections, outbound traffic can be
distributed over the two links.

194

F. Wang and L. Gao

Second, we discuss how an AS multihomed to a single provider can control its
inbound traffic. In this case, the multihomed AS can tweak the BGP attribute values,
such as AS path length or MED, to influence route selection at the providers’ router.
For example, an AS can prepend its AS number on the AS path of the route update
announced via the backup link, or send the route update via the backup link with
a higher MED value than that via the primary link. As a result, the primary link is
used in normal situations since it has a shorter AS path or lower MED value. When
the primary link is down, the backup link will be used.

6.3.2 Multihoming to Multiple Providers
The availability of the Internet connectivity provided by upstream providers is very
important for an AS. Multihoming to more than one provider can ensure that the
AS maintains the global Internet connectivity even if the connection to one of its
providers fails [1]. For example, in Fig. 6.9. AS 0 is multihomed to two upstream
providers: AS 1 and AS 2. AS 0 may use one of its providers as its primary provider,
and the other as a backup provider. When connectivity through the primary provider
fails, AS 0 still has its connectivity to the Internet through the backup provider.
A multihomed AS can be configured to direct its outbound traffic through the
primary provider. Only when the connection through the primary provider fails, its
outbound traffic can use the connection through the backup provider. To achieve this
goal, a multihomed AS can use the same approach described for the AS multihomed
to a single provider. That is, an AS may assign a higher local preference for the
route through the primary provider than that through the backup. For its outbound

Fig. 6.9 An example of an AS multihomed to two upstream providers

6

Interdomain Routing and Reliability

195

traffic, an AS multihomed to multiple providers can use the same approach as those
described for an AS multihomed to a single provider.
A multihomed AS might control which provider its inbound traffic can use. There
are several approaches to control the route used for inbound traffic. The simplest
approach is to advertise its prefixes only to the primary provider so that inbound
traffic can use the primary provider. For example, in Fig. 6.9, AS 0 can advertise
its prefix to its primary provider, say, AS 1. However, such selective advertisement
cannot provide the redundancy afforded by multihoming. In the above example,
if the link between AS 1 and AS 0 fails, AS 0 becomes unreachable until AS 0
notices the failure and advertises its prefixes to the backup provider, AS 2. In this
case, the time it takes to fail over to the backup provider depends on how fast the
multihomed AS detects the failure and determines to announce its profixes to the
backup provider, and how fast the announcement propagates to the global Internet.
Alternatively, an AS can control the route taken by the inbound traffic by splitting
its prefix into several specific prefixes, and advertise the more specific prefixes to the
primary providers. For example, in Fig. 6.10, AS 0 has a prefix, “12.0.0.0/19”. AS
0 splits the prefix into two more specific prefixes: “12.0.0.0/20” and “12.0.16.0/20”.
AS 0 can announce “12.0.0.0/20” to AS 1, and “12.0.16.0/20” to AS 2. At the
same time, AS 0 can advertise its prefix, “12.0.0.0/19” to both providers. As a result, inbound traffic to “12.0.0.0/20” comes from AS 1, while inbound traffic to
“12.0.16.0/20” comes from AS 2. This approach can balance the traffic load between
the two providers by designating each one as the primary provider for a specific
prefix. At the same time, the approach can tolerant failure of links to providers.
For example, if the link between AS 0 and AS 1 fails, destinations within prefix
“12.0.0.0/20” can still be reached via AS 2 since prefix “12.0.0.0/19” is announced
via AS 2. Despite the advantage of load balancing and fault tolerance, this approach
has the drawback of potentially increasing the number of prefixes announced to the
global Internet.

Fig. 6.10 An example of splitting prefixes

196

F. Wang and L. Gao

Another approach to control the route of inbound traffic is via AS prepend. An
AS can prepend its AS number, one or several times when announcing to the backup
provider. This can “discourage” other AS to select the route via the backup provider.
Note that this approach cannot ensure that all inbound traffic will go through the
primary provider. It is possible for an AS to use the longer backup path rather than
the shorter primary path if the backup path has a higher local preference. In fact,
most providers prefer customers over providers. Consider the example network in
Fig. 6.9, AS 2 learns paths to reach prefixes in AS 0 from both the direct and its
upstream connections, but AS 2 will prefer the direct connection, although AS 0
intends it to be a backup path.
In summary, multihoming techniques aim to provide redundant connectivity.
Nevertheless, the extent that these multihoming techniques can ensure continuous
connectivity is hinged on how long it takes for the routing protocol, BGP, to failover
to backup routes. In Section 6.4.2, we will discuss how BGP can recover from a
failure and how long it takes BGP to discover alternate routes.

6.4 Challenges in Interdomain Routing
Failures and changes in topology or routing policy are fairly common in the Internet
due to various causes such as maintenance, router crash, fiber cuts, and misconfiguration [4, 17, 18]. Ideally, when such changes occur, routing protocols should be
able to quickly react to those failures to find alternate paths. However, BGP is a
policy-based routing protocol, and is not guaranteed to converge to a stable state,
in which all routers agree on a stable set of routes. Persistent route oscillation can
significantly degrade the end-to-end performance of the Internet. Furthermore, even
if BGP converges, it has been known to be slow to react and recover from network
changes. During routing convergence, there are three potential routing states from
the perspective of any given router: path exploration during which an alternate route
instead of the final stable route is used, transient failures during which there is
no route to a destination but a route will be eventually discovered, and transient
forwarding loops in which routes to a destination form a forwarding loop and the
forwarding loop will eventually disappear. Path exploration does not lead to packet
drops, while transient failures or transient loops do. In this chapter, we describe how
persistent route oscillation, routing failures, and routing loops can occur.

6.4.1 Persistent Route Oscillation
BGP routing protocol provides great flexibility in routing policies that can be set by
each AS. However, arbitrary setting of routing policies can lead to persistent route
oscillation. For example, Fig. 6.11 shows the “bad gadget” example used in [9]. In
this example and all of the following examples, we focus on a single destination

6

Interdomain Routing and Reliability

Fig. 6.11 An example of
BGP routing policy that leads
to persistent route oscillation.
The AS paths around a node
represent a set of permissible
paths, which are ordered in
the descending order of local
preference

197
2

230
20

0
120
10

1

3

310
30

prefix that originates from AS 0, without losing generality. In this example, ASes 1,
2, and 3 receive only the direct path to AS 0 and indirect path via their clockwise
neighbor, and prefer to route via their clockwise neighbor over the direct path to AS
0. For example, AS 2 receives only paths (2 1 0) and (2 0) and prefers route (2 1
0) over route (2 0). This routing policy configuration will lead to persistent route
oscillation. In fact, it can be proved that no matter what route an AS chooses initially [9], it will keep changing its route and never reach a stable route. For example,
the following sequence of route changes shows how a persistent route oscillation can
occur.
1. Initially, ASes 1, 2, and 3 choose paths (1 2 0), (2 0), and (3 0), respectively.
2. After AS 2 receives path (3 0) from AS 3, it changes from its current path (2 0)
to the higher preference path (2 3 0), which in turn forces AS 1 to change its path
from (1 2 0) to (1 0) because path (1 2 0) is no longer available.
3. When AS 3 notices that AS 1 uses path (1 0), it changes its path (3 0) to (3 1 0).
This in turn forces AS 2 to change its path to (2 0).
4. After AS 2 sends path (2 0) to AS 1, AS 1 changes its path (1 0) to (1 2 0), which
in turn forces AS 3 to change its path (3 1 0) to (3 0), and the oscillation begins
again.
In practice, however, routing policies are typically set according to commercial
contractual agreements between ASes. Typically, there are two types of AS relationship: provider-to-customer and peer-to-peer. In the first case, a customer pays
the provider to be connected to the Internet. In the second case, two ASes agree to
exchange traffic on behalf of their respective customers free of charge. Note that
contractual agreement between peering ASes typically requires that traffic via both
directions of the peering link has to be within a ratio negotiated between peering
ASes. In addition to these two common types of relationship, an AS may have a
backup relationship with a neighboring AS. Having a backup relationship with a
neighbor is important when an AS has limited connectivity to the rest of the Internet.
For example, two ASes could establish a bilateral backup agreement for providing
the connection to the Internet in the case that one AS’ link to its primary provider
fails. Typically, provider-to-customer relationships among ASes are hierarchical.
The hierarchical structure arises because an AS typically selects a provider with a
network of larger size and scope than its own. An AS serving a metropolitan area
is likely to have a regional provider, and a regional AS is likely to have a national
provider as its provider. It is very unlikely that a nationwide AS would be a customer
of a metropolitan-area AS.

198

F. Wang and L. Gao

It is common for an AS to adopt an import routing policy, referred to as prefer
customer routing policy, where routes received from an AS’ customers are always
preferred over those received from its peers or providers. Such a partial order on
the set of routes is compatible with economic incentives. Each AS has economic
incentives to prefer routes via a customer link to those via peer or provider links,
since it does not have to pay for the traffic via customer links. On the other hand,
the AS has to pay for traffic via provider links, and traffic sent to its peer has to be
“balanced out” with traffic from its peer. It is also common for an AS to adopt an
export routing policy, referred to as no-valley routing policy, where an AS does not
announce a route from a provider or peer to another provider or peer. For example,
in Fig. 6.12, and the following examples, an arrowed line between two nodes represents a provider-to-customer relationship, with the arrow ending at the customer.
A dashed line represents a peer-to-peer relationship. We visualize a sequence of
customer-to-provider links as an uphill path, for example, path (1 3 5) is an uphill path. We define a sequence of provider-to-customer links as a down hill path,
for example, path (5 4 1) is a down hill path. A peer-to-peer link is defined as a
horizontal path. The no-valley routing policy ensures that no path contains a valley
where a downhill path is followed by either a peer-to-peer link or uphill path, or
a peer-to-peer link is follower by an uphill path or a peer-to-peer link. That is, an
AS path may take one of the following forms: (1) an uphill path followed by one
or no peer-to-peer link, (2) a downhill path, (3) a peer-to-peer link followed by a
downhill path, (4) an uphill path followed by a downhill path, or (5) a uphill path
followed by a peering link, followed by a downhill path. For example, in Fig. 6.12,
paths (3 5 4) and (1 3 5 6 4 2) are no-valley paths while AS paths (3 1 4) and (3 1 2 6)
are not no-valley paths.
ASes adopt these rules since there is no economic incentive for an AS to transit
traffic between its providers and peers. Note that we name it no-valley routing policy
since such an export policy ensures that no route traverses a provider-to-customer
link and then a customer-to-provider link, or a provider-to-customer link and then a

AS 5

AS 3

AS 6

AS 4

Provider-to-customer
Peer-to-peer

AS 1

AS 2

Fig. 6.12 Paths (3 5 4) and (1 3 5 6 4 2) are no-valley paths while AS paths (3 1 4) and (3 1 2 6)
are not no-valley paths

6

Interdomain Routing and Reliability

199

peer-to-peer link, or a peer-to-peer link and then another peer-to-peer link, or peerto-peer link and then customer-to-provider link, all of which are valley paths if there
is a hierarchical structure in provider-to-customer relationships.
It has been proved that under the hierarchical provider-to-customer relationships,
these common routing policies can indeed ensure route convergence [8]. Furthermore, these policies ensure route convergence under router or link failures, and
changes in routing policy. Note that each AS can configure its routers with the prefer
customer routing policy without knowing the policies applied in other ASes. Therefore, each AS has an economic incentive to follow the preferred customer routing
policy. In addition, it is practical to implement the policy since ASes can set their
routing policies without coordinating with other ASes.
In addition to local preference setting, it has been observed that certain iBGP
configuration may result in persistent route oscillation [2, 10]. Figure 6.13 shows an
example of route reflector and policy configuration that can lead to persistent route
oscillation. AS 1 consists of two route reflectors, A and B. A has two clients, C1
and C2, while B has one client, C3. The IGP cost of the link between two nodes is
indicated beside the link, and the MED value of the routes is indicated in parentheses. It can be proved that no matter what the initial route is for each router, it is not
possible for the routers to reach a stable route. As an example, we show below a
possible sequence of route changes that lead to persistent oscillation.
1. Route reflector A selects path p2 and route reflector B selects path p3 .
2. Route reflector A receives p3 and selects p1 because p3 has a lower MED than
p2 and p1 has lower IGP metric than p3 .
3. Route reflector B receives p1 and selects p1 as the best path (due to a lower IGP
cost) and withdraws p3 .
4. Route reflector A selects p2 over p1 (due to a lower IGP cost) and withdraws p1 .
5. Route reflector B selects p3 over p2 (due to lower MED). Now both A and B
return back to their initial routes.

Fig. 6.13 An example route
reflector configuration that
leads to persistent oscillation

200

F. Wang and L. Gao

One of the reasons that this route reflector configuration can lead to persistent
route oscillation is that MED is compared only among links in the same AS. It
is possible to enforce a rule that MED is always compared even when they come
from links to different ASes. Other guidelines have also been proposed to prevent
route reflector configuration from persistent oscillation. These guidelines include
exploiting the hierarchical structure of route reflector configuration [10] similar to
that proposed in [8]. That is, if a route reflector configuration ensures that a route
reflector chooses a route from its client over that from another route reflector (e.g.
with IGP cost setting), then it can ensure route convergence.

6.4.2 Transient Routing Failures
Even when BGP eventually converges to a set of stable routes, network failures,
maintenance events, and router configuration changes can cause BGP to reconverge. Ideally, when such an event occurs, routing protocols should be able to react
quickly to those failures to find alternate paths. However, BGP is known to be slow
in reacting and recovering from network events. Previous measurement studies have
shown that BGP may take tens of minutes to reach a consistent view of the network
topology after a failure [17–19].
During the convergence period, a router might contain routing information that
lags behind the state of the network. For example, it is possible for a router to eventually discover an alternate path when one of the links in its original path fails.
However, during the discovery process, the router might lose all of its paths before
an alternate path is discovered. Such a transient loss of reachability is referred to as
a transient routing failure.
Figure 6.14 shows an example of policy configuration and link failure scenario
that can lead to a transient routing failure. In this example, AS 1 and AS 2 are
providers of AS 3, AS 0 is a customer of AS 1, and AS 1 is a peer of AS 2. Note
that the import and export policies are realistic in the sense that it follows the prefercustomer and no-valley routing policy. When the link between AS 3 and AS 0 fails,
AS 3 temporarily loses its connection to the destination AS 0. AS 3 has to send a
withdrawal message to cause its neighbor AS 1 to select a new best path. Before
AS 3 receives the new path from AS 1, it will experience transient loss of reachability to AS 0. In addition, the timing of sending withdrawal and announcement

Fig. 6.14 An example
illustrating routing failure at
AS 3. The text around a node
represents a set of permissible
paths and their ordering in
local preference (higher
preference first)

130
10
1
1230

2

3

0

230
210
2130

30
310
3210
Provider−to−customer
Peer−to−peer

6

Interdomain Routing and Reliability

Fig. 6.15 Transient routing
failures take place in a typical
eBGP system. The AS paths
around a node represent a set
of permissible paths, which
are ordered in the descending
order of local preference

201
76310
7850

2

7

6

6310
67850

3

4

26310
267850

8

850
876310

5

50
5876310

310
46310
367850 467850

1
10
1367850

0

messages are determined by the expiration of MRAI timers, which can take several
seconds to tens of seconds. During this period, all packets destined to AS 0 at AS 3
will be dropped.
In a typical AS where the prefer-customer and no-valley routing policies are
followed, it is quite likely to have ASes experience transient failures. In fact, when
an event causes an AS to change from a customer route to a provider route and all
of its providers use it to reach a destination, the AS will definitely experience a
transient failure. This is because the AS has to withdraw the customer route first
before its provider can discover an alternate path and send the path to it. Please refer
to [30] for a proof. Figure 6.15 shows an example to illustrate this point. Suppose
that before the link between AS 1 and AS 0 fails, AS 1, AS 3, and AS 6 all have only
one path via their customers to reach the destination. When the link failure occurs,
the ASes will experience transient failure before they can learn the route via their
providers. AS 2 may experience the failure (depending on whether the withdrawal
from AS 6 is suppressed the MRAI timer), but AS 7 does not experience any
transient routing failure.
In previous section, we have shown that multihoming technology can provide
redundant underlying connections. Here, we use several examples to discuss
whether BGP can fully exploit the redundancy to quickly recover from failures.
In fact, BGP fails to take advantage of this redundancy to provide high degree of
path diversity. The reason is due to the iBGP configuration. A typical hierarchical
iBGP system consists of a core with fully meshed core routers, i.e., route reflectors,
and the edge routers which are the clients of the relevant route reflectors. Transient
routing failures can occur within a hierarchical iBGP system. Figure 6.16 shows an
example that illustrates how routing failures can occur due to iBGP configuration.
A multihoming AS AS 0 has two providers: AS 1 and AS 2. AS 1 can reach a destination originated at AS 0 via one of two access routers, AR1 or AR2. According
to the prefer-customer routing policy, the path via AR1 is assigned higher local
preference value than those via AR2. As a result, all routers inside AS 1 will use
the path via AR1 to reach the destination except the access router AR2. Once the
link between AR1 and AS 2 fails, all routers except AR2 might experience transient
routing failures, before failover to the path via AR2.

202

F. Wang and L. Gao
10
10
120
10

BaR3

10
AR2
BaR2

BaR1

10
AR1

12.1.1.0/24

12.1.1.0/24

12.1.1.0/24

Fig. 6.16 An AS with a hierarchical iBGP configuration can experience transient failures

10

10
BaR3

BaR2

1 0 via AR1

BaR1

10
1000

10
AR1

AR2

12.1.1.0/24 with
AS path (0)

12.1.1.0/24 with
AS path (0 0 0)

BoR1

BoR2
12.1.1.0/24

Fig. 6.17 An AS with multiple connections to a destination prefix can experience transient failures

Our second example, shown in Fig. 6.17, is used to show the reliability issue
for an AS with multiple connections to a single provider. In this example, AS 0
has two connections to AS 1. Suppose that AS 0 considers the connection via AS
1’s AR1 as the primary link, and the other connection via AR2 as the backup link.
Suppose that AS 0 uses AS path prepending to implement this configuration. AS 0’s
BoR2 advertises its prefix with AS path (0 0 0). As a result, all routers inside AS
1 except router AR2 have only one single route to reach the destination. If the link
between AS 0’s BoR1 and AS 1’s AR1 fails, all routers within AS 1 except AR2
will experience transient failures.

6

Interdomain Routing and Reliability

203

Our third example, shown in Fig. 6.18, is used to show the reliability issue for an
AS with multiple geographical connections to a single provider. In this example, we
assume that AS 0 considers the connection via AS 1’s AR2 as the primary link,
and the connection via AR1 as the backup link. Just like the previous example,
suppose that AS 0 uses AS path prepending to implement this configuration. As a
result, all routers inside AS 1 except router AR2 has only one single route to reach
the destination. If the link between AS 0’s BoR2 and AS 1’s AR2 fails, all routers
within AS 1 except AR2 will experience transient failures.
Our last example used to show load balancing can avoid transient routing failures. In Fig. 6.19, AS 0 distributes its inbound traffic among the two connections
by applying hot-potato routing policy. That is, the backbone routers within AS 1
select the best route according to IGP costs to the egress routers, AR1 and AR2.

Fig. 6.18 An AS with geographical connections to a destination prefix can experience transient
failures

Fig. 6.19 Load balancing configuration can avoid transient failures

204

F. Wang and L. Gao

Fig. 6.20 A transient failure experienced by router RT1 when the link between AS 0 and AS 1 is
added or recovered

As a result, all backbone routers have two different routes to reach the destination.
This configuration can avoid single points of failures for backbone routers and link
failures between AS 1 and AS 0.
So far we have focused on scenarios that lose a route. In fact, when gaining
a route, it is still possible to experience transient routing failures. For example,
Fig. 6.20 shows a scenario where a router can experience transient routing failure
due to iBGP configuration. In this example, AS 1 and AS 2 are providers of AS 0,
and AS 1 and AS 2 have peer-to-peer relationship. When the link between AS 1
and AS 0 is added or recovered from a failure, AS 1 prefers direct path to destination AS 0. Before the link is recovered, all routers within AS 1 select the path via
AS 2 as their best paths. After the recovery event, all routers within AS 1 use the
path through the recovered link. During the route convergence process, router RT3
first selects the direct path to AS 0 and then sends the new route to router RT2 and
router RT1. Once router RT2 receives the direct route from router RT3, it selects
the route and withdraws its route through AS 2 from router RT1, since it cannot
announce its currently selected route via router RT3 to router RT2 (due to the fact
that a fully meshed iBGP session cannot reflect a route learned from one peer to
another). If router RT1 receives the withdraw message from router RT2 before receiving the announcement message from router RT3, it will experience transient
routing failures.

6.4.3 Transient Routing Loops
During the route convergence process, it is possible to have not only transient routing failures, but also transient routing loops. A topology or routing policy change
can lead the routers to recompute their best routes and update forwarding tables.
During this process, the routers can be in an inconsistent forwarding state, causing

6

Interdomain Routing and Reliability

205

Fig. 6.21 An example of transient routing loop between AS 2 and AS 3. The list of AS paths
shown beside each node is the set of permissible paths for the node, and the permissible paths are
ordered in the descending order of local preference

transient routing loops. Measurement studies have shown that the transient loops
can last for more than several seconds [13, 29, 31]. Figure 6.21 shows a scenario
where a transient routing loop can occur. In this example, when the link between
AS 1 and AS 0 fails, AS 2 and AS 3 receive a withdrawal message from AS 1.
These two ASes will each select the path via the other to reach the destination because the local preference value of a path via a peer is higher than that of a path via
a provider. As a result, there is a routing loop. After AS 2 and AS 3 exchange their
new routes, AS 2 will remove the path from AS 3 and select the path from AS 4 as
the best path. Finally, all ASes will use the path via AS 4.

6.5 Impact of Transient Routing Failures and Loops
on End-to-End Performance
In this section, we aim to understand the impact that transient routing failures and
loops have on end-to-end path performance. We describe an extensive measurement
study that involves both controlled routing updates of a prefix and active probes
from a diverse set of end hosts to the prefix.

6.5.1 Controlled Experiments
The infrastructure for the controlled experiments is shown in Fig. 6.22. The infrastructure includes a BGP Beacon prefix from the Beacon routing experiment
infrastructure [21]. The BGP Beacon is multihomed to two tier-1 providers to
which we refer to as ISP1 and ISP 2. We control routing events by injecting
well-designed routing updates from BGP Beacon at scheduled times to emulate
link failures and recoveries. To understand the impact of routing events on the data
plane performance, we select geographic and topologically diverse probing locations from the PlanetLab experiment testbed [25] to conduct active probing while
routing changes are in effect.

206

F. Wang and L. Gao

Fig. 6.22 Measurement infrastructure

Fig. 6.23 Time schedule (GMT) for injecting routing events from BGP beacon

Every 2 hours, the BGP Beacon sends a route withdrawal or announcement to
one or both providers according to the time schedule shown in Fig. 6.23. Each circle denotes a state, indicating the providers offering transit service to the Beacon.
Each arrow represents a routing event and state transition, marked by the time that
the routing event (either a route announcement or a route withdrawal) occurs. For
example, at midnight Beacon withdraws the route through ISP 1, and at 2:00 a.m.,
Beacon announces the route through ISP 1. There are 12 routing events every day.
Only eight routing events keep the Beacon connected to the Internet; the other four
serve the purpose of resetting the Beacon connectivity. These eight beacon events
are classified into two categories: failover beacon event and recovery beacon event.
In a failover beacon event, the Beacon changes from the state of using both providers
to the state of using only a single provider. In a recovery beacon event, the Beacon
changes from the state of using a single provider for connectivity to the state of using both providers. These two classes of routing changes emulate the control plane
changes that a multihomed site may experience in terms of losing and restoring a
link to one or more of its providers. For example, between midnight and 2:00 a.m.,

6

Interdomain Routing and Reliability

207

the BGP Beacon is in a state that is only connected to ISP 2; at 2:00 a.m., it announces the Beacon prefix to ISP 1, leading to connectivity to both ISPs. This event
emulates a link recovery event. At 4:00 a.m., the Beacon sends a withdrawal to ISP 1
so that the Beacon is in a state that is only connected to ISP2. This event emulates a
failover event.
A set of geographically diverse sites in the PlanetLab infrastructure probe a host
within the Beacon prefix by using three probing methods: UDP packet probing,
ping, and traceroute. Probing is performed every hour during injected routing events
and when there are no routing events, so as to calibrate the results. At every hour,
every probing source sends a UDP packet stream marked by sequence numbers to
the BGP Beacon host at 50 ms interval. The probe starts 10 min before each hour
and ends 10 min after that hour (i.e., the probing duration is 20 min for each hour).
Upon the arrival of each UDP packet, the Beacon host records the timestamp and
sequence number of the UDP packet. In addition, ping and traceroute are sent from
the probe hosts toward the Beacon host, for measuring round-trip time (RTT) and
IP-level path information during the same 20 min time period. Both ping and traceroute are run as soon as the previous ping or traceroute probe completes. Thus, their
probing frequency is limited by the round-trip delay and the probe response time
from routers.

6.5.2 Overall Packet Loss
In this section, we present data plane performance during failover and recovery
beacon events. Packet loss and loss burst length are used to measure the impact of
routing events on end-to-end path performance. We refer to a series of consecutively
lost packets during a routing event as a loss burst. Loss burst length is the maximum
number of consecutive lost packets during a routing event. Since several lost bursts
can be observed during a routing event, we consider the one with the maximum
number of consecutive lost packets, which represents the worst-case scenario during
the event.
Figure 6.24a shows the number of loss bursts over all probing hosts during
failover beacon events for the entire duration of measurement. The x-axis represents
the start time of a loss burst, which is measured (in second) relative to the injection
of withdrawal messages. We observe that the majority of loss bursts occur right after
time 0, i.e., the time when a withdrawal message is advertised. Figure 6.24b shows
the number of loss bursts during recovery beacon events across all probe hosts undergoing path changes. We observe that loss bursts occur right after time 0, and can
last for 10 s.
Figure 6.25a shows the distributions of loss burst length before, during, and after
a path change for failover beacon events. The x-axis is shown in log scale. We
find that the packet loss burst length during path change can have as many as 480
consecutive packets. Compared with the loss burst length during a path change, the
packet loss burst size before and after a path change are quite short. Figure 6.25b

F. Wang and L. Gao
200
180
160
140
120
100
80
60
40
20
0
–600 –400 –200 0
200 400
Starting time (seconds)

200
Number of loss bursts

Number of loss burst

208

150
100
50
0
–600 –400 –200 0
200 400
Starting time (seconds)

600

(a) Failover

600

(b) Recovery

Fig. 6.24 Number of loss bursts starting at each second [31] (Copyright 2006 Association for
Computing Machinery, Inc. Reprinted by permission)

1

1
0.95

0.8
CDF

CDF

0.9
0.6
0.4

0

1

10
100
Loss burst length

(a) Failover

0.8
0.75

during path change
before path change
after path change

0.2

0.85

during path change
before path change
after path change

0.7
1000

0.65

1

10
Loss burst length

100

(b) Recovery

Fig. 6.25 The cumulative distribution of loss burst length [31] (Copyright 2006 Association for
Computing Machinery, Inc. Reprinted by permission)

shows the loss burst length during recovery beacon events. We observe that the loss
burst length during routing change does not show a significant difference compared
with those before or after routing change. In addition, loss burst length can be as
long as 140 packets for recovery beacon events. Such loss is most likely caused by
routing failures.

6.5.3 Packet Loss Due to Transient Routing Failures or Loops
From the measurement results, we see that during both events, many packet loss
bursts occur. Packet loss can be attributed to network congestion or routing failures. In order to identify routing failures, ICMP response messages, as measured by
traceroutes and pings, are used. After deriving loss burst, unreachable responses
from traceroutes and pings are correlated with the loss bursts. Since hosts in
PlanetLab are NTP time synchronized, the loss bursts are correlated with ICMP

6

Interdomain Routing and Reliability

209

messages using the time window [1 s, 1s]. When a router does not have a route
entry for an incoming packet, it will send an ICMP network unreachable error message back to the source to indicate that the destination is unreachable if it is allowed
to do so. Based on the ICMP response message, we can determine when and which
router does not have a route entry to the Beacon host. Loss bursts that have corresponding unreachable ICMP messages are attributed to routing failures. In addition,
if a packet is trapped in forwarding loops, its TTL value will decrease until the value
reaches 0 at some router. The router will send a “TTL exceeded” message back to
the source. Thus, from traceroute data, we can observe forwarding loops.
Table 6.2 shows the number of failover beacon events, the number of loss bursts,
and the number of lost packets that can be verified as caused by routing failures or
loops. We verify that 23% of the loss bursts, corresponding to 76% of lost packets,
are caused by routing failures or loops. We are unable to verify the remaining 77%
of loss bursts, which correspond to only 24% of packet loss. These loss bursts may
be caused by either congestion or routing failures for which traceroute or ping is not
sufficient (due to either insufficient probe frequency or lack of ICMP messages) for
the verification.
Similar to our analysis on failover events, we correlate ICMP unreachable
messages with loss bursts occurring during recovery events. Table 6.3 shows that
26% of packet loss is verified to be caused by routing failures.
Since routers in the Internet may filter out ICMP packets, it is possible that
some loss packets do not have corresponding ICMP messages even if those loss
bursts might be caused by routing failures or routing loops. As a result, we may
underestimate the number of loss bursts due to routing failures or routing loops.
Therefore, the number of loss bursts caused by routing failures or routing loops
might be more than what can be identified by our methodology.

Table 6.2 Overall packet loss caused by routing failures or loops
during failover events
Failover
Loss
Lost
Causes
beacon events bursts
packets
Routing failures
Routing loops
Unknown

451 (38%)
208 (18%)
539 (44%)

607 (16%)
239 (7%)
2,875 (77%)

37,751 (42%)
30,592 (34%)
21,948 (24%)

Table 6.3 Packet loss caused by routing changes during recovery
events
Recovery
Loss
Loss
Causes
beacon events
bursts
packets
Routing failures
Routing loops
Unknown

17 (5%)
24 (7%)
290 (88%)

39 (2%)
37 (2%)
1,714 (96%)

480 (11%)
640 (15%)
3,266 (74%)

210

F. Wang and L. Gao

1

1

0.8

0.8

0.6

0.6

CDF

CDF

We measure the duration of a loss burst as the time interval between the latest
received packets before the loss and the earliest one after the loss. Figure 6.26a
shows the duration of loss bursts that can and cannot be verified as caused by routing
failures or routing loops during failover events. Again, we observe that the loss
bursts that are verified as caused by routing failures or routing loops last longer than
those unverified loss bursts. Figure 6.26b further shows that loss bursts caused by
routing loops last longer than those caused by routing failures.
Figure 6.27a shows the cumulative distribution of the duration of loss bursts that
are verified and unverified as caused by routing failures or routing loops during recovery events. We observe that verified loss bursts on average are longer than those
unverified. In addition, during recovery events, more than 98% of routing failures
or routing loops last less than 5 seconds, while during failover events, about 80% of
routing failures or routing loops last less than 5 seconds as shown in Fig. 6.26. This
means that loss bursts caused by routing failures during recovery events last much
shorter than those during failover events. We also observe that unverified loss bursts

0.4
0.2
0

0

Unverified loss bursts
Verified loss bursts
5
10
15
20
25
Duration (seconds)

0.4
0.22
0

30

(a) Loss burst verified vs. unverified

Routing failures
Routing loops
0

5

10
15
20
Duration (seconds)

25

30

(b) Routing loops vs. routing failures

1

1

0.8

0.8

0.6

0.6

CDF

CDF

Fig. 6.26 Duration for verified vs. unverified loss bursts during failover events [31] (Copyright
2006 Association for Computing Machinery, Inc. Reprinted by permission.)

0.4
0.2
0

0.2

Unverified loss bursts
Verified loss bursts
0

2

4
6
8
Duration (seconds)

0.4

10

(a) Loss bursts verified vs. unverified

0

Routing failures
Routing loops
0

2

4
6
8
Duration (seconds)

10

(b) Routing loops vs. routing failures

Fig. 6.27 Duration of verified loss bursts during recovery events [31] (Copyright 2006 Association
for Computing Machinery, Inc. Reprinted by permission.)

6

Interdomain Routing and Reliability

211

last less than 4 seconds. Figure 6.27b shows the duration of verified loss bursts that
are caused by routing failures and loops during recovery events. We observe that
57% of packet loss is due to forwarding loops, which is slightly higher than that for
failover events (47%). This implies that forwarding loops are also quite common
during recovery events.

6.6 Research Approaches
We have seen from the measurement study in the previous section that routing
failures and routing loops contribute to degraded end-to-end path performance significantly. Several approaches have been proposed to address the problem of routing
failures and routing loops. These approaches can be broadly classified into three categories: convergence-based solution, path protection-based solution, and multiple
path-based solution.
 Convergence-Based Solutions

These approaches focus on reducing BGP convergence delay. In particular, they aim to reduce convergence delay by eliminating invalid routes quickly. Reducing convergence delay may indirectly shrink the
periods of routing failures or routing loops since it takes less time to converge to
a stable route.
 Path Protection-Based Solutions These approaches focus on preestablishing
recovery paths before potential network events. These preestablished paths supplement the best path selected by BGP. When there is a routing outage, the
recovery path is used to route traffic. The recovery path could be a preestablished
protection tunnel, or an alternate AS path.
 Multipath-Based Solutions The goal of these approaches is to exploit path diversity to provide fault tolerance. To increase path diversity, multipath routes are
discovered. For example, multiple routing trees can be created on the same underlying topology. When one of the routes fails, other routes can be probed and
then used if valid to route traffic.

6.6.1 Convergence Based Solutions
BGP is a path vector protocol. Each BGP speaking router has to rely on its
neighbors’ announcements to select its best route. Since each BGP speaking router
does not have the topology information, it is possible that an AS explores many AS
paths before eventually reaching the final stable path. Figure 6.28 shows an example
of the path exploration process during BGP convergence. Suppose the link between
AS 1 and AS 0 fails. This failure event makes the destination unreachable at each
AS. We refer to this type of events as fail-down events. The following potential
sequence of route changes shows how path exploration can occur.

212
Fig. 6.28 An example of
path exploration during BGP
convergence. The list of AS
paths shown beside each node
is the set of permissible paths
for the node, and the
permissible paths are ordered
in the descending order of
local preference

F. Wang and L. Gao
4

4310
4210
42310

210
2310 2
24310
3
1

310
3210
34210

10

0

1. AS 1 sends a withdrawal message to AS 2 and AS 3, respectively.
2. As AS 2 receives the withdrawal, it removes path (2 1 0) from its routing table,
selects path (2 3 1 0) as its new best path, and advertises the new path to all
neighbors.
3. After AS 3 receives the withdrawal from AS 1, it will use path (3 2 1 0), and
advertise it to its neighbors.
4. When AS 2 and AS 3 learn the new paths (2 3 1 0) and (3 2 1 0) from each other,
they will remove their best paths, and use path (2 4 3 1 0) and path (3 4 2 1 0),
respectively.
5. Since both AS 2 and AS 3 use the paths from AS 4, they will send AS 4 withdrawal messages to withdraw their previously advertised paths. As a result, AS 4
loses its all paths, and sends a withdrawal message to AS 2 and AS 3, respectively.
6. After AS 2 and AS 3 receive the withdrawals from AS 4, their routing tables do
not have any route to the destination.
This example shows that each node literally has to try several AS paths that traverse the failed link/node before it finally chooses the best valid path or determines
that there is no best path. For instance, AS 2 might explore the sequence AS paths
(2 1 0) ! (2 3 1 0) ! (2 4 3 1 0) before it removes all paths from its routing table.
Previous measurement studies have shown that BGP may take tens of minutes to
reach a consistent view of the network topology after a failure [17–19]. Note that
although this example shows a fail-down scenario, we can indeed extend it to show a
fail-over scenario in which an AS has to explore many invalid paths before finalizing
to a stable valid path.
Several solutions have been proposed to rapidly indicate and remove invalid
routes to suppress the exploration of obsoleted paths [5, 7, 23, 24]. Consistency Assertions (CA) [24] tries to achieve this goal by examining path consistency based
solely on the AS path information carried in BGP announcements. Suppose that
an AS has learned two paths to a destination from neighbor N1 and neighbor
N2 , respectively. N1 advertises path (N1 A B C 0) and neighbor N2 advertises
(N2 B X Y 0). CA assumes that each AS can only use one path. Thus, by comparing
these two paths, it can detect that the two paths advertised by AS B ((B C 0) and
(B X Y 0)) are not consistent. We use an example shown in Fig. 6.28 to show how

6

Interdomain Routing and Reliability

213

an AS can take advantage of consistency checking to accelerate route convergence.
A router can use a withdrawal received directly from a neighbor to check path consistency. When the link between AS 1 and AS 0 fails, AS 1 sends withdrawals to
AS 2 and AS 3. Once AS 2 and AS 3 notice that their neighbor AS 1 withdraws its
path to the destination, they check whether AS 1 appears in any existing path. Since
the two path (2 3 1 0) and (2 4 3 1 0) contains path (1 0), neither can be selected
and AS 2 removes them from its routing table. Similarly, AS 3 removes path (3 2
1 0) and (3 4 2 1 0). Eventually, AS 2 and AS 3 will withdraw their paths to the
destination. As a result, CA eliminates the paths to be explored.
However, the AS path consistency might not contain sufficient information about
invalid paths. It is hard to accurately detect invalid routes based solely on the AS
path information. For example, in Fig. 6.28, after AS 2 and AS 3 receive the withdrawals sent by AS 1 due to link (1 0) failure, AS 2 and AS 3 send withdrawals to
AS 4 since all of their paths go through AS 1. Now suppose that AS 2’s withdrawal
reaches AS 4 before AS 3 does. In this case, AS 4 cannot consider path (4 3 1 0)
as an invalid path since the path does not contain the withdrawn path (2 1 0). AS 4
cannot determine if the withdrawal of path (2 1 0) is due to the failure of link (2 1)
or link (1 0).
To accurately identify invalid paths, Ghost Flushing [5] reduces convergence delay by aggressively sending explicit withdrawals to quickly remove invalid paths.
Whenever an AS’s current best path is replaced by a less preferred route, Ghost
Flushing allows the AS to immediately generate and send explicit withdrawal messages to all its neighbors before sending the new path. The withdrawal messages is
to flush out the path previously advertised by the AS. For example, in Fig. 6.28, after
AS 2 receives the withdrawal sent by AS 1 due to link (1 0) failure, AS 2 will use
less preferred path (2 3 1 0). Before sending the path (2 3 1 0) to its neighbors, AS 2
sends extra withdrawal messages to its neighbors AS 3 and AS 4. Because BGP
withdrawal messages are not subjected to the MRAI timer, invalid paths can potentially be quickly deleted from the AS’s neighbors. For example, the withdrawal sent
by AS 2 will help AS 3 to remove the invalid path (3 2 1 0). From this example,
we know that Ghost Flushing does not really prevent path exploration, but instead
attempts to speed up the process.
To further identify invalid routes quickly, additional information can be incorporated into BGP route updates. BGP-RCN and EPIC [7, 23] propose to use with
location information about failures, or root cause information, to identify invalid
routes. When a link failure occurs, the nodes adjacent to the link will detect the
change. The node, referred to as the root cause node (RCN), will attach its name to
the routing update it sends out. The RCN is propagated to other ASes along each
impacted path. Thus, an AS can use the RCN to remove all the invalid paths at
once. For example, Fig. 6.28 illustrates the basic idea of BGP-RCN. When the link
between AS 1 and AS 0 fails, root cause notification is sent with a withdrawal by
AS 1. When AS 2 receiving the withdrawal, it uses the root cause notification to
find invalid paths that contain AS 1. Thus, path (2 3 1 0) is considered as an invalid
path and will be removed. Similarly, at AS 3, path (3 2 1 0) is detected as an invalid
route. AS 2 and AS 3 send withdrawals to AS 4, and piggyback the root cause in the

214

F. Wang and L. Gao

Table 6.4 Properties of convergence-based solutions. M is the MRAI timer value. n is the number
of ASes in the network. D is the diameter of the network. jEj is the number of AS level links. h is
the processing delay for a BGP update message to traverse an AS hop
Modification
Convergence delay Messages
Modification to to BGP route
eBGP iBGP
Protocols
(fail-down)
(fail-down) BGPs messages selection
Standard BGP M  n
jEj  n
N/A
N/A
N/A N/A
CA
M n
jEj
No
Yes
Yes No
No
Yes
Yes Yes
Ghost Flushing h  n
2jEjn Mh
BGP-RCN
hD
jEj  n C 1 Yes
Yes
Yes No
EPIC
hD
jEj  1
Yes
Yes
Yes Yes

withdrawals. After receives the withdrawal messages with root cause, AS 4 removes
all its routes because all paths contain the root cause node AS 1.
EPIC [7] further extends the idea of root cause notification so that it can be applied to a router rather than an AS. In general, a failure can occur to a router or a link
between a pair of routers. A failure on a link between two ASes does not necessarily
mean that all links between the two ASes fail. The root cause notification in BGPRCN can only indicate failures on an AS or links between a pair of ASes. EPIC
further allows routing information that contains failure information about router or
link between a pair of routers.
We summarize important properties of the four approaches in Table 6.4. We consider the upper bound of convergence time and the number of messages during a
fail-down event. We also compare those approaches in term of the modifications
need from the standard BGP. For example, we consider if an approach needs to
modify to BGP’s messages format or BGP route selection, and if those approaches
can be applied to eBGP or iBGP.

6.6.2 Path Protection-Based Solutions
The convergence based-approaches focus on rapidly removing invalid routes to
accelerate BGP convergence process. They are efficient in reducing convergence
delay. However, simply applying those methods might not necessarily lead to reliable routing. In fact, accelerating the process of identifying invalid routes might
sometimes exacerbate routing outages. Figure 6.29 shows such an example. We first
consider the case of running the standard BGP. When the link between AS 1 and
AS 0 fails, AS 1 sends a withdrawal to AS 2 and AS 3 immediately, and AS 2 sends
a withdrawal to AS 3 right after. Upon receiving the withdrawal, AS 3 will quickly
switch to the path (3 4 0). At the same time, when AS 2 receives the withdrawal message, it selects path (2 3 1 0). Even though this path is invalid, AS 2 still reroutes
traffic to a valid next hop AS, which has a valid path. Therefore, in this case, AS 2
can reroute traffic to the destination before it receives the valid path (3 4 0).

6

Interdomain Routing and Reliability

215

4
40
4310
43210

310
3210
340

3

2
1

210
2310
2340

10
1340
12340

0

Fig. 6.29 An example showing transient routing failures at AS 2 when RCN is used. The list of
AS paths shown beside each node is the set of permissible paths for the node, and the permissible
paths are ordered in the descending order of local preference

On the contrary, if the root cause information is sent with the withdrawal by AS 1.
AS 2 will remove path (2 3 1 0), and temporarily lose its reachability to AS 0 until
receiving the new path from AS 3. The duration of temporary loss of reachability
could last longer than that in the case of the standard BGP. The duration that AS 2
loses its reachability depends on the delay to get the alternate path from AS 3, which
is determined by the time it takes to receive the announcement of path (3 4 0) from
AS 3, which is subjected to MRAI timer. Without using the root cause information,
the duration that AS 2 loses its reachability depends on the propagation delay of the
withdrawal from AS 1 to AS 2, which is not subjected to MRAI timer [26].
The path protection-based solutions are designed specifically for improving
the reliability of interdomain routing. The major idea is that local protection paths
are identified before failures. When the primary path fails, local protection paths are
temporarily used. Many approaches have been proposed for link-state intradomain
routing protocols to protect intradomain link failures [6, 14, 16, 27, 33]. However, the BGP speaking routers do not have the knowledge of the global network
topology. They have routing information from neighbors only. Therefore, there are
two challenges in implementing path protection in BGP; first, one needs to find local preplanned protection paths; second, one needs to decide how and when to use
the protection paths. Next, we present several path protection-based approaches. We
first focus on how they address the first challenge. We then discuss how they address
the second challenge.
Bonaventure et al. [3] have proposed a fast reroute technique, referred to as
R-Plink, to protect direct interdomain links. The basic idea is that each router precomputes recovery path for each of its BGP peering links, which is used to reroute
traffic when the protected BGP link fails. In order to discover an appropriate recovery path, each edge router inside an AS advertises its currently active eBGP
sessions by using a new type of iBGP update message. After having other routers’
routing information, an edge router chooses a path to protect its current active eBGP
session from all recovery routes. Figure 6.30 shows an example to illustrate this
approach. In this example, AS 2 advertises the same destination to AS 1’s two
routers A and C. Suppose that the routing policies on AS 1 are configured to select the path via router A as the best path. However, router A cannot learn any route
via router C through BGP because of the local-preference settings on this router.

216

F. Wang and L. Gao

Fig. 6.30 A precomputed protection path is used to protect the interdomain link between AS 1
and AS 2

To automatically discover the alternate path, routers A and C advertise their active
eBGP sessions. Thus, router A will know an alternate path via routers C and E, and
choose the path to protect its current path to the destination. Once the link (A D)
fails, router A can forward the packets affected by the failure through the alternate
path via (C E) link.
In contrast of R-Plink, R-BGP aims to solve the transient routing failures problem
for any interdomain link failure, not just for the failure of a direct neighboring
interdomain link [15]. R-BGP precomputes an alternate path for each AS to protect
interdomain links. In particular, an AS first checks all paths it knows, and then selects the one most disjoint from its current best path, which is defined as the failover
path. Finally, the AS advertises the failover path only to the next-hop AS along its
best path. Note that in the standard BGP, an AS should not advertise its best path to
the neighbor currently used to reach that destination, since this path would generate
a loop. Advertising a failover path guarantees that, whenever a link goes down, the
AS immediately upstream of the down link knows a failover path and can avoid unnecessary packet drops. One limitation of this approach is that it guarantees to avoid
routing failures only under the hierarchical provider-customer relationships and the
common routing policy, i.e., the no-valley and prefer-customer routing policy. Further, it does not address the routing failures caused by iBGP configuration.
Backup Route Aware Routing Protocol (BRAP) is to achieve fast transient failure recovery considering both eBGP routing policy and iBGP configurations [28].
To achieve this, BRAP requires that a router should be enabled to advertise an alternate path if its best path is not allowed to be advertised due to loop prevention
or routing policies. The general idea for BRAP is as follows: a router should advertise following policy compliant paths in addition to the best path: (1) a failover

6

Interdomain Routing and Reliability

217

Table 6.5 Comparing path protection-based solutions. jEj is the number of AS level links, jEr j
is the number of router level links
Messages
Modification to
Modification to
eBGP
iBGP
Protocols
(failover)
BGPs messages
other part of BGP
R-Plink
R-BGP
BRAP

N/A
jEj
jEr j

Yes
Yes
Yes

Yes
Yes
Yes

Yes
Yes
Yes

Yes
No
Yes

path to the nexthop router along the best path; and (2) a loop-free alternate path,
defined as a temporary backup path, to its upstream neighbors. BRAP extends BGP
to distribute the alternate routes along eBGP and iBGP sessions.
Now, we describe how to use a protection path. When a router needs to use a
protection path, the router needs to inform the other routers along the path of the
change. Otherwise, redirecting traffic to the protection path could cause forwarding
loops. For example, in Fig. 6.30, when router A sends traffic along the alternate path
via routers B and C, their routing tables still consider router A as the next hop. Protection tunnels on the data plane is proposed to avoid such forwarding loops [3].
Protection tunnels can be implemented by using encapsulation schemes such as
MPLS over IP. With MPLS over IP, only the ingress border router consults its BGP
routing table to forward a packet, and encapsulates IP header with the destination
set to the IP address of the egress border router. All the other routers inside the
AS will rely on their IGP routing tables or their label forwarding table to forward
the packet. R-BGP utilizes “virtual” connections to avoid forwarding loops. There
are two “virtual” connections between each pair of BGP-speaking routers, one for
the primary path traffic, and the other for the failover traffic. The virtual connection
can be implemented by using virtual interfaces when the two routers are physically
connected, or MPLS or IP tunnels if they are not. Similarly, BRAP uses a protection
path through MPLS or IP tunnels.
We summarize the features of the three path protection-based solutions in
Table 6.5. We consider the upper bound of the number of messages during a
failover event, modification to BGP, and whether those approaches can be applied
to eBGP or iBGP.

6.6.3 Multiple Path-Based Solution
A straightforward solution to improve the route reliability is to discover multiple
paths. There are two proposals for multiple path interdomain routing. The first one
is MIRO [32] that allows routers to inform their neighbors multiple routes instead
of only the best one. Thus, MIRO can allow ASes to have more control over the
flow of traffic in their networks, as well as enable quick reaction to path failures.
The second one is Path Splicing [22], which aims to take advantage of alternate
paths in BGP routing table to discover multiple paths. Instead of using only the best

218

F. Wang and L. Gao

path in the BGP routing table, a packet can select any path in the BGP routing table
by indicating which one to use in its header. Clearly, probing has to be deployed
before multiple paths can be discovered since arbitrary selection of alternate paths
can lead to routing loops.

6.7 Conclusion and Future Directions
Interdomain routing is the glue that binds thousands of networks in the Internet together. Its reliability plays determinable role on the end-to-end path performance. In
this chapter, we have presented the challenges in designing and implementing a reliable interdomain routing protocol. Specifically, through measurement studies, we
present a clear overview of the impact of transient routing failures and transient routing loops on the end-to-end path performance. Finally, we have critically reviewed
the existing proposals in this field, highlighting pros and cons of those approaches.
While certain efforts have been made to enhance interdomain routing reliability,
this issue remains open. We believe that the development of new routing infrastructure, for example, multipath routing is one promising direction of future research.
Reliability enhancement through multiple path advertisement is not a new idea.
Many efforts have been been made to extend BGP to allow the advertisement of
multiple paths [12, 20]. However, designing scalable interdomain routing through
multiple path advertisement is challenging. One of those challenges is to understand
the degree of path diversity provided by multiple path advertisement is sufficient to
overcome network failures. At the same time, this challenge highlights the need for
designing new path diversity metrics. Path diversity metrics such as the number of
node-disjoint and link-disjoint links can be used to compute the inter-AS path diversity. However, new path diversity metrics needs to be devised to take into account
the performance, reliability, and stability.
Acknowledgments The authors would like to thank the editors, Chuck Kalmanek and Richard
Yang, for their comments and encouragement. This work is partially supported by NSF grants
CNS-0626617 and CNS-0626618.

References
1. Akella, A., Maggs, B., Seshan, S., Shaikh, A., & Sitaraman, R. (2003). A measurement-based
analysis of multihoming. In Proceedings of ACM SIGCOMM, August 2003.
2. Basu, A., Ong, L., Shepherd, B., Rasala, A., & Wilfong, G. (2002). Route oscillations in I-BGP
with route reflection. In Proceedings of the ACM SIGCOMM.
3. Bonaventure, O., Filsfils, C., & Francois, P. (2007). Achieving sub-50 milliseconds recovery
upon BGP peering link failures. IEEE/ACM Transactions on Networking (TON), 15(5), 1123–
1135.
4. Boutremans, C., Iannaccone, G., Bhattacharyya, S. C., Chuah, C., & Diot, C. (2002). Characterization of failures in an IP backbone. In Proceedings of ACM SIGCOMM Internet Measurement
Workshop, November, 2002.

6

Interdomain Routing and Reliability

219

5. Bremler-Barr, A., Afek, Y., & Schwarz, S. (2003). Improved BGP convergence via ghost flushing. In Proceedings of IEEE INFOCOM 2003, vol. 2, San Francisco, CA, Mar. 30-Apr. 3, 2003,
pp. 927–937.
6. Bryant, S., Shand, M., Previdi, S. (2009). IP fast reroute using not-via addresses. Draft-ietfrtgwg-ipfrr-notvia-addresses-04.
7. Chandrashekar, J., Duan, Z., Zhang, Z. L., & Krasky, J. (2005). Limiting path exploration in
BGP. In Proceedings of IEEE INFOCOM 2005, Miami, Florida, March 13–17 2005, Volume:
4, 2337–2348.
8. Gao, L., & Rexford, J. (2001). A stable internet routing without global coordination.
IEEE/ACM Transactions on Networking, 9(6), 681–692.
9. Griffin, T. G., & Willfong, G. (1999). An analysis of BGP convergence properties. In Proceedings of ACM SIGCOMM, pp. 277–288, Boston, MA, September 1999.
10. Griffin, T. G., & Willfong, G. (2002). On the correctness of IBGP configuration. In Proceedings
of ACM SIGCOMM, pp. 17–29, Pittsburgh, PA, August 2002.
11. Griffin, T. G., Shepherd, B. F., & Wilfong, G. (2002). The stable paths problem and interdomain
routing. IEEE/ACM Transactions on Networking (TON), 10(2) pp. 232–243.
12. Halpern, J. M., Bhatia, M., & Jakma, P. (2006). Advertising Equal Cost Multipath routes in
BGP. Draft-bhatia-ecmp-routes-in-bgp-02.txt
13. Hengartner, U., Moon, S., Mortier, R., & Diot, C. (2002). Detection and analysis of routing
loops in packet traces. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurement, Marseille, France, pp. 107–112.
14. Iselt, A., Kirstdter, A., Pardigon, A., Schwabe, T. (2004). Resilient routing using ECMP and
MPLS. In Proceedings of HPSR 2004, Phoenix, Arizona, USA April 2004, pp. 345–349.
15. Kushman, N., Kandula, S., Katabi, D.,& Maggs, B. (2007). R-BGP: staying connected in a connected world. In 4th USENIX Symposium on. Networked Systems Design & Implementation,
Cambridge, MA, April 2007, pp. 341–354.
16. Kvalbein, A., Hansen, A. F., Cicic, T., Gjessing, S., & Lysne, O. (2006). Fast IP network
recovery using multiple outing configurations. In Proceedings IEEE INFOCOM, pp. 23–26,
Barcelona, Spain, Mar. 2006.
17. Labovitz, C., Malan, G. R., & Jahanian, F. (1998). Internet routing instability. IEEE/ACM
Transactions on Networking 6(5): 515–528 (1998).
18. Labovitz, C., Ahuja, A., Bose, A., et al. (2001). Delayed internet routing convergence.
IEEE/ACM Transactions on Networking, Publication Date: June 2001, 9(3), pp. 293–306.
19. Labovitz, C., Ahuja, A., Wattenhofer, R., et al. (2001). The impact of internet policy and topology on delayed routing convergence. In Proceedings of IEEE INFOCOM’01, Anchorage, AK,
USA, April 2001, pp. 537–546.
20. Mohapatra, P., Fernando, R., Filsfils, C., & Raszuk, R. (2008). Fast connectivity restoration
using BGP add-path. Draft-pmohapat-idr-fast-conn-restore-00.
21. Morley Mao, Z., Bush, R., Griffin, T., & Roughan, M. (2003). BGP Beacons. In Proceedings
of IMC, October 27–29, 2003, Miami Beach, Florida, USA, pp. 1–14.
22. Motiwala, M., Feamster, N., & Vempala, S. (2008). Path splicing. SIGCOMM 2008. Seattle,
WA: August.
23. Pei, D., Azuma, M., Massey, D., & Zhang, L. (2005). BGP-RCN: improving BGP convergence
through root cause notification. Computer Networks, 48(2), 175–194.
24. Pei, D., Zhao, X., Wang, L., Massey, D., Mankin, A., Wu, S. F., & Zhang, L. (2002). Improving BGP convergence through consistency assertions. In Proceedings of the IEEE INFOCOM
2002, vol. 2, New York, NY, June 23–27, 2002, pp. 902–911.
25. PlanetLab, http://www.planet-lab.org
26. Rekhter, Y., Li, T., Hares, S. (2006). A border gateway protocol 4 (BGP-4). RFC 4271.
27. Stamatelakis, D., & Grover, W. D. (2000). IP layer restoration and network planning based
on virtual protection cycles. IEEE Journal on Selected Areas in Communications, 18(10), Oct
2000, pp. 1938–1949.
28. Wang, F., & Gao, L. (2008). A backup route aware routing protocol – fast recovery from transient routing failures. Proceedings of IEEE INFOCOM Mini-Conference, April 2008. Arizona:
Phoenix.

220

F. Wang and L. Gao

29. Wang, F., Gao, L., Spatscheck, O., & Wang, J. (2008). STRID: Scalable trigger-based route incidence diagnosis. Proceedings of IEEE ICCCN 2008, St. Thomas, U.S. Virgin Islands, August
3–7, 2008, pp. 1–6.
30. Wang, F., Gao, L., Wang, J., & Qiu, J. (2009). On understanding of transient interdomain
routing failures. IEEE/ACM Transactions on Networking, 17(3), June 2009, pp. 740–751.
31. Wang, F., Mao, Z. M., Gao, L., Wang, J., & Bush, R. (2006). A measurement study on
the impact of routing events on end-to-end internet path performance. Proceedings of ACM
SIGCOMM 2006, September 11–15. Pisa, Italy, pp. 375–386.
32. Xu, W., & Rexford, J. (2006). MIRO: multi-path interdomain routing. In Proceedings of ACMSIGCOMM 2006, pp. 171–182, Pisa, Italy.
33. Zhong, Z., Nelakuditi, S., Yu, Y., Lee, S., Wang, J., & Chuah, C.-N. (2005). Failure inferencing based fast rerouting for handling transient link and node failures. In Proceedings of IEEE
Global Internet, Miami, Fl, USA, Mar. 2005, pp. 2859–2863.

Chapter 7

Overlay Networking and Resiliency
Bobby Bhattacharjee and Michael Rabinovich

7.1 Introduction
An “overlay” is a coordinated collection of processes that use the Internet for communication. The overlay uses the connectivity provided by the network to form
any overlay topologies and information flows fitting its applications, irrespective
of the topology of the underlying network infrastructure. In a broad sense, every
distributed system and application forms an overlay. Certainly, routing protocols
form overlays as does the interconnection of NNTP servers that form the Usenet.
We use the term “overlay networks” in a narrower sense: an application uses an
overlay only if processes on end-hosts are used for routing and relaying messages.
The overlay network is layered atop the physical network, which enables additional
flexibility. In particular, the overlay topology can be tailored to application requirements (e.g., overlay topologies can be set up to provide low-latency lookup on flat
names spaces), overlay routing may choose application-specific policies (e.g., overlay routing meshes can find paths in contradiction of policies exported by BGP),
and overlay networks can emulate functionality not supported by the underlying
network (e.g., overlays can implement application-layer multicast over an unicast
network).
The flexibility enabled by overlay networks can be both a blessing and a curse.
On the one hand, it gives application developers the control they need to implement
sophisticated measures to improve the resilience of their application. On the other
hand, overlay networks are built over end-hosts, which are inherently less stable,
reliable, and secure than lower-layer network components comprising the Internet
fabric. This presents significant challenges in overlay network design.

B. Bhattacharjee
Department of Computer Science, University of Maryland, College Park, MD 20742, USA
e-mail: bobby@cs.umd.edu
M. Rabinovich
Electrical Engineering and Computer Science, Case Western Reserve University,
10900 Euclid Avenue, Cleveland, Ohio 44106–7071, USA
e-mail: misha@eecs.case.edu

C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications,
Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 7,
c Springer-Verlag London Limited 2010


221

222

B. Bhattacharjee and M. Rabinovich

In this chapter, we concentrate on the former aspect of overlay networks and
present a survey of overlay applications with a focus on how they are used to
increase network resilience. We begin with a high-level overview of some issues
that can hamper the network operation and how overlay networks can help address
these issues. In particular, we consider how overlay networks can make a distributed
application more resilient to flash crowds and overload, to component failures and
churn, network failures and congestion, and to denial of service attacks.

7.1.1 Resilience to Flash Crowds and Overload
The emergence of the Web has led to a new phenomenon where Internet resources
are exposed to potentially unlimited demand. It is difficult (and indeed inefficient)
for content providers to provision sufficient capacity for the worst-case load (which
is often hard to predict). Inability to predict worst-case load leaves content providers
susceptible to flash crowds: rapid surges of demand that exceed the provisioned
capacity.
Approaches to address flash crowds differ by resource type. It is useful to distinguish the following types of Internet resources:
 Large files, exemplified by software packages and media files, with file sizes on

the order of megabytes for audio tracks, going up to tens or even hundreds of
megabytes for software packages and gigabytes for full-length movies.
 Web objects, consisting of typical text and pictures on Web pages, with sizes
ranging from one to hundreds of kilobytes.
 Streaming media, where the download (often at bounded bit rates) continues over
the duration of content consumption.
 Internet applications, where a significant part of service demand to process a
client request is due to the computation at the server rather than delivering content
from the server to the client.
IP multicast is a mechanism at the IP level that could potentially address the flash
crowd problem in the first three of these resource types. At a high level, IP multicast creates a tree with the content source as the root, and the content consumers
as the leaves. The source sends only one copy of a packet, and routers inside the
network forward and duplicate packets as necessary to implement forwarding to all
receivers. IP multicast decouples the resources requirements at the source from the
number of simultaneous receivers of identical data. However, IP multicast cannot
help when different contents need to be sent to different clients, or when the same
content needs to be sent at different times, or when one needs to scale up an Internet
application. Furthermore, although IP multicast is widely implemented, access to
the IP multicast service is enabled only in the confine of individual ISPs to selected
applications.

7

Overlay Networking and Resiliency

223

Overlay networks can help overcome these limitations. Content delivery
networks are an overlay-based approach widely used for streaming, large file,
and Web content delivery. A content delivery network (CDN) is a third-party infrastructure that content providers employ to deliver their data. In a sense, it emulates
multicast at the application level, with content providers’ sites acting as roots of the
multicast trees and servers within the CDN infrastructure as internal multicast tree
nodes. What distinguishes a CDN from IP multicast is that, as with any overlays,
its deployment does not rely on additional IP services beyond the universal IP
unicast service, and that CDN nodes have long-term storage capability, allowing the
distribution trees to encompass clients consuming content at different times.
A CDN derives economy of scale from the fact that its infrastructure is shared
among multiple content providers who subscribe to the CDN’s service. Indeed,
because flash crowds are unlikely to occur at the same time for multiple content
providers, a CDN needs much less overprovisioning of its infrastructure than an individual content provider: a CDN can reuse the same capacity slack to satisfy peak
demands for different content at different times.
Another overlay approach, called peer-to-peer (P2P) delivery, provides resilience
to flash crowds by utilizing client bandwidth in delivering content. By integrating
clients into the delivery infrastructure, P2P approaches promise the ability to organically scale with the demand surge: the more clients want to obtain certain content,
the more resources are added to the delivery infrastructure. The P2P paradigm has
been explored in various contexts, but most widely used are P2P approaches to
large-file downloads and streaming content.
Peer-to-peer or peer-assisted delivery of streaming content is particularly compelling because streaming taxes the capacity of the network and at the same time
imposes stringent timing requirements. Consider, for example, a vision for a future Internet TV service (IPTV), where viewers can seamlessly switch between
tens of thousands of live broadcast channels from around the world, millions of
video-on-demand titles, and tens of millions of videos uploaded by individual users
using capabilities similar to those provided by today’s YouTube-type applications.
Consider a global carrier providing this service in high-definition to 500 million subscribers, with 200 million simultaneous viewers at peak demand watching different
streams – either distinct titles or the same titles shifted in time. Assume conservatively that a high-definition stream requires a streaming rate of 6 Mbps (it is
currently close to 10 Mbps but is projected to reduce with improvements in coding). The aggregate throughput to deliver these streams to all the viewers is 1.2
Petabits per second. Even if a video server could deliver 10 Gbps of content, the
carrier would need to deploy 120,000 video servers to satisfy this demand through
naive unicast. Given these demands on the network and server capacities, overlay
networks – in particular peer-to-peer networks – are important technologies to enable IPTV on a massive scale.

224

B. Bhattacharjee and M. Rabinovich

7.1.2 Resilience to Component Failures and Churn
A distributed application needs to be able to operate when some of its components
fail. For example, we discussed how P2P networks promise resiliency to flash
crowds. However, because they integrate users’ computers into the content delivery infrastructure, they are especially prone to component failures (e.g., when a user
kills a process or terminates a program) and to peer churn (as users join and leave
the P2P networks). The flexibility afforded by overlay networks can be exploited
to incorporate a range of redundancy mechanisms. These mechanisms allow system designers to utilize many failure prone components (often user processes on
end-hosts) to craft highly resilient applications.
Existing P2P networks have proven this resiliency by functioning successfully
despite constant peer churn. Besides traditional file-sharing P2P networks, other
examples of churn-resistant overlay network designs include a peer-to-peer Web
caching system [36] and a churn-resistant distributed hash table [52].

7.1.3 Resilience to Network Failures and Congestion
Overlay networks can mitigate the effects of network outages and hotspots. Two
end-hosts communicating over an IP network have little control over path selection
or quality. The end-to-end path is a product of the IGP routing metrics used within
the involved domains, and the BGP policies (set by administrators of these domains)
across the domains. These metrics and policies are often entirely nonresponsive to
transient congestion; in some case, two nodes may fail to find a path (due to BGP
policies) even when a path exists.
Overlay networks allow end-users finer-grained control over routing and thus
can be agile in reacting to the underlying network conditions. Consider a hypothetical voice-over-IP communication between hosts at the University of Maryland
(in College Park, Maryland) and Case Western Reserve University (in Cleveland,
Ohio). The default path may traverse an Internet2 router in Pennsylvania. However,
if this router is congested, an overlay-based routing system that is sensitive to path
latency could try to route around the congestion. For instance, the routing overlay
could tunnel the packets through overlay nodes at the University of Virginia and the
University of Illinois, which might bypass the temporary congestion on the default
path.
Systems such as RON [4], Detour [55, 56], and Peerwise [38] create such routing overlays that route around adverse conditions in the underlying IP network.
These systems build meshes for overlay routing and make autonomous routing decisions. RON builds a fully connected mesh and continually monitors all edges.
When the direct path between two nodes fails or has shown degraded performance,
communication is rerouted through the other overlay nodes. Not all systems build a
fully connected mesh: Nakao et al. [44] use topology information and geographybased distance prediction to build a mesh that is representative of the underlying

7

Overlay Networking and Resiliency

225

physical network. Peerwise creates overlay links only between nodes that can provide shortcuts to each other. Experiments with all of these systems show that it is
indeed possible to reduce end-to-end latency and improve connectivity using routing
overlays.

7.1.4 Resilience to DoS Attacks
Overlay networks can be used to protect content providers from Distributed Denialof-Service (DDoS) attacks. During a DDoS attack, an attacker directs a set of
compromised machines to flood the victim’s incoming links. DDoS attacks are
effective because (1) the content provider often cannot distinguish an attacking connection from a legitimate client connection, (2) the number of attacking hosts can
be large enough that it is difficult for the victim’s network provider to set up static
address filters, and (3) the attackers may spoof their source IP addresses. Over the
last decade, DDoS attacks have interrupted service to many major Internet destinations, and in some cases, have been the root cause for the termination of service [31].
Networking researchers have developed many elegant approaches to mitigating the
effect of and tracing the root of DDoS attacks; unfortunately, almost all of them
require changes to the core Internet protocols.
Overlay services can be used to provide resiliency without changing protocols or
infrastructure. SOS [28] and Mayday [3] are overlay services that “hide” the address
of the content-providing server. Instead the server is “protected” by an overlay, and
access to the server may require strong authentication or captchas (that can distinguish attackers from legitimate clients). The protective overlay is large enough that
it is not feasible or profitable to attack the entire overlay. The content provider’s
ISP blocks all access to the server except by a small set of (periodically changing)
trusted nodes who relay legitimate requests to the server.

7.1.5 Chapter Organization
We have discussed various ways in which overlay networks can improve resiliency
of networked applications. In the rest of this chapter we discuss some of these applications in more detail. We begin by introducing a foundational concept used in many
overlay applications – a distributed hash table – in Section 7.2. We then discuss representative overlay applications including streaming media systems in Section 7.3
and Web content delivery networks in Section 7.4. Section 7.5 describes an overlay approach to improving the resiliency of Web services against DDoS attacks.
We discuss swarming protocols for bulk transfer in Section 7.6, and conclude in
Section 7.7.

226

B. Bhattacharjee and M. Rabinovich

7.2 A Common Building Block: DHTs
Distributed applications often maintain large sets of identifiers or keys, such as
names of files, IDs of game players, or addresses of chat rooms. For scalability,
resilience, and load-balance, the task of maintaining these keys is divided amongst
the nodes participating in the system. This approach scales since each node only
deals with a limited subset of keys, it is resilient since a single key can be replicated
onto more than one node, and finally it balances load since lookups and storage
overhead are distributed (relatively) evenly over all the participants.
A node responsible for a key may perform various application-specific actions
related to this key: store the corresponding data, act as a control server for a named
group, and so forth. A fundamental capability such a system must support is to allow
each participating node to identify the node(s) responsible for a given key. Once
a seeking node locates the node(s) that store a key, it may initiate corresponding
actions.
Distributed Hash Tables (DHTs) are a technique for efficiently distributing keys
among nodes. DHTs provide this capability while limiting the knowledge each node
must maintain about the other nodes in the system: instead of directly determining
a responsible node (as would be the case with regular hashing), a node can only
determine some nodes that are “closer” (by some metric) to the responsible node.
The node then sends its request to one of the closer nodes, which in turn would
forward the request toward a responsible node until the request reaches its target.
Good DHTs ensure that requests must traverse only a small number of overlay hops
en route to a responsible node. In a system with n nodes, many DHT protocols limit
this hop count to O.log n/ while storing only O.log n/ routing state at each node
for forwarding requests. Newer designs reduce some of the overheads to constants
[23, 41, 50].
DHTs are a common building block for many types of distributed services,
including distributed file systems [18], publish–subscribe systems [14, 58], cooperative Web caching [25], and name service [6]. They have even been proposed as a
foundation for general Internet infrastructures [58]. DHTs can be built using a structured network, in which the DHT protocol chooses which nodes in the network are
linked (and uses the structure inherent in these connections to reduce lookup time)
or an unstructured network, in which the node interconnection is either random or
an external agent specifies which nodes may be connected (as can be the case if
links are constrained as in a wireless network or have specific semantics such as
trust). We next describe prototypical DHT systems that are designed for cooperative
environments.

7.2.1 Chord: Lookup in Structured Networks
Chord [59] was one of the first DHTs that routed requests in O.log n/ overlay hops
while requiring each node to store only O.log n/ routing state. The routing state at

7

Overlay Networking and Resiliency

227

each node contains pointers to some other nodes and is called a node’s finger table.
Nodes responsible for a key store a data item associated with this key; the DHT can
be used to lookup data items by key.
Chord assigns an identifier (uniformly at random) to each node from a large ID
space (2N IDs, N is usually set at 64 or 128). Each item to be stored in the DHT
is also assigned an ID from the same space. Chord orders IDs onto a ring modulo
2N . An item is mapped to the node with the smallest ID larger than the item’s ID
modulo 2N . Using this definition, we say that each item is mapped onto the node
“closest” to the item in the ID space.
A node with ID x stores a “finger table”, which consists of references to nodes
closest to IDs x C 2i ; i 2 f0; N  1g. The successor of i , denoted as s.i /, is the node
whose ID is immediately greater than i ’s ID modulo 2N . Likewise, the predecessor
of i , p.i /, is the node whose ID is immediately less than n’s (Fig. 7.1). Each Chord
node is responsible for the half-open interval consisting of its predecessor’s ID (noninclusive) and its own ID (inclusive).
When a new node joins, it finds its “place” on the ring by routing to its own ID
(say x), and can populate its own routing table by successively querying for nodes
with the appropriate IDs (x C 1; x C 2; x C 4; : : : ). In the worst case, this incurs
O.log2 n/ overhead.
A node returns the data (if any) upon receiving a lookup for a key in the range
of IDs it stores. For other lookups, it “routes” (forwards) the query to the node in
its finger table with the highest ID (modulo 2N ) smaller than the key. This process
iterates until the item is found or it is determined that there is no item corresponding
to the lookup. Figure 7.2 shows two examples of lookups in Chord. In the first case,
the data corresponding to key value 3 is looked up (starting from node 52); in the

62

2
4

55

2+20

3

maps
to
node
4

2+21

4

4

2

52

finger

6

8

2+23

10

15

2+2

4

18

21

2+2

5

34

34

2+2
46

8

15

43
21
22

Fig. 7.1 Finger table state
for Node 2

34

31

28

228

B. Bhattacharjee and M. Rabinovich
Key = 3
Interval = [2, 4)
Next hop = 2
Key = 3
Interval = [61, 5)
Next hop = 62

Key = 3
Interval = [61, 5)
Next hop = 62

62

2

4

55

52

8

Key = 42
Interval = [31, 47)
Next hop = 31

Key = 42
Interval = [14, 46)
Next hop = 15

15

46

43
21
22
34

31

28

Key = 42
Interval = [39, 47)
Next hop = 43

Fig. 7.2 Two lookups on the Chord ring

second, 42 is looked up starting from node 46. The figure shows the nodes visited
by the queries in each case, and also the interval (part of the Chord space) each node
is responsible for.
In practice, Chord nodes inherit most of their routing table from their neighbors
(and avoid the O.log2 n/ work to populate tables). Nodes periodically search the
ring for “better” finger table entries. As nodes leave and rejoin, the Chord ring is
kept consistent using a stabilize protocol, which ensures eventual consistency of
successor pointers.
More details about Chord, including the details of the stabilization protocol, can
be found in [60].

7.2.2 LMS: Lookup on Given Topologies
As we saw in the previous section, Chord imposes the overlay topology on its nodes
that is stipulated by node IDs, and lookup queries traverse routes in this topology.
Such networks are often referred to as structured. In contrast, some overlay networks allow participating nodes to form arbitrary topologies, irrespective of their
node IDs. These networks are called unstructured. The simplest form of lookup on
an unstructured topology is to flood the query. Flooding searches, while adequate
for small networks, quickly become infeasible as networks grow larger.
LMS (Local Minima Search [43]) is a protocol designed for unstructured networks that scale better than flooding. In LMS, the owner of each object places
replicas of the object on several nodes. Like in a DHT, LMS places replicas onto

7

Overlay Networking and Resiliency

229

nodes which have IDs “close” to the object. Unlike in a DHT, however, in an
unstructured topology there is no deterministic mechanism to route to the node,
which is the closest to an item. Instead, LMS introduces the notion of a local minimum: a node u is a local minimum for an object if and only if the ID of u is the
closest to the item’s ID in u’s neighborhood (those nodes within h hops of u in the
network, where h is a parameter of the protocol, typically 1 or 2).
In general, for any object there are many local minima in a graph, and replicas are placed onto a subset of these. During a search, random walks are used to
locate minima for a given object, and a search succeeds when a local minimum
holding a replica is located. While DHTs typically provide a worst-case bound of
O.log n/ steps for lookups in a network of size n, LMS provides a worst-case bound
of O.T .G/ C log n/, where T .G/ is the mixing time of G (the time by which a
random walk on the topology G approaches its stationary distribution). T .G/ is
O.log n/ or polylogarithmic in n for a wide range of randomly-grown topologies.
This “O.T .G/ C log n/” is typically in the 6–15 range in networks of size up to
100; 000. Let dh be the minimum size of the h-hop
p neighborhood of any node in G.
LMS achieves its performance byp
storing O. n=dh / replicas, and with a message
complexity (in its lookups) of O. n=dh  .T .G/ C log n//. This is notably worse
than DHTs, but is a considerable improvement over other (essentially linear-time)
lookup techniques in networks that cannot support a structured protocol, and a vast
improvement over flooding-based searches [43].
The use of local minima in LMS provides a high assurance that object replicas are distributed randomly throughout the network. This means that even if the
lookup part of the LMS protocol is not used (such as for searches on object attributes
that consequently cannot use the virtualized object identifier), flooding searches will
succeed with high probability even with relatively small bounded propagation distances. Finally, LMS also provides a high degree of fault-tolerance.

7.2.3 Case Study: OpenDHT
Since many distributed applications can benefit from a lookup facility, a logical step
is to develop a DHT substrate. OpenDHT is an example of such a substrate[53].
An application using a DHT may need to execute application-specific actions
at each node along DHT routing paths or at the node responsible for a given key.
However, to satisfy a range of applications, OpenDHT takes a minimalist approach:
it only allows applications to associate a data item with a given key and store it in the
substrate (at a node or nodes that OpenDHT selects to be responsible for this key)
as well as retrieve it from the substrate. The DHT routing is done “under covers”
within the substrate and is not exposed to the application.
In other words, OpenDHT is an external storage platform for third-party applications. While OpenDHT in itself is a peer-to-peer overlay network, application
end-hosts do not participate in it directly. Instead, it runs on PlanetLab [16] nodes;
applications that use OpenDHT may or may not use PlanetLab.

230

B. Bhattacharjee and M. Rabinovich

OpenDHT provides two simple primitives to applications: put(key, data) which
is used to store a data item and an associated key, and get(key) which retrieves previously stored data given its key.1 Multiple puts with the same key append their
data items to the already existing ones, so a subsequent get would retrieve all these
data. OpenDHT, therefore, implements an application-agnostic shared storage facility. Due to its open nature, OpenDHT includes special mechanisms to prevent
resource hoarding by any given user. It also limits the size of data items to 1 KB and
times out deposited data items that are not explicitly renewed by the application.
Renewal is done by issuing an identical “put” before the original data item expires.
The shared storage provided by OpenDHT allows end-hosts in a distributed application to conveniently share state, without any administrative overhead. This
capability turned out to be powerful enough to support a growing number of applications. In fact, OpenDHT primitives can be used to implement an application
that employs its own DHT routing among the application’s end-hosts [53].
While a great deal of engineering ingenuity ensures that OpenDHT nodes’ resources are shared fairly among competing applications, OpenDHT’s resiliency and
scalability come from its overlay network architecture. Besides demonstrating these
benefits of overlays, OpenDHT has shown the generality of the DHT concept by
using it as a foundation of a substrate that has proved useful for a number of diverse
applications.

7.2.4 Securing DHTs
Chord and LMS are only two of many different contemporary lookup protocols.
These two protocols assume that nodes are cooperative and altruistic. While these
protocols are highly resilient to random component failures, it is more difficult to
protect them against malicious attacks. This is especially a concern since DHTs
may be built using public, non-centrally administered nodes, some of which may
be corrupt or compromised. There are several ways in which adversarial nodes may
attempt to subvert a DHT. Malicious nodes may return incorrect results, may attempt
to route requests to other incorrect nodes, provide incorrect routing updates, prevent
new nodes from joining the system, and refuse to store or return items. There are
several DHT design that provide resilience to these types of attacks. We describe
one in detail next.

7.2.5 Case Study: NeighborhoodWatch
The NeighborhoodWatch DHT [11] provides security against malicious users
that attempt to subvert a DHT instance by misrouting or dropping queries,
1
The actual API includes additional primitives and parameters, which are beyond the scope of our
discussion.

7

Overlay Networking and Resiliency

231

refusing to store items, preventing new nodes from joining, and similar attacks.
NeighborhoodWatch employs the same circular ID space as Chord [59], and also
maps its nodes into neighborhoods as in [20]. However, in NeighborhoodWatch,
each node has its own neighborhood that consists of itself, it’s k successors, and
k predecessors, where k is a system parameter. NeighborhoodWatch’s security
guarantees hold if and only if for every sequence of k C 1 consecutive DHT nodes,
at least one is alive and honest.
NeighborhoodWatch employs an on-line trusted authority, the Neighborhood
Certification Authority (NCA) to attest to the constituents of neighborhoods. The
NCA has a globally known public key. The NCA may be replicated, and the state
shared between NCA replicas is limited to the NCA private key, a list of malicious
nodes, and a list of complaints of non-responsive nodes.
The NCA creates, signs, and distributes neighborhood certificates, or nCerts, to
each node. Nodes need a current and valid nCert in order to participate in the system.
Upon joining, nodes receive an initial nCert from the NCA. nCerts are not revoked;
instead nodes must renew their nCerts on a regular basis by contacting the NCA.
nCerts list the current membership of a neighborhood, accounting for any recent
changes in membership that may have occurred. Using signed nCerts, any node can
identify the set of nodes that are responsible for storing an item with a given ID.
NeighborhoodWatch employs several mechanisms that detect and prove misbehavior (described in detail in [11]). The NCA removes malicious nodes from the DHT
by refusing to sign a fresh nCert for that node.
Nodes maintain and update their finger tables as in Chord. The join procedure
is shown in Fig. 7.3. For each of node n’s successors, predecessors, and finger table
n
p3(n) p 2(n)
p (n)

n
p3(n) p 2(n)
p (n)

n.id

n.id
s(n)

s(n)

s 2(n)

2
s (n)

s 3(n)

NCA

(1) Node n requests to join by
contacting an NCA replica.
3

p (n) p 2(n)
p (n)

s 3(n)

NCA

(2) NCA returns an nCert to n,
who uses it to find owner (n.id ).

n

3
n
p (n) p 2(n)
p (n)

n.id

n.id
s(n)

s (n)

2

2
s (n)

s (n)
3
s (n)

NCA

(3) Node n returns nCertowner(n.id) to NCA.
3
n
p (n) p 2(n)
p (n)

(4) NCA requests neighborhood certificates
from k predecessors and k successors of n

p3(n) p 2(n)
p (n)

n.id

n
s (n)

s (n)

k

s 2(n)

2
s (n)

NCA

s 3(n)

NCA

s 3(n)

(5) Nodes return current certificates and
the NCA verifies their consistency

NCA
(6) NCA issues fresh certificates
to all affected nodes

Fig. 7.3 The join process in the NeighborhoodWatch DHT [11]. Here k D 3

3

s (n)

232

B. Bhattacharjee and M. Rabinovich

entries, n stores a full nCert (instead of only the node ID and IP address as in Chord).
When queried as part of a lookup operation, nodes return nCerts rather than information about a single node. Routing is iterative: if a node on the path fails (or does not
answer), the querier can contact another node in the most recently obtained nCert.
Recall that NeighborhoodWatch assumes that every sequence of k C 1 consecutive nodes in the DHT contains at least one node that is alive and honest. The insight
is that if nodes cannot choose where they are placed in the DHT, malicious nodes
would have to corrupt a large fraction of the nodes in the DHT in order to obtain
a long sequence of consecutive, corrupt nodes. By making routing depend on long
sequences of nodes (neighborhoods), nodes are guaranteed to know of at least one
other honest node that is “near” a given point in the DHT. In order to protect against
a given fraction f of malicious nodes, the system operator chooses a value of k
such that this assumption holds with high probability.
Items published to the DHT are self-certifying. In addition, when a node stores
an item, it returns a signed receipt to the publisher. This receipt is then stored back
in the DHT. This prevents nodes from lying about whether they are storing a given
item: if a querier suspects that a node is refusing to return an item, it can look for a
receipt. If it finds a receipt, it can petition the NCA to remove the misbehaving node
from the DHT.

7.2.6 Summary and Further Reading
In this section, we have described the basic functionality provided by DHTs, and
provided case studies that demonstrate different flavors of DHTs and lookup protocols. We have described how DHTs attain their lookup performance, and also
described how DHT protocols can be subverted by attackers. Finally, we have
presented a DHT design that is more resilient to noncooperative and malicious behavior. Our review is not comprehensive; there are many other interesting DHT
designs. We point the interested reader to [12, 20, 23, 41, 50, 51, 54, 66].

7.3 Resilient Overlay-Based Streaming Media
Overlay-based streaming media systems can be decomposed into three broad categories depending on their data delivery mechanism (Fig. 7.4).
Participants in a single-tree system arrange themselves into a tree. By definition,
this implies that there is a single, loop-free, path between any two tree nodes. The
capacity of each tree link must be at least the streaming rate. Content is forwarded
(i.e., pushed) along the established tree paths. The source periodically issues a content packet to its children in the tree. Upon receiving a new content packet, each
node immediately forwards a copy to its children. The uplink bandwidth of leaf
nodes remains unused (except by recovery protocols) in a single tree system.

7

Overlay Networking and Resiliency

Fig. 7.4 Decomposition
of Streaming Media Protocols

233
Streaming Media Protocols

Single-Tree

Single-Tree Mesh Hybrid

Mesh

Multi-Tree

Multi-Tree Mesh Hybrid

Examples of single-tree systems include ESM [15], Overcast [26], ZIGZAG [61],
and NICE [8].
In a multi-tree system, each participating node joins k different trees and the
content is partitioned into k stripes. Each stripe is then disseminated in one of the
trees, just as in a single-tree system. In a multi-tree protocol, each member node
can be an interior node in some tree(s) and a leaf node in other trees. Further, each
stripe requires only 1=kth the full stream bandwidth, enabling multi-trees to utilize
forwarding bandwidths that are a fraction of the stream rate. These two properties
enable multi-tree systems to utilize available bandwidth better than a single-tree.
SplitStream [13], CoopNet [45], and Chunkyspread [62] are examples of multi-tree
systems.
In mesh-based or swarming overlays, the group members construct a random
graph. Often, a node’s degree in the mesh is proportional to the node’s forwarding
bandwidth, with a minimum node degree (typically five [69]) sufficient to ensure
that the mesh remains connected in the presence of churn.
The source periodically makes a new content block available, and each node
advertises its available blocks to all its neighbors. A missing block can then be
requested from any neighbor that advertises the block. Examples of mesh-based
systems are CoolStreaming [69], Chainsaw [46], PRIME [39], and PULSE [47].
As Fig. 7.4 shows, the base dataplanes can be combined to form hybrid dataplanes. Hybrid dataplanes combine tree- and mesh-based systems by employing a
tree backbone and an auxiliary mesh structure. Typically, blocks are “pushed” along
the tree edges (as in a regular tree protocol) and missing blocks are “pulled” from
mesh neighbors (as in a regular mesh protocol).
Prototypical examples of single-tree-mesh systems are mTreeBone [65] and
Pulsar [37]. Bullet [29] is also a single-tree mesh but instead of relying on the primary tree backbone to deliver the majority of blocks, random subsets of blocks are
pushed along a given tree edge and nodes recover the missing blocks via swarming.
PRM [9] is a probabilistic single-tree mesh system. Chunkyspread [62], GridMedia [68], and Coolstreaming+ [33, 34] are multi-tree-mesh systems. CPM [22] is a
server-based system that combines server multicast and peer-uploads.

234

B. Bhattacharjee and M. Rabinovich

7.3.1 Recovery Protocols
Tree-based delivery is fragile, since a single failure disconnects the data delivery until the tree is repaired. Existing protocols have added extra edges to a tree
(thus approximating a mesh) for reducing latency [40] and for better failure recovery [9, 67]. These protocols are primarily tree-based, but augment tree delivery (or
recovery) using links. Multi-tree protocols are more resilient, since a single failure
often affects only one (of k) trees. Mesh delivery is robust by design; single node
or even multiple failures are not of high consequence since the data is simply pulled
along surviving mesh paths.
We next describe in detail different delivery protocols with a focus on their
recovery behavior.

7.3.2 Case Study: Recovery in Trees Using Probabilistic Resilient
Multicast (PRM)
PRM [10] introduces three new mechanisms – randomized forwarding, triggered
NAKs and ephemeral guaranteed forwarding – to tree delivery. We discuss randomized forwarding in detail.
In randomized forwarding, each overlay node, with a small probability,
proactively sends a few extra transmissions along randomly chosen overlay edges.
Such a construction interconnects the data delivery tree with some cross edges
and is responsible for fast data recovery in PRM under high failure rates of overlay nodes. We explain the details of proactive randomized forwarding [10] using
the example shown in Fig. 7.5. In the original data delivery tree (Panel 0), each
overlay node forwards data to its children along its tree edges. However, due to network losses on overlay links (e.g., hA; Di and hB; F i) or failure of overlay nodes
(e.g., C , L, and Q), a subset of existing overlay nodes do not receive the packet
(e.g., D; F; G; H; J; K and M ). We remedy this as follows. When any overlay node
receives the first copy of a data packet, it forwards the data along all other tree edges
(Panel 1). It also chooses a small number (r) of other overlay nodes and forwards

0

B

A
E

C

F

D

1

B

A
D

Q
G

H

J

K

L M N

P

F

E

C

T

T

Q
G

H

J

K L

M

N P

Fig. 7.5 The basic idea behind PRM. The circles represent the overlay nodes. The crosses indicate
link and node failures. The arrows indicate the direction of data flow. The curved edges indicate
the chosen cross overlay links for randomized forwarding of data. [10]

7

Overlay Networking and Resiliency

235

data to each of them with a small probability, ˇ. For example, node E chooses to forward data to two other nodes using cross edges F and M . Note that as a consequence
of these additional edges some nodes may receive multiple copies of the same packet
(e.g., node T in Panel 1 receives the data along the tree edge hB; T i and cross edge
hP; T i). Therefore, each overlay node needs to detect and suppress such duplicate
packets. Each overlay node maintains a small duplicate suppression cache, which
temporarily stores the set of data packets received over a small time window. Data
packets that miss the latency deadline are dropped. Hence the size of the cache is
limited by the latency deadline desired by the application. In practice, the duplicate
suppression cache can be implemented using the playback buffer already maintained
by streaming media applications. It is easy to see that each node on average sends
or receives up to 1 C ˇr copies of the same packet. The overhead of this scheme is
ˇr, where we choose ˇ to be a small value (e.g., 0.01) and r to be between 1 and 3.
In PRM, nodes discover other random nodes by employing periodic random walks.
It is instructive to understand why such a simple, low-overhead randomized forwarding technique is able to increase packet delivery ratios with high probability,
especially when many overlay nodes fail. Consider the example shown in Fig. 7.6,
where a large fraction of the nodes have failed in the shaded region. In particular, the
root of the subtree, node A, has also failed. So if no forwarding is performed along
cross edges, the entire shaded subtree is partitioned from the data delivery tree. No
overlay node in this entire subtree would receive data packets until the partition is
repaired. However, using randomized forwarding along cross edges a number of
nodes from the unshaded region will have random edges into the shaded region
as shown (hM; X i; hN; Y i and hP; Zi). The overlay nodes that receive data along
such randomly chosen cross edges will subsequently forward data along regular tree
edges and any chosen random edges. Since the cross edges are chosen uniformly at
random, a large subtree will have a higher probability of cross edges being incident
on it. Thus as the size of a partition increases, so does its chance of repair using
cross edges.
Triggered NAKs are the reactive components of PRM. An overlay node can detect missing data using gaps in received sequence numbers. This information is used
to trigger NAK-based retransmissions. PRM further includes a Ephemeral Guaranteed Forwarding technique, which is useful for providing uninterrupted data service

M

Fig. 7.6 PRM provides
successful delivery with high
probability because large
subtrees affected by a node
failure get randomized
recovery packets with high
probability. [10]

A
X

N

Y
Z

P

Overlay subtree with large
number of node failures

236

B. Bhattacharjee and M. Rabinovich

when the overlay construction protocol is detecting and repairing a partition in the
data delivery tree. Here, when the tree is being repaired, the root of an affected subtree receives a stream of data from a “random” peer. More details about PRM are
available in [10].

7.3.3 Case Study: Multi-Tree Delivery Using Splitstream
In Splitstream, the media is divided into k stripes, using a coding techniques such
as multi-descriptive coding (MDC). All of the stripes in aggregate provides perfect
quality, but each stripe can be used independent of the others and each received
stripe progressively improves the stream quality. Splitstream forms k trees, such
that, ideally, each node is an interior node in only one tree. The source multicasts
stripes onto different trees, and each node receives all stripes and forwards only one
stripe.
When a node departs, at most one tree is affected since every node is a leaf in
all but one tree. Therefore, node departures do not affect delivery quite as much as
a single tree system. Further, the forwarding bandwidth of every node is now used,
since each node is an interior node in at least one stripe tree. Finally, since each
stripe is approximately 1=kth the bandwidth of the original stream, each node can
serve more children, which results in a shorter tree (higher average outdegree) and
lower latency.
Splitstream is built atop Scribe, which itself is an overlay multicast protocol built
using the Pastry DHT. Due to bandwidth constrains on individual nodes, it is not
always feasible to form the ideal interior-disjoint trees such that each node is an
interior node in only one tree. In particular, a stripe tree may run out of forwarding
bandwidth (because all of its leaf nodes are interior nodes in some other tree). To
solve this problem, Splitstream maintains a “Spare Capacity Group (SCG),” which
contains nodes with extra capacity that can forward onto more than one stripe. In
bandwidth-scare deployments, nodes may have to use the SGC to locate a parent. In
extreme cases, it may be impossible to form a proper Splitstream forest; however,
this condition is rare and analysed in detail in [13].

7.3.4 Case Study: Recovery Using a Mesh
in CoolStreaming/DONet
In Coolstreaming, a random mesh connects the members of the data overlay, and
random blocks are “pulled” from different mesh neighbors. Each node maintains
an mCache, which is a partial list of other active nodes in the overlay. A new
node initially contacts the source; the source selects a random “deputy” from its
mCache, and the deputy supplies the new node with currently active nodes. Each

7

Overlay Networking and Resiliency

237

node periodically percolates a message (announcing itself) onto the overlay using a
gossip protocol.
The media stream is divided into fixed sized segments; each segment has a
sequence number and each node maintains a bitmap, called the buffer map, to
represent the availability of segments. In CoolStreaming, the default buffer map
contains 120 bits. Each node maintains neighbors (called partners) proportional to
its forwarding bandwidth, while still maintaining a minimum number of partners
(typically 5).
Nodes periodically (usually every second) exchange their buffer maps with their
partners, and use a scheduling heuristic to exchange blocks. The scheduling algorithm must select a block to request, and an eligible node to request the block from.
The block requested is the scarcest block (supplied by least number of nodes). The
node from which this block is requested is the eligible node (which has advertised
the scarce block) with the most bandwidth. The origin node serves only as a supplier
and publishes a new content block every second.
Partners can be updated from the node’s mCache as needed, and the mCache is
updated using the periodic gossip. Individual node failures have very little effect on
the delivery since a node can simply select a different partner to receive a block.
However, the trade-off is control overhead (bitmap exchange) and latency (which is
now proportional to the product of buffer map size and overlay diameter).

7.4 Web Content Delivery Networks
Resource provisioning is a fundamental challenge for Internet content providers.
Too much provision and the infrastructure will simply depreciate without generating
return on investment; too little provision and the web site may lose business and
potentially steer users to competitors.
A content delivery network (CDN) offers a service to content providers that helps
address this challenge. A typical CDN provider deploys a number of CDN servers
around the globe and uses them as a shared resource to deliver content from multiple
content providers that subscribe to the CDN’s service. The CDN servers are also
known as edge servers because they are often located at the edges of the networks
in which they are deployed. Content delivery networks represent a type of overlay
network because they route content between the origin sites and the clients through
edge servers.
A CDN improves resiliency and performance of subscribing web sites in several
ways.
 As already mentioned in Section 7.1.1, a CDN can reuse capacity slack to absorb

demand peaks for different content providers at different times. By sharing a
large slack across a diverse pool of content providers, CDNs improve resiliency
of the subscribing web sites to flash crowds.

238

B. Bhattacharjee and M. Rabinovich

 A CDN promises a degree of protection against denial of service attacks because

the enormous capacity the attacker would need to saturate to exert any noticeable
performance degradation.
 A CDN improves the performance of content delivery under normal load because
it can process client requests from a nearby edge server.
CDNs are used to deliver a variety of content, including static web objects, software packages, multimedia files, and streaming content – both video-on-demand
and live. For video-on-demand, edge servers deliver streams to viewers from their
cached files; typically, these files are pre-loaded to the edge server caches from
origin sites as they become available. However, if a requested file is not cached,
the edge server will typically obtain the stream from the origin and forward it to
the viewer, while also storing the content locally for future requests. In the case
of live streaming (“Webcasts”), content flows form a distribution tree, with viewers as leaves, edge servers as intermediate nodes, and the origin as the root. Often,
however, CDN servers form deeper trees. In either case, Webcast delivery through
a CDN can benefit from various tree-based approaches to streaming media systems
such as those discussed in Section 7.3. In the rest of this section, we will limit our
discussion to how CDNs deliver static files, including static web objects, software
packages, multimedia files, etc.

7.4.1 CDN Basics
A CDN must interpose its infrastructure transparently between the content provider
and the user. Furthermore, unlike P2P networks where users run specialized peer
software, a CDN must serve clients using standard web browsers. Thus, a fundamental building block in a CDN is a mechanism to transparently reroute user requests
from the content provider’s site (known as the “origin size” in the CDN parlance)
to the CDN platform. The two main techniques that have been used for this purpose
are DNS outsourcing and URL rewriting. Both techniques rely on the domain name
system (DNS), which maps human-readable names, such as www.firm-x.com, to
numeric Internet protocol (IP) addresses. A browser’s HTTP request is preceded by
a DNS query to resolve the host name from the URL. The DNS queries are sent by
browsers’ local DNS servers (LDNS) and processed by the web sites’ authoritative
DNS servers (ADNS).
In URL rewriting, a content provider rewrites its web pages so that embedded links use host names belonging to the CDN domain. For example, if a page
www.firm-x.com contains an image picture.jpg that should be delivered by the
CDN, the image URL would be rewritten to a form such as http://images.firmx.com.cdn-foo.net/real.firm-x.com/picture.jpg. In this case, the DNS query for
images.firm-x.com.cdn-foo.net would arrive to CDN’s DNS server in a normal way,
without redirection from firm-x.com’s ADNS. Note that URL rewriting only works
for embedded and hyperlinked content. The container pages (i.e., the entry points
to the web sites) would have to be delivered from the origin site directly.

7

Overlay Networking and Resiliency

239

CDN
135.207.24.10

5
Client

1
4

135.207.24.11
135.207.24.11

Images.firm-x.com?

135.207.24.12

6
Firm-x.com
192.15.183.17
CDN_DNS
135.207.25.01

3
Auth DNS

Local DNS
135.207.24.13
Images.firm-x.com?
2

“Ask 135.207.25.01”

Fig. 7.7 A high-level view of a CDN architecture

DNS outsourcing refers to techniques that exploit mechanisms in the DNS protocol that allow a query to be redirected from one DNS server to another. Beside
responses containing IP addresses, the DNS protocol allows two response types that
can be used for redirection. An NS-type response specifies a different DNS server
that should be contacted to resolve the query. A CNAME-type response specifies a
canonical name, a different host name that should be used instead of the name contained in the original query. Either response type can be used to implement DNS
outsourcing.
Figure 7.7 depicts a high-level architecture of a CDN utilizing DNS outsourcing.
Consider a content provider – firm-x.com in the example – that subscribes to CDN
services to deliver its content from the images.firm-x.com subdomain. (Content
from other subdomains, such as www.firm-x.com might be delivered independently,
perhaps by the provider’s origin server itself.)
When a client wants to access a URL with this hostname, it first needs to resolve
this hostname into the IP address of the server. To this end, it sends a DNS query
to its LDNS (step 1), which ultimately sends it to the ADNS server for firm-x.com
(step 2). ADNS now engages the CDN by redirecting LDNS’s query to the DNS
server operated by the CDN provider (CDN DNS in the figure). ADNS does it by
returning, in the exchange of step 2, an NS record specifying CDN DNS. LDNS
now sends the query for images.firm-x.com to CDN DNS, which can now choose
an appropriate edge server and return its IP address to LDNS (step 3). The LDNS
server forwards the response to the client (step 4), which now downloads the file
from the specified server (step 5). When the request arrives at the edge server, the
server may or may not have the requested file in its local cache. If it does not, it

240

B. Bhattacharjee and M. Rabinovich

obtains the file from the origin server (step 6) and sends it to the client; the edge
server can also cache this file for future use, depending on the cache-controlling
headers that came with the file from the origin server.
With either DNS outsourcing or URL rewriting, when a DNS query arrives at
CDN’s DNS server, the latter has the discretion to select the edge server whose IP
it would return in the DNS response. This provides the CDN with an opportunity to
spread the content delivery load among its edge servers (by resolving different DNS
queries to different edge servers) and to localize content delivery (by resolving a
given DNS query to an edge server that close to the requesting client, according to
some metric). There are a number of sometimes contradicting factors that can affect
edge server selection. The mechanisms and policies for server selection is a large
part of what distinguishes different CDNs from one another.
The much-simplified architecture described above is fully workable except for
one detail: how does the edge server receiving a request know which origin server
to contact for the requested file? CDNs use two basic approaches to this issue. In
the example of Fig. 7.7, assuming the client uses HTTP 1.1, the client will include
an HTTP Host header “Host:images.firm-x.com” with its request to the edge server.
This gives the edge server the necessary information.
Another approach, which does not rely on the host header, involves embedding provider identity into the path portion of the URL. This technique is used in
particular with URL rewriting. For example, with the above URL http://images.firmx.com.cdn-foo.net/real.firm-x.com/picture.jpg, the client’s request to the edge serve
will be for file “real.firm-x.com/picture.jpg”, providing edge server with the information about the origin server.

7.4.2 Bag of DNS Tricks
Looking at Fig. 7.7, an immediate concern with this architecture is the CDN DNS
server. First, it is a centralized component that can become the bottleneck in the
system. Second, it undermines localized data delivery to some degree because all
DNS queries must travel to this centralized component no matter where they come
from. These issues are exacerbated by the fact that, in order to retain fine-grained
control over edge server selection, CDN DNS must limit the amount of time its
responses can be cached and reused by clients. It does so by assigning a low timeto-live (TTL) value to its responses, a standard DNS protocol feature for controlling
response caching. This increases the volume of DNS queries that CDN DNS must
handle.
Moderate-sized CDNs sometimes disregard these concerns because DNS queries
usually take little processing, with a single server capable of handling several thousand queries per second. With additional consideration that DNS server load is
easily distributed in a server cluster, the centralized DNS resolution can handle
large amounts of load before becoming the bottleneck in practice. Furthermore,
the overhead of nonlocalized DNS processing only becomes noticeable in practice

7

Overlay Networking and Resiliency

241

for delivering small files. For large file downloads, such as software packages or
multimedia files, a few hundred millisecond of initial delay will be negligible compared to several minutes of the download itself.
Large CDNs, however, deal with extraordinary total loads and provide content
delivery services for all file sizes. Thus, they implement their DNS service as a
distributed system in its own right.
One approach to implement a distributed DNS service again utilizes DNS redirection mechanisms. For example, the Akamai CDN [1] implements a two-level
DNS system. The top-level DNS server is a centralized component and is registered as the authoritative DNS server for the accelerated content. Thus, initial DNS
queries arrive at this server. The top-level DNS server responds to queries with
an NS-type response, redirecting the requester to a nearby low-level DNS server.
Moreover, these redirecting responses are given a long TTL, in effect pinning the
requester to the selected low-level DNS server. The actual name resolution occurs
at the low-level DNS servers. Because most DNS interactions occur between clients
and low-level CDN DNS servers, the DNS load is distributed and the interactions
are localized in the network.
Another approach uses a flat DNS system, and utilizes IP anycast to spread the
load among them. A CDN using this approach deploys a number of CDN DNS
servers in different Internet locations but assigns them the same IP address. Then, it
relies on the underlying IP routing infrastructure to deliver clients’ DNS queries destined to this IP address to the closest CDN DNS server. In this way, DNS processing
load is both distributed and localized among the flat collection of DNS servers. The
Limelight CDN [35] utilizes this technique.
Beside DNS service scalability, Limelight further leverages the above technique
to sidestep the decision about which of the data centers would be the closest to
the client. In particular, Limelight deploys a DNS server in every data center; then
each given request will be delivered by the anycast mechanism to its closest data
center. The DNS server receiving a request then simply picks one of the edge servers
co-located in the same data center for the subsequent download. This approach,
however, is not without drawbacks. One limitation is that it relies exclusively on the
proximity notion reflected in Internet routing; there are other considerations, such
as network congestion and costs. Another limitation is due to the originator problem
discussed in the next subsection.

7.4.3 Issues
The basic idea behind CDNs might seem simple, but many technical challenges
lurk. An obvious challenge is server selection, which is an open-ended issue. There
are a number of factors that may affect the selection.
A basic factor is proximity: one of the key promises of CDN technology is that
they can deliver content from a nearby network location. But what does “nearby”
mean? To start with, there are a number of proximity metrics one could use, which

242

B. Bhattacharjee and M. Rabinovich

differ in how closely they correlate with end-to-end performance and how hard they
are to obtain. Geographical distance, autonomous system hops, and router hops,
could be used as relatively static proximity metrics. Static metrics may incorporate domain knowledge, such as maps of private peering points among network
providers, since private peering points can be more reliable than public network access points. Then, one could consider dynamic path characteristics, such as packet
loss, network packet travel delay (one-way or round-trip), and available path bandwidth. Obtaining these dynamic metrics and keeping them fresh is much more
challenging. Further, a CDN may account for economic factors, such as the preference of utilizing certain network carriers even at the expense of a slight performance
degradation.
Once the proximity metrics are figured out, the next question is how to combine them with server load metrics, since in the end we need to pick a certain
edge server for a given request. Server loads are inherently dynamic. They raise
a number of questions of their own, with their own research literature. How long
a history of past data to consider, and which load characteristics to measure? One
can consider a variety of characteristics, including CPU usage, network utilization,
memory, and disk IO. How frequently to collect load measurements, and how frequently to recompute load metrics? How to avoid a “herd effect” [19], where a CDN
sends too much the demand to an underloaded server, only to overload it in the next
cycle?
The next set of questions is architectural in nature. As we discussed earlier,
the prevalent mechanism in CDNs for routing requests to a selected edge server
is based on DNS. DNS-based routing raises so-called originator and hidden load
problems [49].
The originator problem is due to the fact that CDN proximity-based server selection can only be done relative to the originator of the DNS query, which is the
client’s DNS server, and not the actual host that will be downloading the content.
Thus, the quality of any proximity-based server selection is limited by how close
the actual client is to the LDNS it is using. While there has been some work on
determining the distance between clients and their LDNSs [42, 57], the end-to-end
effect of this issue on user-perceived performance is not yet fully known.
One way to sidestep the originator problem is to utilize IP anycast for the HTTP
interaction [2]. Similar to anycast-based DNS interactions considered previously,
different edge servers in this case would advertise the same IP address. This address
would be returned to the clients by CDN DNS, and packets from a given browser
machine would be delivered to the closest edge server naturally thanks to IP routing. Anycast was previously considered unsuitable for HTTP downloads for two
reasons. First, unlike DNS that uses the UDP transport protocol by default, HTTP
runs on top of TCP. TCP is a stateful connection-oriented protocol, and if a routing path changes in the middle of the ongoing download, the edge server browser
may attempt to continue the download from a different edge server, leading to a
broken TCP connection. Second, IP anycast selects among end-points for packet delivery without consideration for the routing path quality or end-point load. However,
recent insights into the anycast behavior [7] and network traffic engineering [63]

7

Overlay Networking and Resiliency

243

alleviate these concerns, especially when a CDN is deployed within one autonomous
system. ICDS – a CDN service by AT&T [5] – is currently pursuing a variant of this
approach.
The hidden load problem arises because of drastically different number of clients
behind different LDNS servers. A large ISP likely has thousands of clients sharing
the same LDNS. Then, a single DNS query from this LDNS can result in a large
amount of demand for the selected edge server. At the same time, a single query
from the LDNS of a small academic department will impose much smaller load.
Because a CDN distributes load at the granularity of DNS queries, potentially drastic
and unknown imbalances of load resulting from single queries complicate proper
load balancing.
Another architectural issue relates to the large number of edge servers a CDN
maintains. When new popular content appears and generates a large number of
requests, these requests will initially miss in the edge server caches and will be forwarded to the origin server. These missed requests may overload the origin server
in the beginning of a flash crowd, until edge servers populate their caches [27].
CDNs often pre-load new content to the edge servers when the content is known
to be popular. However, unpredictable flash crowds present a danger. Consequently,
CDNs sometimes deploy peer-to-peer cooperation among their edge servers, with
edge servers forwarding missed requests to each other rather than directly to the origin server. This gives rise to more complex overlay network topologies than the
one-hop overlay routing in the basic CDN architecture described here. In fact, the
underlying mechanisms can be even more complex: the complex overlay topologies add overhead due to application-level processing at each hop. Thus, one could
try to use simple one-hop topology under normal load and add more complex request routing dynamically once the danger of a “miss storm” is detected. This in
turn opens a range of interesting algorithmic questions involved in deciding when
to start forming a complex topology and how to form it.
This overview is necessarily brief. Its goal is only to convey the fact that content delivery networks represent an important aspect of Internet infrastructure and a
rich environment for research and innovation. We refer the reader to more targeted
literature, such as [24, 49, 64]

7.5 Attack-Resilient Services
We have seen that overlay systems provide resilience by design: the lack of centralized entities naturally provides a measure of resilience against component failures.
Overlay systems can also form the building block for systems that are resilient to
malicious attack. SOS [28] and a subsequent derivative, Mayday [3], are the two
overlay systems that provide denial-of-service protection for Internet services. We
discuss SOS next.

244

B. Bhattacharjee and M. Rabinovich

7.5.1 Case Study: Secure Overlay Service (SOS)
Secure Overlay Services (SOS) is an overlay network designed to protect a server
(the target) from distributed denial of service attacks.
SOS enables a “confirmed” user to communicate with the protected service. Conceptually, the service is protected by a “ring” of SOS overlay nodes, which are able
to confirm incoming requests as valid. Once a request is validated, it is forwarded on
to the service. Users, by themselves, are not able to directly communicate with the
service (initially); in fact, the protected server’s address may be hidden or changing.
SOS forms a distributed firewall around the target server. The server advertises
the SOS overlay nodes (called Service Overlay Access Points [SOAPs])) as its initial
point of contact. Users initiate contact to the server by connecting to one of the SOS
overlay nodes. Malicious users may attack overlay nodes, but by assumption are not
able to bring down the entire overlay.
The server’s ISP filters all packets to the server’s address, except for a chosen few
(who are allowed to traverse this firewall). These privileged nodes are called “secret
servlets”. Secret servlets designate a few SOAP nodes (called Beacons) as the rendezvous point between themselves and incoming connections. Regular SOAP nodes
use an overlay routing protocol (such as Chord) to route authenticated requests to
the Beacons.
Beacons know of and forward requests to the secret servlets. Only secret servlets
are allowed through the ISP firewall around the target, and the servlets finally forward the authenticated request to the protected server.

7.6 File-Sharing Peer-to-Peer Networks
Consider the task of distributing a large file (e.g., in the order of hundreds of MB)
to a large number of users. We already discussed one overlay approach – CDNs –
targeting such an application. However, the CDN approach requires the source of
the file to subscribe to CDN services (and pay the resultant service fees). Furthermore, this approach requires a CDN company to be vigilant in provisioning enough
resources to keep up with the potential scale of downloads involved.
Peer-to-peer networks provide an appealing alternative, which organizes users
themselves into an overlay distribution platform. This approach is appealing to content providers because it does not require a CDN subscription. It also scales naturally
with the popularity of a download: the more users are downloading a file, the more
resources take part in the overlay distribution network adding the capacity to the delivery platform. Some peer-to-peer networks also provide administrative resiliency,
as they have no special centralized administrative component. In fact, the utilization of the client upload bandwidth and CPU capacity in content delivery can also
make P2P techniques interesting as an adjunct (rather than an alternative) to a CDN
service.

7

Overlay Networking and Resiliency

245

In this section, we will concentrate on unique challenges that arise when the P2P
system downloads a large (e.g., on the order of 100s of MB) file. In particular, we
will consider the following two challenges:
 Block Distribution

Imagine a flash crowd downloading a 100 MB software
package. A naive approach (pursued by early P2P networks) would let each peer
download the entire file and then make itself available as a source of this file for
other peers. This approach, however, would not be able to sustain a flash crowd.
Indeed, each peer would take a long time – tens of minutes over a typical residential broadband connection – to download this file and in the meantime the initial
file source would have no help in coping with the demand. The solution is to
chop the file into blocks and distribute different blocks to different peers, so that
they can start using each other faster for block distribution. But this creates an interesting challenge. Obviously, the system needs to make a diverse set of blocks
available as quickly as possible, so that each peer has a better chance of finding
another peer from which to obtain missing blocks. But achieving this diversity
is difficult when no peer possesses global knowledge about block distribution at
any point in time.
 Free Riders A particularly widespread phenomenon is that of selfish peers:
peers that attempt to make use of the peer content delivery without contributing their own resources. These peers are called “free riders”. More generally, a
peer may try to bypass fairness mechanisms in the P2P network and obtain more
than its share of resources, thus getting better service at the expense of other
users.
We will consider these two challenges in the context of the mesh model of content distribution. Using the terminology of BitTorrent – a popular P2P network – the
key components of a mesh P2P network are seeds, trackers, and peers (or leechers).
Originally the file exists at the source server (or servers) called seeds. There is a
special tracker node that keeps track of at least some subset of the peers who are in
the process of downloading the file. A new peer joins the download (a swarm) by
contacting the tracker, obtaining a random subset of existing peers, and establishing P2P connections (i.e., overlay network links) with them. The download makes
collective progress by peers exchanging missing blocks along the overlay edges.
Having completed the download, a peer may stay in the swarm as a seed, uploading
without downloading anything more in return.

7.6.1 Block Distribution Problem
BitTorrent attempts to achieve a uniform distribution of blocks (or “pieces”: a set
of blocks in BitTorrent) among the peers through localized decisions. Neighboring
peers exchange lists of blocks that they already have. A peer determines which of
the blocks it is missing are the rarest in its local neighborhood and requests these
blocks first. Because the neighborhoods in the BitTorrent protocol evolve over time,

246

B. Bhattacharjee and M. Rabinovich

the rarest-first block distribution leads to more uniform distribution of blocks in the
network and to better chance of a peer finding a useful block without contacting the
source.
Recently, an ingenious alternative to the BitTorrent protocol has been proposed,
which removes the issue of choosing the blocks completely [21]. This new approach,
called Avalanche, follows the same mesh model with seeds, trackers, and peers,
as BitTorrent. However, Avalanche makes virtually every block useful to any peer
through network coding as follows.
Peers no longer choose a single, original block to download from their neighbors
at a time. Instead, every time a peer uploads a block to a neighbor, it simply computes a linear combination of all the blocks it currently has from a given file using
random coefficients, and uploads the result along with auxiliary information, derived
from the coefficients it used and those previously received with its own downloaded
blocks. Once a peer collects enough encoded blocks (usually the same number as the
number of blocks in the file), it can reconstruct the original file by solving a system
of linear equations. A system implementing these ideas has been publicly available
as Microsoft Secure Content Downloader since 2007, although the original author
of BitTorrent raised questions about the importance of the removal of the block
distribution problem in practice and the possible performance overhead involved
[17]. These concerns have been reflected in recent empirical studies demonstrating
that BitTorrent’s rarest-first piece selection strategy effectively provides block
uniformity [30].

7.6.2 Free Riders Problem: Upload Incentives
To improve its resiliency to free riding, BitTorrent utilizes an incentives mechanism.
The goal of this mechanism is to ensure that peers who contribute more to content
upload receive better download service. Just like its approach to block distribution
problem, BitTorrent implements its incentives mechanism largely through localized
decisions by each peer using a round-based unchoking algorithm to decide how
much to send to its neighbors.
When a peer learns a set of other peers from the file’s tracker (usually around
30–50), the peer starts by establishing connections to these peers, some of which
will agree to send blocks to the peer. At the end of every unchoking round (10 s in
most BitTorrent clients), the peer decides which of the peers it should upload blocks
to in the next round. To this end, the peer considers the throughput of its download
from the peers in the previous round and selects a small number (four in Azureus,
a popular BitTorrent client implementation) of peers to which it will upload blocks
in the next round. Selecting a peer for uploading is called “unchoking” a peer. In
addition to unchoking the top four peers who have given in the past, a peer also
unchokes another peer at random in each round. This helps the peer to bootstrap
new peers, to discover potentially higher-performing peers, and to ensure that every
peer, even with poor connectivity, makes some progress; without this “optimistic

7

Overlay Networking and Resiliency

247

unchoking,” these impoverished peers would end up choked by everybody. Except
for optimistic unchoking, a peer only uploads to other peers if they have blocks that
it does not. If two peers have blocks that the other lacks, the peers are said to be
interested in one another.
This protocol works because a free rider will end up being choked by most of its
neighbors, only relying on random unchokes to make any progress. However, recent
work [48] has found that the BitTorrent protocol penalizes high-capacity peers: as
the upload performance of a peer increases, its download performance grows but
less than proportionally to the upload contribution. In other words, the protocol is
not entirely tit-for-tat in a usual sense of the word.
Consequently, a new BitTorrent client called BitTyrant has been implemented
that improves the download performance of high-capacity peers [48]. BitTyrant
achieves this goal by exploiting the following observation. Regular BitTorrent peers
allocate their upload capacity equally among their unchoked neighbors. Because of
this, a strategic peer does not need to upload to regular peers at its maximum capacity: it only needs to upload faster than most of its peers’ other neighbors, so that its
peers would keep it unchoked.
Thus, the key idea behind the BitTyrant client is to keep an estimate of the individual upload rates to its neighbors that is sufficient to stay in the neighbors’
unchoked set most of the time, and to upload to each neighbor at just that rate. Then,
BitTyrant uses the spared upload capacity to unchoke more peers and hence to increase its download performance. Furthermore, the BitTyrant client selects only the
peers with the highest return-on-investment: those peers whose data capacity can
be obtained “cheaply.” The authors of BitTyrant observed significant reduction in
file download times by their modified client. However, if all clients adopted selfish
BitTyrant behavior with cut-off of expensive peers as mentioned above, the overall performance for all clients would decrease, especially for low-capacity clients.
Thus, while discouraging free riding, BitTorrent still relies on altruistic contribution
of high-capacity peers to achieve its performance.
Although BitTorrent’s unchoking algorithm of giving to the top-four contributors has been broadly described as being tit-for-tat, recent work has shown that it
is more accurately represented as an auction [32]. Each unchoking round can be
viewed as an auction, where the “bids” are other peers’ uploads in previous rounds,
and the “good” being auctioned is the peer’s upload bandwidth. Viewed this way,
BitTyrant’s strategy of “coming in the last (winning) place” is easily seen as the
clear winning strategy. Also by reframing BitTorrent as an auction, a solution to
strategic attempts like BitTyrant arises: change the way peers “clear” their auction.
A new client has been introduced that replaces BitTorrent’s top-four strategy
with a proportional share response. Proportional share is a simple strategy: if a peer
has given some fraction, say 10%, of all of the blocks you received in the previous
round, then allot to that peer the same fraction, 10%, of your upload bandwidth. Note
that this does not necessarily result in peers providing the same number of blocks
in return, rather the same fraction of bandwidth. This results in what turns out to
be a very robust form of fairness: the more a peer gives, the more that peer gets.
Even highly provisioned peers therefore have incentive to contribute as much of

248

B. Bhattacharjee and M. Rabinovich

their bandwidth as possible. The authors of this PropShare client have demonstrated
that proportional share is resilient to a wide array of strategic manipulation. Further, PropShare outperforms BitTorrent and BitTyrant, and as more users adopt the
PropShare client, the overall performance of the system improves.This work demonstrates the importance of an accurate model of incentives in a complex system such
as BitTorrent.
A strategic peer can achieve higher download performance by manipulating the
list of blocks it announces to its neighbor [32]. Suppose node p in a BitTorrent
swarm possesses some rare blocks. Since p has rare blocks, it is going to be interesting to many of its neighbors, who will all want to upload blocks to p in exchange
for these rare blocks. However, once p announces these rare blocks, p’s neighbors
will download these blocks from p and exchange them amongst themselves. Node
p can sustain interest amongst its neighbors longer by under-reporting its block
map, in particular, by strategically revealing the rare blocks one by one. This strategy guarantees p remains interested for longer since p’s neighbors, who all get the
same rare block from p, cannot benefit by exchanging amongst themselves.
This observation suggests a general under-reporting strategy. A node can remain
interesting to its neighbors longest by announcing only the blocks necessary to
maintain interest but no more. Similar to an all-BitTyrant strategy, when all peers
strategically under-report their blocks in this manner [32], the overall performance
of the system degrades.
In general, BitTorrent’s incentives mechanisms have come under intense scrutiny.
Through rich empirical studies and analyses that incorporate various economic principles, BitTorrent continues to grow more robust to cheating clients. Whether a
system as complex as BitTorrent can be made fully robust to such users remains
open.

7.7 Conclusion
This chapter considers ways by which overlays-based techniques improve application resiliency. We have described how applications can utilize overlay networks to
better cope with challenges such as flash crowds, the need to scale to often unpredictable loads, network failures and congestion, and denial of service attacks. We
have considered a representative sample of these applications, focusing on their use
of overlay network concepts. This sample included distributed hash tables, network
storage, large file distribution by peer-to-peer networks, streaming content delivery,
content delivery networks, and web services. It is simply not feasible to comprehensively cover overlay applications and research within one chapter. Instead, we hope
that this chapter conveys sufficient information to give the reader a sampling of the
various application domains where overlays are useful, and a sense for the flexibility
that overlay networks provide to an application designer.

7

Overlay Networking and Resiliency

249

Acknowledgments The authors thank Katrina LaCurts, Dave Levin, and Adam Bender for their
comments on this chapter. The authors are grateful to the editors, Chuck Kalmanek and Richard
Yang, for their comments and encouragement.

References
1. Akamai Technologies. Retrieved from http://www.akamai.com/html/technology/index.html
2. Alzoubi, H. A., Lee, S., Rabinovich, M., Spatscheck, O., & Van der Merwe, J. (2008).
Anycast cdns revisited. In Proceedings of WWW ’08 (pp. 277–286). New York, NY: ACM.
DOI http://doi.acm.org/10.1145/1367497.1367536
3. Andersen, D. G. (2003). Mayday: Distributed filtering for Internet services. In USITS.
4. Andersen, D. G., Balakrishnan, H., Kaashoek, M. F., & Morris, R. (2001). Resilient overlay
networks. In Proceedings of 18th ACM SOSP, Banff, Canada.
5. ATT ICDS: Retrieved from http://www.business.att.com/service fam overview.j-sp?serv
fam=eb intelligent content distribution
6. Balakrishnan, H., Lakshminarayanan, K., Ratnasamy, S., Shenker, S., Stoica, I., & Walfish, M.
(2004). A layered naming architecture for the Internet. In Proceedings of the ACM SIGCOMM,
Portland, OR.
7. Ballani, H., Francis, P., & Ratnasamy, S. (2006). A measurement-based deployment proposal
for IP anycast. In Proceedings of the ACM IMC, Rio de Janeiro, Brazil.
8. Banerjee, S., Bhattacharjee, B., & Kommareddy, C. (2002). Scalable application layer multicast. In Proceedings of ACM SIGCOMM, Pittsburg, PA.
9. Banerjee, S., Lee, S., Bhattacharjee, B., & Srinivasan, A. (2003). Resilient multicast using
overlays. In Proceedings of the Sigmetrics 2003, Karlsruhe, Germany.
10. Banerjee, S., Lee, S., Bhattacharjee, B., & Srinivasan, A. (2006). Resilient overlays using multicast. IEEE/ACM Transactions of Networking, 14(2), 237–248.
11. Bender, A., Sherwood, R., Monner, D., Goergen, N., Spring, N., & Bhattacharjee, B. (2009).
Fighting spam with the NeighborhoodWatch DHT. In INFOCOM.
12. Castro, M., Druschel, P., Ganesh, A. J., Rowstron, A. I. T., & Wallach, D. S. (2002). Secure
routing for structured peer-to-peer overlay networks. In OSDI.
13. Castro, M., Druschel, P., Kermarrec, A., Nandi, A., Rowstron, A., & Singh, A. (2003). Splitstream: High-bandwidth multicast in a cooperative environment. In Proceedings of the 19th
ACM Symposium on Operating Systems Principles (SOSP 2003), Lake Bolton, NY.
14. Castro, M., Druschel, P., Kermarrec, A. M., & Rowstron, A. (2002). Scribe: A large-scale and
decentralized application-level multicast infrastructure. IEEE Journal on Selected Areas in
Communication, 20(8), 1489–1499. DOI 10.1109/JSAC.2002.803069
15. Chu, Y., Ganjam, A., Ng, T., Rao, S., Sripanidkulchai, K., Zhan, J., & Zhang, H. (2004). Early
experience with an Internet broadcast system based on overlay multicast. In Proceedings of
USENIX Annual Technical Conference, Boston, MA.
16. Chun, B., Culler, D., Roscoe, T., Bavier, A., Peterson, L., Wawrzoniak, M., & Bowman, M.
(2003). Planetlab: An overlay testbed for broad-coverage services. SIGCOMM Computer Communication Review, 33(3), 3–12.
17. Cohen, B. Avalanche. Retrieved from http://bramcohen.livejournal.com/20140.html
18. Dabek, F., Kaashoek, M. F., Karger, D. R., Morris, R., & Stoica, I. (2001). Wide-area cooperative storage with cfs. In SOSP (pp. 202–215).
19. Dahlin, M. (2000). Interpreting stale load information. IEEE Transactions on Parallel and
Distributed Systems, 11(10), 1033–1047.
20. Fiat, A., Saia, J., & Young, M. (2005). Making chord robust to Byzantine attacks. In ESA.
21. Gkantsidis, C., & Rodriguez, P. (2005). Network coding for large scale content distribution. In
INFOCOM (pp. 2235–2245).

250

B. Bhattacharjee and M. Rabinovich

22. Gopalakrishnan, V., Bhattacharjee, B., Ramakrishnan, K. K., Jana, R., & Srivastava, D. (2009).
Cpm: Adaptive video-on-demand with cooperative peer assists and multicast. In Proceedings
of INFOCOM, Rio De Janeiro, Brazil.
23. Gupta, I., Birman, K. P., Linga, P., Demers, A. J., & van Renesse, R. (2003). Kelips: Building
an efficient and stable p2p dht through increased memory and background overhead. In IPTPS
(pp. 160–169).
24. Hofmann, M., & Beaumont, L. R. (2005). Content networking: Architecture, protocols, and
practice. San Francisco, CA: Morgan Kaufmann.
25. Iyer, S., Rowstron, A. I. T., & Druschel, P. (2002). Squirrel: A decentralized peer-to-peer web
cache. In PODC (pp. 213–222).
26. Jannotti, J., Gifford, D., Johnson, K. L., Kaashoek, M. F., & Jr., J. W. O. (2000). Overcast:
reliable multicasting with an overlay network. In Proceedings of the Fourth Symposium on
Operating System Design and Implementation (OSDI), San Diego, CA.
27. Jung, J., Krishnamurthy, B., & Rabinovich, M. (2002). Flash crowds and denial of service
attacks: Characterization and implications for cdns and web sites. In WWW (pp. 293–304).
28. Keromytis, A. D., Misra, V., & Rubenstein, D. (2002). SOS: Secure overlay services. In
SIGCOMM.
29. Kostic, D., Rodriguez, A., Albrecht, J., & Vahdat, A. (2003). Bullet: High bandwidth data
dissemination using an overlay mesh. In Proceedings of SOSP (pp. 282-297), Lake George, NY.
30. Legout, A., Urvoy-Keller, G., & Michiardi, P. (2006). Rarest first and choke algorithms are
enough. In IMC.
31. Lemos, R.: Blue security folds under spammer’s wrath. http://www.securityfocus.com/news/
11392
32. Levin, D., LaCurts, K., Spring, N., & Bhattacharjee, B. (2008). Bittorrent is an auction: Analyzing and improving bittorrent’s incentives. In SIGCOMM (pp. 243–254).
33. Li, B., Xie, S., Qu, Y., Keung, G., Lin, C., Liu, J., & Zhang, X. (2008). Inside the new coolstreaming: Principles, measurements and performance implications. In Proceedings of the
INFOCOM 2008, Phoenix, AZ (pp. 1031–1039).
34. Li, B., Yik, K., Xie, S., Liu, J., Stoica, I., Zhang, H., & Zhang, X. (2007). Empirical study of the
coolstreaming system. Proceedings of the IEEE Journal on Selected Areas in Communication
(Special Issues on Advance in Peer-to-Peer Streaming Systems), 25(9), 1627-1639.
35. http://www.limelightnetworks.com/network.htm
36. Linga, P., Gupta, I., & Birman, K. (2003). A churn-resistant peer-to-peer web caching system.
In 2003 ACM Workshop on Survivable and Self-Regenerative Systems (pp. 1–10).
37. Locher, T., Meier, R., Schmid, S., & Wattenhofer, R. (2007). Push-to-pull peer-to-peer live
streaming. In Proceedings of the International Symposium of Distributed Computing, Lemesos,
Cyprus.
38. Lumezanu, C., Baden, R., Levin, D., Spring, N., & Bhattacharjee, B. (2009). Symbiotic relationships in internet routing overlays. In Proceedings of NSDI, Boston, MA.
39. Magharei, N., & Rejaie, R. (2007). PRIME: Peer-to-peer receiver-drIven MEsh-based streaming. In Proceedings of the INFOCOM 2007, Anchorage, Alaska (pp. 1424–1432).
40. Magharei, N., Rejaie, R., & Guo, Y. (2007). Mesh or multiple-tree: A comparative study of live
p2p streaming approaches. In Proceedings of the INFOCOM 2007, Anchorage, Alaska.
41. Malkhi, D., Naor, M., & Ratajczak, D. (2002). Viceroy: A scalable and dynamic emulation of
the butterfly. In PODC (pp. 183–192).
42. Mao, Z. M., Cranor, C. D., Douglis, F., Rabinovich, M., Spatscheck, O., & Wang, J. (2002).
A precise and efficient evaluation of the proximity between web clients and their local dns
servers. In USENIX Annual Technical Conference (pp. 229–242).
43. Morselli, R., Bhattacharjee, B., Marsh, M. A., & Srinivasan, A. (2007). Efficient Lookup on
Unstructured Topologies. IEEE Journal on Selected Areas in Communications, 25(1), 62–72.
44. Nakao, A., Peterson, L., & Bavier, A. (2006). Scalable routing overlay networks. SIGOPS
Operating Systems Review, 40(1), 49–61.
45. Padmanabhan, V., Wang, H., Chou, P., & Sripanidkulchai, K. (2002). Distributing streaming
media content using cooperative networking. In NOSSDAV, Miami Beach, FL, USA.

7

Overlay Networking and Resiliency

251

46. Pai, V., Kumar, K., Tamilmani, K., Sambamurthy, V., & Mohr, A. (2005). Chainsaw:
Eliminating trees from overlay multicast. In IPTPS 2005, Ithaca, NY, USA.
47. Painese, F., Perino, D., Keller, J., & Biersack, E. (2007). PULSE: An adaptive, incentive-based,
unstructured p2p live streaming system. IEEE Trans. on Multimedia 9(8), 1645–1660.
48. Piatek, M., Isdal, T., Anderson, T. E., Krishnamurthy, A., & Venkataramani, A. (2007). Do
incentives build robustness in bittorrent? (awarded best student paper). In NSDI.
49. Rabinovich, M., & Spatscheck, O. (2001). Web caching and replication. Reading, MA:
Addison-Wesley, Longman Publishing Co., Inc. Boston, MA, USA.
50. Ramasubramanian, V., & Sirer, E. G. (2004). Beehive: O(1) lookup performance for power-law
query distributions in peer-to-peer overlays. In NSDI (pp. 99–112).
51. Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Shenker, S. (2001). A scalable contentaddressable network. In SIGCOMM.
52. Rhea, S., Geels, D., Roscoe, T., & Kubiatowicz, J. (2004). Handling churn in a dht. In USENIX
Annual Technical Conference.
53. Rhea, S. C., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., Stoica, I., &
Yu, H. (2005). Opendht: A public dht service and its uses. In SIGCOMM (pp. 73–84).
54. Rowstron, A., & Druschel, P. (2001). Pastry: Scalable, distributed object location and routing
for large-scale peer-to-peer systems. In IFIP/ACM Middleware 2001, Heidelberg, Germany.
55. Savage, S., Anderson, T., Aggarwal, A., Becker, D., Cardwell, N., Collins, A., Hoffman, E.,
Snell, J., Vahdat, A., Voelker, G., & Zahorjan, J. (1999). Detour: A case for informed internet
routing and transport IEEE Micro, 19(1), 50–59.
56. Savage, S., Collins, A., Hoffman, E., Snell, J., & Anderson, T. (1999). The end-to-end effects
of Internet path selection. In SIGCOMM.
57. Shaikh, A., Tewari, R., & Agrawal, M. (2001). On the effectiveness of DNS-based server selection. In Proceedings of IEEE Infocom, Anchorage, Alaska.
58. Stoica, I., Adkins, D., Zhuang, S., Shenker, S., & Surana, S. (2002). Internet indirection infrastructure. In SIGCOMM (pp. 73–86).
59. Stoica, I., Morris, R., Karger, D. R., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A
scalable peer-to-peer lookup service for internet applications. In SIGCOMM (pp. 149–160).
60. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D. R., Kaashoek, M. F., Dabek, F., &
Balakrishnan, H. (2003). Chord: A scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions on Networking, 11(1), 17–32.
61. Tran, D., Hua, K., & Do, T. (2003). ZIGZAG: An efficient peer-to-peer scheme for media
streaming. In Proceedings of the INFOCOM 2003, San Francisco, CA.
62. Venkataraman, V., Francis, P., & Calandrino, J. (2006). Chunkyspread: Multi-tree unstructured
peer-to-peer multicast. In Proceedings of the 1st International Workshop on Peer-to-Peer Systems (IPTPS ’06), Santa Barbara, CA.
63. Verkaik, P., Pei, D., Scholl, T., Shaikh, A., Snoeren, A., & Van der Merwe, J. (2007). Wresting
control from BGP: Scalable fine-grained route control. In 2007 USENIX Annual Technical
Conference.
64. Verma, D. C. (2001). Content distribution networks: An engineering approach. New York:
Wiley.
65. Wang, F., Xiong, Y., & Liu, J. (2007). mTreebone: A hybrid tree/mesh overlay for applicationlayer live video multicast. In Proceedings of the ICDCS 2007, Toronto, Canada.
66. Wang, P., Hopper, N., Osipkov, I., & Kim, Y. (2006). Myrmic: Secure and robust DHT routing.
Technical Report, University of Minnesota.
67. Yang, M., & Fei, Z. (2004). A proactive approach to reconstructing overlay multicast trees. In
Proceedings of the IEEE Infocom 2004, Hong Kong.
68. Zhang, M., Luo, J., Zhao, L., & Yang, S. (2005). A peer-to-peer network for live media streaming – Using a push-pull approach. In Proceedings of the ACM Multimedia, Singapore.
69. Zhang, X., Liu, J., Li, B., & Yum, T. (2005). Donet: A data-driven overlay network for efficient
live media streaming. In Proceedings of the INFOCOM 2005. Miami, FL.

Part IV

Configuration Management

Chapter 8

Network Configuration Management
Brian D. Freeman

8.1 Introduction
This chapter will discuss network configuration management by presenting a
high-level view of the software systems that are involved in managing a large
network of routers in support of carrier class services. It is meant to be an overview,
highlighting the major areas that a network operator should assess while designing or buying a configuration management system, and not the final source of all
information needed to build such a system.
When a service and its network are small, network configuration management is
typically done manually by a knowledgeable technician with some form of workflow to get the data needed to perform their configuration tasks from the sales group.
Inventory tracking may be handled by simply inserting comments into the interface
description fields on the router and perhaps by maintaining some spreadsheets on a
file server. The technician might or might not use an element management system
(EMS) to do the configuration changes. If the network is new, for example, supporting the needs of a small company or the network needs of an “Internet startup,” most
of the configuration tasks represent a “new order.” Configuration requests occur at
low volume and the technician probably has a great deal of flexibility in how he or
she goes about meeting the needs of the new network service.
As the number of users of the service grows, the expectations placed on the
network operator to meet a certain level of reliability and performance grows accordingly. In time, because of growth in the sheer volume of orders, the single
knowledgeable worker becomes a department, and “change orders” that modify the
configuration associated with an existing customer of the network start becoming a
larger and larger share of the effort. At this point, the network may contain multiple types of routers purchased from different vendors, each of which has different
features and resource limits. Changes made to a router configuration to support
one customer can now affect another customer. For example if one customer’s
B.D. Freeman ()
AT&T Labs, Middletown, NJ, USA
e-mail: bdfreeman@att.com

C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications,
Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 8,
c Springer-Verlag London Limited 2010


255

256

B.D. Freeman

configuration change causes a router resource such as table size to be exceeded,
multiple customers might be affected. In addition, other departments or areas within
the business now need data on the installed inventory to drive customer reporting,
usage-based billing or ticketing, etc. Finally, as the volume grows, there is a need
for automation or “flow through provisioning” to both reduce cost/time and protect
against mistakes. The simple, manual approaches no longer work: an end-to-end
view is needed for network configuration management so that all the pieces required
to support the business can be integrated.
This chapter provides an overview of the elements of a robust network configuration management system. There are many goals for such a system, but the primary
goal of any network configuration management system is to protect the network
while providing the ordered service for the customers. Since changing the network
configuration can cause outages if not done correctly, a key requirement of a network configuration management system is to ensure that the configuration changes
do not destabilize the network. The system must provide the ordered service for
the customer without affecting other customers, other ports associated with the customer being provisioned or the network at large.
The network configuration management system is also typically the primary
source of data – the source of truth – used by many business systems and processes
that surround the network. The functions that depend on configuration data are as
mundane as trouble ticketing and spare part tracking, to more sophisticated capabilities like traffic reporting, for which the association of ports to customers must
be obtained so that traffic reports can be properly displayed on the customer service
portal.
Finally, the network configuration management system is the enforcer of the engineering rules that specify the maximum safe resources to be consumed on the
routers for various features. As such, in addition to protecting the network, the
system also impacts profitability, since inventory is either used efficiently or inefficiently. This depends on how good the configuration management system is at
implementing the engineering rules as well as how good it is at processing service
cancellation or disconnect requests in a timely fashion. If the configuration management system does not properly return a port that is no longer in service to the
inventory available for new requests, expensive router hardware can be stranded
indefinitely.
In summary, the primary goal of a network configuration management system is
to manage router configurations to support customer service, subject to three key
secondary goals:
 Protect the network
 Be the source of truth about the network
 Enforce the business and engineering rules

To explore this topic further, we will first review some key concepts to help structure the types of data items the system must deal with in Section 8.2. Section 8.3
describes the subcomponents of the system and the unique requirements of each
subcomponent. This section also discusses the two approaches that are commonly

8

Network Configuration Management

257

used for router configuration – policy-based and template-based approaches – since
this is a key aspect of the problem to be solved. Section 8.3 also touches on the
differences between provider-edge (PE) and customer-edge (CE) router configuration tasks and the differences between consumer and enterprise IP router services in
their typical approaches to configuration management. We present a brief overview
of provisioning audits, which is discussed in more detail in Chapter 9. Provisioning
audits are important to ensure that the network configuration management system
stays as a good source of truth for the other systems and business processes that
need data about the network. Finally, one of the key challenges in a large network is
handling changes, ranging from an isolated change to a setting on an individual customer’s interface, to more complex changes such as bulk changes to a large number
of routers and interfaces. To illustrate these issues, Section 8.4 discusses the data
model and process issues associated with moving a working connection from one
configuration to the next. This section also touches on some typical network maintenance activities that impact a system in different ways than a customer provisioning
focus. Section 8.5 shows a complete step by step example of provisioning a port
order.

8.2 Key Concepts
There are two important types of data that a network configuration management
system must handle: physical inventory data and logical inventory data. In addition
to these data types, the system has to be designed to appropriately handle and resolve
data discords between the state of the network (“What it is”) and the view of the
network that is contained in the network inventory database (“What it should be”).
This section introduces these concepts.

8.2.1 Physical Inventory
The physical inventory database, as the name implies, contains the network hardware that is deployed in the field. The basic unit is usually a chassis with a set of
components, including common elements like route processor cards or power supplies, and line cards with transport interfaces that support one or more customer
“ports.” These ports are what carry the customer-facing and backbone-facing traffic.
Line cards that support multiple customer ports are often referred to as channelized
interfaces (e.g. channelized T3 cards or channelized OC48 cards). The physical inventory database keeps track of whether the subchannels on these line cards are
assigned to a customer with a state for each channel of “assigned” or “unassigned.”
The data model for physical inventory often reflects the physical world in which
cards are contained in a chassis and a chassis is contained in a cabinet. Each customer port is associated with a subchannel on a physical interface.

258

B.D. Freeman

8.2.2 Logical Inventory
The logical inventory database includes the inventory data that are not physical.
This is a broad and less rigid category of information, since it includes multiple
database entities with ephemeral ties to the physical inventory. An IP address is a
good example of a database entity with an ephemeral tie. IP addresses exist on an
interface, but we can move addresses to ports on another router; hence, an address
is not permanently tied to a single piece of physical equipment. Many logical components are inventoried as database entities and assigned as needed by the carrier.
IP addresses, VLAN tags, BGP community strings [1], and Autonomous System
Numbers (ASNs) [2] are all examples of logical data that need to be tracked and
managed. Generally, logical inventory assigned to a customer is associated with a
particular piece of physical inventory. However, the association can change over
time. A good example of a change in the association between physical and logical
inventory occurs when a customer’s connection is upgraded from a T1 to a T3. The
physical inventory will change drastically but the logical inventory in terms of the IP
address, BGP routing, and QoS settings may not change. It is also useful to understand that some logical inventory is associated with a single piece of equipment like
an IP address while other logical inventory is “network wide” and is associated with
multiple pieces of equipment like MPLS Route Distinguishers and Route Targets.

8.2.3 Discords: What It Is Versus What It Should Be?
Data discords are a fact of life in production systems. Through a variety of means,
the data in the network and the data in the inventory system get out of synch. In
plain language, a situation is created where the inventory view of the world, “what
it should be,” does not match with truth or the network view of the world, “what
it is.”
Both physical and logical inventory can contain discords. Generally, the physical inventory discords occur because of card replacements and initial installation
errors that occur without a corresponding update of the database. For example, a
discord would occur if a 4-port Ethernet card was replaced with an 8-port Ethernet card, but the database was not updated. Autodiscovery of hardware components
can greatly assist in reducing the data discords in the physical inventory. Many production systems back up the router configuration daily and use commands from
the vendor to collect detailed firmware and hardware data from the equipment. The
command “show diag” dumps this kind of detailed information and the output can
be saved to a file. Very accurate physical inventory information can be obtained by
parsing the output of commands run on the router to obtain hardware information
like the “show diag” command or various SNMP MIB queries. Automatic discovery of physical inventory can reduce the physical discords to zero. Many spare part
tracking processes are dependent on the ability to automatically discover changes

8

Network Configuration Management

259

in serial numbers on components so that failure rates on cards can be tracked and
replacement parts restocked as needed. Maintaining control on “What it is” is part
of the physical inventory audit process.
Logical inventory discords also happen frequently but are harder to resolve. As
an example, if a customer port that is running in the network has static routing and
the inventory database indicates that it should be BGP routing, which is correct?
Another example of logical inventory discord is the mismatch between the service
that the customer currently has and the ordered service. In general, it is easier to
detect logical inventory discords than to resolve them. Given their impacts on the
external support processes and billing, detection, reporting, and correcting these
situations is important.
Another key concept that the industry uses is that “the network is the database.”
This concept results from a desire by network operators to use the network configuration as ground truth to drive processes. Most equipment has some mechanism
for querying for configuration data. However, practical matters require externally
accessible views of those data. Fault management, for example, cannot query the
network in real time on every SNMP trap that gets generated (this can be thousands
per second); so a copy of the configuration data has to reside in a database and consequently a process/program to audit and synchronize that data with the network has
to be part of the overall network configuration management system.
With these key concepts in mind, we will discuss the elements of a network
configuration management system.

8.3 Elements of a Network Configuration Management System
Figure 8.1 provides a high-level view of the elements that make up a Network
Configuration Management System. The external interfaces are to technicians and
Operating Support Systems/Business Support Systems (OSS/BSS) on the top and
the Network Elements at the bottom. Each of the major elements inside the system
will be addressed in subsequent sections.

8.3.1 Inventory Database
A database of the physical and logical inventory is the core of the system. This
database will consist of both the real assets purchased and deployed by the corporation (the physical inventory discussed in Section 8.2.1) and the logical assets that
need to be tracked (e.g., WAN and LAN IP address assignments, number of QoS
connections per router, max assigned Virtual Route Forwarding (VRF) tables [3] on
the router, etc.).
The database entities have parent/child relationships that form a tree as you place
items in the schema. For example, a complex is a site with a set of cabinets. A cabinet within a site may have multiple chassis or routers. A router has multiple cards,

260

B.D. Freeman
Technicians

OSS/BSS
(Ordering)

GUI

API

OSS/BSS
(Maintenance/Inventory)

Reports and
Feeds

Design & Assign
Physical Inventory
Management

Logical Inventory
Management

Router
Audit
Mediation Layer

Router
Configuration
Mediation Layer

Inventory
Data base

Network Elements

Fig. 8.1 High-level view of network configuration management system

each in a slot on the chassis. A card can have multiple ports. When viewed graphically, this parent/child relationship is a tree with the single item complex at the top
and the ports at the “leaves” of the tree. A robust inventory database will have a
schema with multiple “regions” of data with linkages between them as needed. One
major ISP has an inventory database with over 1,000 tables to handle the inventory
and the various applications that deal with the inventory.
The two main regions are the physical equipment tree of data (e.g., complex/cabinet/router/slot/port) and logical inventory tree of data (e.g., customer,
premise, service, and connection). The service database entity (one node up from
the connection entity in the tree) typically contains the linkage to other logical
assignments like Serial IP address, VRF labels, Route Distinguishers [3], Route
Targets [3], etc. The reason the data are separated into these regions is to permit
the movement of logical assets to different ports (i.e., connections) and to support
changes in the physical assets associated with a customer as a result of changes in
technology or network-grooming activities. Changes in technology, such as a new
router with lower port costs, and network grooming, moving connections from one
router/circuit to another to improve efficiency, are examples of carrier changes that
may also affect the data model. These carrier decisions are sometimes even more
complex than the customer-initiated changes to deal with correctly in the inventory
database.
Without separation of the regions, the ongoing life-cycle management of the service is difficult. For example, at points in time, we need to have multiple assets
available for testing and move the “active” connection to the new assets only after
satisfactory testing has completed. This means that we maintain multiple “services”
for the same physical port, both the old service and the future service.
The inventory database stores the “What should be” for the corporation and the
current and future state of the equipment and connections for a customer.

8

Network Configuration Management

261

Many subsystems of a configuration management system are dependent on the
inventory database. One of the major dependencies is the audit subsystem. The audit
subsystem must store information for the physical “What it is” form of the network
in a schema. Typically, since audit or discovery starts with the physical assets, the
physical inventory model at the router/component level is reused for the “What it
is” model. It is interesting to note that cabinet and location of equipment data are
typically not discoverable, so those are usually inferred through naming conventions
like the encoding of the router hostname. For example, a router might have a hostname like “n54ny01zz1” where the “n54” indicates a particular office in New York
City and “ny” is for New York State. The “01” indicates that it is the first router in
the office and the “zz1” would indicate the type or model of the router. The encoding
is not an industry standard, but most carriers use something similar.
The logical “What it is” model is also based on the rich “What it should be”
model. It is again interesting to note that the logical discovery does not have the nonnetwork data items like street address of the customer or other business information.
A prudent network operation puts processes in place to encode pertinent information
in the interface description line so that linkages to business support systems can be
maintained and audited.
For example, large carriers tend to automatically encode a customer name and
pointers to location records to make it easier to manage events pertinent to the interface in customer care and ticketing systems. The example below shows an active
port in maintenance (MNX), for a customer, ACME MARKETING that is located
in ANYTOWN, NJ, on circuit DHEC.123456..ATI. Various database keys are also
encoded.
interface Serial4/0.11/8:0
description MNX j ACME TECHNICAL MARKETING j ANYTOWN j NJ j
DHEC.123456..ATI j 19547 j 3933470 j 4151940 j USA j MIS j j
The two main inputs to the inventory database are the physical and logical inventory on the router and the customer order data. The physical and logical router
data are typically inserted through the GUI during network setup by the capacity
management organization as assets are installed, tested, and made ready for service.
Another practice in use is to install the equipment and then use the autodiscovery
tools to “learn” the equipment’s physical inventory. Logical assets are entered into
the system as appropriate since they are not necessarily tied to the equipment in all
cases.
The customer order data are created usually through an API from the OSS/BSS
during the ordering phase of a customer’s request for service and updated as the order progresses through the business processes to move from an order to an installed
and tested connection.
A note of caution, the amount of customer order data that are replicated into the
network configuration management system should be minimized. A good design
incorporates just enough to make it easier for people to deal with problems encountered in provisioning and activities that the upstream OSS/BSS may not have the
capability to manage like custom features. The more customer order data stored

262

B.D. Freeman

in the network configuration management system, the more the management of
that data alone becomes a problem. Customer contact data are an example of data
that should not be in the network configuration management system, since they are
volatile and in fact may pertain to broader applications than the network service.

8.3.2 Router Configuration Subsystem
The second subsystem we will discuss is the Router configuration management
system. This subsystem takes the information from the inventory database and creates configuration changes for the installed router. The inventory database typically
provides data needed to drive configuration details like the types and versions of
commands to use for configuration (these can vary by make and model), the IP addresses/hostnames and passwords for access to the routers, and the customer order
data for the specific configuration. The generation of the specific router configuration commands is the more difficult aspect. There are numerous approaches to the
creation of the configuration changes, but the two main ones large carriers use today
are policy management and templates.
8.3.2.1 Policy Management Approach
The policy management approach attempts to break down the router configuration
into a set of conditions and actions (e.g., policies) and generates the combined configuration on the router by evaluating the conditions and action in a set of policies.
For example, QoS settings fit nicely into the policy management approach, since
the router typically has a configuration statement to define the condition and action
for applying QoS. The configuration statement can be shared by multiple ports and
any interface can be assigned to that policy. Creating a QoS policy that assigns 20%
of the bandwidth to high-priority class (e.g., voice traffic) and the remainder to a
best-effort class could be reused by many ports on a router. One condition/action
definition (e.g., policy) reused multiple times is easy to implement and maintain.
Some configurations are more difficult to implement in a policy management system since they do not adhere nicely to a condition/action policy format. An example
of this is IP addressing (or address management), which typically uses fairly complex rules to determine which address to assign to an interface.
Large policy management systems do exist, but the linkage between different
policies can be subject to scaling issues when dealing with the application of a
large number of network and customer policies as in a VPN with a large number
(e.g., thousands) of end points. Configuration auditing (described later) in particular
becomes difficult to manage in a policy management system because the policy view
of the data sometimes is not readily apparent to the knowledgeable network engineer
when looking at the more detailed CLI commands in the backup configuration file
used for audits. Finally, testing of policy-based systems is complicated, since it is
not always clear what the resulting policy-based configuration will be in the CLI.

8

Network Configuration Management

263

The number of test cases increases to make sure the policy engine generates all the
configuration change options that the network certification process has confirmed as
working correctly.

8.3.2.2 Template Management Approach
Template management uses a more simplistic approach. The details of tested sets
of configurations are documented in a template and the data to drive a particular
template is pulled from the inventory database. The benefit of a template approach is
that only the configurations that are known to be valid are put into the network. This
approach is a more reliable method of ensuring that the network is always configured
to operate in a configuration supported by the testing and certification program.
Policy-management systems have a more difficult time ensuring that they are always
configuring the router into a condition that matches the certified configurations.
The challenge is building the template from the set of features ordered by the customer. Generally, the template languages have a nesting structure so that the range
of templates can be kept under control. As the set of templates grows, there is some
complication in applying the correct template, but the resulting router configuration
tends to be cleaner and more optimized (since each template is a test case) than the
policy-based configurations.
Both approaches have merit and a growing set of functions can be handled more
readily with policies; so the likely system for a large carrier is a mixture of these
techniques with templates for the basic configurations like basic IP conditions and
routing and policies for the more advanced functions like QoS configuration on CPE
routers. Large ISPs will have hybrid approaches to provide the best fit tool for each
problem.
An important aspect of the router configuration subsystem is the interaction between the users of current inventory (processes like ticketing and fault management)
and the need to deal with future changes. Growing from a 512 kb/s link to a full T1
or growing from a single T1 to Multilink PPP (MLPPP) [4] are examples with very
different degrees of complexity but both have the need to track both the current connection data and the future connection data. The router configuration system has to
be able to handle modifying the current configuration to move an active connection
to the new connection configuration. To handle failure conditions properly, this subsystem has to deal with roll forward and roll back of the configuration. Sometimes,
the template approach is cleaner, since the “before” configuration can be captured
directly from the router and re-applied even if the original data for it are not readily
available.
There are some key differences in managing provider-edge and customer-edge
configurations that influence the choice of template-based or policy-based configuration management that we discuss here.
Provider-edge (PE) routers tend to have a large number of interfaces (100 or
more) with many interfaces of the same basic type. Generally, the configurations
are relatively simple since the router’s primary role is stability, reliability, and fast

264

B.D. Freeman

packet forwarding. Since large carrier router configurations tend to be less variable,
we tend to see template-based configuration management systems on the PE. However, since MPLS VPNs have the added complexity of multiple router configurations
being involved to correctly implement the VPN, usage of policy-based configuration
management is growing.
Customer-edge (CE) routers tend to have a much smaller number of interfaces
(less than 10) with a wide variation in configurations depending on the business/industry of the customer. For example, the CE router may need advanced
traffic-shaping rules to ensure that performance-sensitive traffic has a priority on
their internal network over the access to the internet proxy/firewall. Other customers
might need to do video streaming for training and thus need QoS setting for video
priority over other data traffic. Some customer may even be running internet applications that require prioritization of the http/ftp traffic to/from their router to provide
service to their customers. The CE router is closer to the customer and thus gets
the burden of handling more customer-specific applications like firewalls, packet
shaping, and complicated internal routing policies. Policy-based router configuration management systems are commonly used on CE routers because that is a better
fit to the disparate customer needs for the edge environment.
Finally, for the network carrier it is important to understand the different challenges that a mass-market consumer broadband internet access service places on
the configuration management system. Mass-market configuration tends to have a
very small set of routing configuration options. The most obvious variable in the
configurations is the access speed. While you might think setting up QoS and ACLs
would tend to increase the configuration options, it really only adds complexity and
not much variation, since the configurations tend to be similar across large sets of
connections. Although the number of different configurations is small, the rate of
change is large. Initial provisioning rates are not only much larger than the enterprise space but the volumes of change orders are large as well. An Enterprise Internet
access service might typically need to process several thousand orders a week with
a similar magnitude of change orders. A mass-market service might need to process thousands of orders per day and tens of thousands of change orders per day.
Mass-market router configuration systems tend toward template-based approaches
because of the simplicity of the configuration, the smaller range of features, and the
performance advantages of the template approach for large-scale processing.
8.3.2.3 Mediation Layer
Most service providers have multiple vendor platforms in their network, but even
single vendor network will have multiple models and versions of the router operating system. The router configuration subsystem that writes data to the routers
usually has a mediation layer to deal with the router-specific commands. The mediation layer also exists when reading data for the audit layer to turn the vendor-specific
commands/output into a common syntax for use by the audit application. The
mediation layer will also handle nuances of the security model for accessing the
routers that may vary based on vendor and region of the globe.

8

Network Configuration Management

265

8.3.3 GUI/API
The GUI/API subsystem deals with the typical functions of retrieval, display, and
data input for the system. The technology of this subsystem is typical of large-scale
systems. This subsystem uses HTTP Web server technology with an html-based
GUI and a SOAP/XML-based API. A critical aspect for large carrier is that the API
becomes the predominant flow into the system. At scale, the API is used to handle
the large volume order flow from the business support systems (BSS), both to electronically transfer the data and trigger the various automated functions in the router
configuration management system. The GUI is used infrequently for customer provisioning and is used primarily for correcting any fallout that might have occurred.
Having a robust set of APIs is critical to business success. Obviously, the APIs
must also keep pace, as new features are added to the router so that the automated
processes can trigger them. The GUI comes into play for manual interaction and
maintenance activities and various other tasks that are not economic to automate
through APIs. The other important aspect of the GUI is the implementation of a
robust authentication/authorization layer, since some user groups should not have
access to the router configuration change functions to prevent unintended changes
that could cause a service outage.
One aspect of the GUI that is also worth mentioning is read access to the “What
it is” state of the router. Typically, there are sets of read-only CLI commands that the
customer care organization depends on for responding to customer-reported problems. Most router platforms have a limited set of connections, so it is problematic
to give a large customer care team direct access to the router CLI. The solution
large carriers typically use is to put a web-based GUI in place with a limited set
of functions that can be selected by the customer care agents. The GUI then acts
as a proxy through the router configuration subsystem to execute these commands
on the router. These commands include the various “show” commands as well as
options to run limited repair functions like “clear counters” and/or “shutdown”/“no
shutdown” on the interface. Exposing these functions through the GUI reduces the
impact on the router and provides a mechanism for the throttling and audit rules to
be applied to prevent a negative impact. The edit checks that occur before commands
are executed on the router also help one to prevent unintended effects.

8.3.4 Design and Assign
This subsystem applies the engineering rules to select a port for a customer’s service and can accept or reject a request for service based on available inventory.
The subsystem has an API that takes the service request parameters and other
customer network information and generates an assignment to a particular port on a
router. That assignment is typically called a Tie Down and the data set is Tie Down
Information (TDI). The API can be called either through the GUI or directly by the
BSS. Assignment is nontrivial, since the function must ensure that all engineering

266

B.D. Freeman

rules that help protect the network are satisfied like finding a port on a card with
sufficient resources while also satisfying the business rules, which seek to limit
transport costs and latency by picking a router closer to the customer. For example, the engineering rules may limit the number of QoS configured ports on certain
card types. As an example of router assignment, a poor assignment would be to pick
a router in California for a customer in New York.
The assignment function calculates both an optimal assignment and the current
assignment. The optimal assignment is the first choice router location that minimizes backhaul cost (e.g., ideally a customer in Ohio will be homed to a router in
Ohio). However, it could be that the Ohio router complex does not have a router with
sufficient capacity (bandwidth, QoS ports, etc.). The design and assign function system needs to be designed to implement the appropriate business rules in this case.
For example, the business rule in this case is to “home” the customer on an alternate
router in a different location. Alternatively, the business rule could be to reject the
order. Typically, the “reject the order” business rule applies in mass-market situations. Business rules for enterprise markets usually choose to have longer backhaul
costs rather than reject the order. In the enterprise market, the business rule might
select a router in an alternate location like Indiana if no routers in Ohio had sufficient
resources.
The business would like the flexibility to be able to move the port from the
Indiana router to an Ohio router in the future without impacting the customer.
Consequently, the “assign” function will allocate a Serial IP address from a logical inventory pool associated with Ohio’s router complex, assign it to the interface
on the router in Illinois, and “exception route” that address to Indiana. This assignment permits the CE/PE connection to be re-homed from Ohio to Indiana without
affecting the customer’s router configuration, since their WAN IP address would not
change and then the exception route for Indiana can be removed to get to a more
optimum network routing configuration as well as a reduced backhaul configuration. The tracking of the optimal and current assignment data adds complexity to
this subsystem, the inventory database, and the router configuration system (for the
exception routes), but it is a good example of the types of business decisions that
can ripple back into the router configuration management system requirements.

8.3.5 Physical Inventory Management
Physical inventory management deals with the entering and tracking of data about
the router equipment. It deals not only with equipment configuration details like
what cards are installed in the routers but also where those routers are located
for maintenance dispatch. The physical database also contains the parameters for
the engineering rules that vary by equipment make and model. These parameters
come either from the router vendor documentation or from certification testing.
The parameters and the associated rules can range from simple rules like maximum
bandwidth per line card to complex rules like the maximum number of VPN routes

8

Network Configuration Management

267

with QoS on all line cards on the router with version 3 of the line card firmware. As
new routers or cards are added to the network, this subsystem tracks all the associated data for these assets including tracking whether a router or port is “in service”
and available for assignment. As ports are assigned to customers, the physical inventory removes those ports from the assets that are available for assignment. The
physical inventory also deals with the tracking of serial numbers of cards so that as
cards are replaced or upgraded, the new parameters can be used for the engineering
rules. For example, a card with 256 MB of memory could be upgraded to 512 MB
and thus be able to support more QoS connections. The physical inventory subsystem keeps track of these engineering parameters (sometimes called reference data)
about vendor equipment for use by other subsystems. Here are a few of the typical
parameters tracked:
Maximum logical ports
Maximum aggregate bandwidth
Maximum card assignment
Maximum PVCs
Logical channel limits
IDB limit
VRF limit
BGP limit
COS limit
Routes limit

8.3.6 Logical Inventory Management
Logical inventory management deals with the entering and tracking of data about
the logical assets (IP addresses, ACLs, Route Distinguishers, Route Targets, etc.).
This can be a large subsystem depending on the different features available, but the
hardest item in the category is the IP address management. IP address management
deals with the assignment of efficient blocks to the various intended uses. Typically,
the engineering rules require different blocks of addresses to be used for infrastructure connections, WAN IP address blocks, and customer LAN address blocks.
This requires not only higher-level IP address block management functions so that
access control lists can be managed efficiently but also functions to deal with external systems like the ARIN registry. Service Providers typically update the ARIN
“Who Is” database through an API so that LAN IP blocks assigned to enterprise customers appear as being assigned to those customers. This aids the service provider
in obtaining additional IP address blocks from the registrar if needed. The tracking
of per router elements like ACL numbers is simpler but has its own nuances and
complexity, since the goal is to reuse ACL numbers where it is possible to reduce
the load on the router. Typically, memory is consumed for every ACL on a router.
The ACLs for different ports for the same customer tend to be identical so that memory utilization (and processing time on the ACL) can be reduced by compressing the

268

B.D. Freeman

disparate ACLs into a single ACL that can be shared among a custom’s ports. Numerous other items have to be tracked in logical inventory and assigned during the
assignment function depending on the feature or service being provided and the logical inventory management system grows in complexity as more logical features are
added to the service.

8.3.7 Reports and Feeds
The reports and feeds subsystem is responsible for distributing inventory data to
users and systems required to run the business. The main users of this subsystem
are the fault/service assurance system and the ticketing system. The fault/service
assurance system needs data about the in-service assets so that alarms can be processed correctly. Its source of truth is usually the “What it is” data from the inventory
database. The ticketing system is more concerned with the data about the customer,
since they get notification of an event from the fault/service assurance system and
have need to understand for a given port/card/router problem which customer or customers are affected. Fault and ticketing systems tend to get feeds of the inventory
data, since their query volume can be quite high and the load can best be managed
with a local cache of the data rather than directly querying the inventory database.
Generally, the inventory data does not change rapidly; so a local cache is sufficient
and alarms/tickets do not need these data until after test and turn up of the interface.
Other users need various reports and feeds from the inventory database, and
generally these are pulled either as a report from the GUI or APIs. A GUI-based
reporting application can easily be deployed on the inventory database for items like
port utilization reports for capacity management. APIs can be created as needed for
generating bulk files or responding to simple queries.

8.3.8 Router Audit
The router audit subsystem is responsible for doing both the discovery of the “What
it is” state of the router and comparing the “What it is” with the “What it should be”
in the inventory database. The audit function described in this section is designed to
detect differences with the inventory data. There are other mechanisms that can be
applied to look at the larger set of configuration rules. Some of these are covered in
Chapter 9.
Discovery is typically done with an engine that parses the router configurations
into database attributes. As described before, the parsed router configuration data
are stored in the inventory database but in a separate set of tables from the physical
and logical inventory. The schemas of the audit tables are similar to the physical
and logical inventory tables, but they lack some attributes that do not exist in the
router configuration; the major attributes are the same so that they can be compared

8

Network Configuration Management

269

with the “what it should be” tables. After storage, the compare or audit function
does an item-by-item comparison, tracking any discords. The audit is CPU- and
disk-intensive and typically is only done across the entire network data set on a
daily basis. The discovery/audit process is also used to pick up changes like card
replacements. It is typical for this audit function to take 4–6 hours to complete across
a large network even when high-end servers are employed. The good news is that
the process can typically be run using the backup copies of the router configuration
files so that there is no impact on the network and limited impact on the users of the
system. Incremental audits can also be done on a port or card basis on demand as
part of the router configuration process.
It is worth noting that the tracking of discords requires a historical view: when a
discord was first detected and when was the last time it was detected. New discords
could correlate with an alarm or customer-reported problem. Old discords might be
indicative of data integrity error from a manual correction that was implemented to
repair a customer problem but not appropriately reflected back into the inventory
database.
While perhaps less visible to the overall router configuration management process than other aspects of the configuration workflow, audit is a key step. Real-time
validations must be implemented for a change order so that if there is a discord, the
process will stop the change order from being applied to prevent a problem. It is
important to subsequently find and fix these discords so that future change orders
are not affected.

8.4 Dealing with Change
An important aspect of a configuration management system is to deal with changes
to an existing service. For example, the initial configuration of an interface can be
done in various phases and with little concern for timing until the interface is moved
from the shutdown state to the active state. However, an active interface has a different set of rules. Generally, the timing associated with configuration changes is more
critical and the set of checks on the data and the configuration are more involved.
First, a robust network configuration management system will validate the current configuration of the interface (“What it is”) against the “What it should be”
data and if there is a mismatch it should stop the change. The reasons are probably
obvious that unless the “What it is” and “What it should be” data sets are in agreement, we are running the risk of changing to a configuration that will not work for
the customer because of a previous data inconsistency. For instance, if there have
been problems with a previous re-home and the ACLs are not the same between
the old configuration and the new configuration, it could prevent the customer from
accessing their network services.
Second, for the intended change, the configuration management system should
validate the data set against the interface data, the global configuration of the router,
and to the extent possible the larger network for the customer to ensure that the

270

B.D. Freeman

change is consistent with other “What it is” data. This usually consists of a set of
rules applied by the configuration management subsystem to ensure that a successful
change will be applied. A good example is again a re-home. If the old port is still
advertising its WAN IP address, you cannot bring up the same WAN IP address on
a different router or instabilities can be introduced (duplicate IP address detection is
an important validation rule).

8.4.1 Test and Turn Up
Bringing up a new connection involves testing that the connection works correctly
as ordered and then turning up the port for full service. Turning up a large connection like a 10 Gb Ethernet connection is something done carefully because if
mis-configured it could either drive large amounts of traffic into a customer’s network before they are prepared for it or remove traffic from a customer’s network by
mistake. For most changes against a running configuration, the process of applying
the change has to be coordinated with a maintenance window1 since service could
be impacted. Some changes may also require changes on the customer’s side of the
connection; so proper scheduling with the customer’s staff is required. For changes
that involve the physical connection (speed changes and re-homes), typically two
ports are in assignment at the same time and operations would like to test all or
parts of the new port before swinging the customer’s connection over. This “testing
phase” creates database complexity, since the new port has to be reserved for the
customer but it is not the “in-service” port from an alarming/ticketing standpoint.
Both the old and new have to be tracked until the port is fully migrated to the new
configuration. This requires the concept of “Pending” port assignments/connections
and database transactions to move a port from “Pending” to “Active,” from “Active”
to “Disconnected,” and finally the old record is deleted from the database.
The router configuration system has to maintain the ability to generate router
configurations for each of the interim steps in moving an active connection from one
port to another. There are configurations to bring up the new interface on temporary
information (e.g., temporary serial IP addresses and/or RD/RT/VRF information
for testing), steps to “shutdown” the old interface, steps to “no shutdown” the new
interface, and steps to reverse the entire process to roll back to the old interface.
All these need to be able to be driven through the API for relatively straightforward
changes with automated PE side re-homes that do not affect the customer premise
router and via the GUI for those more complicated changes that require coordination
with the customer. It is with dealing with change that the entire system is stressed
the most to meet the needs of not only ensuring that the network is protected but also
that the entire system responds fast enough to meet the human- or machine-driven
process requirements.
1
The Maintenance Window is a time period when there is expected to be low traffic and is used
by an operator for planned activities that could impact service. Usually it is in the late night/ early
morning of the time zone of the router like 3–6 a.m.

8

Network Configuration Management

271

Another attribute of change that is worth mentioning is changes to active interfaces that are infrastructure connections (e.g., two or more backbone links that
connect network routers). A routine task is to change the OSPF metric on one link
to “cost it out”2 of use so that maintenance on the connection can be done. A problem exists if the state of this link is left in the “costed out” state. Failure of the now
single primary link causes isolation, since one link could be hard failed and the other
link is out of service by being “costed out.” A robust configuration management system also has maintenance functions to permit the operations staff to cost out a link,
to record that the link is “costed out,” and to generate an alarm condition if the link
stays “costed out” for a period of time.
Finally, a type of change that is of growing importance in large networks is the
ability to apply changes in bulk. The complexity of modern routers leads to situations where a latent bug or security vulnerability is found in a router that can only
be repaired by changing the configuration on a large number of ports in the network.
This requires special update processes to handle the updates in a bulk fashion. Typically, this is a customized application on the router configuration subsystem that is
targeted at dealing with the bulk processing. The reason why this gets complicated is
not only because of tracking that all the changes are applied (routers sometimes tend
to refuse administrative requests under heavy load) but also throttling the updates to
specific routers so as not to overload them.

8.5 Example of Service Provisioning
This section will tie all the pieces together in an example of service provisioning for
a simple Internet access service.
Once all the order data are collected and optionally entered into an automated
order management, the provisioning steps can occur including downloading the configurations to the router. The individual configurations are called configlets, since
they are usually incremental changes to an interface or pieces of the global configuration, and not an entire router configuration. They are outlined below.
1.
2.
3.
4.
5.
6.
7.
8.
9.

Create customer
Create premise/site
Create service instance
Create connection and reserve inventory
Download initial configuration
Download loopback test configlet
Download shutdown configlet
Download final configlet with “no shutdown”
Run daily audit

2
When OSPF costs on a set of links are adjusted to shift traffic off of one link and onto another
link, the process is informally called “costing out” the link.

272

B.D. Freeman

1. Create customer
This task is simply to group all the customer data into one high-level account
by creating (or using a previously created) customer entity in the database.
Sometimes, it relates to an enterprise but oftentimes because of mergers and acquisitions or even departmental billing arrangement the “customer” at this level
does not uniquely identify a corporation. There can even be complicated arrangements with wholesalers that must be reflected in various customer attributes.
2. Create premise/site
This task creates a database entity corresponding to the physical site that
the access circuit terminates in at the customer’s site. Street address, city,
state/province, country etc. are typical parameters. Corporations can have multiple services at an address so that we track the address partly not only to make
it easier to work with the customer but also because these data will impact the
selection of the optimum router to reduce backhaul costs.
3. Create service instance
This task collects the parameters about the intended service on this connection. It
will define the speed, any service options like quality of service, and all the other
logical connection parameters. These data directly affect the set of engineering
rules that will be applied to actually find an available port on an optimum router.
4. Create connection and reserve inventory
This task combines the above data into an assignment. The selection of a router
complex is done first using the parameters of address to look for a complex with
a short backhaul. This is called “Homing.” After a preliminary complex is assigned, the routers in the complex are checked for available port capacity and if
there is port capacity, the engineering rules for this connection on that router are
tested. For example, a router may have available ports, but there may be insufficient resources for additional QoS or MPLS VPN routes on the cards. The system
will recursively examine all routers in the complex to look for an available port
that matches the engineering rules. If no router is found, the system will examine
a next best optimum complex and repeat the search. This assignment function can
take a substantial amount of system resources to complete and is not guaranteed
to find a solution due to resource or other business rule constraints.
Once a complex, router, and port has been selected, the logical inventory will
be tied to the physical inventory and this Tie Down Information (TDI) will be
returned to the ordering system so that it can order the layer 1 connection from
the router to the customer premise. It is important to note that at this point the
Inventory database must set a state of the port so that no other customer can use
that router port. If the customer’s order is cancelled, the business process must
ensure that the port assignment is deleted as well to avoid stranded inventory.
At this point, the inventory database would show the port as “PENDING,”
since the inventory has been assigned but it is not in service. All the logical data
needed to configure the interface are in the database and any provider inventory
items have been assigned (serial IP addresses, ACL numbers, etc.).

8

Network Configuration Management

273

5. Download initial configuration
After the inventory has been assigned, an initial configuration of the port is downloaded to the router to define the basic interface. This configlet typically only
includes the serial IP address and default routing and defines the interface in a
shutdown state. This is also the first real-time audit step. This audit will confirm
that the assigned port is not used by some other connection. While rare, data discords of this type do occur. This download need not occur in real time, since it
will typically be some amount of time before the Layer 1 connection is ready.
6. Download the loopback test configlet
This step depends on the layer 1 connection to be installed so that it can occur
days, weeks, or months after step 5. In addition, after Layer 1 is installed, this
step typically occurs 24 h before the scheduled turn up date for a customer.
This configlet contains all the routing and configuration data for the connection. Downloading a configlet to do loopback testing on the network side of the
connection provides a final check of the provider’s part of the work. Just before the configlet is downloaded, a series of real-time audits are again conducted,
since the initial configlet audits could have been months ago. These audits check
both the static order data against the running router and attributes on other ports
on the router. For example, there is a verification that any new ACL number is
not already in use on another port for another customer. This check makes sure
that a manually configured port was not done in error. There is a verification that
any new VRF does not already exist on the router to check and see if another
order has been processed in parallel. There are numerous other validations as
well. This real-time audit is more detailed than the audit done for the initial configlet, since it contains all the routing, QoS, and VPN data. If all validations are
successful, the configuration is downloaded and activated for testing with Layer
1 in loopback.
7. Download shutdown configlet
After successful pretesting, the router port is left in a shutdown state. It can remain in this configuration for some period of time but because routing instances
may have been defined even though the port is shutdown typically operators do
not leave a shutdown interface in the router configuration for more than 48 hours
or so. A shutdown interface is still discoverable from an SNMP network management perspective so that a large number of admin down interfaces simply adds
load to the fault management system without adding value. If it is not successfully turned up, the configuration will be rolled back to the initial configuration.
While the Layer 1 circuit is being ordered/installed, there will likely be many
daily audits that run. These audits will find the port in the router in shutdown
state. The discord analysis will compare the “What it is” configuration and state
with the “What it should be” configuration and state and report any problems.
For our example, there is no problem but the audit might find that the port is in
a “no shutdown” state in the network indicating that perhaps a test and turn up
occurred but was not completed in the inventory database. The daily audit would
also find if the router card had been replaced for some reason and update tracking
data like serial numbers, etc.

274

B.D. Freeman

8. Download final configlet with “no shutdown”
At activation, the system will download the final router configuration with
“no shutdown” of the interface. Final testing may occur with the customer.
The testing for single-link static routed interfaces is usually automated but for
advanced configurations with multiple links or BGP routing, manual testing procedures are typical. It is at this point that the inventory database will update its
status on the port to active and mark the port “In service” for downstream systems
like the Fault Management and Ticketing systems.
9. Run daily audit
The daily audit will find the new state of the port to be active and the “What it
should be” state of “ACTIVE” matches the “What it is” state in the network.

8.6 Conclusion
Hopefully, we have provided a useful overview of a robust router configuration management system and helped to tie the key functions and subsystems back to the
business needs that drive complexity. From inventory management to provisioning
the customer’s service to handling changes to dealing with bulk security updates,
a large carrier cannot provide reliable service without a robust router configuration
management system.
Here is a summary of some “best practice” principles that will be helpful when
designing a Network Configuration Management system.

 Recognize data discords as a fact of life. Separate “What it is” and “What

it should be” data in the inventory database
 Configuration management is the source of truth for the business about the

current network using the “What it is” data
 Protect the network through real-time validation and auditing of the run-

ning network
 Design for change so that logical data are not permanently tied to physical

data
Separate the schema for physical inventory and logical inventory
Use templates to make configuration, discord detection, and testing easier
Track port history, and not just the current state
Design for multiple configurations of a port to handle the current port configuration and the pending port configuration
 Design the system to support testing a port before it is turned up and rollback to an earlier configuration when tests fail
 Limit the amount of business data in the network-facing system so that you
do not create a problem of maintaining consistency





8

Network Configuration Management

275

References
1. Chandra, R., Traina, R., & Li, T. IETF Request for Comments 1997, BGP Communities
Attribute, August 1996.
2. Hawkinson, J., & Bates, T. IETF Request for Comments 1930, Guidelines for creation, selection,
and registration, March 1996.
3. Rosen, E., & Rekhter, Y. IETF Request for Comments 4364, BGP/MPLS Virtual Private
Networks, April 2006.
4. Sklower, K., Lloyd, B., McGregor, G., Carr, D., & Coradetti, T. IETF Request for Comments
1990, The PPP Multilink Protocol, August 1996.

Chapter 9

Network Configuration Validation1
Sanjai Narain, Rajesh Talpade, and Gary Levin

9.1 Introduction
To set up network infrastructure satisfying end-to-end requirements, it is not only
necessary to run appropriate protocols on components but also to correctly configure
these components. Configuration is the “glue” for logically integrating components at and across multiple protocol layers. Each component has configuration
parameters, each of which can be set to a definite value. However, today, the large
conceptual gap between end-to-end requirements and configurations is manually
bridged. This causes large numbers of configuration errors whose adverse effects on
security, reliability, and high cost of deployment of network infrastructure are well
documented. For example:
 “Setting it [security] up is so complicated that it’s hardly ever done right. While






we await a catastrophe, simpler setup is the most important step toward better
security.” – Turing Award winner Butler Lampson [42].
“. . . human error is blamed for 50 to 80 percent of network outages.” – Juniper
Networks [40].
“The biggest threat to increasingly complex systems may be systems themselves.” – John Schwartz [61].
“Things break and complex things break in complex ways.” – Steve Bellovin
[61].
“We don’t need hackers to break systems because they’re falling apart by themselves.” – Peter Neumann [61].

S. Narain (), R. Talpade, and G. Levin
Telcordia Technologies, Inc., 1 Telcordia Drive, Piscataway, NJ 08854, USA
e-mail: narain@research.telcordia.com; rrt@research.telcordia.com;
glevin@research.telcordia.com
1
This material is based upon work supported by Telcordia Technologies, and Air Force Research
Laboratories under contract FA8750-07-C-0030. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect
the views of Telcordia Technologies or of Air Force Research Laboratories. Approved for Public
Release; distribution unlimited: 88ABW-2009-3797, 27 August 09.

C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications,
Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 9,
c Springer-Verlag London Limited 2010


277

278

S. Narain et al.

Thus, it is critical to develop validation tools that check whether a given configuration is consistent with the requirements it is intended to implement. Besides
checking consistency, configuration validation has another interesting application,
namely, network testing. The usual invasive approach to testing has several limitations. It is not scalable. It consumes resources of the network and network
administrators and has the potential to unleash malware into the network. Some
properties such as absence of single points of failure are impractical to test as they
require failing components in operational networks. A noninvasive alternative that
overcomes these limitations is analyzing configurations of network components.
This approach is analogous to testing software by analyzing its source code rather
than by running it. This approach has been evaluated for a real enterprise.
Configuration validation is inherently hard. Requirements can be on connectivity,
security, performance, and reliability and span multiple components and protocols.
A real infrastructure can have hundreds of components. A component’s configuration file can have a couple of thousand configuration commands, each setting the
value of one or more configuration parameters. In general, the correctness of a component’s configuration cannot be checked in isolation. One needs to evaluate global
relationships into which components have been logically integrated. Configuration
repair is even harder, since changing configurations to make one requirement true
may falsify another. The configuration change needs to be holistic in that all requirements must concurrently hold.
This chapter motivates the need for configuration validation in the context of
a realistic collaboration network, proposes an abstract design of a configuration
validation system, surveys current technologies for realizing this design, outlines
experience with deploying such a system in a real enterprise, and outlines future
research directions.
Section 9.2 discusses the challenges of configuring a realistic, decentralized
collaboration network, the vulnerabilities caused by configuration errors, and the
benefits of using a validation system. Requirements on this network are complex to
begin with. Their manual implementation can cause a large number of configuration errors. This number is compounded by the lack of a centralized configuration
authority.
Section 9.3 proposes a design of a system that can not only validate the above
network but also evolve to validate even more complex ones. This design consists
of four subsystems. The first is a Configuration Acquisition System for extracting
configuration information from components in a vendor-neutral format. The second
is a Requirement Library capturing best practices and design patterns that simplify
the conceptualization of end-to-end requirements. The third is a Specification Language whose syntax simplifies the specification of requirements. The fourth is an
Evaluation System for efficiently evaluating requirements, for suggesting configuration repair when requirements are false, and for creating visualizations of logical
relationships.
Section 9.4 discusses the Telcordiar IP Assure product [38] and the choices
it has made to realize this design. It uses a parser generator for configuration acquisition. Its Requirement Library consists of requirements on integrity of logical

9

Network Configuration Validation

279

structures, connectivity, security, performance, reliability, and government policy.
Its specification language is one of visual templates. Its evaluation system uses algorithms from graph theory and constraint solving. It computes visualizations of
several types of logical topologies.
Section 9.5 discusses logic-based techniques for realizing the above validation
system design. Their use is particularly important for configuration repair. They simplify configuration acquisition and specification. They allow firewall subsumption,
equivalence, and rule redundancy analysis. These techniques are the languages Prolog, Datalog, and arithmetic quantifier-free forms [51, 53, 67], the Kodkod [41, 69]
constraint solver for first-order logic of finite domains, the ZChaff [27, 46, 73]
minimum-cost SAT solver for Boolean logic, and Ordered Binary Decision Diagrams (OBDDs) [12].
Section 9.6 outlines related techniques for realizing the above validation system
design. These are type inference for configuration acquisition [47], symbolic reachability analysis [72], its implementation [3] with symbolic model checking [48], and
finally, validation techniques for Border Gateway Protocol (BGP), the Internet-wide
routing protocol, and one of the most complex.
Section 9.7 contains a summary and outlines future research directions.

9.2 Configuration Validation for a Collaboration Network
This section discusses the challenges of configuring a realistic, multi-enterprise collaboration network, the types of its vulnerabilities caused by configuration errors,
the reasons why these arise, and the benefits that can be derived from using a configuration validation system. Multiple communities of interest (COIs) are set up
as logically partitioned virtual private networks (VPNs) overlaid on a common IP
network backbone. The “nodes” of this VPN are gateway routers at each enterprise that participate in the COI. An enterprise can participate in more than one
COI, in which case it would have one gateway router for each COI. For each COI,
agreement is reached between participating network administrators on the top-level
connectivity, security, performance, and reliability requirements governing the COI.
Configuration of routers, firewalls, and other network components to implement
these requirements is up to administrators. There is no centralized configuration
authority. The administrators at different enterprises in a COI negotiate with each
other to ensure configuration consistency. Such decentralized networks exist in industry, academia, and government and are clear candidates for the application of
configuration validation tools.
Typical COI requirements are now described. The connectivity requirement is
that every COI site must be reachable from every other COI site. The security requirement is twofold. First, all communication between sites must be encrypted.
Second, no packets from one COI can leak into another COI. This requirement is
especially important since collaborating enterprises have limited mutual trust. A site
can be a part of more than one COI but the information that site is willing to share

280

S. Narain et al.

with partners on one COI is distinct from that with partners in another COI. The
performance requirement specifies the bandwidth, delay, jitter, and packet loss for
various types of applications. The reliability requirement specifies that connectivity
be maintained in the face of link or node failure.
Since these requirements are complex, large numbers of configuration errors can
be made. This number is compounded by the lack of a centralized configuration
authority. The complexity has the further consequence that –less experienced administrators, especially in an emergency, tend to statically route traffic directly over
the IP backbone rather than correctly set up dynamic routing. But, when the emergency passes, static routes are not removed for concern of breaking the routing. Over
time, this causes the COIs to become brittle in that routes cannot be automatically
recomputed in the face of link or node failure.
While administrators are well aware of configuration errors and their adverse effects on the global network, they lack the tools to identify these, much less remove
these. The decentralized nature of the network prevents them from obtaining a picture of the global architecture. A validation system that could identify configuration
errors, make recommendations for repairing these and help understand the global
relationships would be of immense value to administrators.
Figure 9.1 shows the architecture of a typical COI with four collaborating sites
A, B, C, D. Each site contains a host, an internal router, and a gateway router. The
first two items are shown only for sites A and C. Each gateway router is physically
connected to the physical IP backbone network (WAN). Overlaid on this backbone
is a network of IPSec [41] tunnels interconnecting the gateway routers. An IPSec
tunnel is used to encrypt packets flowing between its endpoints. Overlaid on the
IPSec network is a network of GRE [22] tunnels. A GRE tunnel provides the appearance of two routers being directly connected even though there may be many
physical hops between them. The two overlay networks are “glued” together in such

RB

Physical Link
RC

RA

WAN

IC
IPSec Tunnel

IA

HA
RD

Fig. 9.1 Community of interest architecture

HC
GRE Tunnel

9

Network Configuration Validation

281

a way that all packets through GRE tunnels are encrypted. A routing protocol, e.g.,
BGP [33, 36], is run over the GRE network to discover routes on this overlay. If
a link or node in this network fails, BGP discovers an alternate route if possible.
A packet originating at host HA destined to host HC is first directed by its internal
router IA to the gateway router RA. RA encrypts the packet, then finds a path to HC
on the GRE network. When the packet arrives at RC, it is decrypted, decapsulated,
and forwarded to IC. IC then forwards it to HC. All routers also run the internal
routing protocol called OSPF [42]. OSPF discovers routes to destinations that are
internal to a site. The OSPF process at the gateway router redistributes or injects internal routes into the BGP process. The BGP process then informs its peers at other
gateway routers about these routes. Eventually, all gateway routers come to know
about how to route packets to any reachable internal destination at any site.
In summary, connectivity, security, and reliability requirements are satisfied by
the use, respectively, of GRE, IPSec and BGP, and OSPF. The security requirement
that data from one COI not leak into another is satisfied implicitly. GRE reachability to a different COI is disallowed, static routes to destinations in different COIs are
not set up, gateway routers at the same enterprise but belonging to different COIs
are not directly connected, and BGP sessions across different COIs are not set up.
The performance requirement is satisfied by ensuring that GRE tunnels are
mapped to physical links of the proper bandwidth, delay, jitter, and packet loss properties, although this is not always in control of COI administrators. Avoiding one
cause of packet loss, is however, in their control. This is the blocking of Maximum
Transmission Unit (MTU) mismatch messages. If a router receives a packet whose
size is larger than the router’s configured MTU, and the packet’s Do Not Fragment
bit is set, the router will drop the packet. The router will also warn the sender in
an ICMP message that it has dropped the packet. Then, the sender can reduce the
size of packets its sends. However, since ICMP is the same protocol used to carry
ping messages, firewalls at many sites block ICMP. The result is that the sender will
continue to send packets without reducing their size and they will all be dropped
by the router [68]. Packets increase in size beyond an expected MTU because GRE
and IPSec encapsulations add new headers to packets. To avoid such packet loss,
the MTU at all routers is set to some fixed value accounting for the encapsulation.
Alternatively, ICMP packets carrying MTU mismatch messages are not blocked.
This design is captured by the following requirements:
Connectivity Requirements
1. Each site has a gateway router connected to the WAN.
2. There is a full-mesh of GRE tunnels between gateway routers.
3. Each gateway router is connected to an internal router at the same site.
Security Requirements
1.
2.
3.
4.

There is a full-mesh network of IPSec tunnels between all gateway routers.
Packets through every GRE tunnel are encrypted with an IPSec tunnel.
No gateway router in a COI has a static route to a destination in a different COI
No cross-COI physical, GRE, BGP connectivity, or reachability is permitted.

282

S. Narain et al.

Reliability Requirements
1. BGP is run on the GRE tunnel network to discover routes to destinations in different sites.
2. OSPF is run within a site to discover routes to internal destinations.
3. OSPF and BGP route redistribution is set up.
Performance Requirements
1. MTU settings on all interfaces are set to be less than the expected packet size
after taking into account GRE and IPSec encapsulation.
2. Alternatively, access-control lists at each gateway router permit ICMP packets
carrying MTU messages.
Configuration parameters that must be correctly set to implement the above requirements include:
1. IP addresses and mask of physical and GRE interfaces
2. IP address of the local and remote BGP session end points and the autonomous
system (AS) number of the remote end point
3. Names of GRE interface and IP address of associated local and remote physical
tunnel end points
4. IP addresses of local and remote IPSec tunnel end points, encryption and hash
algorithms to apply to protected packets, and the profile of packets to be protected
5. Destination, destination mask, and next hop of static routes
6. Interfaces on which OSPF is enabled and the OSPF areas to which they belong
7. Source and destination address ranges, protocols, and port ranges of packets for
access-control lists
8. Maximum transmission units for router interfaces
As can be imagined, a large number of errors can be made in manual computation of
configuration parameter values implementing these requirements. GRE tunnels may
only configure in one direction or not at all. IPSec tunnels may only configure in one
direction or not at all. GRE and IPSec tunnels may not be “glued” together. GRE
tunnels or sequences of tunnels may link routers in distinct COIs. A COI gateway
router may contain static routes to a different COI, so packets could be routed to
that COI via the WAN. BGP sessions may be set up between routers in different
COIs, so these routers may come to know about destinations behind each other.
BGP sessions may only be configured in one direction or not at all. BGP sessions
may not be supported by GRE tunnels, so these sessions will not be established.
There may be single points of failure in the GRE and BGP networks. Finally, MTU
settings on routers in a COI may be different leading to the possibility of packet
loss. Such errors can be visualized by mapping various logical topologies. Two of
these are shown below.
In Fig. 9.2, nodes represent routers and edges represent a GRE edge between
routers. These edges have to be set up in both directions for a GRE tunnel to be established. This graph shows two problems. First, the edge labeled “Asymmetric” has
no counterpart in the reverse direction. Second, the dotted line indicates a missing

9

Network Configuration Validation

283

Fig. 9.2 GRE tunnel
topology

Single point
of Failure

Missing
Asymmetric

Fig. 9.3 BGP neighbor
topology

COl 1 COl 2

tunnel. Third, the hatched router indicates a single point of GRE failure. All GRE
packets to destinations to the right of this router pass through this router.
In Fig. 9.3, nodes represent routers and links represent BGP sessions between
nodes. This graph shows two problems. First, there is no full-mesh of BGP sessions
within COI 1. Second, there is a BGP session between routers in two distinct COIs.

9.3 Creating a Configuration Validation System
This section outlines the design of a system that can not only validate the network of
the previous section but also evolve to validate even more complex ones. As shown
in Fig. 9.4, this consists of a Configuration Acquisition system to acquire configuration information in a vendor-neutral format, a Requirement Library containing
fundamental requirements simplifying the task of conceptualizing administrator intent, an easy-to-use Specification Language in which to specify requirements, and
an Evaluation System to efficiently evaluate specifications in this language. These
subsystems are now described.

284

S. Narain et al.
Configuration Files

Requirement Library

Configuration Acquisition System

Administrator

Configuration Database

End-to-End Requirements in
Specification Language

Specification Language

Evaluation System

Root-Cause Of Non-Compliance

Visualizations

Suggestions For Repair

Fig. 9.4 Validation system architecture

9.3.1 Configuration Acquisition System
Each component has associated with it a configuration file containing commands
that define that component’s configuration. These commands are entered by the
network administrator. The most reliable method of acquiring a device’s configuration information is to acquire this file, manually or automatically. Other
less-reliable methods are accessing the devices’ SNMP agent and querying configuration databases. SNMP agents often do not store all of the configuration
information one might be interested in. The correctness and completeness of a configuration database varies from enterprise to enterprise.
If configuration information is acquired from files, then these files have to be
parsed. Configuration languages have a simple syntax and semantics, since they are
intended to be used by network administrators who may not be expert programmers.
Different vendors offer syntactically different configuration languages. However,
the abstract configuration information stored in these files is the same, barring nonstandard features that vendors sometimes implement. This information is associated
with standardized protocols. Examples of it from the previous section are IP addresses, OSPF area identifiers, BGP neighbors, and IPSec cryptographic algorithms.
This information needs to be extracted from files and stored in a vendor-neutral format database. Then, algorithms for evaluating requirements can be written just once
against this database, and not once for every combination of vendor configuration
language. However, configuration languages are vast, each with a very large set
of features. Their syntax can change from one product release to another. Some

9

Network Configuration Validation

285

vendors do not supply APIs to extract the abstract information. It should be possible to extract configuration information without having to understand all features
of a configuration language. Extraction algorithms should be resilient to inevitable
changes in configuration language syntax.

9.3.2 Requirement Library
The Requirement Library is analogous to libraries implementing fundamental algorithms in software development. The Library should capture design patterns and
best practices for accomplishing fundamental goals in connectivity, security, reliability, and performance. Examples of these for security can be found in [18] and for
routing in [33]. These patterns can be expressed as requirements. The administrator
should be easily able to conceptualize end-to-end requirements as compositions of
Library requirements.

9.3.3 Specification Language
The specification language should provide an easy-to-use syntax for expressing
end-to-end requirements. Specifications should be as close as possible in their forms
to their natural language counterparts. The syntax can be text-based or visual. Since
requirements are logical concepts, the syntax should allow specification of objects,
attributes, and constraints between these and compositions of constraints via operators such as negation, conjunction, disjunction, and quantification. For example, all
of these constructs appear in the Section 9.2 requirement “No gateway router in a
COI has a static route to a destination in a different COI.”

9.3.4 Evaluation System
The Requirement Evaluation system should contain efficient algorithms to evaluate a requirement against configuration. These algorithms should output not just a
yes/no answer but also explanations or counterexamples to guide configuration repair. Configuration repair is harder than evaluation. A set of requirements can be
independently evaluated but if some are false, they cannot be independently made
true. Changing the configuration to make one requirement true may falsify another.
To provide further insight into reasons for truth or falsehood of requirements, this
system should compute visualizations of logical relationships that are set up via
configuration, analogous to visualizations of quantitative data [70].

286

S. Narain et al.

9.4 IP Assure Validation System
This section describes the Telcordiar IP Assure product and discusses the choices
made in it to implement the above abstract design of a validation system. This product aims to improve the security, availability, QoS, and regulatory compliance of IP
networks. It uses a parser generator for configuration acquisition. Its Requirement
Library consists of well over 100 requirements on integrity of logical structures,
connectivity, security, performance, reliability, and government policy. Its specification language is one of visual templates. Its evaluation system uses algorithms from
graph theory and constraint solving. It also computes visualizations of several types
of logical topologies. If a requirement is false, IP Assure does compute a root-cause,
although its computation is hand-crafted for each requirement. IP Assure does not
compute a repair that concurrently satisfies all requirements.

9.4.1 Configuration Acquisition System
Section 9.3 raised three challenges in the design of a configuration acquisition system. The first was the design of a vendor-neutral database schema for storing
configuration information. The second was extracting information from configuration files without having to know the entire configuration language for a given
vendor. The third was making the extraction algorithms robust to inevitable changes
in the configuration language. This section describes IP Assure’s configuration acquisition system and sketches how well it meets these challenges.
IP Assure has defined a schema loosely modeled after DMTF [20] schemas. It
uses the ANTLR [5] system to define a grammar for configuration files. The parser
generated by ANTLR reads the configuration file and if successful returns an abstract syntax tree exposing the structure of the file. This tree is then analyzed by
algorithms implemented in Java to create and populate tables in its schema. Often,
information in a table is assembled from information scattered in different parts of
the file.
The system is illustrated in the context of a configuration file containing the following commands in Cisco’s IOS configuration language:
hostname router1
!
interface Ethernet0
ip address 1.1.1.1 255.255.255.0
crypto map mapx
!
crypto map mapx 6 ipsec-isakmp
set peer 3.3.3.3
set transform-set transx
match address aclx
!

9

Network Configuration Validation

287

crypto ipsec transform-set transx esp-3des hmac
!
ip access-list extended aclx
permit gre host 3.3.3.3 host 4.4.4.4

A configuration file is a sequence of command blocks consisting of a main command
followed by zero or more indented subcommands. The first command specifies the
name router1 of the router. It has no subcommands. Any line beginning with
! is a comment line. The second command specifies an interface Ethernet0. It
has two subcommands. The first specifies the IP address and mask of this interface.
The second specifies the name mapx of an IPSec tunnel originating from this interface. The parameters of the IPSec tunnel are specified in the next command block.
The main command specifies the name of the tunnel, mapx. The subcommands
specify the address of the remote endpoint of the IPSec tunnel, the set transx of
cryptographic algorithms to be used, and the profile aclx of the traffic that will be
secured by this tunnel. The next command block defines the set transx as consisting of the encryption algorithm esp-3des and the hash algorithm hmac. The
last command block defines the traffic profile aclx as any packet with protocol,
source address and destination address equal to gre, 3.3.3.3 and 4.4.4.4,
respectively.
Part of an ANTLR grammar for recognizing the above file is:
commands: command NL (rest=commands | EOF)
->ˆ(COMMAND command $rest?);
command: (’interface’) => interface_cmd
|(’crypto’)
=> crypto_cmd
|(’ip’)
=> ip_cmd
|unparsed_cmd;
interface_cmd: ’interface’ ID (LEADINGWS interface_subcmd) *
-> ˆ(’interface’ ID interface_subcmd *)
interface_subcmd:
’ip’ ’address’ a1=ADDR a2=ADDR -> ˆ(’address’ $a1 $a2)
|’crypto’ ’map’ ID -> ˆ(CRYPTO_MAP ID)
|unparsed_subcmd;

The first grammar rule states that commands is a sequence of one or more command blocks. The ˆ symbol is a directive to construct the abstract syntax tree whose
root is the symbol COMMAND, whose first child is the command block just read, and
second child is the tree representing the sequence of subsequent command blocks.
The next rule states that a command block begins with the keywords interface,
crypto, or ip. The symbol = > means no backtracking. The last line in this
rule states that if a command block does not begin with any of these identifiers, it is
skipped. Skipping is done via the unparsed cmd symbol. Grammar rules defining it skip all tokens till the beginning of the next command block. The last two
rules define the structure of an interface command block. ANTLR produces a
parser that processes the above file and outputs an abstract syntax tree. This tree
is then analyzed to create the tables below. Note that the ipsec table assembles
information from the interface, crypto map, crypto ipsec, and ip
access-list command blocks.

288

Host
router1

S. Narain et al.

Interface
Ethernet0

Host
router1

SrcAddr
1.1.1.1

Host
router1

Filter
Aclx

ipAddress Table
Address
1.1.1.1
ipsec Table

DstAddr
EncryptAlg
3.3.3.3
esp-3des
acl Table
Protocol
gre

SrcAddr
3.3.3.3

Mask
255.255.255.0
HashAlg
hmac
DstAddr
4.4.4.4

Filter
aclx
Perm
permit

IP Assure’s vendor-neutral schema captures much of the configuration information
for protocols it covers. Its skipping idea allows one to parse a file without recognizing the structure of all possible commands and command blocks. However, the idea
is quite hard to get right in the ANTLR framework. One is trying to avoid writing a
grammar for the skipped part of the language, yet the only method one can use is to
write rules defining unparsed cmd.

9.4.2 Requirement Library
9.4.2.1 Requirements on Integrity of Logical Structures
A very useful class of requirements is on the integrity of logical structures associated with different protocols. Before a group of components executing a protocol
can accomplish an intended joint goal, various logical structures spanning these
components must be set up. These structures are set up by making component configurations satisfy definite constraints. For example, before packets flowing between
two interfaces can be secured via IPSec, the lPSec tunnel logical structure must be
set up. This is done by setting IPSec configuration parameters at the two interfaces
and ensuring that their values satisfy definite constraints. For example, the two interfaces must use the same hash and encryption algorithms, and the remote tunnel
endpoint at each interface must equal the IP address of its counterpart.
An Hot Standby Routing Protocol (HSRP) [44] router cluster is another example
of a logical structure. It allows two or more routers to behave as a single router by
offering a single virtual IP address to the outside world, on a given subnet. This
address is mapped to the real address of an interface on the primary router. If this
router fails, another router takes over the virtual address. Before the cluster correctly
functions, however, the same virtual address and HSRP group identifier must be
configured on all interfaces and the virtual and all physical addresses must belong
to the same subnet.
Much more complex logical structures are set up for BGP. Different routers in
an autonomous system (AS) connect to different neighboring ASes, giving each
router only a partial view of BGP routes. To allow all routers in an AS to construct

9

Network Configuration Validation

289

a complete view of routes, routers exchange information between themselves via
iBGP (internal BGP) sessions. The simplest logical structure for accomplishing
this exchange is a full-mesh of iBGP sessions, one for each pair of routers. But
a full-mesh is impractical for a large AS, since the number of sessions grows
quadratically with the number of routers. Linear growth is accomplished with a
hub-and-spoke structure. All routers exchange routes with a spoke called a route
reflector. If these structures are incorrectly set up, protocol oscillations, forwarding
loops, traffic blackholes, and violation of business contracts can arise [6,31,74]. See
Section 9.6.4 for more discussion of BGP validation.
IP Assure evaluates requirements on integrity of logical structures associated
with all common protocols. These structures include IP subnets, GRE tunnels, IPSec
tunnels, MPLS [60] tunnels, BGP full-mesh or hub-and-spoke structures, OSPF subnets and areas, and HSRP router clusters.

9.4.2.2 Connectivity Requirements
Connectivity (also called reachability) is a fundamental requirement of a network.
It means the existence of a path between two nodes in the network. The most obvious network is an IP network whose nodes represent subnets and routers and links
represent direct connections between these. But as noted in Section 9.2, connectivity
requirements are also meaningful for many other types of networks such as GRE,
IPSec, and BGP. IP Assure evaluates connectivity for IP, VLANs, GRE, IPSec, BGP,
and MPLS networks.
IP Assure also evaluates reachability in the presence of access-control policies,
or lists, configured on routers or firewalls. An access-control list is a collection of
rules specifying the IP packets that are permitted or denied based on their source
and destination address, protocol, and source and destination ports. These rules
are order-dependent. Given a packet, the rules are scanned from the top-down and
the permit or deny action associated with the first matching rule is taken. Even if a
path exists, a given packet may fail to reach a destination because an access-control
list denies that packet.

9.4.2.3 Reliability Requirements
Reliability in a network means the ability to maintain connectivity in the presence
of failures of nodes or links. A single point of failure for connectivity between two
nodes in a network is said to exist if a single failure causes connectivity between the
two nodes to be lost. Reliability is achieved by provisioning backup resources and
setting up a reliability protocol. This protocol monitors for failures and when one
occurs, finds backup resources and attempts to restore connectivity using those.
Configuration errors may prevent backup resources from being provisioned. For
example, in Section 9.2, some GRE tunnels were only configured in one direction,
not in the other, so they were unavailable for being rerouted over. Even if backup

290

S. Narain et al.

resources have been provisioned, configuration errors in the routing protocol can
prevent these resources from being found. For example, in Section 9.2, BGP was
simply not configured to run over some GRE tunnels, so it would not find these
links to reroute over.
The architecture of the fault-tolerance protocol itself can introduce a single point
of failure. For example, a nonzero OSPF area may be connected to OSPF area zero
by a single area-border-router. If that router fails, then OSPF will fail to discover
alternate routes to another area [36] even if these exist. Similarly, unless BGP route
reflectors are replicated, they can become single points of failure [7].
Furthermore, redundant resources at one layer must be mapped to redundant
resources at lower layers. For example, if all GRE tunnels originate at the same
physical interface on a router, then if that interface fails, all tunnels would simultaneously fail. Ideally, all GRE tunnels originating at a router must originate at distinct
interfaces on that router.
Single points of failure can also arise out of the dependence between security and
reliability.
As shown in Fig. 9.5, routers R1 and R2 together constitute an HSRP cluster with
R1 as the primary router. This cluster forms the gateway between an enterprise’s
internal network on the right and the WAN on the left. For security, an IPSec tunnel
is configured from R1 to the gateway router C of a collaborating site. However, this
tunnel is not replicated on R2. Consequently, if R1 fails, then R2 would take over
the cluster’s virtual address; however, IPSec connectivity to C would be lost.
Reliability requirements that IP Assure evaluates include absence of single points
of failure in IP networks, with and without access-control policies; absence of
single OSPF area-border-routers; and replication of IPSec tunnels in an HSRP
cluster.

IPSec Tunnel 1

C

WAN

R1

HSRP
Cluster

X
IPSec Tunnel 2

Fig. 9.5 HSRP cluster

R2

Internal network

9

Network Configuration Validation

291

9.4.2.4 Security Requirements
Typical network security requirements are about data confidentiality, data integrity,
authentication, and access-control. IPSec is commonly used to satisfy the first three
requirements and access-control lists are used to satisfy the last one. Access-control
lists were discussed in Section 9.4.2.2. Components dedicated just to processing access-control lists are called firewalls. IP Assure evaluates requirements for
both these technologies. For IPSec, it evaluates the tunnel integrity requirements
in Section 9.4.2.1. For access-control lists, IP Assure evaluates two fundamental
requirements. First, an access-control list subsumes another in that any packet permitted by the second is also permitted by the first. A related requirement is that
one list is equivalent to another in that any packet permitted by one is permitted by
the other. Two lists are equivalent if each subsumes the other. An enterprise may
have multiple egress firewalls. Access-control lists on these may have been set up
by different administrators over different periods of time. It is useful to check that
the policy governing packets that leave the enterprise are equivalent. The second
requirement that IP Assure evaluates on access-control lists is that a firewall has no
redundant rules. A rule is redundant if deleting it will not change the set of packets a firewall permits. Deleting redundant rules makes lists compact and easier to
understand and maintain.

9.4.2.5 Performance Requirements
The [19] protocol allows one to specify policies for partitioning packets into different classes, and then for according them differentiated performance treatment. For
example, a packet with a higher DiffServ class is given transmission priority over
one with a lower. Typically, voice packets are given highest priority because of the
high sensitivity of voice quality to end-to-end delays. Performance requirements
that IP Assure evaluates are that all DiffServ policies on all routers are identical,
and that any policy that is defined is actually used by being associated with an
interface.
IP Assure also evaluates the requirement that ICMP packets are not blocked. This
is a sufficient condition for avoiding packet loss due to mismatched MTU sizes and
setting of Do Not Fragment bits discussed in Section 9.2.

9.4.2.6 Government Regulatory Requirements
Government regulatory requirements represent “best practices” that have evolved
over a period of time. Compliance to these is deemed essential for connectivity,
reliability, security, and performance of an organization’s network. Compliance
to certain regulations such as the Federal Information Security Management Act
(FISMA) [26] is mandatory for government organizations. Two examples of a
FISMA requirement are (a) alternate communications services do not share a single

292

S. Narain et al.

point of failure with primary communication services, (b) all access between nodes
internal to an enterprise and those external to it is mediated by a proxy server. IP
Assure allows specification of a large number of FISMA requirements.

9.4.3 Specification Language
IP Assure’s specification language is that of graphical templates. It offers a menu of
more than 100 requirements in different categories. A user can select one or more
of these to be evaluated. For each requirement, one can specify its parameters. For
example, for a reachability requirement, one can specify the source and destination.
For an access-control list equivalence requirement, one can specify the two lists.
One cannot apply disjunction or quantification operators to requirements. The only
way to define new requirements is to program in Java and SQL.
Figure 9.6 shows a few requirement classes that can be evaluated. These are QoS
(DiffServ), HSRP, OSPF, BGP, and MPLS.

Fig. 9.6 IP Assure requirement specification screen

9

Network Configuration Validation

293

9.4.4 Evaluation System
Structural integrity requirements are evaluated with algorithms specialized to
each requirement. In IP Assure, these algorithms are implemented with SQL
and Java. The relevant tuples from the configuration database are extracted with
SQL and analyzed by Java programs. For example, to evaluate whether an IPSec
tunnel between two addresses local1 and local2 is set up, one checks that
there are tuples ipsec(h1, local1, remote1, ea1, ha1, filter1)
and ipsec(h2, local2, remote2, ea2, ha2, filter2) in the configuration database, and that local1 = remote2, remote1 = local2,
ea1 = ea2, ha1 = ha2 and filter1 is a mirror image of filter2.
Reachability and reliability requirements for a network are evaluated by extracting the relevant graph information from the configuration database with
SQL queries, then applying graph algorithms [63]. For example, given the tuple
ipAddress(host, interface, address, mask), one creates two
nodes, the router host and the subnet whose address is the bitwise-and of
address and mask, and then creates directed edges linking these in both directions. This step is repeated for all such tuples to compute an IP network graph.
To evaluate whether a node or a link is a single point of failure, one removes it
from the graph and checks whether two nodes are reachable. If not, then the deleted
node or link is a single point of failure. To check reachability in the presence of
access-control lists, all edges at which these lists block a given packet are deleted,
and then reachability analysis is repeated for the remaining graph.
Firewall requirements cannot be evaluated by enumerating all possible packets
and checking for subsumption, equivalence, or redundancy. The total number of
combinations of all source and destination addresses, ports, and protocols is astronomical: the total number of IPv4 source and destination address, source and
destination port, and protocol combinations is 2^ 104 (32 C 32 C 16 C 16 C 8).
Instead, symbolic techniques are used. Each policy is represented as a constraint
on the following fields of a packet: source and destination address, protocol, and
source and destination ports. The constraint is true precisely for those packets that
are permitted by the firewall, taking rule ordering into account. Let P1 and P2 be
two policies and C1 and C2 be, respectively, the constraints representing them. The
constraint can be constructed in time linear in the number of rules. Then, P1 is subsumed by P2 if there is no solution to the constraint C1 ^ :C2. To check that
a rule in P1 is redundant, delete it from P1 and check that the resulting policy is
equivalent to P1.
For example, let a firewall contain the following rules that, for simplicity, only
check whether the source and destination addresses are in definite ranges:
1, 2, 3, 4, deny
5, 6, 7, 8, permit
10, 15, 15, 20, permit

294

S. Narain et al.

The first rule states that any packet with source address between 1 and 2 and destination address between 3 and 4 is denied. Similarly, for the second and third rules.
These are represented by the following constraint C1 on the variables src and dst.
: (1=, >D, and bitwise logic operators. This QFF is then efficiently solved by Kodkod. If ConfigAssure is unable to
find a solution, it outputs a proof of unsolvability, inherited from Kodkod. This proof
is interpreted as a root-cause and guides configuration repair. Arithmetic quantifierfree forms constitute a good intermediate language between Boolean logic and
first-order logic. Not only is it easy to express requirements in it, but it can also
be efficiently compiled into Boolean logic. ConfigAssure was designed to avoid,
where possible, the generation of very large intermediate constraints in Kodkod’s
transformation of first-order logic into Boolean.
If the fields that are responsible for making a requirement false are known, then
one way to repair these is as follows: replace these fields with variables and use
ConfigAssure to find new values of these variables that make the requirement true.
Two approaches can be used to narrow down these fields. The first exploits the
proof of unsolvability of the falsified requirement to compute a type of root-cause.
The second exploits properties of Datalog proofs and ZChaff to compute that set of
fields whose cost of change is minimal. The second approach has been developed in
the MulVAL [35,55,56] system. More generally, MulVAL is a system for enterprise
security analysis using attack graphs.
Ordered Binary Decision Diagrams are an alternative to SAT solvers for evaluating firewall policy subsumption and rule redundancy with a method conceptually
similar to that in Section 9.4.4.
The use of these techniques for building different parts of a validation system is
now illustrated with concrete examples based on the case study in Section 9.2.

9.5.1 Configuration Acquisition by Querying
When the structure of a configuration file is simple, as it is for Cisco’s IOS, then
it is not necessary to write a grammar with ANTLR or PADS/ML [47]. Instead,
the structure can be put into a command database and then queried to construct the

298

S. Narain et al.

configuration database. The query needs to refer only to that part of the command
database necessary to construct a given table. All other parts are ignored. This idea
provides substantial resilience to insertion of new command blocks, insertion of
new subcommands in a known command block, and insertion of new keywords in
subcommands.
This idea is illustrated using Prolog, although any database engine could be used.
Each command block is transformed into an ios cmd tuple or Prolog fact, with the
structure
ios_cmd(FileName, MainCommand, ListOfSubCommands)

where MainCommand and each item in ListOfSubCommands is of the form
[NestingLevel j ListOfTokens]. [AjB] means the list with head A and tail
B. For example, the IOS file of Section 9.4.1, named f here, is transformed into the
following Prolog tuples:
ios_cmd(f, [0, hostname, router1], []).
ios_cmd(f,
[0, interface, ’Ethernet0’],
[
[1, ip, address, ’1.1.1.1’, ’255.255.255.0’],
[1, crypto, map, mapx] ]).
ios_cmd(f,
[0, crypto, map, mapx, 6, ’ipsec-isakmp’],
[
[1, set, peer, ’3.3.3.3’],
[1, set, ’transform-set’, transx],
[1, match, address, aclx]]).
ios_cmd(f,
[0,crypto,ipsec,’transform-set’,
transx,’esp-3des’,hmac], []).
ios_cmd(f,
[0, ip, ’access-list’, extended, aclx],
[
[1, permit, gre, host, ’3.3.3.3’,
host, ’4.4.4.4’]]).

Note the close correspondence between the structure of command blocks in the IOS
file and associated ios cmd tuples. One can now write Prolog rules to construct the
configuration database. For instance, to construct rows for the ipAddress table,
one can use:
ipAddress(H, I, A, M):ios_cmd(File, [0, hostname, H|_], _),
ios_cmd(File, [0, interface, I|_], Args),
member(SubCmd, Args),
subsequence([ip, address, A, M], SubCmd).

The syntactic convention followed in Prolog is that identifiers beginning with capital
letters are variables, otherwise they are constants. The :- symbol is a shorthand for
if. All variables are universally quantified. The rule states that ipAddress of an
interface I on host H is A with mask M if there is a File containing a hostname
command declaring host H, an interface command declaring interface I, and
a subcommand of that command declaring its address and mask to be A and M,
respectively.

9

Network Configuration Validation

299

Note that this definition is unaffected by subcommands of the interface command that are not of interest for computing ipAddress, or that are defined in
a subsequent IOS release. It only tries to find a subcommand containing the sequence [ip, address, A, M]. It does not require that the subcommand be in
a definite position in the block, or that the sequence address A, M appear in definite position in the ip subcommand. Now, where H, I, A, M are variables, the
query ipAddress(H, I, A, M) will succeed with the solution H = f, I =
’Ethernet0’, A = ’1.1.1.1’ and M = ’255.255.255.0’. Here f is
a host, I is an interface on this host, and A and M its address and mask, respectively.
ipsec is more complex but querying simplifies the assembly of information
from different parts of a configuration file. For each interface, one finds the name
of a crypto map Map applied to that interface, and then finds the corresponding crypto map command, from which one can extract the peer address Peer,
the filter Filter, and transform-set Transform. These values are used to select the crypto ipsec command from which the Encrypt and Hash values
are extracted. Thus, the ipSecTunnel(H, Address, Peer, Encrypt,
Hash, Filter) is constructed.
ipsec(H, Address, Peer, Encrypt, Hash, Filter):ios_cmd(File, [0, interface, I |_], Args),
member([_, crypto, map, Map |_], Args),
ios_cmd(File, [0, hostname, H |_], _),
ipAddress(H, I, Address, _),
ios_cmd(File, [0, crypto, map, Map |_], CArgs),
member([_, set, peer, Peer |_], CArgs),
member([_, match, address, Filter|_], CArgs),
member([_, set, ’transform-set’,
Transform |_], CArgs),
ios_cmd(File, [0, crypto, ipsec,
’transform-set’, Transform, Encrypt, Hash],_).

The ipAddress and ipsec tuples are constructed in all possible ways via Prolog
backtracking. Together, these form the configuration database for these protocols.

9.5.2 Specification Language
This section shows how Prolog can be used to specify the types of requirements in
the case study of Section 9.2. It has already been used to validate VPN and BGP
requirements [50, 58]
As shown in Fig. 9.9, routers RA and RB are in the same COI but RX is in a
different COI. RA’s configuration violates two security requirements and one connectivity requirement. First, RA has a GRE tunnel into RX. Second, RA has a default
static route using which it can forward packets destined to RX, to the WAN. Third,
RA does not have a GRE tunnel into RB. All these violations need to be detected
and configurations repaired.

300

S. Narain et al.

RB
COI1

eth_0 address = 200

COI1
eth_0 address = 100
RA

WAN

tunnel_0

COI2

eth_0 address = 300
RX

Fig. 9.9 Network violating security and connectivity requirements

A configuration database for the above network is represented by the following
Prolog tuples:
static_route(ra, 0, 32, 400).
gre(ra, tunnel_0, 100, 300).
ipAddress(ra, eth_0, 100, 0).
ipAddress(rb, eth_0, 200, 0).
ipAddress(rx, eth_0, 300, 0).
coi([ra-coi1, rb-coi1, rx-coi2]).

The first tuple states that router ra has a default static route with a next hop of address 400. Normally, a mask is a sequence of 32 bits containing a sequence of ones
followed by a sequence of zeros. In the ipAddress tuple, a mask is represented
implicitly as the number of zeros at the end of the sequence. This simplifies the
computations we need. The route is called “default” because any address matches
it. The second states that router ra has a GRE tunnel originating from GRE interface tunnel 0 with local physical address 100 and remote physical address
300. The third tuple states that router ra has a physical interface eth 0 with address 100 and mask 0. Similarly, for the fourth and fifth tuples. The last tuple
lists the community of interest of each router. Requirements are defined with Prolog
clauses, e.g.:
good:-gre_connectivity(ra, rb).
gre_connectivity(RX, RY):gre_tunnel(RX, RY),
route_available(RX, RY).

9

Network Configuration Validation

301

gre_tunnel(RX, RY):gre(RX, _, _, RemoteAddr),
ipAddress(RY, _, RemoteAddr, _).
route_available(RX, RY):static_route(RX, Dest, Mask, _),
ipAddress(RY, _, RemotePhysical, 0),
contained(Dest, Mask, RemotePhysical, 0).
contained(Dest, Mask, Addr, M):Mask>=M,
N is ((2ˆ32-1)<< Mask)/\Dest,
N is ((2ˆ32-1)<< Mask)/\Addr.
bad:-gre_tunnel(ra, rx).
bad:-route_available(ra, rx).

The first clause states that good is true provided there is GRE connectivity between
routers ra and rb since they are in the same COI. The second clause states that
there is GRE connectivity between any two routers RX and RY provided RX has a
GRE tunnel configured to RY and a route available to RY. The third clause states
that a GRE tunnel to RY is configured on RX provided there is a GRE tuple on
RX whose remote address is that of an interface on RY. The fourth clause states
that a route to RY is available on RX provided an address RemotePhysical on
RY is contained within the address range of a static route on RX. The fifth clause
checks this containment. < < is the left-shift operator and /n is the bitwise-and
operator, not to be confused with the conjunction operator. The sixth clause states
that bad is true provided there is a gre tunnel between ra and rx since ra and rx
are not in the same COI. The last clause states that bad is also true provided a route
on ra is available for packets with a destination on rx.
We now show how to capture requirements containing quantifiers. To capture the
requirement all good that between every pair of routers in a COI there is GRE
connectivity, we can write:
all_good:-not(same_coi_no_gre).
same_coi_no_gre:-same_coi(X, Y), not(gre_connectivity (X, Y)).
same_coi(X, Y):-coi(L), member(X-C, L), member (Y-C, L).

The first rule states all good is true provided same coi no gre is false. The
second rule states that same coi no gre is true provided there exist X and Y that
are in the same COI but for which gre connectivity(X, Y) is false. The last
rule states that X and Y are in the same COI provided there is some COI C such that
X-C and Y-C are in the COI association list L.
Similarly, we can capture the requirement no bad that no router contains a route
to a router in a different COI.
As previously mentioned, the MulVAL system has proposed the use of Datalog for specification and analysis of attack graphs. Datalog is a restriction of
Prolog in which arguments to relations are just variables or atomic terms, i.e., no
complex terms and data structures. This restriction means, in particular, that predicates such as all good and all pairs gre cannot be specified and neither can
subnet id since it needs bitwise operations. However, the first five Prolog tuples

302

S. Narain et al.

above and the first three rules can be specified. This restriction, however, permits
MulVAL to perform fine-grained analysis of root-causes of configuration errors and
to compute strategies for their repair. This is discussed in the next section.

9.5.3 Evaluation for Repair
If a configuration database and requirements are expressed in Prolog, then its query
capability can be used to evaluate whether requirements are true. For example,
the query route available(ra, rb) is evaluated to be true by clauses for
route available, static route, and contained. The query bad succeeds for two reasons. First, the static route on ra is a default route. It forwards
packets to any destination, including to destinations in a different COI. Second,
a GRE tunnel to router rx is configured on ra even though rx is in a different COI. On the other hand, the query good fails. This is because the predicate
gre tunnel(ra, rb) fails. The only GRE tunnel configured on ra is to rx,
not to rb.
If requirement evaluation against a configuration database is the only goal, then
a Prolog-based validation system is practical on a realistic scale. However, if a requirement is false for a configuration database and the goal is to change some fields
in some tuples so that the requirement becomes true, then Prolog is not adequate.
The Prolog query (good,not(bad)), representing the conjunction of good
and not(bad), will simply fail. Prolog will not return new values of these fields
that make the query true.
In order to efficiently compute new values of these fields, a constraint solver
with the capability to compute a proof of unsolvability is needed. Such a capability
is provided by the ConfigAssure system. ConfigAssure allows one to replace some
fields in some tuples in a configuration database with configuration variables. These
variables are unrelated to Prolog variables. ConfigAssure also allows one to specify
a requirement R as an equivalent QFF RC on these configuration variables. Solving
RC would compute new values of these fields, in effect repairing the fields.
For example, suppose we suspect that the query (good,not(bad)) fails because addresses and the static route mask are incorrect. We can replace all these
with configuration variables to obtain the following database:
static_route(ra, dest(0), mask(0), 400).
gre(ra, tunnel_0, gre_a_local(0), gre_a_remote(0)).
ipAddress(ra, eth_0, ra_addr(0), 0).
ipAddress(rb, eth_0, rb_addr(0), 0).
ipAddress(rx, eth_0, rx_addr(0), 0).
coi([ra-coi1, rb-coi1, rx-coi2]).

Here, dest(0), mask(0), gre a local(0), gre a remote(0),
ra addr(0), rb addr(0), rx addr(0) are all configuration variables.
In order that this database satisfy (good ^ not(bad)), these configuration
variables must satisfy the following constraint RC:

9

Network Configuration Validation

303

:gre a remote(0)=rx addr(0)^
:contained(dest(0),mask (0),
rx addr(0),0)
^ gre a remote(0)=rb addr(0)
^ contained(dest(0),mask(0),rb addr(0),0)
^ : ra addr(0)=rb addr(0) ^ :rb addr(0)=rx addr(0) ^
:rx addr(0)=ra addr(0)

The constraint on the first two lines is equivalent to not(bad). It states that ra
should neither have a GRE tunnel nor a static route to rx. The constraint on the
next two lines is equivalent to good. It states that ra should have both a GRE tunnel and a static route to rb. The constraint on the last line states that all interface
addresses are unique. Solving this constraint would indeed find new values of configuration variables and hence repair the fields. However, one may change fields,
such as ra addr(0), unrelated to the failure of (good,not(bad)). To change
fields only related to failure, one can exploit the proof of unsolvability that ConfigAssure automatically computes when it fails to solve a requirement. This proof is
a typically small and unsolvable part of the requirement, and can be taken to be a
root-cause of unsolvability.
The idea is to generate a new constraint InitVal that is a conjunction of equations of the form x = c where x is a configuration variable that replaced a field
and c is the initial value of that field. Now try to solve RC^InitVal. Since R is
false for the database without variables, ConfigAssure will find RC^InitVal to be
unsolvable and return a proof of unsolvability. If, in this proof, there is an equation
x = c that is also in InitVal, then relax the value of x by deleting x = c from
InitVal to create InitVal’. Reattempt a solution to RC^InitVal’ to find a
new value of x. More than one such equation can be deleted in a single step. For
example, the definition of InitVal for above configuration variables is:
dest(0)=0
^ mask(0)=32
^ gre a local(0)=100
^ gre a remote(0)=300
^ ra addr(0)=100
^ rb addr(0)=200
^ rx addr(0)=300

Submitting RC^InitVal to ConfigAssure generates a proof of unsolvability that
ra should have a tunnel to rb but instead has one to rx:
gre a remote(0)=rb addr(0) ^ gre a remote(0)=300 ^ rb addr(0)=200

Deleting the second equation from InitVal to obtain InitVal’ and solving
RC^InitVal’ we obtain another proof of unsolvability that ra has a static route
to rx:
rx addr(0)=300 ^ dest(0)=0 ^ mask(0)=32 ^
:contained
(dest(0),mask(0),rx addr(0),0)

304

S. Narain et al.

Deleting the second and third equations and solving, we obtain a solution that fixes
both the GRE tunnel and the static route on ra:
dest(0)=200
mask(0)=0
gre_a_remote(0)=200
gre_a_local(0)=100
ra_addr(0)=100
rb_addr(0)=200
rx_addr(0)=300

Values of just the first three variables needed to be recomputed. Values of others do
not need to be. Note that ra addr(0) never appeared in a proof of unsolvability
even though it did in RC. Thus, its value definitely does not need to be recomputed.
This is not obvious from RC. Note also that repair is holistic in that it satisfies both
good and not(bad).
The remaining task is generation of the constraint RC. It is accomplished by
thinking about specification as a method of computing an equivalent quantifier-free
formula, i.e., defining the predicate eval(Req, RC) where Req is the name of a
requirement and RC is a QFF equivalent to Req. The original Prolog specification
of Req in Section 9.5.2 is no longer needed. It is replaced by a metalevel version as
follows:
eval(bad, or(C1, C2)):eval(gre_tunnel(ra, rx), C1),
eval(route_available(ra, rx), C2).
eval(gre_tunnel(RX, RY), RemoteAddr=Addr):gre(RX, _, _, RemoteAddr),
ipAddress(RY, _, Addr, _).
eval(route_available(RX, RY), C):static_route(RX, Dest, Mask, _),
ipAddress(RY, _, RemotePhysical, _),
C=contained(Dest, Mask, RemotePhysical, 0).
eval(addr_unique, C):andEach([not(ra_addr(0)=rb_addr(0)),
not(rb_addr(0)=rx_addr(0)),
not(rx_addr(0)=ra_addr(0))], C).
eval(topReq, C):eval(good, G),
eval(bad, B),
eval(addr_unique, AU),
andEach([G, B, AU], C).

These rules capture the semantics of the Prolog rules. The first states that
a QFF equivalent to bad is the disjunction of C1 and C2 where C1 is the
QFF equivalent to gre tunnel(ra, rx) and C2 is the QFF equivalent to
route available(ra, rx). The second rule states that the QFF equivalent to gre tunnel(RX, RY) is RemoteAddr= Addr where RemoteAddr
is the remote physical address of a GRE tunnel on RX and Addr is the address of an interface on RY. The third rule states that the QFF equivalent to

9

Network Configuration Validation

305

route available(RX, RY) is C provided C is the constraint that RX contains a static route for an address on RY. The fourth rule computes the QFF for all
interface addresses being unique. The last rule computes the QFF for the top-level
constraint topReq.
Now, the Prolog query eval(topReq, RC) computes RC as above. As has
been shown in [51], QFFs are much more expressive than Boolean logic, so it is not
hard to write requirements using the eval predicate.

9.5.4 Repair with MulVAL
The MulVAL system proposes an alternative, precise method of computing the fields
that cause the success of an undesirable requirement provided that requirement is
expressed in Datalog. A requirement, such as bad, is said to be undesirable if it
enables adversary success. This method is based on the observation that any tuple
in a proof of an undesirable requirement is responsible for the truth of that requirement. These tuples contain all the fields that need to be replaced by configuration
variables. For example, one proof of bad with the original Prolog specification in
Section 9.5.2 is:
bad
gre_tunnel(ra, rx)
gre(ra, tunnel 0, 100, 300) ^ ipAddress(rx, eth 0, 300, 0)

Here, each condition is implied by its successor by the use of a rule in the Prolog
specification. The second proof of bad is:
bad
route_available(ra, rx)
static route(ra,0,32,400) ^ ipAddress(rx, eth 0,300,
0) ^ contained(0,32,300,0)

The tuples that contribute to the proof of bad are:
gre(ra, tunnel_0, 100, 300) -- from the first proof
ipAddress(rx, eth_0, 300, 0) -- from the first proof
static_route(ra, 0, 31, 400) -- from the second proof

The following tuples do not contribute to the proof of bad:
ipAddress(ra, eth_0, 100, 0).
ipAddress(rb, eth_0, 200, 0).

The three tuples in the proof of bad contain all the fields that need to be replaced by
configuration variables. Note that the address of interfaces at ra and rb do not need
to be replaced.

306

S. Narain et al.

The MulVAL system does not actually compute new values of fields. It only
computes the set of tuples that should be disabled to disable all proofs of the undesirable property. A tuple can be disabled by changing its fields to different values
or deleting it. But, MulVAL computes the set in an optimal way. It first derives a
Boolean formula representing all the ways in which tuples should be disabled, then
solves this with a minimum-cost SAT solver. A solution represents a set of tuples to
disable. For example, the Boolean formula for the above two proofs is:
: gre(ra, tunnel 0, 100, 300) _ :ipAddress(rx, eth 0,300, 0) ^
: ipAddress(rx, eth 0, 300, 0) _ :static route(ra, 0, 32, 400)

The first formula states that to disable the first proof, either the gre tuple or the
ipAddress tuple must be disabled. The second formula states that to disable the
second proof, either the ipAddress or the static route tuple must be disabled. Costs are associated with disabling each tuple. The minimum-cost SAT solver
computes that set of tuples whose cost of disabling is a minimum. For example, the
cost of disabling the ipAddress tuple may be high because many requirements
depend on this tuple. The cost of disabling the static route and gre tuples
may be a lot lower. It is not, in general, simple to assign cost to disabling a tuple. Furthermore, this approach only computes how to disable an undesirable requirement.
It does not guarantee that disabled tuples will also not disable desirable requirements, unless these latter requirements are also expressed in Boolean logic and the
combined constraint is solved.

9.5.5 Evaluating Firewall Requirements with Binary
Decision Diagrams
Hamed et al. [34] evaluate firewall subsumption and rule redundancy using Ordered
Binary Decision Diagrams [12]. Their algorithm is conceptually the same as in Section 9.4.4. It first transforms firewall policies into Boolean constraints upon source
and destination addresses, source and destination ports, and the protocol. These constraints are true only for those packets that are permitted by the firewall. These fields
are represented as sequences of Boolean variables, e.g., an address field as a sequence of 32 variables and a port field as a sequence of 16 bits. The algorithm then
checks whether combinations of constraints for evaluating subsumption and redundancy have a solution. Since constraints are represented as Ordered Binary Decision
Diagrams, this check is straightforward. By contrast, ConfigAssure represents the
above fields as integer variables and represents a policy as an arithmetic quantifierfree form constraint. It lets Kodkod transform this into a Boolean constraint and use
a SAT solver to check satisfiability.

9

Network Configuration Validation

307

9.6 Related Work
9.6.1 Configuration Acquisition by Type Inference
Another approach to parsing configuration files is with the use of PADS/ML system
[47]. Based on the functional language ML, PADS/ML describes the accepted language as if it were a type definition. PADS/ML supports the generation of parser,
printer, data structure representation, and a generic interface to this representation.
The generated code is in OCAML [43] language and additional tools, written in
OCAML, then manipulate the internal data structure. This internal data structure
is traversed to populate the relational database in the same way that the ANTLR
abstract syntax tree is traversed.
Adaptive parsers are reported in [17]. These can modify the language they
recognize when given examples of legal input. The inference system recognizes
commands that are only handled in the abstract, much as the ANTLR grammar of
IP Assure skips over some commands. Repeated instances of commands are used
to generate new PADS/ML types, which are then further refined to provide access
to fields in the commands. This means that as the IOS language evolves, the parser
can evolve to provide an ever richer internal representation.

9.6.2 Symbolic Reachability Analysis
Instead of performing reachability analysis for each packet, a system for reachability
analysis for sets of packets is described in Xie et al. [72]. This makes it possible to
evaluate a requirement such as “a change in static routes at one or more routers does
not change the set of packets that can flow between two nodes.” It is not feasible to
evaluate such a requirement by enumerating all packets and checking reachability.
In this system, the reachability upper bound is defined to be the union of all packets permitted by each possible forwarding path from the source to the destination.
This bound models a security policy that denies some packets (i.e., those outside the
upper bound) under all conceivable operational conditions. The reachability lower
bound is defined to be the common set of packets allowed by every feasible forwarding path from the source to the destination. This bound models a resilience policy
that assures the delivery of some packets despite network faults, as long as a backup
forwarding path exists. Algorithms are created for estimating the reachability upper and lower bounds from a network’s packet filter configurations. Moreover, the
work shows that it is possible to jointly reason about how packet filters, routing, and
packet transformations affect reachability.
An interesting implementation of reachability analysis for sets of packets is found
in the ConfigChecker [3] system. It represents the network’s packet forwarding behavior as a giant state machine in which a state defines what packets are at what
routers. However, the state-transition relation is not represented explicitly but rather

308

S. Narain et al.

symbolically as a constraint that must be satisfied by two states for the network
to transition between these. This constraint itself is represented as an Ordered Binary Decision Diagram and input to a symbolic model checker [48]. Reachability
requirements such as that above are expressed in Computational Tree Logic [48]
and the symbolic model checker used to evaluate these. The transition-relation also
takes into account features such as IPSec tunnels, multicast, and network address
translation.

9.6.3 Alloy Specification Language
Alloy [2, 39] is a first-order relational logic system. It lets one specify object types
and their attributes. It also lets one specify first-order logic constraints on these attributes. These are more expressive than Prolog constraints. Alloy solves constraints
by compiling these into Kodkod and using Kodkod’s constraint solver. The use of
Alloy for network configuration management was explored in [49].Alloy’s specification language is very appropriate for specifying requirements. All the requirements in Section 9.2 can be compactly expressed in Alloy. However, its constraint
solver is inappropriate for evaluating requirements. This is because the compilation
of first-order logic into Boolean logic leads to very large intermediate constraints.
Kodkod addresses this problem by its partial-model optimization that exploits
knowledge about parts of the solution. If the value of a variable is already known, it
does not appear in the constraint that is submitted to the SAT solver. ConfigAssure
follows a related approach but at a higher layer. The intuition is that given a requirement, many parts of it can be efficiently solved with non-SAT methods. Solving
these parts and simplifying can yield a requirement that truly requires the power of
a SAT solver. This plan is carried out by transforming a requirement into an equivalent quantifier-free form by defining the eval predicate for that requirement. QFFs
have the property that not only is it easy to write eval rules, but also that QFFs are
efficiently compiled and solved by Kodkod. Evaluation of parts of requirements and
simplification are accomplished in the definition of eval.

9.6.4 BGP Validation
The Internet is, by definition, a “network of networks,” and the responsibility for
gluing together the tens of thousands of independently administered networks falls
to the Border Gateway Protocol (BGP) [59, 64]. A network, or AS uses BGP to
tell neighboring networks about each block of IP addresses it can reach; in turn,
neighboring ASes propagate this information to their neighbors, allowing the entire
Internet to learn how to direct packets toward their ultimate destinations. On the
surface, BGP is a relatively simple path-vector routing protocol, where each router
selects a single best route among those learned from its neighbors, adds its own AS

9

Network Configuration Validation

309

number to the front of the path, and propagates the updated routing information to
its neighbors for their consideration; packets flow in the reverse direction, with each
router directing traffic along the chosen path in a hop-by-hop fashion.
Yet, BGP is a highly configurable protocol, giving network operators significant
control over how each router selects a “best” route and whether that route is disseminated to its neighbors. The configuration of BGP across the many routers in an
AS collectively expresses a routing policy that is based on potentially complex business objectives [15]. For example, a large Internet Service Provider (ISP) uses BGP
policies to direct traffic on revenue-generating paths through their own downstream
customers, rather than using paths through their upstream providers. A small AS
like a university campus or corporate network typically does not propagate a BGP
route learned from one upstream provider to another, to avoid carrying data traffic
between the two larger networks. In addition, network operators may configure BGP
to filter unexpected routes that arise from configuration mistakes and malicious attacks in other ASes [14,52]. BGP configuration also affects the scalability of the AS,
where network operators choose not to propagate routes for their customers’ small
address blocks to reduce the size of BGP routing tables in the rest of the Internet.
Finally, network operators tune their BGP configuration to direct traffic away from
congested paths to balance load and improve user-perceived performance [25].
The routing policy is configured as a “route map” that consists of a sequence of
clauses that match on some attributes in the BGP route and take a specific action,
such as discarding the route or modifying its attributes with the goal of influencing the route-selection process. The BGP defines many different attributes, and the
route-selection process compares the routes one attribute at a time to ultimately
identify one “best” route. This somewhat indirect mechanism for selecting and
propagating routes, coupled with the large number of route attributes and routeselection steps, makes configuring BGP routing policy immensely complicated and
error-prone. Network operators often use tools for automatically configuring their
BGP-speaking routers [11, 21, 29]. These tools typically consist of a template that
specifies the sequence of vendor-specific commands to send to the router, with parameters unique to each BGP session populated from a database; for example, these
parameters might indicate a customer’s name, AS number, address block(s), and the
appropriate route-maps to use. When automated tools are not used, the network
operators typically have configuration-checking tools to ensure that the sessions
are configured correctly, and that different sessions are configured in a consistent
manner [16, 24].
Configuring the BGP sessions with neighboring ASes, while important, is not the
only challenge in BGP configuration. In practice, an AS consists of multiple routers
in different locations; in fact, a large ISP may easily have hundreds if not thousands of routers connected by numerous links into a backbone topology. Different
routers connect to different neighbor ASes, giving each router only a partial view
of the candidate BGP routes. As such, large ISPs typically run BGP inside their
networks to allow the routers to construct a more complete view of the available
routes. These internal BGP (iBGP) sessions must be configured correctly to ensure
that each router has all the information it needs to select routes that satisfy the AS’s

310

S. Narain et al.

policy. The simplest solution is to have a “full-mesh” configuration, with an iBGP
session between each pair of routers. However, this approach does not scale, forcing
large ISPs to introduce hierarchy by configuring route reflectors or confederations
that limit the number of iBGP sessions and constrain the dissemination of routes.
Each route reflector, for instance, selects a single “best route” that it disseminates to
its clients; as such, the route-reflector clients do not learn all the candidate routes
they would have learned in a full-mesh configuration.
When the “topology” formed by these iBGP sessions violates certain properties,
routing anomalies like protocol oscillations, forwarding loops, traffic blackholes,
and violations of business contracts can arise [6, 31, 74]. Fortunately, static analysis of the iBGP topology, spread over the configuration of the routers inside the
AS, can detect when these problems might arise [24]. Such tools check, for instance, that the top-level route reflectors are fully connected by a “full-mesh” of
iBGP sessions. This prevents “signaling partitions” that could prevent some routers
from learning any route for a destination. Static analysis can also check that route
reflectors are “close” to their clients in the underlying network topology, to ensure
that the route reflectors make the same routing decisions that their clients would
have made with full information about the alternate routes. Finally, these tools can
validate an ISP’s own local rules for ensuring reliability in the face of router failures. For instance, static analysis can verify that each router is configured with at
least two route-reflector parents. Collectively, these kinds of checks on the static
configuration of the network can prevent a wide variety of routing anomalies.
For the most part, configuration validation tools operate on the vendor-specific
configuration commands applied to individual routers. Configuration languages vary
from one vendor to another, – for example, Cisco and Juniper routers have very different syntax and commands, even for relatively similar configuration tasks. Even
within a single company, different router products and different generations of the
router operating system have different commands and options. This makes configuration validation an immensely challenging task, where the configuration-checking
tools much support a wide range of languages and commands. To address these
challenges, research and standards activities have led to new BGP configuration
languages that are independent of the vendor-specific command syntax [1, 71], particularly in the area of BGP routing policy. In addition to abstracting vendor-specific
details, these frameworks provide some support for configuring entire networks
rather than individual routers. For example, the Routing Policy Specification Language (RPSL) [1] is object-oriented, where objects contain AS-wide policy and
administrative information that can be published in Internet Routing Registries [37].
Routing policy can be expressed in terms of user-friendly keywords for defining actions and groups of address blocks or AS number. Configuration-generation tools
can read these specifications to generate vendor-specific commands to apply to the
individual routers [37]. However, while RPSL is used for publishing information in
the IRRs, many ISPs still use their own configuration tools (or manual processes)
for configuring their underlying routers.
In summary, the configuration of BGP takes place at many levels – within a single
router (to specify a single end point of a BGP session with the appropriate route-

9

Network Configuration Validation

311

maps and addresses), between pairs of routers (to ensure consistent configuration of
the two ends of a BGP session), across different sessions to the same neighboring
AS (to ensure consistent application of the routing policy at each connection point),
and across an entire AS (to ensure that the iBGP topology is configured correctly).
In recent years, tools have emerged for static analysis of router-configuration data to
identify potential configuration mistakes, and for automated generation of the configuration commands that are sent to the routers. Still, many interesting challenges
remain in raising the level of abstraction for configuring BGP, to move from the
low-level focus on configuring individual routers and BGP sessions toward configuring an entire network, and from the specific details of the BGP route attributes
and route-selection process to a high-level specification of an AS’s routing policy.
As the Internet continues to grow, and the business relationships between ASes become increasingly complex, these issues will only become more important in the
years ahead.

9.6.5 Other Validation Systems
Netsys was an early software product for configuration validation. It was first acquired by Cisco Systems and then by WANDL Corporation. It contained about a
100 requirements that were evaluated against router configurations. OPNET offers
validation products NetDoctor and NetMapper. These are not standalone but rather
modules that need to be plugged into the base IT Sentinel system [54]. For more
description of these, see [23]. None of these products offer configuration repair, reasoning about firewalls, or symbolic reachability analysis. The Smart Firewalls work
[13] was an early attempt at Telcordia to develop a network configuration validation system. A survey of system, not network, configuration is found in [4]. Formal
methods for jointly reasoning about IPSec and firewall polices are described in [32].
A high-level configuration language is described in [45].

9.7 Summary and Directions for Future Research
To set up network infrastructure satisfying end-to-end requirements, it is not only
necessary to run appropriate protocols on components but also to correctly configure
these components. Configuration is the “glue” for logically integrating components
at and across multiple protocol layers. Each component has a finite number of configuration parameters, each of which can be set to a definite value. However, today,
the large conceptual gap between end-to-end requirements and configurations is
manually bridged. This causes large numbers of configuration errors whose adverse
effects on security, reliability, and high cost of deployment of network infrastructure
are well documented. See also [57, 62].

312

S. Narain et al.

Thus, it is critical to develop validation tools that check whether a given configuration is consistent with the requirements it is intended to implement. Besides
checking consistency, configuration validation has another interesting application,
namely network testing. The usual invasive approach to testing has several limitations. It is not scalable. It consumes resources of the network and network
administrators and has the potential to unleash malware into the network. Some
properties such as absence of single points of failure are impractical to test as they
require failing components in operational networks. A noninvasive alternative that
overcomes these limitations is analyzing configurations of network components.
This approach is analogous to testing software by analyzing its source code rather
than by running it. This approach has been evaluated for a real enterprise.
Configuration validation is inherently hard. Whether a component is correctly
configured cannot be evaluated in isolation. Rather, the global relationships into
which the component has been logically integrated with other components have
to be evaluated. Configuration repair is even harder since changing configurations
to make one requirement true may falsify another. The configuration change should
be holistic in that it should ensure that all requirements concurrently hold.
This chapter described the challenges of configuring a typical collaboration
network and the benefits of using a validation system. It then presented an abstract design of a configuration validation system. It consists of four subsystems:
configuration acquisition system, requirement library, specification language, and
evaluation system. The chapter then surveyed technologies for realizing this design. Configuration acquisition systems have been built using three approaches:
parser generator, type inference, and database query. Classes of requirements in
their Requirements Library are logical structure integrity, connectivity, security, reliability, performance, and government regulatory. Specification languages include
visual templates, Prolog, Datalog, arithmetic quantifier-free forms, and Computational Tree Logic. Evaluation systems have used graph algorithms, the Kodkod
constraint solver for first-order logic constraints, the ZChaff SAT solver for Boolean
constraints, Binary Decision Diagrams, and symbolic model checkers. Visualization
of not just the IP topology but also of various other logical topologies provides useful insights into network architecture. Logic-based languages are very useful for
creating a validation system, particularly for solving the hard problems of configuration repair and symbolic reasoning about requirements.
Future research needs to focus on all four components of a validation system.
Robust configuration acquisition systems are critical to automated validation. The
accumulated experience of building large networks is vast but largely unformalized. Formalizing these in a Requirement Library would not only raise the level of
abstraction at which network requirements are written but also improve their precision. New classes of requirements, one on VLAN optimization and another on
configuration complexity, are reported in [28, 65] and in [9], respectively. Specification languages that are easy to use by network administrators are also critical for
broad adoption of validation systems. Logic-based languages are a good candidate
despite the perception that these are too complex for administrators. These are closest in form to the natural language requirements in network design documents. The

9

Network Configuration Validation

313

configuration languages administrators use are already declarative in that they do
not contain side-effects and the ordering of commands is unimportant. Introducing
logical operators, data structures, and quantifiers into these is a natural step toward
making these much more expressive. See [71] for a recent example of using the
Haskell functional language for specifying BGP policies. High-level descriptions
of component configurations could then again be composed by logical operators
to describe network-wide requirements. In the nearer term, even making an implementation of the Requirement Library available as APIs in system administration
languages like Perl or Python should vastly improve configuration debugging. Much
greater understanding is needed of useful ways to visualize logical structures and
relationships in networks. One might derive inspiration from works such as [70]. Finally, a good framework for repairing configurations was described in Section 9.5.3,
but it needs to be further explored. For example, one needs to understand how the
convergence of the repair procedure is affected by choice of configuration variable
to relax, and how ideas of MulVAL can be generalized and combined with those of
ConfigAssure. Creating the trust in network administrators before they allow automated repair of their component configurations is an open problem.
Acknowledgments We are very grateful to Jennifer Rexford, Andreas Voellmy, Richard Yang,
Chuck Kalmanek, Simon Ou, Geoffrey Xie, Yitzhak Mandelbaum, Ehab Al-Shaer, Sanjay Rao,
Adel El-Atawy, and Paul Anderson for their contributions and comments.

References
1. Alaettinoglu, C., Villamizar, C., Gerich, E., Kessens, D., Meyer, D., Bates, T., et al. (1999).
Routing Policy Specification Language. RFC 2622.
2. Alloy. http://alloy.mit.edu/
3. Al-Shaer, E., Marrero, W., El-Atawy, A., & ElBadawy, K. (2008). Towards global
verification and analysis of network access control configuration. Technical Report, TR-08008, DePaul University, from http://www.mnlab.cs.depaul.edu/projects/ConfigChecker/TR08-008/paper.pdf
4. Anderson P (2006) System Configuration. In Short Topics in System Administration ed. Rick
Farrow. USENIX Association.
5. ANTRL v3. http://www.antlr.org/
6. Basu, A., Ong, C.H., Rasala, A., Shepherd, F.B., & Wilfong, G. (2002). Route oscillations in
I-BGP with route reflection. ACM SIGCOMM.
7. Bates, T., Chandra, R., & Chen, E. (2000). BGP route reflection – an alternative to full mesh
IBGP. RFC 2796. http://www.faqs.org/rfcs/rfc2796
8. Bellovin, R., & Bush, R. (2009). Configuration management and security. IEEE Journal on
Selected Areas in Communications [special issue on Network Infrastructure Configuration],
27(Suppl. 3).
9. Benson, T., Akella, A., & Maltz, D. (2009). Unraveling the complexity of network management. USENIX Symposium on Network Systems Design and Implementation.
10. Berkowitz, H. (2000). Techniques in OSPF-Based Network. http://tools.ietf.org/html/draft-ietfospf-deploy-00
11. Bohm, H., Feldmann, A., Maennel, O., Reiser, C., & Volk, R. (2005). Network-wide interdomain routing policies: Design and realization. Unpublished report, http://www.net.t-labs.
tu-berlin.de/papers/BFMRV-NIRP-05.pdf.

314

S. Narain et al.

12. Bryant, R. (1986). Graph-based algorithms for Boolean function manipulation. IEEE Transactions on Computers, C-35(Suppl. 8), 677–691.
13. Burns, J., Cheng, A., Gurung, P., Martin, D., Rajagopalan, S., Rao, P., et al. (2001). Automatic
management of network security policy. Proceedings of DARPA Information Survivability Conference and Exposition (DISCEX II’01), volume 2, Anaheim, CA.
14. Butler, K., Farley, T., McDaniel, P., & Rexford, J. (2008). A survey of BGP security issues and
solutions. Unpublished manuscript.
15. Caesar, M., & Rexford, J. (2005). BGP routing policies in ISP networks. IEEE Network
Magazine [Special issue on Interdomain Routing], 19, 5–11.
16. Caldwell, D., Gilbert, A., Gottlieb, J., Greenberg, A., Hjalmtysson, G., & Rexford, J. (2003).
The cutting EDGE of IP router configuration. ACM SIGCOMM HotNets Workshop.
17. Caldwell, D., Lee, S., & Mandelbaum, Y. (2008). Adaptive parsing of router configuration
languages. Proceedings of the Internet Management Workshop.
18. Cheswick, W., Bellovin, S., & Rubin, A. (2003). Firewalls and Internet security: Repelling the
Wily Hacker. Reading, MA: Addison-Wesley.
19. Cisco Systems. (2005). DiffServ – The Scalable End-to-End QoS Model.
20. Distributed Management Task Force, from http://www.dmtf.org/home
21. Enck, W., Moyer, T., McDaniel, P., Sen, S., Sebos, P., Spoerel, S., et al. (2009). Configuration
management at massive scale: System design and experience. IEEE Journal on Selected Areas
in Communications. 27(Suppl. 3), 323–335.
22. Farinacci, D., Li, T., Hanks, S., Meyer, D., & Traina, P. (2000). Generic routing and encapsulation. RFC 2784.
23. Feamster, N. (2006). Proactive techniques for correct and predictable Internet routing. Doctoral dissertation, Massachusetts Institute of Technology, Boston, MA.
24. Feamster, N., & Balakrishnan, H. (2005). Detecting BGP configuration faults with static analysis. Symposium on Networked Systems Design and Implementation.
25. Feamster, N., & Rexford, J. (2007). Network-wide prediction of BGP routes. IEEE/ACM
Transactions on Networking, 15(2), 253–266.
26. Federal Information Security Management Act. (2002). National Institute of Standards and
Technology.
27. Fu, Z., & Malik, S. (2006). Solving the minimum-cost satisfiability problem using branch and
bound search. Proceedings of IEEE/ACM International Conference on Computer-Aided Design
ICCAD.
28. Garimella, P., Sung Y.W., Zhang, N., & Rao, S. (2007). Characterizing VLAN usage in an
Operational Network. ACM SIGCOMM Workshop on Internet Network Management.
29. Gottlieb, J., Greenberg, A., Rexford, J., & Wang, J. (2003). Automated provisioning of BGP
customers IEEE Network Magazine.
30. Graphviz. http://www.graphviz.org/
31. Griffin, T.G., & Wilfong, G. (2002). On the correctness of IBGP configuration. Proceedings of
ACM SIGCOMM.
32. Guttman, J. (1997). Filtering postures: local enforcement for global policies. Proceedings of
the 1997 IEEE Symposium on Security and Privacy.
33. Halabi, B. (1997). Internet routing architectures. Indianapolis, IN: New Riders Publishing.
34. Hamed, H., Al-Shaer, E., & Marrero, W. (2005). Modeling and verification of IPSec and VPN
security policies. Proceedings of IEEE International Conference on Network Protocols.
35. Homer, J., & Ou, X. (2009). SAT-solving approaches to context-aware enterprise network security management. IEEE JSAC [Special Issue on Network Infrastructure Configuration].
36. Huitema, C. (1999). Routing in the Internet. Upper Saddle River, NJ: Prentice Hall.
37. Internet Routing Registry Toolset Project, from https://www.isc.org/software/IRRtoolset
38. IP Assure. Telcordia Technologies, Inc., from http://www.telcordia.com/products/ip-assure/
39. Jackson, D. (2006). Software abstractions: Logic, language, and analysis. Cambridge, MA:
MIT Press.
40. Juniper Networks. (2008). What is behind network downtime? Proactive steps to reduce human error and improve availability of networks, from http://www.juniper.net/ solutions/literature/white papers/200249.pdf

9

Network Configuration Validation

315

41. Kodkod, from http://web.mit.edu/emina/www/kodkod.html
42. Lampson, B. (2000). Computer security in real world. Annual computer security
applications conference, from http://research.microsoft.com/en-us/um/people/blampson/64securityinrealworld/acrobat.pdf
43. Leroy, X., Doligez, D., Garrigue, J., Rémy, D., & Vouillon, J. (2007). The objective caml
system, release 3.10, documentation and user’s manual.
44. Li, T., Cole, B., Morton, P., & Li, D. (1998). Cisco Hot Standby Router Protocol. RFC 2281.
45. Lobo, J., & Pappas, V. (2008). C2: The case for network configuration checking language.
Proceedings of IEEE Workshop on Policies for Distributed Systems and Networks.
46. Mahajan, Y., Fu, Z., & Malik, S. (2004). Zchaff2004, An Efficient SAT Solver. Proceedings of
7th International Conference on Theory and Applications of Satisfiability Testing.
47. Mandelbaum, Y., Fisher, K., Walker, D., Fernandez, M., & Gleyzer, A. (2007). PADS/ML:
A functional data description language. ACM Symposium on Principles of Programming Language.
48. McMillan, K. (1992). Symbolic model checking. Doctoral dissertation, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA.
49. Narain, S. (2005). Network configuration management via model-finding. Proceedings of
USENIX Large Installation System Administration (LISA) Conference.
50. Narain, S., Kaul, V., & Parmeswaran, K. (2003). Building autonomic systems via configuration.
Proceedings of AMS Autonomic Computing Workshop.
51. Narain, S., Levin, G., Kaul, V., & Malik, S. (2008). Declarative infrastructure configuration
synthesis and debugging. In E. Al-Shaer, C. Kalmanek, F. Wu (Eds), Journal of Network Systems and Management [Special issue on Security Configuration]
52. Nordstrom, O. & Dovrolis, C. (2004). Beware of BGP attacks. ACM SIGCOMM Computer
Communications Review, 34(Suppl. 2), 1–8.
53. O’Keefe, R. (1990). The craft of prolog. Reading, MA: Addison Wesley.
54. OPNET IT Sentinel, from http://www.opnet.com/solutions/network planning operations/
it sentinel.html
55. Ou, X., Boyer, W., & McQueen, M. (2006). A scalable approach to attack graph generation.
13th ACM Conference on Computer and Communications Security (CCS).
56. Ou, X., Govindavajhala, S., & Appel, A. (2005). MulVAL: A logic-based network security
analyzer. 14th USENIX Security Symposium, Baltimore, MD.
57. Pappas, V., Wessels, D., Massey, D., Terzis, A., Lu, S., & Zhang, L. (2009). Impact of configuration errors on DNS robustness. IEEE Journal on Selected Areas in Communication, 27(Suppl.
1), 275–290.
58. Qie, X., & Narain, S. (2003). Using service grammar to diagnose configuration errors in
BGP-4. Proceedings of USENIX Systems Administrators Conference.
59. Rekhter, Y., Li, T., & Hares, S. (2006). A Border Gateway Protocol 4 (BGP-4), RFC 4271.
60. Rosen, E., Viswanathan, A., & Callon, R. (2001). Multiprotocol Label Switching Architecture.
RFC 3031.
61. Schwartz, J. (2007). Who Needs Hackers? New York Times http://www.nytimes.com/
2007/09/12/technology/techspecial/12threat.html
62. Securing Cyberspace for the 44th Presidency. (2008). CSIS Commission On Cybersecurity.
63. Sedgewick, R. (2003). Algorithms in Java. Reading, MA: Addison Wesley.
64. Stewart, J. (1999). BGP4: Inter-Domain Routing in the Internet. Reading, MA: AddisonWesley.
65. Sung, E.Y., Rao, S., Xie, G., & Maltz, D. (2008). Towards systematic design of enterprise
networks. ACM CoNEXT Conference.
66. SWI-Prolog Semantic Web Library, from http://www.swi-prolog.org/pldoc/package/
semweb.html
67. SWI-Prolog, from http://www.swi-prolog.org/
68. TCP Problems with Path MTU discovery. RFC 2923.
69. Torlak, E., & Jackson, D. (2007). Kodkod: A Relational Model Finder. Tools and Algorithms
for Construction and Analysis of Systems (TACAS ‘07).

316

S. Narain et al.

70. Tufte, E. (2001). The visual display of quantitative information. Cheshire, CT: Graphics Press.
71. Voellmy, A., & Hudak, P. Nettle: A domain-specific language for routing configuration, from
http://www.haskell.org/YaleHaskellGroupWiki/Nettle
72. Xie, G., Zhan, J., Maltz, D., Zhang, H., Greenberg, A., Hjalmtysson, G., et al. (2005). On static
reachability analysis of IP networks. IEEE INFOCOM.
73. ZChaff, from http://www.princeton.edu/chaff/
74. Zhang-Shen, R., Wang, Y., & Rexford, J. (2008). Atomic routing theory: Making an AS route
like a single node. Princeton University Computer Science technical report TR-827-08.

Part V

Network Measurement

Chapter 10

Measurements of Data Plane Reliability
and Performance
Nick Duffield and Al Morton

10.1 Introduction
10.1.1 Service Without Measurement: A Brief History
Measurement was not a priority in the original design of the Internet, principally
because it was not needed in order to provide Best Effort service, and because the
institutions using the Internet were also the providers of this network. A technical strength of the Internet has been that endpoints have not needed visibility into
the details of the underlying network that connects them in order to transmit traffic between one another. Rather, the functionality required for data to reach one
host from another is separated into layers that interact through standardized interfaces. The transport layer provides a host with the appearance of a conduit through
which traffic is transferred to another host; lower layers deal with routing the traffic
through the network, and the actual transmission of the data over physical links. The
Best Effort service model offers no hard performance guarantees to which conformance needs to be measured. Basic robustness of connectivity – the detection of
link failures and rerouting traffic around them – was a task of the network layer, and
so need not concern the endpoints.
The situation described above has changed over the intervening years; the complexity of networks, traffic, and the protocols that mediate them, the separation of
network users from network providers, coupled with customer needs for service
guarantees beyond Best Effort now require detailed traffic measurements to manage and engineer traffic, and to verify that performance meets required goals, and
to diagnose performance degradations when they occur. In the absence of detailed

N. Duffield ()
AT&T Labs, 180 Park Avenue, Florham Park, NJ 07901, USA
e-mail: duffield@research.att.com
Al Morton
AT&T Labs, 200 S Laurel Ave, Middletown, NJ 07748, USA
e-mail: acmorton@att.com

C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications,
Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 10,
c Springer-Verlag London Limited 2010


319

320

N. Duffield and Al Morton

network monitoring capabilities integrated with the network, many researchers,
developers, and vendors jumped into the void to provide solutions. As measurement
methodologies become increasingly mature, the challenge for service providers becomes how to deploy and manage measurement infrastructure scalably. Indeed,
to meet this need, sophisticated measurement capabilities are increasingly being
found on network routers. Furthermore, all parties concerned with the provenance
and interpretation of measurements – vendors of measurement systems, software
and services, service providers and enterprises, network users and customers – need
a consistent way to specify how measurements are to be conducted, collected,
transmitted, and interpreted. Many of these aspects for both passive and active
measurement are now codified by standard bodies.
We continue this introduction by briefly setting out the type of passive and active measurements that are the subject of this chapter, then previewing the broader
challenges that face service providers in realizing them in their networks.

10.1.2 Passive and Active Measurement Methods
This chapter is concerned with two forms of dataplane measurement: passive and
active measurements. These two types of measurement have generally focused on
different aspects of network behavior, support different applications, and are accomplished by different technical means.
 Passive measurement comprises recording information concerning traffic as it

passes observation points in the network. We consider three categories of passive
measurement:
– Link utilization statistics as provided by router interface counters; these are
retrieved from a managed device by a network management station using the
SNMP protocol.
– Flow-level measurements comprising summaries of flows of packets with
common network and transport header properties. These are commonly compiled by routers, then exported to a collector for storage and analysis. These
statistics enable detailed breakdown of traffic volumes according to network
and transport header fields, e.g., IP addresses and TCP/UDP ports.
– Inspection of packet payloads in order to provide application-level flow measurements, or to support other payload-dependent applications such as network security and troubleshooting.
In active measurement, probe traffic is inserted into the network, and the probe
traffic, or the response of the network to it, is subsequently measured. Comparing the probe and response traffic provides a measure of network performance,
as experienced by the probes. Active probing has been conducted by standalone
tools such as ping and traceroute [53] that utilize or coerce IP protocols for measurement functionality. These and other methods are used for active

10

Measurements of Data Plane Reliability and Performance

321

measurement between hosts in special purpose measurement infrastructures, or
between network routers, or from these to other endpoints such as application or
other servers.
Although the correspondence between methods and applications – passive measurement for traffic analysis and active measurement for performance – has been
the norm, it is not firm: passive measurement is used to observe probe packets, and
there are purely passive approaches to performance measurement.

10.1.3 Challenges for Measurement Infrastructure
and Applications
We now describe challenges facing design and deployment of active and passive
measurement infrastructure by service providers and enterprises. As we discuss
passive and active measurement methodologies in the following sections, we shall
discuss their strengths and weaknesses in meeting these challenges. As one would
expect, weaknesses in some of the more mature methods that we discuss have often
provided the motivation for subsequent methods.
 Speed

Increasingly fast line rates challenge the ability of routers to perform
complex per packet processing, including updating flow statistics, and packet
content inspection.
 Scale The product of network speed times the large number of devices producing measurements, gives rise to an immense amount of measurement data (e.g.,
flow statistics). In addition to consuming resources at the observation points,
these data require transmission, storage, and processing in the measurement infrastructure and back-end systems.
 Granularity Service providers and their customers increasingly require a detailed picture of network usage and performance. This is both to support individualized routine reporting, and also to support detailed retrospective studies of
network behavior. These requirements reduce the utility of aggregate usage measurements, such as link-level counters, and simple performance measurement
tools, such as ping and traceroute.
 Scope For passive measurement: not all routers support granular measurement
functionality, e.g, reporting flow statistics; or, the functionality may not be enabled due to resource constraints at the observation point or in the measurement
collection infrastructure. When measurements are performed, information about
protocol layers below IP (such as MPLS), or optical layer attributes (such as the
physical link of an IP composite link) may be incompletely reported or even absent. Information above the network layer may be hidden as a result of endpoint
encryption. For active measurement: not all network paths or links may be directly measured because of cost or other limitations in the deployment of active
measurement hosts.

322

N. Duffield and Al Morton

 Timeliness

Measurement applications increasingly require short temporal granularity of measurements, either because it is desirable to measure events of short
duration, such as traffic microbursts and sub-second timescale routing events, or
because the reporting latency must be short, e.g., in real-time anomaly detection
for security applications. The concomitant increase in measurement reporting or
polling frequency increases load on measurement devices and increases the number of measurement data points.
 Accuracy In passive measurement, reduction of data volumes through sampling, in order to meet the challenges of speed and scale, introduces statistical
uncertainty into measurements. In active measurement, bandwidth and scale
constraints place a limit on active probing frequency and hence measurement
accuracy is inherently dependent on the duration of the measurement period.
 Management There are several challenges for the management and administration of measurement infrastructure.
– Reliability Measurement infrastructure components are subject to failure
or outage, resulting in loss or corruption of measurements. The effects of
component failure can be mitigated (i) at the infrastructure level (providing
redundant capacity with fast detection of failure resulting in failover to backup
subsystems), (ii) by employing reporting paradigms (e.g., sequence numbers)
that facilitate automated checking, flagging, or workarounds for missing data,
and (iii) reporting measurement uncertainty due to missing data or sampling
to the consumer of the measurements.
– Correlation Measurement applications may require correlation of measurements generated by different measurement subsystems, for example, passive
and active traffic measurements, logs from application servers, and authentication, authorization, and accounting subsystems. A common case is when
measurements are to be attributed to an entity such as an end host, but the
mapping between measurement identifier (such as source IP address) and entity is dynamic (e.g., dynamic DHCP mappings). Correlation of multiple data
sets presents challenges for data management, e.g., due to data size, diverse
provenance, physical locations, and access policies. The measurement infrastructure must facilitate correlation by measures including the synchronization
of timestamps set by different measurement subsystems.
– Consistency The methodologies, reporting and interpretation of measurements must be consistent across different equipment and network management software vendors, service providers, and their customers.
In this chapter, Sections 10.2–10.6 cover passive measurement, including linklevel aggregates, flow measurement, sampling, packet selection, and deep packet
inspection (DPI). Sections 10.7–10.10 cover active measurements, including standardization of performance metrics, service level agreements, and deployment
issues for measurement infrastructures. We conclude with an outlook on future challenges in Section 10.11. We shall make use of and refer to other chapters in this
book that deal with specific applications of measurements, principally Chapter 5 on
Network Planning and Chapter 13 on Network Security.

10

Measurements of Data Plane Reliability and Performance

323

10.2 Passive Traffic Measurement
As previewed in Section 10.1.2, we consider three broad types of passive measurement: link statistics, flow measurements, and DPI. These encompass methods that
are currently employed in provider networks, and also describe some newer approaches that have been proposed or may be deployed in the medium term. We now
motivate and outline in more detail the material on passive measurement.
Section 10.3 describes SNMP measurements, or, more precisely, interface packet
counters maintained in a router’s Management Information Base (MIB) that are
retrieved using the Simple Network Management Protocol (SNMP). The remote
monitoring capabilities supported by the RMON MIB are also discussed.
SNMP measurements provide an undifferentiated view of traffic on a link. By
contrast, measurement applications often need to classify traffic according to the
values occurring in protocol header fields that occur at different levels of the protocol stack. They must determine the aggregate traffic volumes attributable to each
such value, for example, to each combination of the network layer IP addresses and
transport layer TCP/UDP ports. This information, and that relating to encapsulating
protocols such as MPLS, has come to be known as “packet header” information.
This is contrasted with “packet payload” or “packet content” information, which includes higher layer application and protocol information. This information may be
spread across multiple network level packets.
The major development in passive traffic measurement over the last roughly
20 years, that serves these needs, has been traffic flow measurement. Traffic flows
are sets of packets with common network/transport header values observed locally
in time. Routers commonly compile summary statistics of flows (total packets,
bytes, timing information) and report them, together with the common header values and some associated router state – but without any payload information – in a
flow record that is exported to a collector. Cisco’s NetFlow is the prime example.
Flow records provide a relatively detailed representation of network traffic that supports many applications. Several of these are covered in detail in other chapters of
this book: generation of traffic matrices and their use in network planning is described in Chapter 5; analysis of traffic patterns and anomalies for network security
is described in Chapter 13. Related applications are the routine reporting of traffic
matrices and trending of traffic volumes and application mix for customers and for
service provider’s network and business development organizations (see e.g. [5]).
Section 10.4 describes traffic flow measurement, including the operational formation of flow statistics, protocols for the standardization of flow measurement,
flow measurement collection infrastructure, the use of sampling both packets and
flow records themselves in order to meet the challenges of speed and scale and its
impact on measurement accuracy, some recent proposals for traffic flow measurement and aggregation, and concludes with some applications of flow measurements.
Uniform packet sampling is one member of a more general class of packet selection primitives, that also includes filtering and more general sampling operations.
In Section 10.5, we describe standardization of packet selection operations, their
realization in routers, and applications of combined selection primitive for network

324

N. Duffield and Al Morton

management. We describe in detail the hash-based selection primitive, which allows for consistent selection of the same packet at different observation points, and
discuss new measurement applications that this enables.
Packet header-based flow measurements provide little visibility into properties of
the packet payload. However, network- and transport-level packet headers provide
only a partial indication of traffic properties for the purposes of application characterization, security monitoring and attack mitigation, and software and protocol
debugging. Section 10.6 reviews technologies for DPI of packet payload beyond the
network- and transport-level headers, and shows how it serves these applications.

10.3 SNMP, MIBs, and RMON
In this section, we discuss traffic statistics that are maintained within routers and
the methods and protocols for their recovery. A comprehensive treatment of these
protocols and their realization can be found in [25].

10.3.1 Router Measurement Databases: MIBs
A MIB is a type of hierarchical database maintained by devices such as routers.
MIBs have been defined by equipment vendors and standardized by the IETF.
Currently, over 10,000 MIBs are defined. The MIB most relevant for traffic measurement purposes is MIB-II [60] that maintains counters for the total bytes and numbers
of unicast and multicast packets received on an interface, along with discarded and
errored packets. The Interface-MIB [59] further provides counts of multicast packets per multicast address. Protocol-specific MIBs, e.g., for MPLS [76], also provide
counts of inbound and outbound packets per interface that use those protocols.

10.3.2 Retrieval of Measurements: SNMP
SNMP [77] is the Internet Protocol used to manage MIBs. A SNMP agent in the
managed device is used to access the MIB and communicate object values to or from
a network management station. SNMP has a small number of basic command types.
Read commands are used to retrieve objects from the MIB. Write commands are
used to write object values to the MIB. Notify commands are used to set conditions
under which the managed device will autonomously generate a report. The most
recent version of SNMP, SNMPv3, offers security functionality, including encryption and authentication, that were weaker or absent in earlier versions. For traffic
measurement applications, the MIB interface-level packet and byte counters are retrieved by periodic SNMP polling from the management station; a polling interval
of 5 min is common. The total packets and bytes transmitted between successive
polls are then obtained by subtraction.

10

Measurements of Data Plane Reliability and Performance

325

10.3.3 Remote Monitoring: RMON
The RMON MIB [81] supports a more detailed capability for remote monitoring than MIB-II, enabling the aggregation and notification over relatively complex
events, e.g involving multiple packets. The original focus of RMON was in remote
monitoring of LANs; resource limitations make RMON generally unsuitable for
monitoring high rate packet streams in the WAN context, e.g., to supply greater
detail than presented by SNMP/MIB-II measurements. Indeed, the limitations of
RMON motivate the alternate flow and packet measurement paradigm in which
samples or aggregates of packet header information are exported from the router
to a collector which supports reporting, analysis, and alarming functionality, rather
than the router performing these functions itself. We explore this paradigm in more
detail in the following sections.

10.3.4 Properties and Applications of SNMP/MIB
We now review how SNMP/MIB measurements align with the general measurement challenges described in Section 10.1.3. Scope: The major strength of SNMP
measurements is their ubiquitous availability from router MIBs. Scale: From the
data management point of view, SNMP statistics have the advantage of being relatively compact, routinely comprising a fixed length data collected per interface at
each polling instant, commonly every 5 min. Granularity: The main limitation of
SNMP measurement is that they maintain packet and byte counters per interface
only. Timeliness: The externally chosen and relatively infrequent polling times for
SNMP measurements limit their utility for real-time or event-driven measurement
applications.
Historically, SNMP measurements have been a powerful tool in the management
of networks with undifferentiated service classes. SNMP statistics have been used to
trend link utilization, and network administrators have used these trends to plan and
prioritize link deployment and upgrades, on the basis of heuristics that relate link
utilization to acceptable levels of performance. Active performance measurements
using the ping and traceroute tools can also inform these decisions.
Although SNMP measurement do not directly report any constituent details
within link aggregates, network topology and routing in practice constrain the set of
possible edge-to-edge traffic flows that can give rise to the collection of measured
traffic rates over all network links. This leads to the formulation of an inverse problem to recover the edge-to-edge traffic matrices from the link aggregates. A number
of approaches have been proposed and some are sufficiently accurate to be of operational use; for further detail see Chapter 5. Knowledge of the traffic matrices
provides powerful new information beyond simple trending, because it allows the
prediction of link utilization under different scenarios for routing, topology, and
spatially heterogeneous changes in demand.

326

N. Duffield and Al Morton

10.4 Traffic Flow Measurement
This section describes traffic flow measurement, including the operational formation
of flow statistics, protocols for the standardization of flow measurement, flow
measurement collection infrastructure, the use of sampling both packets and flow
records themselves in order to meet the challenges of speed and scale and its impact
on measurement accuracy, some recent proposals for traffic flow measurement and
aggregation, and concludes with some applications of flow measurements.

10.4.1 Flows and Flow Records
10.4.1.1 Flow and Flow Keys
A flow of traffic is a set of packets with a common property, known as the flow key,
observed within a period of time. A set of interleaved flows is depicted in Fig. 10.1.
Many routers construct and export summary statistics on flows of packets that pass
through them. A flow record can be thought of as summarizing a set of packets arising in the network through some higher-level transaction, e.g., a remote terminal
session, or a web-page download. In practice, the set of packets that are included in
a flow depends on the algorithm used by the router to assign packets to flows. The
flow key is usually specified by fields from the packet header, such as the IP source
and destination address and TCP/UDP port numbers, and may also include information from the packet’s treatment at the observation point, such as router interface(s)
traversed. Flows in which the key is specified by individual values of these fields are
often called raw flows, as opposed to aggregate flows in which the key is specified
by a range of these quantities. As we discuss further in Section 10.4.3.2, routers
commonly create flow records from a sampled substream of packets.
10.4.1.2 Operational Construction of Flow Records
Flow statistics are created as follows. A router maintains a cache comprising entries
for each active flow, i.e., those flows currently under measurement. Each entry includes the key and summary statistics for the flow such as total packets and bytes,

Fig. 10.1 Flows of observed packets, key indicated by shading

10

Measurements of Data Plane Reliability and Performance

327

and times of observation of the first and last packets. When the router observes a
packet, it performs a cache lookup on the key to determine if the corresponding
flow is active. If not, it instantiates a new entry for that key. The flow statistics are
then updated accordingly. A router terminates the recording of a flow according to
criteria describe below; then the flow’s statistics are exported in a flow record, and
the associated cache memory released for use by new flows. Flow termination criteria include: (i) inactive flow or interpacket timeout: the time since the last packet
observed for the flow exceeds some threshold; (ii) protocol-level information, e.g., a
TCP FIN packet that terminates a TCP connection; (iii) memory management: termination to release memory for new flows; and (iv) active flow timeout: to prevent
data staleness, flows are terminated after a given elapsed time since the arrival of
the first packet of the flow.
The summary information in the flow record may include, as well as the flow
key, and summary statistics of packet timing and size, other information relating to
the packet treatment in the router, such as interfaces traversed, next hop router, and
routing state information. Additionally, lower layer protocol information from the
packet header may be included. For example, Cisco’s NetFlow has a partial ability
to report the MPLS label stack: it can report up to three labels from the MPLS label
stack, with position in stack configurable. NetFlow can in some cases report the
loopback address of the certain tunnel endpoints.
10.4.1.3 Commercial and Standardized Flow Reporting
The idea of modeling traffic as packets grouped by a common property seems first to
have appeared in [54], and the idea was taken up in support of internet accounting
in [62], and systematized as a general measurement methodology in [22]. Early
standardization efforts within the Real Time Flow Measurement working group of
the Internet Engineering Task Force (IETF) has now been supplanted by the work
of the IP Flow Information eXport working group (IPFIX) [49]. In practice flow
measurement has become largely identified with Cisco’s NetFlow [18] due to (i)
the large installed base; (ii) its emulation in other vendors’ products, and (iii) its
effective standardization by the use of NetFlow version 9 [23] as the starting point
for the IPFIX protocol. NetFlow v9 offers the ability to administrators to define
and configure flow keys, aggregation schemes, and the information reported in flow
records.
An alternative reporting paradigm is provided by sFlow [71], in which headerlevel information from a subset of sampled packets are exported directly without
aggregating information from packet bearing the same key. sFlow reports include a
position count of the sampled packet within the original traffic stream; this facilitates
estimating traffic rates.

328

N. Duffield and Al Morton

10.4.2 Flow Measurement Infrastructure
10.4.2.1 Generation and Export of Flow Records
Cisco originated NetFlow as a by-product of IP route caching [17], but it has
subsequently evolved as a measurement and reporting subsystem in its own right.
Other router vendors now support the compilation of flow statistics, e.g., Juniper’s
JFlow [55], with the flow information being exported using the NetFlow version
9 format or according to the IPFIX standard. Note that implementation differences
may lead to different information being reported across different routers. Standalone
monitoring devices as discussed in Section 10.6.2 may also compile and export flow
records.
Cisco Flexible NetFlow [14] provides the ability to instantiate and separately
configure multiple flow compilers that operate concurrently. This allows a single
router to serve different measurement applications that may have different requirements: traffic can be selected by first filtering on header fields; parameters such as
sampling granularity, spatial and temporal aggregation granularity, reporting detail
and frequency, and collector destination can be specified for each instantiation. We
discuss packet selection operations more generally in Section 10.5.

10.4.2.2 Collection and Mediation of Flow Records
Flow records are exported from the observation point, either directly to a collector,
or through a mediation device. NetFlow collection systems are available commercially [15] or as freeware [10], either in a basic form that receives and writes flow
records to storage, or as part of larger traffic analysis system to support network
management functions [5, 69], or focused on specific applications such as security [68]. Although export of flow records may take place directly to the ultimate
collector, there are two architectural reasons that favor inserting mediation devices
in the export path: scalability and reliability. The primary reason is scalability. Even
with the compression of information that summarizes a set of packets in a fixed
length flow record, the volumes of flow records produced by large-scale network infrastructure are enormous. As a rough example, a network comprising 100 10 Gb/s
links that are 50% loaded in each direction, and in which each flow traverses ten
routers, each of which compiles flow statistics after packet sampling at a rate of 1 in
several hundred (see Section 10.4.3.2), would produce 1Gb/s of flow records, i.e.,
roughly 10 TeraBytes per day.
A secondary reason for using mediation boxes has been transmission reliability. Until recently, NetFlow has exclusively used UDP for export, in part to avoid
the need for buffer flow records at the exporter, as would be required by a reliable
transport protocol. But the use of UDP exposes flow records to potential loss in
transit, particularly over long WAN paths. Due to skew in flow length distributions
(see Section 10.4.3.3) uncontrolled loss of the records of long flows could severely
reduce measurement accuracy.

10

Measurements of Data Plane Reliability and Performance

329

Fig. 10.2 Flow measurement collection infrastructure: hardware elements, their resources, and
sampling and aggregation operations that act on the measurements

Mediation devices can address these issues and provided additional benefits:
 Data Reduction

By aggregating and sampling flow records, then exporting the
reduced data to a central collector.
 Reliable Staging The mediator can receive flow records over a LAN with controlled loss characteristics, then export flow records (or samples or aggregates)
to the ultimate collector using a reliable transport protocol such as TCP. NetFlow
v9 and the IPFIX protocol both support SCTP [78] for export, which gives administrators flexibility to select a desired trade-off between reliability and buffer
resource usage at the exporter.
 Distributed Query The mediation devices may also support queries on the
flow records that traverse them, and thus together constitute a distributed query
system.
 Selective Export Multiple streams of flow records selected according to specified criteria may be exported to collectors serving different applications.
An example of such an architecture is illustrated in Fig. 10.2; see also [39]. In each
of a number of geographically distributed router centers, a mediation device receives
flow records from its colocated routers; aggregates and samples are then exported to
ultimate collector. Protocols for flow mediators are currently under standardization
in the IPFIX working group of the IETF [49].
10.4.2.3 Collection and Warehousing of Flow Records
The final component of the collection infrastructure is the repository that serves to
receive and store the flow records, and serve as a database for reporting and query
functions. Concerning the attributes of a data store:
 Capacity

Must be extensive; even with packet and flow sampling, a large service provider network may generate many GB of flow records per day.

330

N. Duffield and Al Morton

 DataBase Management System

Must be well matched to the challenges of large
datasets, including rapid ingestion and indexing, managing large tables, a highlevel query language to support complex queries, transaction logging, and data
recovery. The Daytona DBMS is an example of such a system in current use;
see [44].
 Data Sources Interpretation of flow data typically requires joining with other
datasets, which should also be present in the management system, including but
not limited to, topology and configuration data, control plane measurements (see
Chapter 11 for a description of routing state monitoring), MIB variables acquired
by SNMP polling, network elements logs from authentication, authorization, and
accounting servers, and logs from DHCP and other network servers.
 Data Quality Data may be corrupt or missing due to failures in the collection
and reporting systems. The complexity and volume of measured data necessitate
automated mechanisms to detect, mark, and mitigate unclean data; see e.g. [30].
 Data Security and Customer Privacy Flow measurements and other data listed
should be considered as sensitive customer information. Service provider policies
must specify practices to maintain the integrity of the data, including controlled
and auditable access restricted to individuals needing to work with the data,
encryption, anonymization, and data retention policies.

10.4.3 Sampling in Flow Measurement and Collection
10.4.3.1 Sampling as a Data Reduction Method
In the previous sections, we have touched on the fact that the speed of communications links provides a challenge for the formation of flow records at the router,
and both speed and the scale of networks – the large number of interfaces that can
produce flow records – provide a challenge for the collection and storage of flow
records. Figure 10.2 illustrates the relevant resources at the router, mediator, and
collector. To meet these challenges, data reduction must be performed. The reduction method must be well matched to the uses to which the reduced data is put.
Three reduction methods are usually considered:
 Aggregation

Summarizing measurements that share common properties. In the
context of traffic flow measurement, header-level information on packets with the
same key is aggregated into flows. Subsequent aggregation of flow records into
predefined aggregates (e.g., aggregate traffic to each routing prefix) is a powerful
tool for routine reporting.
 Filtering Selection of a subset of measurement that matches a specified criterion. Filtering is useful for drill down (e.g., to a traffic subset of interest).
 Sampling Selection of data points according to some nondeterministic
criterion.

10

Measurements of Data Plane Reliability and Performance

331

A limitation for aggregation and filtering as general data reduction methods is the
manner in which they lose visibility into the data: traffic not matching a filter is discarded; detail within an aggregate is lost (while flow records aggregate packets over
time, they need not aggregate spatially, i.e., over packet header values). Of the three
methods, only sampling retains the spatial granularity of the original data, and thus
retains the ability to support arbitrary aggregations of the data, include those formulated after the measurements were made. This is important to support exploratory,
forensic, and troubleshooting functions, where the traffic aggregates of interest are
typically not known in advance. The downside of sampling is the statistical uncertainty in the resulting measurements; we address this further in Section 10.4.3.4.
We now discuss sampling operations used during the construction and recovery of flow measurements. As illustrated in Fig. 10.2, packet sampling (see
Section 10.4.3.2) is used in routers in order to reduce the rate of the stream of
packet header information from which flow records are aggregated. The complete
flow records are then subjected to further sampling (see Section 10.4.3.3) and aggregation within the collection infrastructure, at the mediator to reduce data volumes,
or in the collector, for example, dynamically sampling from a flow record database
in order to reduce query execution times, or permanently in order to select a representative set of flow records (or their aggregates) for archiving. We discuss the
ramifications of sampling for measurement accuracy in Section 10.4.3.4, and some
more recent developments in stateful sampling and aggregation the straddle the
packet and flow levels in Section 10.4.3.5. Finally, we look ahead to Section 10.5,
which sets random packet sampling in the broader context of packet selection operations and their applications, including filtering, both in the sense understood above,
and also consistent packet selection as exemplified by hash-based sampling.

10.4.3.2 Random Packet Sampled Flows
The main resource constraint for forming flow records is at the router flow cache
in which the keys of active flows are maintained. To lookup packet keys at the full
line rate of the router interfaces would require the cache to operate in fast, expensive memory (SRAM). Moreover, routers carry increasingly large numbers of flows
concurrently, necessitating a large cache. By sampling the packet stream in advance
of the construction of flow records, the cache lookup rate is reduced, enabling the
cache to be implemented in slower, less expensive, memory (DRAM).
A number of different sampling methods are available. Cisco’s Sampled NetFlow
samples packets every N th packet systematically, where N is a configurable parameter. Random Sampled NetFlow [21] feature employs stratified sampling based on
arrival count: one packet is selected at random out of every window on N consecutive arrivals. Although these two methods have the same average sampling rate,
there are higher-order differences in the way multiple packets are sampled; for
example, consecutive packets are never selected in Sampled NetFlow, while they
can be in Random Sampled NetFlow. However, the effect of such differences on
flow statistics is expected to be small except possibly for flows which that represent

332

N. Duffield and Al Morton

noticeable proportion (greater than 1=N ) of the load, since the position of a given
flow’s packets in the packet arrival order at an interface is then effectively randomized by the remaining traffic. In distinction, Juniper’s J-flow [55] offers the ability
to sample runs of consecutive packets.
Sampling and other packet selection methods have been standardized in the
PSAMP working group of the IETF [24,32,33,82]. We review these in greater detail
in Section 10.5. PSAMP is positioned as a protocol to select packets for reporting at
an observation point, with IPFIX as the export protocol. For example, selected packets could be reported on as single packet flow records, using zero active timeout for
immediate reporting.
If sampling 1 out of N packets on average, then from a flow with far fewer than
N packets, if any packets are sampled, typically only one packet will be sampled. In
this case one might just as well sample packets without constructing flow records;
this would save resources at the router since there would be no need to cache the single packet flows until expiration of the interpacket timeout. Indeed, there are many
short flows: web traffic is a large component of Internet traffic, in which the average flow length is quite short, around 16 packets in one study [42]. However, there
are several reasons to expect that longer flows will continue to account for much
traffic. First, several prevalent applications and application classes predominantly
generate long-lived flows, for example, multimedia downloads and streaming, and
VoIP. Secondly, tunneling protocols such as IPSEC [56] may aggregate flows between multiple endpoints into a packet stream in which the endpoint identities are
not visible in the network core; from the measurement standpoint, the stream will
thus appear as a single longer flow. For these reasons, unless packet sampling periods becomes comparable with or larger than the number of packets in these flows,
flow statistics will still afford useful compression of information.

10.4.3.3 Flow Record Sampling
Sampling flow records present a challenge, because of the highly skewed distribution of flow sizes found in network traffic. Experimental studies have shown that
the distribution of flow lengths is heavy tailed; in particular, a large proportion of
the total bytes and packets in the traffic stream occur in a small proportion of the
flows; see, e.g. [42]. This makes the requirements for flow record sampling fundamentally different to those for packet sampling. While packets have a bounded size,
uniform and uncontrolled sampling due to transmission loss are far more problematic for flow records than for sampled packets, since omission of a single flow report
can have huge impact on measured traffic volumes. This motivates sampling dependent on the size of the flow reported on. A simple approach would be to discard flow
records whose byte size falls below a threshold. This gives a conservative, and hence
biased measure of the total bytes, and is susceptible to subversion: an application or
user that splits its traffic up into small flows could evade measurement altogether.
This would be a weakness for accounting and security applications.

10

Measurements of Data Plane Reliability and Performance

333

Smart Sampling can be used to avoid the problems associated with uniform sampling of flow records. Smart Sampling is designed with the specific aim of achieving
the optimal trade-off between the number of flow records actually sampled, and the
accuracy of estimates of underlying traffic volumes derived from those samples.
In the simplest form of Smart Sampling, called Threshold Sampling [36], each
flow record is sampled independently with a probability that depends on the reported
flow bytes: all records that report flow bytes greater than a certain threshold z are
selected; those below threshold are selected with a probability proportional to the
flow bytes. Thus, the probability to sample a flow record representing x bytes is
pz .x/ D minf1; x=zg
The desired optimality property described above holds in the following sense. Suppose X bytes
P are distributed over some number m of flows of size x1 ; : : : ; xm so
b
b
that X D m
i D1 xi . We consider unbiased estimates X of X , i.e., X is a random
b
quantity whose average value is X . Suppose X is an unbiased estimate of X obtained from a random selection of a subset of n < m of the original flows, having
sizes x1 ; : : : ; xn , where selection is independent according to some size-dependent
probability p.x/. A standard procedure to obtain unbiased estimates is to divide the
measured value by the probability that it was sampled [47]. Thus in
Pour case each
b D n xi =p.xi /
sampled flow size is normalized by its sampling rate, so that X
i D1
is an unbiased estimate of X . We express the optimal trade-off as trying to minimize
a total “cost” that is a linear combination
b
Cz D z2 EŒn C VarŒX
of the average number of samples and the estimation variance, where z is a parameter
that expresses the relative importance we attach to making the number of samples
small versus making the variance small. For example, when z is large, making EŒn
small has a larger effect on reducing Cz . It is proved in [36] that the cost Cz is
minimized for any set of flow sizes x1 ; : : : ; xm by using the sampling probabilities
p.x/ D pz .x/. With the probabilities pz , each selected flow xi gives rise to an
estimate xi =pz .xi / D maxfxi ; zg.
Although optimal as stated, Threshold Sampling does not control the exact number of samples taken. For example, if the number of flows doubles during a burst,
then on average, the number of samples also doubles (assuming the same flow size
distribution). However, exact control may be required in some applications, e.g.,
when storage for samples has a fixed size constraint, or for sampling a specified
number of representative records for archiving. A variant of Smart Sampling, called
Priority Sampling [37], is able to achieve a fixed sample of size n < m, as follows.
Each flow of size xi is assigned a random priority wi D xi =ai where ai is a uniformly distributed random number in .0; 1. Then the k flows of highest priority are
selected for sampling, and each of them contributes an estimate maxfxi ; z0 g where z0
is now a data-dependent threshold z0 set to be .k C 1/st largest priority. It is shown
in [37] that this estimate is unbiased.

334

N. Duffield and Al Morton

Priority Sampling is well suited for back-end database applications serving
queries that require estimation of total bytes in an arbitrary selection of flows (e.g.,
all those in a specific matrix element) over a specified time period. A random priority is generated once for each flow, and the records are stored in descending order of
priority. Then an estimate based on k flows proceeds by reading k C 1 flow records
of highest priority that match the selection criterion, forming an unbiased estimate
as above. Because the flow records already are in priority-sorted order, selection is
very fast (see [4]).

10.4.3.4 Estimation and the Statistical Impact of Sampling
Whether sampling packets or flow records, the measured numbers of packet, bytes,
or flows must be normalized in order to give an unbiased estimate of the actual traffic
from which they were derived; we saw how this was done for threshold sampling in
Section 10.4.3.3. For 1 in N packet sampling, byte estimates from selected packets
are multiplied by N . The use of sampling for measuring traffic raises the question of
how accurate estimates of traffic volumes will be. The statistical nature of estimates
might be thought to preclude their use for some purposes. However, for many sampling schemes, including those described above, the frequency of estimation errors
of a given size can be computed or approximated. This can help answer questions
such as “if no packets matching a given key were sampled, then how likely is it that
there were X or more bytes in packets with this key that were missed”.
A rough indication of estimation error is the relative standard deviation (RSD),
b divided by the true value X . The RSD
i.e, the standard deviation of the estimator X
for estimating an aggregate ofp
X bytes of traffic using independent 1 in N packet
sampling is bounded above by N xmax =X where xmax is the maximum
p packet size.
For flow sampling with threshold z, the RSD is bounded above by z=X . Observe
the RSD decreases as the aggregate size increases. In cases where multiple stages
of sampling and aggregation are employed – for example, packet sampled NetFlow
followed by Threshold Sampling of flow records – the sampling variance is additive.
In the example, the RSD becomes
p
.z C N xmax /=X
As an example, consider 1 in N D 1;000 sampling of packets of maximum size
xmax D 1;500 bytes with a flow sampling threshold of z D 50 MB. In this case z
N xmax D 1:5 MB , and so Smart Sampling contributes most of the estimation error.
With these sampling parameters, estimating the 10 min average rate of a 1 Gb/s
backbone traffic stream on a backbone would incur a typical relative error of 3%. In
fact, rigorous confidence intervals for the true bytes in terms of the estimated values
can be derived (see [26, 79]), including for some cases of multistage sampling.
Using an analysis of the sampling errors, the impact of flow sampling on usagebased charging, and ways to avoid or ameliorate estimation error, are described in
[35]. The key idea is that a combination of (i) systematic undercounting of customer

10

Measurements of Data Plane Reliability and Performance

335

traffic by a small amount, and (ii) using sufficiently long billing periods, can reduce
the likelihood over over-billing customers to an arbitrarily small probability.

10.4.3.5 Stateful Packet Sampling and Aggregation
The dichotomy between packet sampling on a router and flow sampling in the measurement infrastructure, while architecturally simple, does not necessarily result in
the best trade-off between resource usage and measurement accuracy. We briefly review some recent research that proposed to maintain various degrees of router state
in order to select and maintain flow records for subsets of packets.
 Sample and Hold [41]

All packets arriving at the router whose keys are not
currently in the flow cache are subjected to sampling; packets that are selected in
this manner have a corresponding flow cache entry created, and all subsequent
packets with the same key are selected (subject to timeout). Thus, long flows are
preferentially sampled over short flows, since the flow cache tends to be populated only by the longer flows. This achieves similar aims to Smart Sampling
but in a purely packet-based solution. While the cache can be made smaller than
would be required to measure all flows, a cache lookup is still required for each
packet.
 Adaptive Sampling Methods Both NetFlow and Sample and Hold can be made
adaptive by adjusting their underlying sampling rate and flow termination criteria
in response to resource usage, e.g., to control cache occupancy and flow record
export rate. Now recall from Section 10.4.3.3 that construction of unbiased estimators required normalization of sample bytes and packet counts by dividing by
the sampling rate. Adjustment of the sampling rate requires matching renormalization in estimators in order to maintain unbiasedness. Partial flow records may
be resampled (and further renormalized) and may be discarded in some cases (see
[40]). In one variant of this approach the router maintains and exports a strictly
bounded number of flow records, providing unbiased estimates of the original
traffic bytes.
 Stepping Methods Stepping is an extension of the adaptive method in which,
when downward adjustments of the sampling rate occur, estimates of the total
bytes in packets of a given key that arrived since the previous such adjustment –
the steps – are sampled and exported from the flow cache. Such exports can take
place from the flow cache into DRAM, where the steps can be aggregated. The
payoff is higher estimation accuracy, because once exported, the steps are not
subject to loss (see [27]).
 Run-Based Estimation In its simplest form, run-based estimation involves
caching in SRAM only the key of the last observed packet. If the current packet
matches the key, the run event is registered in a cache in DRAM. Using a timeseries model, the statistics of the original traffic are estimated from those of the
runs. A generalization of the approach can additionally utilize longer runs [45].

336

N. Duffield and Al Morton

10.5 Packet Selection Methods for Traffic Flow Measurement
10.5.1 Packet Selection Primitives and Standards
In Section 10.4.3.2 random packet sampling was presented as a necessity for reducing packet rates prior to the formation of flow statistics; moreover, random sampling
has significant advantages over filtering and aggregation as a continuously operating
general data reduction method. In this chapter we shift the emphasis somewhat and
consider a set of packet selection primitives, and their ability to serve a variety of
specific measurement applications. Following [33] we classify selection primitives
as follows:
 Filtering

Selection of packets based deterministically on their content. There
are two important subcases:
– Property Match Filtering Selection of a packet if a field or fields match a
predefined value.
– Hash-Based Selection A hash of the packet is calculated and the packet is
selected if it falls in a certain range.

 Sampling

Selection of packets nondeterministically.

Some primitives of this type are provided by Cisco Flexible NetFlow [14] that
allows combinations of certain random sampling and property match filters. The
framework above was standardized in the Packet Sampling (PSAMP) working group
of the IETF [33]. A collection of sampling primitives is described in [82], including
but not limited to the fixed rate sampling from Section 10.4.3.2. Property match
filtering can be based on packet header fields (such as IP address and port) and the
packet treatment by the router, including interfaces traversed, and the routing state
in operation during the packet’s transit of the router. Hash-based selection, including
specific hash functions, is also standardized in [82]. We describe the operation and
applications of hash-based selection in Section 10.5.2.
From both at the implementation and standards viewpoint, packet selection is
positioned as a front-end process that passes selected packets to a process that compiles and exports flow statistics. Thus, a PSAMP packet selector passes packets to
an IPFIX flow reporting process. A flow record can report on single selected packets
by setting the inactive flow timeout to zero. A key development in support of network management is the ability of routers and other measurement devices to support
simultaneous operation of multiple independent measurements, each of which is
composed of combinations of packet selection primitives. This type of capability is
already present in Cisco Flexible NetFlow [14] and standardized in PSAMP/IPFIX.
Each packet selection process can, in principle, be associated with its own independently configurable flow reporting process. The ability to dynamically configure
or reconfigure packet selection provides a powerful tool for a variety of applications, from low-rate sampling of all traffic to supply routine reporting for Network
Operation Center (NOC) wallboard displays, to targeted high-rate sampling that
drills down on an anomaly in real time (see Fig. 10.3).

10

Measurements of Data Plane Reliability and Performance

337

Packet
Header

Fig. 10.3 Concurrent combinations of sampling and filtering packet selection primitives

10.5.2 Consistent Packet Sampling and Hash-Based Selection
The aim of consistent packet sampling (also called Trajectory Sampling) is to sample a subset of packets at some or all routers that they traverse. The motivation is
new measurement applications that are enabled or enhanced; see below. Consistent
packet sampling can be implemented through hash-based selection. Routers calculate a hash of packet content that is invariant along the packet path, and the packet is
selected for reporting if the hash values falls in a specified range. When all routers
use the same hash function and range, the sampling decisions for each packet are
identical at all points along its path. Thus, each packet signals implicitly to the router
whether it should be sampled. Information on the sampled packet can be reported
in flow records, potentially one per sampled packet. In order to aid association of
different reports on the same packet by the collector, the report can include not only
packet header fields, but also a packet label or digest, taking the form of a hash (distinct from that used for selection) whose input includes part of the packet payload.
An ideal hash function would provide the appearance of uniform random sampling over the possible hash input values. This is important both for accurate traffic
estimation purposes, and for integrity: network attackers should not be able to predict packet sampling outcomes. Use of a cryptographic hash function with private
parameter provides the strongest conformance to the ideal. In practice, implementation constraints on computational resources may require weaker hash functions to
be used. Hash-based packet selection has been proposed in [38], with further work
on its applications passive performance monitoring in [34, 83]. Security ramifications of different hash function choices are discussed in [43]. Hash-based sampling
has been standardized as part of the PSAMP standard in the IETF [82].

338

N. Duffield and Al Morton

Applications of consistent sampling include:
 Route Troubleshooting

Direct measurements of packet paths can be used to
detect routing loops and measure transient behavior of traffic paths under routing changes. This detailed view is not provided by monitoring routing protocols
alone. Independent packet sampling at different locations does not provide such
a fine timescale view in general, since a given packet is typically not sampled at
multiple locations.
 Passive Performance Measurement Correlating packet samples at two or more
points on a path enables direct measurement of the performance experienced by
traffic on the path, such as loss (as indicated by packets present at one point on
the path that are missing downstream) and latency (if reports on sampled packets
include measurement timestamps from synchronized clocks). This is an attractive
application for service providers since it can alert performance degradation at the
level of individual customers, reflecting the same packet transit performance that
customers themselves experience.

10.6 Deep Packet Inspection
Sections 10.4 and 10.5 are concerned with the measurement and characterization
of traffic at the granularity of a flow key that depends on the packet only through
header fields. However, there are important network management tasks that depend
on knowledge of packet payloads, and hence for which traffic flow monitoring is
insufficient. The term DPI denotes measurement and possible treatment of packets
based on their payload. We describe some broad designs policy issues associated
with the deployment of DPI in Section 10.6.1; specific technologies for DPI devices
are described in Section 10.6.2, and three applications of DPI for network management in Section 10.6.3: application-specific bandwidth management, network
security monitoring, and troubleshooting.

10.6.1 Design and Policy Issues for DPI Deployment
DPI functions are not uniformly featured in routers, and hence some uses will require additional infrastructure deployment. DPI is extremely resource intensive due
to the need to access and process packet payload at line rate. This makes DPI expensive compared with flow measurement, which hinders its widespread deployment.
A limited deployment may be restricted to important functional sites, or at a representative subset of different site types, e.g., a backbone link, an aggregation router,
or in front of datacenter.
Like all traffic measurements, DPI must maintain privacy and confidentiality of
customer information throughout the measurement collection and analysis process.
Although flow measurements already encode patterns of communications through

10

Measurements of Data Plane Reliability and Performance

339

source and destination IP addresses, DPI of packet payload may also encompass
the content of the communications. Service provider policies must specify practices
to maintain the privacy of the data, including controlled and auditable access restricted to individuals needing to work with the data, encryption, anonymization,
and data retention policies. See also the discussion specific to DPI for security monitoring in Section 13.4. Furthermore, any use of DPI data must be conducted in
accordance with legal regulations in force. Similar issues exist for providers of hostbased services as opposed to communications services, where servers intrinsically
have access to user-specific data that may be presented by the customer in the course
of using those services, e.g., email, search, or e-commerce transactions.

10.6.2 Technologies for DPI
DPI functionality is realized in dedicated general-purpose traffic monitors [28], and
within vendor equipment targeted at specific applications such as security monitoring [68] and application-specific bandwidth management [19]. As the value of
DPI-based applications for service providers grows, DPI functionality has also appeared in some routers and switches [16]. General-purpose computing platforms
have been used for DPI, e.g., using Snort [74], an open-source intrusion detection
system. Some DPI devices operate in line where they perform network management
functions directly, such as security-based filtering or application bandwidth management. Others act purely as monitors and require a copy of the packet stream to
be presented at an interface. There are several ways by which this can be accomplished: (i) by copying the physical signal that carries the packets, e.g., with an
optical splitter; (ii) by attaching the monitor to a shared medium carrying the traffic,
or (iii) by having a router or switch copy packets to an interface on the monitor.
The architectural challenges for all DPI platforms are: (i) the high incoming
packet rate; (ii) the large number of distinct signatures against which each packet
is to be matched – Snort has several hundred – and (iii) signatures that match over
multiple packets, and hence require flow-level state to be maintained in the measurement device. These factors have tended to favor the use of dedicated DPI devices
ahead of router-based integration in the past. They also drive architectural design
for DPI devices in which aggregation and analysis if pushed down as close to the
data stream as possible.
Coupled with general-purpose computational platforms, tcpdump [52] is a public
domain software that captures packets at an interface of the host on which it executes. Tcpdump has been widely used as both a diagnostic tool, and also to capture
packet header traces in order to conduct reproducible exploratory studies. However,
the enormous byte rates of network data in comparison with storage and transmission resources, generally preclude collecting packet header traces longer than a few
minutes or perhaps hours. A number of anonymized packet header traces have been
made available by researchers; see e.g., ([9]). Software for removal of confidential
information from packet traces, including anonymization, is available (see [63]).

340

N. Duffield and Al Morton

10.6.3 Applications of DPI
In this section, we motivate the importance of DPI by describing network management applications that require detail from packet payload: application characterization and management, network security, and network debugging.

10.6.3.1 Application Demand Characterization and Bandwidth Management
Applications place diverse service requirements on the network. For example, realtime applications such as VoIP require relatively small bandwidth but have stringent
latency requirements. Video downloads require high throughput but are elastic in
terms of latency. Service providers can differentiate resources among the different
service classes according to the size of the demands in each class. Hence a crucial
task for network planning is to characterize and track changes in the traffic mix
across application classes.
In the past, application and application class could be inferred reasonably well
from TCP/UDP port numbers on the basis of IANA well-known port assignments
[50]. However, purely port-based identification is becoming less easy due to factors
including (i) lack of adherence to port conventions by application designers, (ii) piggybacking of applications on well-known ports, such as HTTP port 80, in order to
facilitate firewall traversal; and (iii) separation of control and data channels with
dynamic allocation of data port during control level handshaking (see Chapter 5
for further details). On the other hand, knowledge of application operation can be
used to develop packet content-level signatures. In some cases, this would involve
matching strings of an application-level protocol across one or more network packets. For applications that use separate data and control channels, this could entail
(a) matching a signature of the control channel in the manner just described with
further inspection, then (b) identifying the data channel port communicated in the
control channel, (c) using the identified data channel port to classify further packet
or flow level measurements taken (see [80]).
Application-based classification can be used purely passively. Knowledge of the
mix and relative growth between different application classes is necessary for network planning. It can also be used actively to apply differentiated resource allocation
policies to different application classes, concerning traffic shaping, dropping of outof-profile packets, or restoration priority after failures. As an example, access to
a customer access channel can be prioritized so that the performance of delaysensitive VoIP traffic is not impaired by other traffic. A number of vendors supply
equipment with such capabilities (see e.g. [19, 75]).

10.6.3.2 Network Security
While some network attacks can be identified based on header-level information
this is not true in general. As a counterexample, the well-known Slammer worm

10

Measurements of Data Plane Reliability and Performance

341

[64] was evident due to (i) its rapid growth leading to sharp increases in traffic
volume; (ii) the increase was associated with particular values of the packet header
field, and (iii) contextual information that the application exploited predominantly
exchanges traffic across LANs or intranets rather than across the WAN. This combination of factors made it relatively easy to identify the worm and block its spread by
instantiating header-level packet filters, without significantly impacting legitimate
traffic.
However, these conditions do not hold in general. Many network attacks exploit
vulnerabilities in common applications such email, chat, p2p, and web-browsing
mediated by network communications that, in contrast with the Slammer example
[64], (i) are relatively stealthy, not exhibiting large changes in network traffic volume at least during the acquisition phase, (ii) are not distinguished from legitimate
traffic by specific header field values, and hence (iii) blend into the background of
legitimate traffic at the flow level. Examples include installation of malware such
as keystroke loggers, or the acquisition and subsequent control of zombie hosts
in botnets.
To detect and mitigate these and other attacks, packet inspection is a powerful
tool to enable matching against known signatures of malware, including viruses,
worms, trojans, botnets. Indeed, a sizable proportion of the attack detection signatures commonly used in the public domain Snort packet inspection system [74]
match only on the packet payload rather than the header.
Similarly to Section 10.6.3.1, a network security tool may operate purely passively in order to gain information about unwanted traffic, or may be coupled to
filtering functions that block specific flows of traffic (see Chapter 13 for further
details).

10.6.3.3 Debugging for Software, Protocols, and Customer Support
Both networking hardware and software that implement services can contain subtle
dependencies and display unexpected behavior that, despite pre-deployment testing, only becomes evident in the live network. DPI permits network operators to
monitor, evaluate, and correct such problems. To troubleshoot specific network
or service layer issues, DPI devices could be deployed at a concentration point
where specific protocol exchanges or application-layer transactions can be monitored for correctness. Operators might also use portable DPI devices, which would
allow them deploy devices in specific locations to investigate suspected hardware or
software bugs. Similarly, DPI enables technicians to assist customers in debugging
customer equipment, and software installations and configurations. This can enable
technicians to rapidly determine the nature of problems rated to network transmissions, rather than rely on potentially incomplete knowledge derived from customer
dialogs.

342

N. Duffield and Al Morton

10.7 Active Performance Measurement
This section is concerned with the challenges and design aspects of providing active performance measurement infrastructures for service providers. The four metric
areas of common interest are:
 Connectivity Can a given host be reached from some set of hosts?
 Loss What proportion of a set of packets are lost on a path (or paths) between

two hosts? Loss may be considered in an average sense (all packets over some
period of loss) or granular in time (burst loss properties) or space (broken down,
e.g., by customer or application).
 Delay The network latency over a path (or paths) between two hosts, viewed at
the same granularity as for loss measurements.
 Throughput Bytes or packets successfully transmitted between two hosts,
potentially broken down by application or protocol (e.g., TCP vs. UDP).
Historically, active measurement tools such as ping and traceroute have
long been used to baseline roundtrip loss and delay and map IP paths, either as standalone tools, or integrated into performance measurement systems. Bulk throughput
has been estimated using the treno tool [58], which creates a probe stream that
conforms to the dynamics of TCP. There is a large body of more recent research
work proposing improved measurement methods and analysis (see, e.g., [29]). However, the focus of the remainder of this chapter concerns more the design and
deployment issues for the components of an active measurement and reporting infrastructure of the type increasingly deployed by service providers and enterprise
customers. Specifically:
 Performance Metric Standardization

This is required in order for all parties
involved in the measurement, dissemination and interpretation of results to
agree on the methods of acquiring performance measurements, and their meaning. Such parties include network service providers, their customers, third-party
measurement service providers, and measurement system vendors. Performance
metric standardization is described in Section 10.8.
 Service Level Agreements Service providers must offer specific performance
targets to their customers, based upon agreed metrics. Section 10.9 describes
processes for establishing SLAs between service providers and customers.
 Deployment of Active Measurement Infrastructures Deployment issues for
large-scale active measurement infrastructures are discussed in Section 10.10,
together with some examples of different deployment modes.

10.8 Standardization of IP Performance Metrics
In this section, we give an overview of standardization activities on IP performance
metrics. There are not one, but two standard bodies that provide the authoritative
view of IP network performance and on packet performance metrics in general.

10

Measurements of Data Plane Reliability and Performance

343

They are the IETF (primarily the IP Performance Metrics IPPM working group),
and the International Telecommunications Union - Telecommunications Sector
Study Group 12 (ITU-T SG 12, specifically the Packet Network Performance
Question 17). Although there are some differences in the approaches and the metric
specifications between these two bodies, they are relatively minor.
The critical advantage of using standardized metrics is the same as for any good
standard: the metrics can be implemented from unambiguous specifications, which
ensure that two measurement devices will work the same way. They will assign
timestamps at the same defined instants when a packet appears at the measurement
point (such as first bit in, or the last bit out). They will use a waiting time to distinguish between packets with long delays and packets that do not arrive (because
one cannot wait forever to report results, and for many applications a packet with
extremely long delay is as good as lost). They will perform statistical summary calculations the same way, and when presented with identical network conditions to
measure, they produce the same results.
The ITU-T has defined its IP performance metrics in one primary Recommendation, Y.1540. The general approach is to define basic sections bounded by
measurement points, which are
 Hosts at the source and destination(s)
 Network Sections (composed of routers and links, and usually defined by admin-

istrative boundaries)
 Exchange Links (between the other entities)

The next step is to define packet transfer reference events at the various section
boundaries. There are two main types of reference events:
 Entry event to a host, exchange link, or network section
 Exit event from a host, exchange link, or network section

Then, the fundamental outcomes of successful packet transfer and lost packet are
defined, followed by performance parameters that can be calculated on a flow of
packets (referred to using the convention “population of interest”). ITU-T’s metrics
are useful in either active or passive measurement, and do not specify sampling
methods.
The IETF began work on network performance metrics in the mid-1990s, by first
developing a comprehensive framework for active measurement [70]. The framework RFC established many important conventions and notions, including:
 The expanded use of the metric definition template developed in earlier IETF

work on Benchmarking network devices [6].
 The general concept of “packets of Type-P” to reflect the possibility that packets

of different types would experience different treatment, and hence, performance
as they traverse the path. A complete specification of Type-P and the source
and destination addresses are usually equivalent to the ITU-T’s “population of
interest”.

344

N. Duffield and Al Morton

 The notion of “wiretime”, which recognizes that physical devices are needed to

observe packets at the IP-layer, and these devices may contribute to the observed
performance as a source of error. Other important time-related considerations are
detailed, too.
 The hierarchy of singletons (“atomic” results), samples (sets of singletons), and
statistics (calculations on samples).
A series of RFCs followed over the next decade, one for each fundamental metric
that was identified. The IETF wisely put the various metric RFCs (RFC 2679 [2]
and RFC 2680 [3]) on the Standards Track, so that the implementations could be
compared with the specifications and used to improve their quality (and narrowdown some of the flexibility) over time. RFC 2330 [70] and RFC 3432 [72] specify
Poisson and Periodic sampling, respectively. Throughput-related definitions are in
RFC 5136 [12].
One area in which IETF was extremely flexible was its specification for delay
variation, in RFC 3393[31]. This specification applies to almost any form of delay variation imaginable, and was endowed with this flexibility after considerable
discussion and comparisons between the ITU-T preferred form and other methods
(some of which were adopted in other IETF RFCs). This flexibility was achieved
using the “selection function” concept, which allows the metric designer to compare any pair of packets (as long as each is unambiguously defined from a stream
of packets). Thus, this version of the delay variation specification encouraged practitioners to gain experience with different metric formulations on IP networks, and
facilitated comparison between different forms by establishing a common framework for their definition. A common selection function uses adjacent packets in the
stream, and this is called “Inter-Packet Delay Variation”.
In contrast, the ITU-T Recommendations of the early 1990s (for ATM networks)
used essentially the same form of delay variation metric as in Y.1540 and as used
today in Recommendations for the latest networking technologies. It is called the
“2-point Packet Delay Variation” metric. This metric defines delay variation as the
difference between a packet’s one-way delay and the delay for a single reference
packet. The recommended reference is the packet with the minimum delay in the test
sample, removing propagation from the delay distribution and emphasizing only the
variation. This definition differs significantly from the inter-packet delay variation
definition. Fortunately, an IETF project has rather completely investigated the two
main forms of delay variation metrics, and is available to provide guidance on the
appropriate form of metric for various tasks [66]. The comparison approach was to
define the key tasks (such as de-jitter buffer size and queuing time estimation) and
challenging measurement circumstances for delay variation measurements (such as
path instability and packet loss), and to examine relevant literature. In summary, the
ITU-T definition of “2-point Packet Delay Variation” was the best match to all tasks
and most circumstances, but with a requirement for more stable timing being its
only weakness.

10

Measurements of Data Plane Reliability and Performance

345

10.9 Performance Metrics in Service-Level Agreements
In this section, we discuss Service-Level Agreements, or SLA, and how the key
metrics defined above contribute to a successful relationship between customers
and their service providers.

10.9.1 Definition of a Service-Level Agreement (SLA)
For our purposes, we define a Service-Level Agreement as:
A binding contract between Customer and Service Provider that identifies all important aspects of the service being delivered, constrains those aspects to a satisfactory performance
level which can be objectively verified, and describes the method and format of the verification report.

This definition makes the SLA-supporting role and design of active measurement
systems quite clear. The measurement system must assess the service on each of the
agreed aspects (metrics) according to the agreed reporting schedule and determine
whether the performance thresholds have been met. The details of the SLA may
even specify the points where the active measurement system will be connected to
the network, the sending characteristics of the synthetic packets dedicated for verification testing, and the confidence interval beyond which the results conclusively
indicate that the threshold was met/not met.

10.9.2 Process to Develop the Elements of an SLA
This section describes a process to develop the critical performance aspects of an
SLA. Typically, a network operator establishes a standard set of SLAs for a network
service by conducting this process internally, using a surrogate for the customer.
The specific details of the SLA may differ for different services, e.g., an enterprise
Internet access service might have a different SLA from a premium VPN service. An
SLA might specify performance metrics such as data delivery (the inverse of packet
loss), site-to-site latency by region or location, delay variation or jitter, availability,
etc. as well as a number of nonperformance metrics such as provisioning intervals.
There are also cases in which a network operator may develop a customized SLA
for a particular customer (e.g., because the size of their network or other special
circumstances demand it). The process that a service provider and the customer
would go through to develop a customized SLA illustrates the issues that need to be
addressed when developing an SLA. We present an example of such a process here.
In principle, the SLA represents a common language between the customer and
service provider. The process involves collection of requirements and a meeting of

346

N. Duffield and Al Morton

peers to compare the view from each side of the network boundaries. One set of
steps to create agreeable requirements is given below.
1. The customer identifies the locations where connectivity to the communications
service is required (Customer–Service Interfaces), and the service provider compares the location list with available services.
2. The customer and service provider agree on the performance metrics that will
be the basis for the SLA. For example, a managed IP network provides a very
basic service – packet transfer from source to destination. The SLA is based
on packet transfer performance metrics, such as delay, delay variation, and loss
ratio. If higher-layer functions are also provided (e.g., domain name to address
resolution), then additional metrics can be included.
3. The customer must determine exactly how they plan to use a communications
network to conduct business, and express the needs of their applications in terms
of the packet performance metrics. The performance requirements may be derived from analysis of the component protocols of each customer application,
from tests with simulated packet transfer impairments, or from prior experience.
Sometimes, the service provider will consult on the application modeling.
4. In parallel, the service provider collects (or estimates) the levels of packet transfer performance that can be delivered between geographically dispersed service
interfaces. Active measurements often serve this aspect of the process, by revealing the network performance possible under current conditions.
5. When the customer and service provider meet again, the requested and feasible
performance levels for all of the performance metrics are compared. Where the
requested performance levels cannot be met, revised network designs or a plan
to achieve interim and long-term objectives in combination with deployment of
new infrastructure may be developed, or the customer may relax specific requirements, or a combination of the two.
6. Once the performance levels of the SLA are agreed upon, it remains to decide
on the formal reporting intervals and how the customer might access the ongoing
measurement results. This aspect is important because formal reporting intervals
are often quite long, on the order of a month.
7. If the customer needs up-to-date performance status to aid in their troubleshooting process, then monthly reports might be augmented with the ability to view
a customized report of recent measurements. The active measurement system
would communicate measured results on a frequent basis to support this monitoring function, as well as longer-term SLA reports.
There are several process complexities worth mentioning. First, the customer
may be able to easily determine the performance requirements for a single application flow, but the service providers’ measurements will likely be based on a test
flow, which experiences the same treatment as the rest of the flows. The test packet
flow may not have identical sending characteristics as customer flows, and will certainly represent only a small fraction of the aggregate traffic. Thus, the active test
flow performance will represent the customer flow performance only on a long-term
basis. Second, active measurements of throughput may have a negative affect on live

10

Measurements of Data Plane Reliability and Performance

347

traffic while they are in-progress. As a result, the throughput metric may be specified
through other means, such as the information rate of the access link on each service
interface, and not formally verified through active measurement.

10.10 Deployment of Active Measurement Infrastructures
In this section, we describe several ways in which active measurement systems can
be realized. One of the key design distinctions is the measurement device topology.
We describe and contrast several of the topologies that have seen deployment, as
this will be an important consideration for any system the reader might devise. We
categorize the topologies according to where the devices conducting measurements
are physically located.

10.10.1 Geographic Deployment at Customer–Service Interfaces
In this topology, measurement devices (or measurement processes in multipurpose
devices) are located as close as possible to the service interfaces. Figure 10.4a

a

b

c

Fig. 10.4 Deployment scenarios for active measurement infrastructure. MP D measurement point.
(a) MP at ends of path in point-to-point service. (b) MP at network edge; no coverage of access
links. (c) MP at central location with connectivity to remote locations

348

N. Duffield and Al Morton

depicts this topology for a point-to-point service, with a Measurement Point (MP)
at each end of the path. The Cisco Systems IP SLATM product embeds an active
measurement system at routers and switches that often resides in close proximity
to the Customer–Service Interfaces. The measurement results can be collected by
accessing specific MIB modules using SNMP. The utility of IP SLATM capabilities
was recognized for multi-vendor scenarios, and the Two-way Active Measurement
Protocol (TWAMP) [46] standardizes a fundamental test control and operation
capability.
The primary advantage of this topology is that the measurement path covers the
entire service in a single measurement, so the active test packets will experience conditions very similar to customer traffic. However, the measurement device/process
must be located at a remote (customer) site to provide such coverage, so their cost
is not shared across multiple services and it must be managed (and have results collected) remotely. The scale of the measurement system is also an issue. A full-mesh
of two-way active measurements grows exponentially with the number of nodes, N ,
according to N .N  1/=2.

10.10.2 Geographic Deployment at Network Edges
In Fig. 10.4b, the MPs move to intermediate nodes along the point-to-point path,
the edge of the network providing service. In this scenario, the measurement devices/processes are located at the edge of the network providing service and the
access links may not be covered by the measurements or the SLAs. We also show a
third MP within the network cloud, which can be used to divide the path into segments. This topology makes it possible to share the measurement devices and the
measurements they produce with overlapping paths that support different services,
different customers, or parts of other point-to-point paths for the same customer. Of
course, a process is needed to combine the results of segment measurements to estimate the edge-to-edge performance, and this problem has been successfully solved
[51, 65, 67]. The key points to note are the following:
 The interesting cases are those where impairments are time-varying, thus we ex-

pect to estimate features of time distributions, and not specific values (singletons)
at particular times.
 Some performance metric statistics lend themselves to combination, such as
means and ratios, so these should be selected for measurement and SLAs. For
example, measurements of the minimum delay of path segments can usually be
taken as additive when estimating the complete path performance. Average oneway delay is also additive, but somewhat more prone to estimation errors when
the segment distributions are bimodal or have wide variance (a long tail).
 There must be a reasonable case made that (for each metric used) performance
on one path segment will be independent of the other, because correlation causes
the estimation methods to fail. An obvious correlation example is any metric

10

Measurements of Data Plane Reliability and Performance

349

that evaluates packet spacing differences – the measurement is dependent on the
original spacing, and that spacing will change when there is any delay variation
present on the path segments.
We note that it is also possible to obtain complete path coverage using this topology, with assistance from low-cost test reflector devices/processes located at the
service interfaces (such as those described in RFC 5357 [46]) (see [13] for more
details).

10.10.3 Centralized Deployment with Remote Connectivity
As alternative to remote deployment of measurement devices/processes, Fig. 10.4c
shows all MPs moved to a central location with connectivity to strategic locations
in the network (such as the network edges in key cities). This topology offers the
advantage of easy access to the measurement devices at the central location, thus affording rapid reconfiguration and upgrade. However, reliable remote access links are
needed between this single location and every network node that requires testing.
Also, even if the remote access links are transparent from a packet loss perspective,
they will still introduce delay that is not present on the customer’s path through the
network. The mere cost of the remote access links may make remote device deployment in Fig. 10.4b more attractive. Thus, topologies like this have been deployed
for remote connectivity monitors when the devices implementing a network technology do not have sufficient native support for remote device deployment (e.g.,
Frame Relay networks).
A system exploiting this approach is described in [8] where tunneling is used
to steer measurement packets on round-trip paths from a central host, via the access links. In this sense, virtual measurements are conducted between different pairs
of hosts in the network core. A related approach for multicast VPN monitoring is
described in [7].

10.10.4 Collection for Infrastructure Measurements
When measurement devices are geographically dispersed, there must be a means to
collect the results of measurements and make them available for monitoring, reporting, and SLA compliance verification. This requires some form of protocol to fetch
either the per-packet measurements, or the processed and summarized results for
each intermediate measurement interval (e.g., 5–15 min). Once the measurement
results have been collected at a central point, they should be stored in a database
system and made available for on-going display, detailed analysis, and SLA verification/reporting.

350

N. Duffield and Al Morton

10.10.5 Other Types of Infrastructure Measurements
10.10.5.1 Independent Measurement Networks
Measurement service vendors, such as Keynote [57], station measurement devices
in locations of ISPs representing, e.g., typical customer access points, and conduct a
variety of measurements between measurement devices or between them and service
hosts, including, web and other server response times, access bandwidth, VoIP, and
other access performance. Comparative performance measures are published and
detailed results are made available through subscription.

10.10.5.2 Cross Provider and Network-Wide Measurements
End-to-end paths commonly traverse multiple service providers. Thus, it is natural to measure the inter-provider components to performance. The most prominent
example is the RIPE network [73], which has stationed measurement devices in a
number of participating ISPs, conducts performance measurements between them,
and disseminates selected views to the participants. Novel active measurement infrastructure is being deployed in advanced research and development networks (e.g.,
MeasurementLab/PlanetLab [61]), including work in developing architectures for
managing access to and data recovery from measurement infrastructures.

10.10.5.3 Performance Measurement and Route Selection
Router measurement capabilities may also be coupled to the operation of routing protocols themselves. Cisco Performance Routing [20] enables routers in a
multiply-homed domain to conduct performance measurements to external networks. The measurements are then compared in order to determine the best egress
to that network and adjust route parameters accordingly.

10.11 Outlook
The challenges described in Section 10.1.3 will grow with network size and complexity. The fundamental challenges for passive measurement, that of large data
volumes caused by network scale and speed, are usually addressed by sampling.
Going forward, there are three related trade-offs for the measurement infrastructure.
Unless the capacity of the measurement infrastructure grows commensurate with the
growth in network speed and scale grows, sampling rates must decrease in order to
fit the measurements within the current infrastructure. But decreasing sampling rates
reduces the ability to provide an accurate fine-grained view the traffic. Although loss
of detail and accuracy can be ameloriated by aggregation, that would go against the

10

Measurements of Data Plane Reliability and Performance

351

increasing demand for detailed measurements differentiated by customer, application, and service class. On the other hand, growing the infrastructure and retaining
current sampling rates present its own challenges, and not just for in equipment and
administration costs. Distributed measurement architectures are an attractive way
to manage scale, enabling local analysis and aggregation rather than requiring recovery of data to a single central point. Then, the challenge becomes the design
of distributed analysis and efficient communication methods between components
of measurement infrastructure. This is particularly challenging for network security
applications, which need a network-wide view in order to identify stealthy unwanted
traffic.
Active measurement presents analogous challenges in viewing network performance differentiated by, e.g., customer, application, traffic path, and network
element. Aggregate performance measurements are no longer sufficient. There are a
number of approaches to target probe packets on or onto particular paths: (i) the
probe may craft the packet in order that network elements select the packet on the
desired path; this approach was taken in [7, 8], or (ii) passively measuring customer
traffic directly, e.g., by comparing timestamps between different points on the path
to determine latency (see Section 10.5.2). Both these approaches require knowledge
of the mapping between the desired entity to be measured from (customer, service
class) and the observable parts of the packets. A challenge is that this mapping may
be difficult to elucidate, or depend on network state that may become unstable precisely at the time a performance problem needs to be diagnosed.
Tomographic methods have been proposed to infer performance on links from
performance on sets of measured path that traverse them (see [1, 11]), typically
under simplifying independence assumptions concerning packet loss, latency, and
link failure. These approaches aim to supply indirectly, performance measurements
that are not available directly. It remains a challenge to bring the early promise of
these methods to fruition in production-level tools under general network conditions
(see e.g. [48]). The relative utility of performance tomographic approaches will depend on the extent to which the detailed network performance measurements can be
provided directly by router-based measurements in the future.
This outlook stands in contrast to the state described in the opening section,
where little measurement functionality was provided in the network infrastructure.
As the best ideas in measurement research and development mature into standard
equipment features, the challenge will be to manage the complexity and scale of the
infrastructure and the data itself.

References
1. Adams, A., Bu, T., Caceres, R., Duffield, N., Friedman, T., Horowitz, J., Lo Presti, F., Moon,
S. B., Paxson, V., & Towsley. D. (2000). The use of end-to-end multicast measurements for
characterizing internal network behavior. IEEE Communications Magazine, May 2000, 38(5),
152–159.
2. Almes, G., Kalidindi, S., & Zekauskas, M. (1999). A one-way delay metric for IPPM. RFC
2679, September 1999.

352

N. Duffield and Al Morton

3. Almes, G., Kalidindi, S., & Zekauskas, M. (1999). A one-way packet loss metric for IPPM.
RFC 2680, September 1999.
4. Alon, N., Duffield, N., Lund, C., & Thorup, M. (2005). Estimating arbitrary subset sums
with few probes. In Proceedings of 24th ACM Symposium on Principles of Database Systems
(PODS) (pp. 317–325). Baltimore, MD, June 13–16, 2005.
5. AT&T Labs. Application traffic analyzer. http://www.research.att.com/viewProject.cfm?
prjID=125.
6. Bradner, S. (1991). Benchmarking terminology for network interconnection devices. RFC
1242, July 1991.
7. Breslau, L., Chase, C., Duffield, N., Fenner, B., Mao, Y., & Sen, S. (2006). Vmscope: a virtual multicast vpn performance monitor. In INM ’06: Proceedings of the 2006 SIGCOMM
Workshop on Internet Network Management (pp. 59–64). New York, NY, USA: ACM.
8. Burch, H., & Chase, C. (2005). Monitoring link delays with one measurement host. SIGMETRICS Performance Evaluation Review, 33(3):10–17.
9. CAIDA. The CAIDA anonymized 2009 internet traces dataset. http://www.caida.org/data/
passive/passive 2009 dataset.xml.
10. CAIDA. cflowd: Traffic flow analysis tool. http://www.caida.org/tools/measurement/cflowd/.
11. Castro, R., Coates, M., Liang, G., Nowak, R., & Yu, B. (2004). Network tomography: recent
developments. Statistical Science, 19, 499–517.
12. Chimento, P., & Ishac, J. (2008). Defining network capacity. RFC 5136, February 2008.
13. Ciavattone, L., Morton, A., & Ramachandran, G. (2003). Standardized active measurements
on a tier 1 IP backbone. IEEE Communications Magazine, pp. 90–97, June 2003.
14. Cisco Systems. Cisco IOS Flexible NetFlow. http://www.cisco.com/web/go/fnf.
15. Cisco Systems. Cisco NetFlow Collector Engine. http://www.cisco.com/en/US/products/sw/
netmgtsw/ps1964/.
16. Cisco Systems. Delivering the next generation data center. http://www.cisco.com/en/US/
products/ps9402/.
17. Cisco Systems. IOS switching services configuration guide. http://www.cisco.com/en/US/
docs/ios/12 1/switch/configuration/guide/xcdipsp.html.
18. Cisco Systems. NetFlow. http://www.cisco.com/warp/public/732/netflow/index.html.
19. Cisco Systems. Optimizing application traffic with cisco service control technology. http://
www.cisco.com/go/servicecontrol.
20. Cisco Systems. Performance Routing. http://www.cisco.com/web/go/pfr/.
21. Cisco Systems. Random Sampled NetFlow. http://www.cisco.com/en/US/docs/ios/12 0s/
feature/guide/nfstatsa.html.
22. Claffy, K. C., Braun, H.-W., & Polyzos, G. C. (1995). Parameterizable methodology for
internet traffic flow profiling. IEEE Journal on Selected Areas in Communications, 13(8),
1481–1494, October 1995.
23. Claise, B. (2004). Cisco Systems NetFlow Services Export Version 9. RFC 3954, October
2004.
24. Claise, B., Johnson, A., & Quittek, J. (2009). Packet sampling (psamp) protocol specifications.
RFC 5476, March 2009.
25. Claise, B., & Wolter, R. (2007). Network management: accounting and performance strategies.
Cisco.
26. Cohen, E., Duffield, N., Lund, C., & Thorup, M. (2008). Confident estimation for multistage
measurement sampling and aggregation. In ACM SIGMETRICS. June 2–6, 2008, Maryland,
USA: Annapolis.
27. Cohen, E., Duffield, N. G., Kaplan, H., Lund, C.,& Thorup, M. (2007). Algorithms and estimators for accurate summarization of internet traffic. In IMC ’07: Proceedings of the 7th ACM
SIGCOMM Conference on Internet Measurement (pp. 265–278). New York, NY, USA: ACM.
28. Cranor, C., Johnson, T., Spataschek, O., & Shkapenyuk, V., (2003). Gigascope: a stream
database for network applications. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD
International Conference on Management of Data (pp. 647–651). New York, NY, USA: ACM.
29. Crovella, M., & Krishnamurthy, B. (2006). Internet measurement: infrastructure, traffic and
applications. New York, NY: Wiley.

10

Measurements of Data Plane Reliability and Performance

353

30. Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. New York, NY,
USA: Wiley.
31. Demichelis, C., & Chimento, P. (2002). Ip packet delay variation metric for ip performance
metrics (ippm). RFC 3393, November 2002.
32. Dietz, T., Claise, B., Aitken, P., Dressler, F., & Carle, G. (2009). Information model for packet
sampling export. RFC 5477, March 2009.
33. Duffield, N.G., Claise, B., Chiou, D., Greenberg, A., Grossglauser, M., & Rexford, J. (2009).
A framework for packet selection and reporting. RFC 5474, March 2009.
34. Duffield, N.G., Gerber, A., & Grossglauser, M. (2002). Trajectory engine: A backend for
trajectory sampling. In IEEE Network Operations and Management Symposium (NOMS) 2002.
Florence, Italy, 15–19 April 2002.
35. Duffield, N.G., Lund, C., & Thorup, M. (2001). Charging from sampled network usage. In
Proceedings of 1st ACM SIGCOMM Internet Measurement Workshop (IMW) (pp. 245–256).
San Francisco, CA, November 1–2, 2001.
36. Duffield, N.G., Lund, C., & Thorup, M. (2005). Learn more, sample less: control of volume
and variance in network measurements. IEEE Transactions on Information Theory, 51(5),
1756–1775.
37. Duffield, N.G., Lund, C., & Thorup, M. (2007). Priority sampling for estimation of
arbitrary subset sums. Journal of ACM, 54(6), Article 32, December 2007. Announced at
SIGMETRICS’04.
38. Duffield, N., & Grossglauser, M. (2001). Trajectory sampling for direct traffic observation.
IEEE/ACM Transactions on Networking, 9(3), 280–292, June 2001.
39. Duffield, N., & Lund, C. (2003). Predicting resource usage and estimation accuracy in
an IP flow measurement collection infrastructure. In Proceedings of Internet Measurement
Conference. Miami, FL, October 27–29, 2003.
40. Estan, C., Keys, K., Moore, D., & Varghese, G. (2004). Building a better netflow. In Proceedings of the ACM SIGCOMM 04. New York, NY, 12–16 June 2004.
41. Estan, C., & Varghese, G. (2002). New directions in traffic measurement and accounting.
In Proceedings of ACM SIGCOMM ’2002. Pittsburgh, PA, August 2002.
42. Feldmann, A., Rexford, J., & Cáceres, R. (1998). Efficient policies for carrying web traffic
over flow-switched networks. IEEE/ACM Transactions on Networking, 6(6), 673–685,
December 1998.
43. Goldberg, S., & Rexford, J. (2007). Security vulnerabilities and solutions for packet sampling.
In IEEE Sarnoff Symposium. Princeton, NJ, May 2007.
44. Greer, R. (1999). Daytona and the fourth-generation language cymbal. In SIGMOD ’99:
Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data
(pp. 525–526). New York, NY, USA: ACM.
45. Hao, F., Kodialam, M., & Lakshman, T.V. (2004). Accel-rate: a faster mechanism for memory efficient per-flow traffic estimation. In SIGMETRICS ’04/Performance ’04: Proceedings
of the Joint International Conference on Measurement and Modeling of Computer Systems
(pp. 155–166). New York, NY, USA: ACM.
46. Hedayat, K., Krzanowski, R., Morton, A., Yum, K., & Babiarz, J. (2008). A two-way active
measurement protocol (twamp). RFC 5357, October 2008.
47. Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement
from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.
48. Huang, Y., Feamster, N., & Teixeira, R. (2008). Practical issues with using network tomography for fault diagnosis. SIGCOMM Computer Communication Review, 38(5), 53–58.
49. IETF. IP Flow Information Export (ipfix) charter. http://www.ietf.org/html.charters/ipfixcharter.html. Version of 16 December 2008.
50. Internet Assigned Numbers Authority. Port numbers. http://www.iana.org/assignments/portnumbers.
51. ITU-T Recommendation Y.1540. Network performance objectives for IP-based services,
February 2006.
52. Jacobson, V., Leres, C., & McCanne, S. tcpdump.

354

N. Duffield and Al Morton

53. Jacobson V. Traceroute. ftp://ftp.ee.lbl.gov/traceroute.tar.gz.
54. Jain, R., & Routhier, S. (1986). Packet trains – measurements and a new model for computer
network traffic. IEEE Journal on Selected Areas in Communications, 4(6), 986–995, September 1986.
55. Juniper Networks.
Junose 8.2.x ip services configuration guide: Configuring j-flow
statistics. http://www.juniper.net/techpubs/software/erx/junose82/swconfig-ip-services/html/
ip-jflow-stats-config.html.
56. Kent, S., & Atkinson, R. (1998). Security architecture for the Internet Protocol. RFC 2401,
November 1998.
57. Keynote Systems. http://www.keynote.com.
58. Mathis, M., & Mahdavi, J. (1996). Diagnosing internet congestion with a transport layer performance tool. In Proceedings of INET 96. Montreal, Quebec, 24–28 June 1996.
59. McCloghrie, K., & Kastenholz, F. The interfaces group mib. RFC 2863, June 2000.
60. McCloghrie, K., & Rose, M. (1991). Management Information Base for Network Management of TCP/IP-based internets: MIB-II. RFC 1213, available from http://www.
ietf.org/rfc, March 1991.
61. MeasurementLab. http://www.measurementlab.net/.
62. Mills, C., Hirsh, D.,& Ruth, D. (1991). Internet accounting: background. RFC 1272, November
1991.
63. Greg Minshall. tcpdpriv. http://ita.ee.lbl.gov/html/contrib/tcpdpriv.html.
64. Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., & Weaver, N. (2003). Inside the
slammer worm. IEEE Security and Privacy, 1(4), 33–39.
65. Morton, A. (2008). Framework for metric composition, June 2009. draft-ietf-ippm-frameworkcompagg-08 (work in progress).
66. Morton, A., & Claise, B. (2009). Packet delay variation applicability statement. RFC 5481,
March 2009.
67. Morton, A., & Stephan, E. (2008). Spatial composition of metrics, October 2009. draft-ietfippm-spatial-composition-10 (work in progress).
68. Narus, Inc. Narusinsight secure suite. http://www.narus.com/products/security.html.
69. Packetdesign. Traffic explorer. http://www.packetdesign.com/products/tex.htm.
70. Paxson, V., Almes, G., Mahdavi, J., & Mathis, M. (1998). Framework for ip performance
metrics. RFC 2330, May 1998.
71. Phaal, P., Panchen, S., & McKee, N. (2001). Inmon corporation’s sflow: A method for monitoring traffic in switched and routed networks. RFC 3176, September 2001. http://www.ietf.
org/rfc/rfc3176.txt.
72. Raisanen, V., Grotefeld, G., & Morton, A. (2002). Network performance measurement with
periodic streams. RFC 3432, November 2002.
73. RIPE. http://www.ripe.net.
74. Roesch, M. (1999). Snort – Lightweight Intrusion Detection for Networks. In Proceedings of
USENIX Lisa ’99, Seattle, WA, November 1999.
75. Sandvine. http://www.sandvine.com/.
76. Srinivasan, C., Viswanathan, A., & Nadeau, T. (2004). Multiprotocol label switching (MPLS)
label switching router (LSR) management information base (MIB). RFC 3813, June 2004.
77. Stallings, W. (1999). SNMP, SNMP v2, SNMP v3, and RMON 1 and 2 (Third Edition). Reading,
MA: Addison-Wesley.
78. Stewart, R., Ramalho, M., Xie, Q., Tuexen, M., & Conrad, P. (2004). Stream control transmission protocol (sctp) partial reliability extension. RFC3758, May 2004.
79. Thorup, M. (2006). Confidence intervals for priority sampling. In Proceedings of ACM SIGMETRICS/Performance 2006 (pp. 252–263) Saint-Malo, France, 26–30 June 2006.
80. van der Merwe J., Cáceres, R., Chu, Y.-H., & Sreenan, C. (2000). mmdump: a tool for monitoring internet multimedia traffic. SIGCOMM Computer Commununication Review, 30(5),
48–59.
81. Waldbusser, S. (2000). Remote network monitoring management information base. RFC 2819,
available from http://www.ietf.org/rfc, May 2000.

10

Measurements of Data Plane Reliability and Performance

355

82. Zseby, T., Molina, M., Duffield, N.G., Niccolini, S., & Raspall, F. (2009). Sampling and filtering techniques for ip packet selection. RFC 5475, March 2009.
83. Zseby, T., Zander, S., & Carle, G. (2001). Evaluation of building blocks for passive one-waydelay measurements. In Proceedings of Passive and Active Measurement Workshop (PAM
2001). Amsterdam, The Netherlands, 23–24 April 2001.

Chapter 11

Measurements of Control Plane Reliability
and Performance
Lee Breslau and Aman Shaikh

11.1 Introduction
The control plane determines how traffic flows through an IP network. It consists
of routers interconnected by links and routing protocols implemented as software
processes running on them. Routers (or more specifically routing protocols) communicate with one another to determine the path that packets take from a source to
a destination. As a result, the reliability and performance of the control plane is critical to the overall performance of applications and services running on the network.
This chapter focuses on how to measure and monitor the reliability and performance
of the control plane of a network.
The original Internet service model supported only unicast delivery. That is, a
packet injected into the network by a source host was intended to be delivered to a
single destination. Multicast, in which a packet is replicated inside the network and
delivered to multiple hosts was subsequently introduced as a service. While certain
multicast routing protocols leverage unicast routing information, unicast and multicast have very distinct control planes. They are each governed by a different set of
routing protocols, and measurement and monitoring of these protocols consequently
take different forms. Therefore, we cover unicast and multicast control plane monitoring separately in Sections 11.2 and 11.3, respectively.
We start Section 11.2 with a brief overview of how unicast forwarding works, describing different routing protocols and how they work to determine paths between
a source and a destination. We then look at two key components of performance
monitoring: instrumentation of the network for data collection in Section 11.2.2,
and strategies and tools for data analysis in Section 11.2.3. More specifically, the
instrumentation section describes what data we need to collect for route monitoring
along with mechanisms for collecting the data needed. The analysis section focuses
on various techniques and tools that show how the data is used for monitoring the

L. Breslau and A. Shaikh ()
AT&T Labs – Research, Florham Park, NJ, USA
e-mail: breslau@research.att.com; ashaikh@research.att.com

C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications,
Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 11,
c Springer-Verlag London Limited 2010


357

358

L. Breslau and A. Shaikh

performance of the control plane. While the focus of the section is on management
and operational aspects, we also describe some of the research enabled by this data
that has played a vital role in enhancing our understanding of the control plane behavior and performance in real life. We follow this up with a description of the
AT&T OSPF Monitor [1] in Section 11.2.4 as a case study of a route monitor in
real life. In Section 11.2.5, we describe control plane monitoring of MPLS, which
has been deployed in service provider networks in the last few years and is a key
enabler of Traffic Engineering (TE) and Fast Re-route (FRR) capabilities, as well as
new services such as VPN and VPLS.
Section 11.3 follows a similar approach in its treatment of multicast. We begin
with a motivation for and historical perspective of the development and deployment
of multicast. In Section 11.3.1, we provide a brief overview of the multicast routing protocols commonly in use today, PIM and MSDP. We then outline some of
the challenges specific to monitoring the multicast control plane in Section 11.3.2.
Section 11.3.3 provides detailed information about multicast monitoring. This includes an overview of early multicast monitoring efforts, a discussion of the information sources available for multicast monitoring, and a discussion of specific
approaches and tools used in multicast monitoring.
At the end of the chapter, in Section 11.4, we provide a brief summary and
avenues for future work.

11.2 Unicast
In this section, we focus on monitoring of unicast routing protocols. We begin by
providing a brief overview of how routers forward unicast packets and the routing
protocols used for determining the forwarding paths before delving into details of
how to monitor these protocols.

11.2.1 Unicast Routing Overview
Let us start with the description of how routing protocols enable the forwarding of
unicast packets in IP networks. With unicast, each packet contains the address of
the destination. When the packet arrives at a router, a table called the Forwarding
Information Base (FIB), also known as the forwarding table, is consulted. This table allows the router to determine the next-hop router for the packet, based on its
destination address. Packets are thus forwarded in a hop-by-hop fashion, requiring
look-ups in the forwarding table of each router hop along its way to the destination.
The forwarding table typically consists of a set of prefixes. Each prefix is represented by an IP address and a mask that specifies how many significant bits of a
destination address need to match the address of the prefix. For example, a prefix
represented as 10.0.0.0/16 would match a destination address whose first 16 bits

11

Measurements of Control Plane Reliability and Performance

359

are the same as the first 16 bits of 10.0.0.0 (i.e., 10.0). Thus, the address 10.0.0.1
matches this prefix, so do 10.0.0.2 and 10.0.1.1. It is possible, and is often the case
that, more than one prefix in a FIB match a given (destination) address. In such a
case, the prefix with the highest value of the mask length is used for determining
the next-hop router. For example, if a FIB contains 10.0.0.0/16 and 10.0.0.0/24, and
the destination is 10.0.0.1, prefix 10.0.0.0/24 is used for forwarding the packet even
though both prefixes match the address. For this reason, IP forwarding is based on
the longest prefix.
Routers run one or more routing protocols to construct their FIBs. Every routing
protocol allows a router to learn the network topology (or some part of it) by exchanging messages with other routers. The topology information is then used by a
router to determine next hops for various prefixes, i.e., the FIB.

Learning Topology Information
Depending on how much topology information each router learns, the routing protocols can be divided into two main classes: distance-vector and link-state.
In a distance-vector routing protocol at each step, every router learns the distance of each adjacent router to every prefix. Every prefix is connected to one or
more routers in the network. The distance from a router to a prefix is the sum of
weights of individual links on the path, where the weight of every link is assigned
in the configuration file of the associated router. A router, upon learning distances
from neighbors, chooses the one that is closest to a given prefix as its next-hop,
and subsequently propagates its own distance (which is equal to the neighbor’s distance plus the weight of its link to the neighbor) to the prefix to all other neighbors.
When a router comes up, it only knows about its directly connected prefixes (e.g.,
prefixes associated with point-to-point or broadcast links). The router propagates
information about these prefixes to its neighbors, allowing them to determine their
routes to them. The information then spreads further, and ultimately all routers in
the network end up with next-hops for these prefixes. In a similar vein, the newly
booted router also learns about other prefixes from its neighbors, and builds its
entire FIB. The distance-vector protocols essentially implement a distributed version of the Bellman Ford shortest-path algorithm [2]. RIP [3] is an example of a
distance-vector protocol. EIGRP, a Cisco-proprietary protocol, is another example.
It contains mechanisms (an algorithm called DUAL [4]) to prevent forwarding loops
that can be formed during network changes when routers can become inconsistent in
their views of the topology. A subclass of distance-vector, called path-vector protocols include the actual path to the destination along with the distance in the updates
sent to neighbors. The inclusion of the path helps in identifying and avoiding potential loops from forming during convergence. BGP [5] is an example of a path-vector
protocol.
With link-state routing protocols, each router learns the entire network topology. The topology is conceptually a directed graph – each router corresponds to

360

L. Breslau and A. Shaikh

a node in this graph, and each link between neighboring routers corresponds to a
unidirectional edge. Just like distance-vector protocols, each link also has an administratively assigned weight associated with it. Using the weighted topology graph,
each router computes a shortest-path tree with itself as the root, and applies the results to compute next-hops for all possible destinations. Routing remains consistent
as long as all the routers have the same view of the topology. The view of the topology is built in a distributed fashion, with each router describing its local connectivity
(i.e., set of links incident on it along with their weights) in a message, and flooding
this message to all routers in the network. OSPF [6] and IS-IS [7] are examples of
link-state protocols.

Autonomous Systems (ASes) and Hierarchical Routing
The Internet is an inter-network of networks. By design, these networks are envisioned to be administered by independent entities. In other words, the Internet is
a collection of independently administered networks. Roughly speaking, such networks are known as Autonomous Systems (ASes). Each autonomous system consists
of a set of routers and links that are usually managed by a single administrative
authority. Every autonomous system can run one or more routing protocols of its
choice to route packets within the system. RIP, EIGRP, OSPF and IS-IS are typically
used for routing packets within an AS and are, therefore, known as intradomain
or Interior Gateway Protocols (IGPs). In addition, a routing protocol is needed to
forward packets between ASes. BGP is used for this purpose and is known as an
interdomain or an Exterior Gateway Protocol (EGP).
Next, we present an overview of BGP and OSPF as they come up a lot in the
subsequent discussions. For details on other routing protocols, please refer to [8].

11.2.1.1 BGP Overview
As mentioned in Section 11.2.1.1, BGP is the de facto routing protocol used to exchange routing information between ASes. BGP is a path-vector protocol (a subset
of distance-vector protocols). In path-vector protocols, a router receives routes from
its neighbors that describe their distance to prefixes, as well as the path used to reach
the prefix in question. Since BGP is used to route packets between ASes, the path is
described as a sequence of ASes traversed along the way to the prefix, the sequence
being known as an ASPath. Thus, every route update received at a router contains
the prefix and the ASPath indicating the path used by the neighbor to reach the prefix. The distance is not explicitly included; rather it implicitly equals the number of
ASes in the ASPath.
Apart from ASPath, BGP routes also contain other attributes. These attributes
are used by a router to determine the most preferred route from all received routes
to a destination prefix. Figure 11.1 shows the steps of a decision process that a

11

Measurements of Control Plane Reliability and Performance

Fig. 11.1 The decision
process used by BGP to select
the best route to every prefix.
Vendor-dependent steps are
not included

1.
2.
3.
4.
5.
6.

361

Highest Local Preference
Shortest ASPath Length
Lowest Origin Type
Lowest MED
Prefer Closest Egress (based on IGP distance)
Arbitrary Tie Breaking

BGP-speaking router follows to select its most preferred route. The process is run
independently for each prefix, and starts with all the available routes for the prefix in
question. At every step, relevant attributes of the routes are compared. Routes with
the most preferred values pass onto the next step while other routes are dropped
from further consideration. At the end of the decision process, a router ends up
with a single route for every prefix, and uses it to forward data traffic. Note that the
second step of the decision process compares the length of ASPath of the routes that
survived the first step, keeps the ones with the shortest ASPaths, while discarding
the rest. We will not go into details of other steps except to point out that if faced
with more than one route in step 5, the router selects route(s) which minimize the
IGP distance a packet will have to travel to exit its AS. This process of preferring
the closest egress is known as hot-potato or closest-egress routing.
A router forms BGP sessions with other routers to exchange route updates. The
two ends of a session can either belong to the same AS or a different AS. When
the session is formed between routers in the same AS, it is known as an internal BGP (IBGP) session. In contrast, when the routers are in different ASes, the
session is known as an external BGP (EBGP) session. For example, in Fig. 11.2,
which shows multiple interconnected ASes and routers in them, solid lines depict
IBGP sessions, whereas dashed lines represent EBGP sessions. The EBGP sessions
setup between routers in neighbor ASes allow them to exchange routes to various
prefixes. The routes learned over EBGP sessions are then distributed using IBGP
sessions within an AS. For example, AS 2 in Fig. 11.2 learns routes from ASes 1, 3,
and 4 over EBGP sessions, which are then distributed among its routers over IBGP
sessions.
In order to disseminate all routes learned via EBGP to every router, routers inside
an AS like AS 1 need to form a full-mesh of IBGP sessions. A router receiving a
route update over an EBGP session propagates it to all other routers in the mesh,
however, route updates received over IBGP sessions are not forwarded back to the
routers in the mesh (see [9] for full details). An IBGP full-mesh does not scale for
ASes with a large number of routers. To improve scalability, large ASes use an IBGP
hierarchy such as route reflection [10]. Route reflection allows the re-announcement
of some routes learned over IBGP sessions. However, it sacrifices the number of
candidate routes learned at each router for improved scalability. For example, AS 2
in Fig. 11.2 employs a route reflector hierarchy.

362

L. Breslau and A. Shaikh
AS 2
AS 3

AS 1

AS 4

IBGP Session
EBGP Session
BGP Router
BGP Route Reflector

Fig. 11.2 Example topology with multiple ASes and BGP sessions

11.2.1.2 OSPF Overview
As noted in Section 11.2.1.1, OSPF is a link-state protocol, which is widely used to
control routing within an Autonomous System (AS).1 With link-state routing protocols, each router learns the entire view of the network topology represented as a
weighted graph, uses it to compute a shortest-path tree with itself as the root, and
applies the results to construct its forwarding table. This assures that packets are forwarded along the shortest paths in terms of link weights to their destinations [11].
We will refer to the computation of the shortest-path tree as an SPF computation,
and the resultant tree as an SPF tree.
For scalability, an OSPF network may be divided into areas determining a twolevel hierarchy as shown in Fig. 11.3. Area 0, known as the backbone area, resides
at the top level of the hierarchy and provides connectivity to the non-backbone areas
(numbered 1, 2, etc.). OSPF assigns each link to one or more areas.2 The routers that
have links to multiple areas are called border routers. For example, routers C , D,
and G are border routers in Fig. 11.3. Every router maintains a separate copy of the
topology graph for each area to which it is connected. The router performs the SPF
computation on each such topology graph and thereby learns how to reach nodes in
all adjacent areas.
A router does not learn the entire topology of remote areas. Instead, it learns the
total weight of the shortest paths from one or more border routers to each prefix in
1
Even though an IGP like OSPF is used for routing within an AS, the boundary of an IGP domain
and an AS do not have to coincide. An AS may consist of multiple IGP domains; conversely, a
single IGP domain may span multiple ASes.
2
The original OSPF specification [6] required each link to be assigned to exactly one area, but a
recent extension [12] allows a single link to be assigned to multiple areas.

11

Measurements of Control Plane Reliability and Performance

363

x
Area 0
G
2
1
1 31
E F
I
H
1
12 1
1
J
D C

5

y

A
1

4

C

OSPF domain

Area 1
B 1

1

D
2

1
F
Area 0

E

1
1

1
G

1
2

3

H

1
I

1
1

5
J

Area 2

1
B
1
A

x

4
y

1
E
1
D
1
B
1
A
5
x

Area 1
G
1
F
1
C

2
I
1

H
J

y

Border router
AS border router
OSPF Network Topology

Topology View of Router G

Shortest Path Tree at G

Fig. 11.3 An example OSPF topology, the view of the topology from router G, and the shortestpath tree calculated at G. Although we show the OSPF topology as an undirected graph here for
simplicity, the graph is directed in reality

remote areas. Thus, after computing the SPF tree for each area, the router learns
which border router to use as an intermediate node for reaching each remote node.
In addition, the reachability of external IP prefixes (associated with nodes outside
the OSPF domain) can be injected into OSPF (e.g., X and Y in Fig. 11.3). Roughly,
reachability to an external prefix is determined as if the prefix was a node linked
to the router that injects the prefix into OSPF. The router that injects the prefix into
OSPF is called an AS Border Router (ASBR). For example, router A is an ASBR in
Fig. 11.3.
Routers running OSPF describe their local connectivity in Link State Advertisements (LSAs). These LSAs are flooded reliably to other routers in the network. The
routers use LSAs to build a consistent view of the topology as described earlier.
Flooding is made reliable by mandating that a router acknowledge the receipt of
every LSA it receives from every neighbor. The flooding is hop-by-hop and hence
does not itself depend on routing. The set of LSAs in a router’s memory is called
the link-state database and conceptually forms the topology graph for the router.
Two routers are neighbor routers if they have interfaces to a common network
(i.e., they have a direct path between them that does not go through any other router).
Neighbor routers form an adjacency so that they can exchange LSAs with each
other. OSPF allows a link between the neighbor routers to be used for forwarding
only if these routers have the same view of the topology, i.e., the same link-state
database for the area the link belongs to. This ensures that forwarding data packets
over the link does not create loops. Thus, two neighbor routers make sure that their
link-state databases are in sync by exchanging out-of-sync parts of their link-state
databases when they establish an adjacency.

364

L. Breslau and A. Shaikh

11.2.2 Instrumentation for Route Monitoring
As mentioned, routers exchange information about the topology with other routers
in the network to build their forwarding tables. As a result, understanding control plane dynamics requires collecting these messages and analyzing them. In this
section, we focus on the collection aspect, leaving analysis for the next section. We
first focus on how to instrument a single router, before turning our attention to the
network-wide collection of messages.

11.2.2.1 Collecting Data from a Single Router
Even though the kind of information exchanged in routing messages varies from
protocol to protocol, the flow of messages through individual routers can be modeled
in the same manner, as depicted in Fig. 11.4. Every router basically receives messages from its neighbors from time to time. These messages are sent by neighbors in
response to events occurring in the network or expiration of timers; again, the exact
reasons are protocol specific. As described in Section 11.2.1, the message describes
some aspect of the network topology or reachability to a prefix along with a set of
attributes. Upon receiving the message, the router runs its route selection procedure
taking the newly received message into account. The procedure can change the best
route to one or more prefixes in the FIB. A router also sends messages to neighbors
as network topology and/or reachability to prefixes change – the trigger and contents
of the messages depend on the protocol. Given this, to understand routing dynamics
of a router would require instrumenting the router to collect (i) incoming messages
into a router over all its links, (ii) the changes induced to the FIB, and (iii) outgoing
messages to all the neighbors.
Some protocols such as BGP allow routers to apply import policies to incoming
messages; applying these policies results in either dropping of messages or modifications to the attributes. In such a scenario, it might be beneficial to collect incoming
messages before and after application of import policies. In a similar vein, BGP

Incoming Routing Message

Outgoing Routing Message
Router

Incoming Routing Message
Route Selection Process
Incoming Routing Message
Best Route
Outgoing Routing Message

Fig. 11.4 Message flow through a router

FIB

11

Measurements of Control Plane Reliability and Performance

365

applies export policies to outgoing messages before they are sent to neighbors in
which case messages can be collected before and after the application of export
policies.
Ideally, one would like the router to “copy” every incoming and outgoing
message, as well as changes to the FIB to a management station. In reality, no
standardized way for achieving this exists, and as a result no current router implementations support it. Despite this, one could get an approximate version of the
required information in several different ways. One such way is to use splitters to
read messages directly off a link. Unfortunately, this option is often impractical,
expensive, and does not scale beyond a few routers and links. For this reason, this
option is rarely used in practice. Another option is to log into the router through its
CLI (Command Line Interface) or query SNMP MIBs [13] to extract the required
information. Routers and (routing protocols running on them) often store a copy of
the most recently received and transmitted messages in memory and allow them to
be queried via CLI or SNMP MIBs. Thus, a network management station can periodically pull the information out of a router. Unfortunately, it is almost impossible
to capture every incoming/outgoing message this way since even the most frequent
polling supportable by routers fall far short of the highest frequency at which routing messages are exchanged. Even so, this option is used in practice at times since
it provides a fairly inexpensive and practical way of getting some information about
the routing state of a router. For example, the Peer Dragnet [14] tool uses information captured via the CLI to analyze inconsistent routes sent by EBGP peers of
an AS.
A third option to collect routing messages is to establish a routing session with
a router just like any other router. This forces the router to send messages as it
would to any other router.3 Obviously, this approach does not give information about
incoming messages and changes to the FIB. Even for an outgoing message, the management station does not receive the message at the time a router sends it to other
neighbors. Despite this, the approach provides valuable information about route dynamics. For distance-vector protocols, the outgoing message is usually the route
selected by the router and for link-state protocols, these messages describe updates
to the topology view of the router. As a result, this approach is used quite extensively
in practice. For example, RouteViews [15] and RIPE [16] collect BGP updates from
several ASes and their routers, as does the OSPF Monitor described in [1], and later
in Section 11.2.4. One serious practical issue with this approach is the potential
injection of routing messages from the management station, which could disrupt
the functioning of the control plane. For protocols that allow import policies (e.g.,
BGP) one could apply a policy to drop any incoming messages from the management station, but for other protocols (e.g., OSPF, IS-IS) the only way to protect
against injection of messages is to rely on the correctness of the software running
on the management station.
3
A router running a distance-vector protocol sends its selected route for a given prefix to all its
neighbors, except the next-hop of the route when split horizon [8] is implemented. It is this selected
route that we are interested in, and will receive, at the management station.

366

L. Breslau and A. Shaikh

11.2.2.2 Collecting Network-Wide Data
In Section 11.2.2.1, we discussed ways in which routing messages can be collected
from a single router. In this section, we expand our focus to the entire network. The
key question we focus on is: how many routers does one need to collect routing
messages from? The naive answer is: from all routers of the network. Indeed, if the
aim is to learn about each and every message flowing between routers and the exact
state of routers at every instance of time, then there is no choice but to collect messages from all routers. In reality, collecting messages from all routers is extremely
challenging due to scale issues. Thus, in practice the answer depends on the kind of
routing protocol and the analysis requirement. Let’s go into some details.
The kind of routing protocol – whether link-state or distance-vector – plays a major role in deciding how many routers one needs to collect data from. In a link-state
protocol, every router learns the entire view of the network topology, and so collecting messages from even a single router is enough to determine the overall state
of the network topology. As we will see later in Sections 11.2.3 and 11.2.4, even
this seemingly “limited” data enables a rich set of management applications. Some
examples are (and we will talk about these in more detail in subsequent sections):
(i) ability to track network topology and its integrity (against design rules) in realtime, (ii) ability to determine events such as router/link up/downs and link weight
changes as they unfold, (iii) ability to determine how forwarding paths evolve in
response to network events, and (iv) ability to determine workload imposed by the
routing messages. We should emphasize here that for all the applications, the data is
providing the “view” from the router from which the data is being collected at that
point of time. Other routers’ views can be somewhat different due to message propagation and processing delays. The exact nature of these delays, how they are affected
by other events in the network, and their implications for the analysis/application at
hand are poorly understood. Our belief is that these delays are small (on the order of milliseconds) in most cases, and thus can be safely ignored for all practical
purposes.
The story is different for distance-vector protocols since every router gets a partial view of the topology: only the distance of prefixes from neighbors. As a result,
one often needs views from multiple, if not all, routers. The exact set depends on
the network configuration and on the kind of analysis being performed. For example, if one wants to learn external routes coming into an AS, it suffices to monitor
BGP routes from the routers at the edge of the network. In fact, numerous studies on
BGP dynamics, inter-AS topology and relationships between ASes have been carried out based on BGP data collected from a fairly small set of ASes at RouteViews
and RIPE. Although the completeness and representativeness of these studies is debatable, there is no doubt that such studies have tremendously increased awareness
about BGP and its workings in the Internet. Furthermore, by combining routing data
collected from a subset of routers with other network data, one can often determine
routing state of other routers – at least in steady state once routing has converged
after a change. For example, a paper by Feamster and Rexford [17] describes a

11

Measurements of Control Plane Reliability and Performance

367

methodology to determine BGP routes at every router inside an AS based on routes
learned at the edge of the network, and configuration of IBGP sessions.

11.2.3 Applications of Route Monitoring
In this section, we demonstrate the utility of the data collected by route monitors. We
first describe the basic functionality enabled by the data. We then describe how this
basic functionality can be used in various network management tasks. Finally, we
describe how the data has been used in advancing the understanding of the behavior
of routing protocols in real life.

11.2.3.1 Information Provided by Route Monitors
Routing State and Dynamics Route monitors capture routing messages, and so
they naturally provide information about the current state of routing and how it
evolves over time. This information is useful for a variety of network management
tasks such as troubleshooting and forensics, capacity planning, trending, and traffic
engineering to name a few. For link-state protocols, the routing messages provide
information about the topology (i.e., set of routers, links and link weights), whereas
for distance-vector protocols, the information consists of route tables (i.e., set of
destinations and the next-hop and distance from the router in question). Both pieces
of information are useful. Furthermore, calculating routing tables from topology is
straightforward: one just needs to emulate route calculation for every router in the
topology. Going in the other direction from routing tables to topology is easy if
information from all routers (running the distance-vector protocol) is available. In
practice though, information is often collected from a subset of routers, in which
case, deriving a complete topology view may not be possible.
End-to-End Paths Knowing what path traffic takes in the network (from one router
to another) is crucial for network management tasks such as fault localization and
troubleshooting. For example, a link failure can affect performance of all paths
traversing the link. If the only way of detecting such failures is through end-toend active probing, then knowing paths would allow operators to quickly localize
the problem to the common link. Routing messages collected by route monitors allow one to determine these paths and how they evolve in response to routing events.
Note that active probes (e.g., traceroute) also allow one to determine end-to-end
paths in the network. However, tracking path changes in response to network events
using active probing suffers from major scalability problems. First of all, the number of router pairs in a large network can be in the range of hundreds of thousands
to millions. This makes probing every path at a fine time scale prohibitively expensive. A second problem arises due to the use of multiple equal cost paths (known as
ECMP) between router pairs. ECMP arises when more than one path with smallest

368

L. Breslau and A. Shaikh

weight exist between router pairs. Most intradomain protocols such as OSPF use
all the paths by spreading data traffic across them.4 Since service providers often
have redundant links in their networks, router pairs are more likely to have multiple
paths than not. ECMP unfortunately exacerbates the scalability problem for active
probing. Furthermore, engineering probes so that all ECMPs are covered is next to
impossible since how routers would spread traffic across multiple paths is almost
impossible to determine a priori.

11.2.3.2 Utility of Route Monitors in Network Management
The data provided by route monitors and the basic information gleaned from them
aid several network management tasks such as troubleshooting and forensics, network auditing, and capacity planning. Below we provide a detailed account of how
this is done for each of these three tasks.
Network Troubleshooting and Forensics Route monitors provide a view into
routing events as they unfold. This view can be in the form of topology, routing tables, or end-to-end paths as mentioned in the previous sections; which form proves
useful often depends on the specific troubleshooting task at hand. For example, if a
customer complains about loss of reachability to certain parts of the Internet, looking at BGP routes and their history can provide clues about causes of the problems.
Similarly, if performance issues are seen in some parts of the network, knowing
what routing events are happening and how they are affecting paths can provide an
explanation for the issues. Note that the route monitors’ utility not only stems from
the current view of routing they provide (after all operators can always determine
the current view by logging into routers), but from the historical data they provide
which allows operators to piece together sequence of events leading to the problems. Routers do not store historical state, and so cannot provide such information.
Going back to the debugging of customer complaining about lost reachability, it is
rarely enough to determine the current state of the route, especially if no route exists
to the prefix. To effectively pinpoint the problem, the operator might also need to
know the history of route announcements and withdrawals for the prefix, and that
data can only be provided by route monitors. Figure 11.5 shows snapshot of a tool
that allows operators to view sequence of BGP route updates captured by a monitor
deployed in a tier-1 ISP.
Network Auditing and Protocol Conformance Another use of route monitors is
for auditing the integrity of the networks and conformance of routing protocols to
their specifications. To audit the integrity of the network, one needs to devise rules
against which the actual routing behavior can be checked. For example, network
administrators often have conventions and rules about weights assigned to links.

4

The exact algorithm for spreading traffic across ECMPs is implemented in the forwarding engine
of routers.

11

Measurements of Control Plane Reliability and Performance

369

BGP Route History for 0.0.0.0/0 and its Subnets

Prefix
Time (GMT)
Router
Event
ASPath
Local Pref Origin MED Next-hop
1 Wed Apr 1 18:32:50 2009 10.0.0.1 WITHDRAW 192.168.0.0/24
---- --2 Wed Apr 1 18:32:50 2009 10.0.0.1 ANNOUNCE 172.16.3.0/23 65001 65010 65145
90 IGP
0 10.0.1.3
3 Wed Apr 1 18:32:52 2009 10.0.0.1 ANNOUNCE 10.1.123.0/12 65001 65126
80 IGP
25 10.0.1.8
4 Wed Apr 1 18:32:55 2009 10.0.0.1 ANNOUNCE 192.168.3.0/18 65001 65324 65002 65121 65084
80 IGP
0 10.0.2.1
5 Wed Apr 1 18:32:58 2009 10.0.0.1 ANNOUNCE 192.168.0.0/24 65001 65223 65145
65 IGP 100 10.0.1.1
6 Wed Apr 1 18:33:31 2009 10.0.0.1 ANNOUNCE 172.23.4.0/21 65001 65132
90 IGP
10 10.0.2.1
7 Wed Apr 1 18:33:44 2009 10.0.0.1 ANNOUNCE 10.231.34.64/20 65001 65010 65192 65034
65 IGP
12 10.0.1.45
8 Wed Apr 1 18:33:47 2009 10.0.0.1 ANNOUNCE 192.168.0.0/24 65001 65023 65145
90 IGP
0 10.0.1.1

Count

9 Wed Apr 1 18:34:08 2009 10.0.0.1 ANNOUNCE 172.22.73.0/25 65001 65420 65321 65005
10 Wed Apr 1 18:34:21 2009 10.0.0.1 ANNOUNCE 172.172.72.0/21 65001 65014 65105

70

IGP

0 10.0.2.12

110

IGP

10 10.0.1.109

Fig. 11.5 Screen-shot of a tool to view BGP route announcement/withdrawals

It then becomes necessary to monitor the network for potential deviations (that happen intentionally or due to mistakes) from these rules. Since (intradomain) routing
messages provide current information about link weights, they provide a perfect
source for checking whether network’s actual state conforms to the design rules or
not. Checking that the network state matches the design rules is especially crucial
during maintenance windows when a network undergoes significant change. Similar to network auditing, routing messages can also be used to verify that protocol
implementations conform to the specifications. At the very least, one could check
whether message format is correct as per the specifications or not. Another check is
to compare the rate and sequence of messages against the expected behavior. The
“Refresh LSA bug” caught by the OSPF Monitor [1] where OSPF LSAs were being
refreshed much faster than the recommended value [6] is an instance of this.
Capacity Planning Capacity planning, where network administrators determine
how to grow their network to accommodate growth, is another task where routing
data is extremely useful. In particular, the data allows planners to see how routing
traffic is growing over time, which can then be used to predict resources required in
the future. As such, the growth of two parameters is very important: the number of
routes in the routing table, and the rate at which routing messages are disseminated.
The former has significant bearing on the memory required on the routers, whereas
the latter affects the CPU (and sometimes bandwidth) requirements for routers. For
service providers, accurately knowing how long current CPU/memory configuration
on routers can last, and when upgrades will be needed is extremely important for operational and financial planning. The growth patterns revealed by routing data play
a key role in forming these estimates. These estimates also allow service providers
to devise optimization techniques to reduce resource consumption. For example,
consider layer-3 MPLS VPN [18] service, which allows enterprise customers to interconnect their (geographically distributed) sites via secure, dedicated tunnels over
a provider network. Over the last few years, this service has witnessed a widespread
deployment. This has led to tremendous growth in the number of BGP routes a
VPN service provider has to keep track of, resulting in heavy memory usage on its

370

L. Breslau and A. Shaikh

routers. Realizing this scalability problem, Kim et al. [19] have proposed a solution
that allows a service provider to tradeoff direct connectivity between sites (e.g.,
from any-to-any to a more restricted hub-and-spoke where traffic between two sites
now has to go through one or more hub sites) with number of routes that need to
be stored. The data collected by the route monitors was crucial in this work: first,
to realize that there is a problem, and next, to evaluate the efficacy of the scheme
in realistic settings. In particular, Kim et al. show 90% reduction in the memory usage while limiting path stretch between sites to only a few hundred miles, and extra
bandwidth usage by less than 10%.

11.2.3.3 Performance Assessment of Routing Protocols
Routing data is key to understanding how routing protocols behave and perform in
real life. We have already talked about one aspect of this behavior above, namely
conformance to the specifications. Here we would like to talk about other aspects
of the performance such as stability and convergence, which are key to quantifying the overall performance of the routing infrastructure. For example, numerous
BGP studies detailing its behavior in the Internet have been enabled thanks to the
data collected by RouteViews [15], RIPE [16], and other BGP monitors. We briefly
describe some studies to illustrate the point.
Route updates collected by BGP monitors have led to several studies analyzing the stability (or lack thereof) of BGP routing in the Internet.5 Govindan and
Reddy [20] were the first to study the stability of BGP routes back in 1997 – a
couple of years after commercialization of the Internet started. Their study analyzed
BGP route updates collected from a large ISP and a popular Internet exchange point
(where several service providers are interconnected to exchange routes and traffic).
The study found a clear evidence of deteriorating stability of BGP routes which it attributed to the rapid growth – doubling of the number of ASes and prefixes in about
2 years – of the Internet. Subsequently, Labovitz et al. [21] observed a higher than
expected number of BGP updates in the data collected at five US public Internet
exchange points. The real surprising aspect of their study was the finding that about
99% of these updates did not indicate real topological changes, and had no reason
to be there. The authors found that some of these updates were due to bugs in the
BGP software of a router vendor at that time. Fixing of these bugs by the vendor led
to an order of magnitude reduction in the volume of BGP route updates [22].
Convergence, the time taken by a routing protocol to recalculate new paths after
a network change, is another critical performance metric. Labovitz et al. [23] were
the first to systematically study this metric for BGP in the Internet. They found
that BGP often took tens of seconds to converge – an order of magnitude more
than what was thought at that time. The problem as they showed stems from the

5

The term stability refers to the stability of BGP routes, which roughly corresponds to how
frequently they undergo changes.

11

Measurements of Control Plane Reliability and Performance

371

inclusion of ASPath in BGP route announcements (i.e., the very thing that makes
BGP a path-vector protocol). The purpose of including the ASPath is to prevent
loops and “count-to-infinity” problem6 that BGP’s distance-vector brethren (e.g.,
RIP) suffer from. However, this leads to “path exploration” as shown by Labovitz
et al., where routers might cycle through multiple (often transient) routes with different ASPaths before settling on the final (stable) routes, thereby exacerbating the
convergence times. Several ways of mitigating this problem have been proposed
since then, essentially by including more information in BGP routes [24–28], but
none of them have seen deployment to date.
Mao et al. [29] tied hitherto independently explored stability and convergence
aspects of BGP together by showing how route flap damping (RFD) [30] used for
improving stability of BGP could interact with path exploration to adversely impact convergence of BGP. RFD is a mechanism that limits propagation of unstable
routes, thereby mitigating adverse impact of persistent flapping of network elements
and mis-configurations, which improves overall stability of BGP, and was a recommended practice [31] in early 2000. Unfortunately, as Mao et al. showed, RFD can
also suppress relatively stable routes by treating route announcements received during path exploration as evidence of instability of a route. Specifically, the study
showed that a route needs to be withdrawn only once and then re-announced for
RFD to suppress it for up to an hour in certain circumstances. This work coupled
with manifold increase in router CPU processing capability resulted in a recommendation by RIPE [32] to disable RFD.
Routing data is not only valuable in analyzing performance of protocol separately, but also useful for understanding how they interact with one another as
Teixeira et al. [33] did by focusing on how OSPF distance changes in a tier-1
ISP affected BGP routing. Their study showed that despite the apparent separation
between intra and interdomain routing protocols, OSPF distance changes do affect
BGP routes due to what is known as the “hot-potato routing”. 7 The extent of the impact depended on several factors including location and timing of a distance change.
Even more surprisingly, BGP route updates resulting from such changes could lag
by as much as a minute in some cases, resulting in large delays in convergence.
In closing, these and numerous other studies have not only enhanced our knowledge of how routing protocols behave in the Internet, but have also led to improvements in their performance (such as reduction in unwarranted BGP updates
or disabling of RFD as mentioned earlier).

6
With distance-vector protocols, two or more routers can get locked into a cyclical dependency
where each router in the cycle uses the previous router as a next-hop for reaching a destination.
The routers then increment their distance to the destination in a step-wise fashion until all of them
reach infinity, which is termed as “counting to infinity”. For more details, refer to [8].
7
As explained in Section 11.2.1.1, hot-potato routing refers to BGP’s propensity to select the
shortest way out of its local AS to a prefix when presented with multiple equally good routes
(i.e., ways out of the AS). This allows an AS to hand off data packets as quickly as possible to its
neighboring AS much like a hot potato.

372

L. Breslau and A. Shaikh

11.2.4 Case Study of a Route Monitor: The AT&T OSPF Monitor
Several route monitoring systems are available both as academic/research endeavors
as well as commercial products. RouteViews [15] and RIPE [16] collect BGP route
updates from several ISPs and backbones around the world. The data is used extensively for both troubleshooting and academic studies of the interdomain routing
system. The corresponding web sites also list several tools for analysis of the data.
On the intradomain side, a paper by Shaikh and Greenberg [1] describes an OSPF
monitor. The paper provides detailed description of the architecture and design of
the system and follows it up with a performance evaluation and deployment experience. On the commercial side, Packet Design’s Route Explorer [34] and Packet
Storm’s Route Analyzer [35] are route monitoring products. The Route Explorer
provides monitoring capability for several routing protocols including OSPF, IS-IS,
EIGRP and BGP, whereas Route Analyzer provides similar functionality for OSPF.
Out of various route monitoring systems mentioned above, we focus on the OSPF
Monitor described by Shaikh and Greenberg [1] as a case study in this section since
the paper provides extensive details about system architecture, design, functionality,
and deployment. This is something not readily available for other route monitoring
systems, especially the architecture and design aspects, which are key to understanding how control plane monitoring is realized in practice. From here on, we will
refer to the OSPF Monitor described in [1] as the AT&T OSPF Monitor, and go into
details of the system in terms of data collection and analysis aspects next.
The AT&T OSPF Monitor separates data (specifically, LSAs) collection from
data analysis. The main reasoning behind this is to keep data collection as passive
and simple as possible due to the collector’s proximity to the network. The component used for LSA collection is called an LSA Reflector (LSAR). The data analysis on
the other hand is divided into two components: LSA aGgregator (LSAG) and OSPFScan. The LSAG deals with LSA streams in real time, whereas OSPFScan provides
capabilities for off-line analysis of the LSA archives. This three component architecture is illustrated in Fig. 11.6. We briefly describe these three components now.
The LSAR supports three modes for capturing LSAs: the host mode, the full
adjacency mode, and the partial adjacency mode. With the host mode, which only
works on a broadcast media such as Ethernet LAN, the LSAR subscribes to a multicast group to receive LSAs being disseminated. This is a completely passive way of
capturing LSAs, but suffers from reliability issues, slow initialization of link-state
database and only works on broadcast media. With the full adjacency mode, the
LSAR establishes an OSPF adjacency with a router to receive LSAs. This allows
LSAR to leverage OSPF’s reliable flooding mechanism, thereby overcoming both
the disadvantages of the host mode. However, the main drawback of this approach
is that instability of LSAR or its link to the router can trigger SPF calculations in
the entire network, potentially destabilizing the network. The reason for SPF calculation stems from the fact that with a full adjacency, the router includes a link to
the LSAR in its LSA sent to the network. The partial adjacency mode of collecting
LSAs provides a way to circumvent this problem while retaining all the benefits of
having an adjacency. In this mode, the LSAR establishes adjacency with a router,

11

Measurements of Control Plane Reliability and Performance
LSAG
Real−time Monitoring
LSAs

OSPFScan
Off−line Analysis
LSAs

TCP connection

LSAR 1
‘‘Reflect’’ LSAs

LSAR 2
‘‘Reflect’’ LSAs

LSA
Cache

LSA
Cache

Area 1

373

LSA
Archive

Area 0
Area 2
OSPF Domain

Fig. 11.6 The architecture of the AT&T OSPF monitor described in [1]

but only allows it to proceed to a stage where LSAs can be received over it from the
router, but it cannot be included in the LSA sent by the router to the network. To
keep the LSAR-router adjacency in the intermediate state, the LSAR describes its
own Router-LSA8 to the router during the link-state database synchronization process but never actually sends it out to the router. As a result, the database is never
synchronized, the adjacency stays in OSPF’s loading state [6], and is never fully
established. Keeping the adjacency in the loading state protects the network from
the instability of the LSAR or its link to the router.
Having described data collection by the LSAR, let us now turn our attention to
the LSAG, which processes LSAs in real time. The LSAG populates a model of
the OSPF network topology as it processes the LSAs. The model captures elements
such as OSPF areas, routers, subnets, interfaces, links, and relationship between
them (e.g., an area object consists of a set of routers that belong to the area, a router
object in turn consists of a set of interfaces belonging to the router, etc.). Using
the model as a base, the LSAG identifies changes (such as router up/down, link
up/down, link cost changes, etc.) to the network topology and generates messages
about them. Even though there are only about five basic network events, about 30
different types of messages are generated by the LSAG because of how broadcast
media (such as Ethernet) are supported in OSPF, how a change in one area propagates to other areas, and how external information is redistributed into OSPF. In
addition to identifying changes to the network topology, the LSAG also identifies
elements that are unstable, and generates messages about such flapping elements.
The LSAG also generates messages for non-conforming behavior, such as when

8

A Router-LSA in OSPF is originated by every router to describe its outgoing links to adjacent
routers along with their associated weights.

374

L. Breslau and A. Shaikh

refresh LSAs are observed too often. Apart from using the topology model to identify changes, the LSAG also uses it to produce snapshots of the topology periodically
and when network changes occur. One use of these snapshots is for performing an
audit of link weights as described in Section 11.2.3.2.
Finally, we turn our attention to OSPFScan, which supports off-line analysis of
LSA archives. One thing worth mentioning about the AT&T OSPF Monitor is that
the capabilities supported by OSPFScan for off-line analysis are mostly a superset of
the ones supported in real time by the LSAG with the underlying idea being anything
that can be done in real time can be performed off-line as a playback. In terms of
processing of LSAs, OSPFScan follows a three-step process: parse the LSA, test
the LSA against a user-specified query expression, and analyze the LSA according
to user interest if it satisfies the query. The parsing step converts each LSA record
into what is termed a canonical form to which the query expression and subsequent
analysis is applied. The use of a canonical form makes it easy to adapt OSPFScan
to support LSA archive formats other than the native one used by the LSAR.
The query language resembles C-style syntax; an example query expression is
“areaid == ‘0.0.0.0”’. When a query is specified, OSPFScan matches every LSA
record against the query, carrying out subsequent analysis for the matching records,
while filtering out the non-matching ones. For example, the expression above would
result in the analysis of only those LSAs that were collected from area 0.0.0.0.
In terms of analysis, OSPFScan provides the following capabilities:
1. Modeling Topology Changes Recall that OSPF represents the network topology as a graph. Therefore, OSPFScan allows modeling of OSPF dynamics as a
sequence of changes to the underlying graph where a change represents addition/deletion of vertices/edges to this graph. Furthermore, OSPFScan allows a
user to analyze these changes by saving each change as a single topology change
record. Each such record contains information about the topological element
(vertex/edge) that changed along with the nature of the change. For example,
a router is treated as a vertex, and the record contains the OSPF router-id to identify it. We should point out that the topology change records and LSAG message
logs essentially describe the same thing, but the former is geared more for computer processing, whereas the latter is aimed at humans.
2. Emulation of OSPF Routing OSPFScan allows a user to reconstruct a routing
table of a given set of routers at any point of time based on the LSA archives. For
a sequence of topology changes, OSPFScan also allows the user to determine
changes to these routing tables. Together, these allow calculation of end-to-end
paths through the OSPF domain at a given time, and see how this path changed in
response to network events over a period of time. The routing tables also facilitate
analysis of OSPF’s impact on BGP through hot-potato routing [33].
3. Classification of LSA Traffic OSPFScan allows various ways of “slicingand-dicing” of LSA archives. For example, it allows isolating LSAs indicating
changes from the background refresh traffic. As another example, it also allows classification of LSAs (both change and refresh) into new and duplicate
instances. This capability was used in a case study that analyzed one month LSA
traffic for an enterprise network [36].

11

Measurements of Control Plane Reliability and Performance

375

11.2.5 MPLS
Recall that MPLS has been deployed widely in service provider networks over the
last few years. It has played a key role in evolving best-effort service model of
IP networks by enabling traffic engineering (TE), fast reroute (FRR), and class of
service (CoS) differentiation. In addition, MPLS has also allowed providers to offer
value-added services such as VPN and VPLS.
Unlike traditional unicast forwarding in IP networks where routers match destination IP address to the longest matching prefix, MPLS uses a label switching
paradigm. Each (IP) packet is encapsulated in an MPLS header, which contains
among other things the label which is used by a router to determine the outgoing
interface. The value of the label changes along every hop. Thus, while determining
the outgoing interface, the router also determines the label with which it replaces
the incoming label of the packet. This means that a router running MPLS has to
maintain an LFIB (Label Forwarding Information Base), which contains mapping
between incoming label and (outgoing interface, outgoing label) pairs. The sequence
of routers an MPLS packet follows is known as an LSP (Label Switched Path). The
first router along the LSP encapsulates a packet into an MPLS header, while the last
router removes the MPLS header and forwards the resulting packet based on the
underlying header.
The LFIB used for MPLS switching is populated by its control plane. This is done
by creating and distributing mapping between a label and an FEC or a Forwarding
Equivalence Class. An FEC is defined as a set of packets that need to receive the
same forwarding treatment inside an MPLS network. A router running MPLS first
generates a unique label for each FEC it supports, and uses one of the control plane
protocols to distribute the label-FEC mappings to other routers. The dissemination
of this information allows each router to determine incoming and outgoing labels
and outgoing interface for each FEC, and thereby populate its LFIB.
MPLS currently uses three routing protocols for distributing label-FEC mappings: LDP (Label Distribution Protocol) [37], RSVP-TE (Resource reSerVation
Protocol) [38], and BGP [39, 40]. With LDP, a router exchanges label-FEC mappings with each of its neighbors using a persistent session. FECs, in case of LDP,
are generally IP prefixes. The labels learned from the neighbors allow the router to
determine mapping between incoming and outgoing labels. To determine the outgoing interface, LDP relies on the IGP (such as OSPF, IS-IS etc.) running in the
underlying IP network. Thus, LSPs created by LDP follow the paths calculated by
the IGP from source router to the destination prefix. RSVP, on the end, is used for
“explicitly” created and routed LSPs between two end points; the path need not
follow the IGP path. The first router of the LSP initiates path setup by sending an
RSVP message. The message propagates along the (to be established) LSP to the
last router. Every intermediate router processes the message, creating an entry in
its LFIB for the LSP. RSVP also allows reservation of bandwidth along the LSP,
making it ideal for TE and CoS routing. Finally, BGP is used for distributing prefix
to label mappings (mostly) in the context of VPN services. With VPNs, different

376

L. Breslau and A. Shaikh

customers of a VPN service provider can use overlapping IP address blocks, and
BGP-distributed label to prefix mapping allows a provider’s egress edge router to
determine which customer a given packet belongs to.
The flow of control messages through individual routers running LDP and RSVPTE can be modeled in the same manner as traditional unicast routing protocols as
shown in Fig. 11.4. Thus, to monitor these protocols, one needs to collect incoming
messages, outgoing messages, and changes occurring to the LFIB at every router. As
a result, various techniques described in Section 11.2.2 for data collection apply to
these protocols as well. One caveat applies to RSVP though since it does not have a
notion of a protocol session. Given this, it is not possible to collect information about
RSVP messages through a session with an RSVP router. To collect information
about RSVP dynamics thus requires some mechanism for routers to send messages
to a monitoring session when tunnels are setup and torn down – SNMP traps defined
in RFC 3812 [41] provide such a capability.
Once routing data is collected from LDP or RSVP routers, it can be used in
similar fashion as described in Section 11.2.3. For example, knowing label binding
messages sent by LDP routers allows an operator to know if LSPs are established
correctly or not. As another example, knowing the size of an LFIB (i.e., the number
of LSPs traversing a router) and how it is evolving can be a key parameter in capacity
planning.

11.3 Multicast
Throughout its relatively brief but rapidly evolving history, the Internet has primarily
provided unicast service. A datagram is sent from a single sender to a single receiver,
where each endpoint is identified by an IP address. Many applications, however,
involve communication between more than two entities, and often the same data
needs to be delivered to multiple recipients. As examples, software updates may
be distributed from a single server to multiple recipients, and streaming content,
such as live video, may be transmitted to many receivers simultaneously. When the
network layer only supports one-to-one communication, it is the responsibility of
the end systems to replicate data and transmit multiple copies of the same packet.
This solution is inefficient both with respect to processing overhead at the sender
and bandwidth utilization within the network.
Multicast [42], on the other hand, presents an efficient mechanism for network
delivery of the same content to multiple destinations. In IP multicast, the sender
transmits a single copy of a packet into the network. The network layer replicates
the packet at appropriate routers in the network such that copies are delivered to
all interested receivers and at most one copy of the packet traverses any network
link. Multicast is built around the notion of a multicast group, which is a 32bit identifier taken from the Class D portion of the IP address space (224.0.0.0 –
239.255.255.255). In multicast packets, the group address is contained in the destination IP address field in the header. Receivers make known their interest in

11

Measurements of Control Plane Reliability and Performance

377

receiving packets sent to the group address via a group membership protocol such as
IGMP [43], and multicast routing protocols enable multicast packets to be delivered
to the interested receivers.
Multicast was first proposed in the 1980s and was deployed on an experimental
basis in the early 1990s. This early deployment, known as the MBone [44] (for Multicast Backbone), consisted of areas of the Internet in which multicast was deployed.
These areas were connected together using IP-in-IP tunnels enabling multicast packets to traverse unicast-only portions of the Internet. The predominant applications
used in the MBone, videoconferencing and video broadcast, primarily supported
small group collaboration and broadcast of technical meetings and conferences.
After rapid initial growth, the MBone peaked and then began to flounder. The
technology, while initially promising, did not find its way into service provider
networks. Several reasons have been given for this. These include the lack of a
clear business model (i.e., who would be charged for packets that are replicated
and delivered to many receivers), security concerns (i.e., the original any-to-any IP
multicast service model allowed any host in the network to transmit packets to a
multicast group), and concerns about manageability (i.e., lack of tools to monitor,
troubleshoot and debug this new technology).
More recently, deployment of network layer multicast service within IP networks
has been increasing. This deployment has occurred primarily in enterprise networks,
in which some of the earlier concerns with multicast (e.g., security, business model)
are more easily mitigated. Common multicast applications in enterprise networks include software distribution and dissemination of financial trading information. The
deployment of multicast within enterprise networks has also driven deployment in
service provider networks in order to support the needs of Virtual Private Network
(VPN) customers who use multicast in their networks. The Multicast VPN solution
defined for the Internet [45, 46] requires customer multicast traffic to be encapsulated in a second instance of IP multicast for transport across the service provider
backbone. Finally, the widespread deployment of IPTV, an application that benefits
greatly from multicast service, is creating further growth of IP multicast.
Forwarding multicast packets within a network makes use of a separate FIB from
the unicast FIB and depends on a new set of routing protocols to create and maintain
these FIB entries. As such, the set of tools used to monitor unicast routing cannot be
used. In this section, we review the basics of multicast routing, identify issues that
make monitoring and managing multicast more difficult than monitoring unicast
routing, and finally describe tools and strategies for monitoring this technology.

11.3.1 Multicast Routing Protocols
A multicast FIB entry is indexed by a multicast group and a source specification,
where the latter consists of an address and mask. Packets that match the group address and source specification will be routed according to the FIB entry. The FIB
entry itself contains an incoming interface over which packets matching the source

378

L. Breslau and A. Shaikh

and group are expected to arrive, and a set of zero or more outgoing interfaces over
which copies of the packets should be transmitted. The union of FIB entries pertaining the same group and source(s) across all routers forms a tree, denoting the set of
links over which a packet is forwarded to reach the set of interested receivers. It is
the job of multicast routing protocols to establish the appropriate FIB entries in the
routers and thereby form this multicast tree.
Over the last two decades, several multicast routing protocols have been proposed and in some cases implemented and deployed. These include DVMRP [47],
MOSPF [48], CBT [49], MSDP [50], and PIM [51, 52]. In this section, we give an
overview of PIM and MSDP as they are the most widely deployed multicast routing
protocols.

11.3.1.1 PIM
Protocol Independent Multicast, or PIM, is the dominant multicast routing protocol
deployed in IP networks. PIM does not exchange reachability information in the
sense that unicast routing protocols, such as OSPF and BGP, do. Rather, it leverages information in the unicast FIB in order to construct multicast trees, and it is
agnostic as to the source of the unicast routing information. There are multiple variants of PIM, including PIM Sparse Mode (PIM-SM), PIM Dense Mode (PIM-DM),
Source Specific PIM (PIM-SSM), and Bidirectional PIM (PIM-Bidir). In this section, we present a brief overview of the basic operation of PIM-SM and PIM-SSM,
as they are the most commonly deployed variants of PIM, in order to motivate the
challenges in multicast monitoring and their solutions.
Before turning to PIM we discuss one key aspect of multicast trees and the protocols that construct them. Multicast trees can be classified as shared trees or source
trees. A shared tree is one that is used to forward packets from multiple sources.
In this case, the multicast routing entry is denoted by a group and a set of sources
(e.g., using an address and a mask). For a shared tree, the set of sources usually
includes all sources, and the routing table entry is denoted by the . ; G/ pair, where
G denotes the multicast group address and ‘*’ denotes a wildcard (indicating all
sources). A source tree, on the other hand, is used to forward packets from a single
source, and is denoted as .S; G/, where G again refers to the multicast group and S
refers to a single source.
PIM-SM uses both shared and source trees, depending on both the variant and
how it is configured. In both cases, multicast trees are constructed by sending Join
messages from the leaves of the tree (the routers that are directly connected to hosts
that want to receive packets transmitted to the multicast group) toward the root of
the tree. In the case of a source tree, the root is a source that transmits data to the
multicast group and the Join message is referred to as an .S; G/ Join. For a shared
tree, the root is a special node referred to as a Rendezvous Point, or RP, and the Join
message is referred to as a . ; G/ Join. The RP for a group, which can be configured

11

Measurements of Control Plane Reliability and Performance

379

statically at each router or determined by a dynamic protocol such as BSR [53], must
be agreed upon by all routers in a PIM domain.9
PIM Join messages are transmitted hop-by-hop toward the root of the tree. At
each router, the next hop is determined using the unicast FIB. Specifically, the Join
message is transmitted to the next hop on the best route (as determined by the unicast
routing table) toward the root (i.e., source or RP). As such, the Join message follows
the shortest path from the receiver to the root of the tree. At each hop, the router
keeps track of the neighbor router from which the Join message was received and
the neighbor router to which it was forwarded. The latter is denoted as the upstream
neighbor in the multicast FIB and the former is denoted as a downstream neighbor.
When subsequent multicast data packets are received from the upstream neighbor,
they will be forwarded to the downstream neighbor.
When a router receives a subsequent . ; G/ or .S; G/ Join message for a FIB entry that already exists, the router from which the Join message is received is added
to the list of downstream neighbors. However, the Join message need not be forwarded upstream as a Join message will have already been forwarded toward the
root of the tree. In this way, Join messages from multiple downstream neighbors
are merged, and when data packets are received, they will be replicated with a copy
forwarded to each downstream neighbor. PIM uses soft state, so that Join messages
are retransmitted hop-by-hop periodically, and state that is not refreshed is deleted
when an appropriate timer expires.
In PIM-SM, all communication begins on a shared tree. Last hop routers transmit
Join messages toward the RP, forming a shared tree with the RP at the root and last
hop routers as leaves. This process is depicted in Steps 1–3 in Fig. 11.7a, in which
router R2 transmits a Join message toward the RP. This message is then forwarded
by R1 to the RP. R3 subsequently transmits a Join message toward the RP, which is
received by R1 and not forwarded further. When a source wants to transmit packets
to the group, it encapsulates these packets in PIM Register messages transmitted
using unicast to the RP. The RP decapsulates these packets and transmits them on
the shared tree, so that they are delivered to all routers that joined the tree. The RP
then sends an .S; G/ Join message toward the source, building a source tree from
the source to the RP. Steps 4–5 in Fig. 11.7a depict a Register message from a source
S to the RP followed by a subsequent Join from the RP to S. Once this source tree
is established, packets are sent using native multicast from the source to the RP and
from the RP to the leaf routers, as shown in Fig. 11.7b. When multiple sources have
data to send to the multicast group, each will send PIM Register messages to the RP,
which in turn will send PIM Join messages to the sources, thereby creating multiple
.S; G/ trees.
While all communication, in PIM-SM begin on shared trees, the protocol allows
for the use of source trees. Specifically, when a last hop router receives packets from
a source, it has the option to switch to a source tree for that source. It does this by

9

A PIM domain is defined as a contiguous set of routers all configured to operate within a common
boundary. All routers in the domain must map a group address to the same RP.

380

L. Breslau and A. Shaikh

a

b

4
Register
S

RP

S

(S,G) Join
5

RP

2
(*,G) Join

Data Packets

R1

R1

1

3
(*,G) Join

(*,G) Join

R2

R3

R2

Shared Tree Creation

Shared Tree Data Flow

c
S

R3

d

RP

S

RP

2
(S,G) Join
Data Packets

R1
1

R1

3

(S,G) Join

R2

(S,G) Join

R3

Source Tree Creation

R2

R3

Source Tree Data Flow

Fig. 11.7 Example PIM Operation: (a) Sequence of control messages for shared tree creation.
(b) Resulting flow of data packets. (c) Sequence of control messages for switchover to source tree.
(d) Resulting flow of data packets

sending an .S; G/ Join toward the source, joining the source tree (just as the RP did
in the description above). Once it has received packets on the source tree, it then
sends a Prune message for the source on the shared tree, indicating that it no longer
wants to receive packets from that source on the shared tree. The Join messages
needed to switch from the shared to source tree are shown in Fig. 11.7c, and the
resulting flow of data packets is shown in Fig. 11.7d. Source trees allow for more
efficient paths from the source to receiver(s) at the expense of higher protocol and
state overhead.
PIM-SSM (Source Specific Multicast) does away with the need for RPs, thereby
simplifying multicast tree construction and maintenance while using a subset of

11

Measurements of Control Plane Reliability and Performance

381

the PIM-SM protocol mechanisms. PIM-SSM only uses source trees. The source of
traffic is known to hosts interested in joining the multicast group (e.g., via an out-ofband mechanism). These receivers signal their interest in the group via IGMP, and
their directly connected routers send .S; G/ Join messages directly to the source,
thereby building a source tree rooted at the sender.

11.3.1.2 MSDP
In PIM-SM, there is a single RP that acts as the root of a shared tree for a given
multicast group. (Note that a single router may act as an RP for many groups.)
This provides a mechanism for rendezvous and subsequent communication between
sources and receivers without either having any pre-existing knowledge of the other.
However, there are two situations in which multiple RPs for a group may be desirable. The first involves multicast communication between domains. Specifically,
two or more service providers may wish to enable multicast communication between
them. If there is only a single RP for a group, failure of the RP in one provider’s
network may impact service in the other’s network, even if all of the sources and
receivers are located in the latter’s network. Service providers may not be willing
to depend on a critical resource (e.g., the RP) located in another service provider’s
network for what may be purely intradomain communication. Further, even without
RP outages, performance may be suboptimal if purely intradomain communication
is required to follow interdomain paths. That is, a multicast tree between senders
and receivers in one ISP’s network may traverse another ISP’s network. Thus, each
provider may wish to have an RP located within its own domain.
The second situation in which multiple RPs may be useful involves communication within a single PIM domain. Specifically, redundant RPs provide a measure
of robustness, and this can be implemented using IP anycast [54]. Each RP is configured with the same IP address, and the RP mapping mechanism identifies this
anycast address as the RP address. Each router wishing to join a shared tree sends
a . ; G/ Join message toward the RP address. By virtue of anycast routing, which
uses unicast routing to route the message to the “closest” RP, the router will join a
shared tree rooted at a nearby RP. As a result, multiple disjoint shared trees will
be formed within the domain. Similarly, when a source transmits a PIM Register message to an anycast RP address, this message will only reach the nearest
RP. As such, sources and receivers will only communicate with those subsets of
routers closest to the same RP, and the required multicast connectivity will not be
achieved.
The problem of enabling multicast communication when multiple RPs exist for
the same group (whether within or between domains) is solved by the Multicast
Source Discovery Protocol (MSDP) [50]. MSDP enables multicast communication
between different PIM-SM domains (e.g., operated by different service providers) as
well as within a PIM-SM domain using multiple anycast RPs. MSDP-speaking RPs
form peering relationships with each other to inform each other of active sources.
Upon learning about an active source for a group for which there are interested

382

L. Breslau and A. Shaikh

receivers, an RP joins the source tree of that source so that it can receive packets
from the source and transmit them within its own domain or on its own shared tree.
We give a brief overview of MSDP. Each RP forms an MSDP peering relationship with one or more other RPs using a TCP connection. These MSDP connections
form a virtual topology among the various RPs. RPs share information about sources
as follows. For each source from which it receives a PIM Register message, an RP
transmits an MSDP Source-Active (SA) message to its MSDP peers. This SA message, which identifies a source and the group to which it is sending, is flooded across
the MSDP virtual topology so that it is received by all other MSDP-speaking RP
routers.
Upon receipt of an SA message, an RP (in addition to flooding the message to its
other MSDP peers) determines whether there are interested receivers in its domain.
Specifically, if the RP has previously received a Join message for the shared tree
indicated by the group in the SA message, the RP will transmit a PIM Join message
to the source. In this way, the RP joins the source tree rooted at the source in question, receives multicast packets from it, and multicasts these packets on the shared
tree rooted at the RP. Thus, multicast communication is enabled when multiple RPs
exist for the same group, whether within or across domains.

11.3.2 Challenges in Monitoring Multicast
In the early days of multicast, one of the often cited reasons for its slow deployment
was the difficulty of monitoring and managing the service; commercial routers implemented the protocols, but network operators had little way of knowing how the
service was working when they deployed it. While this was by no means the only
impediment to its deployment, it did present a significant challenge to network operators. To some degree, the problems cited early on with multicast management
remain true today. Before turning to specific tools and techniques used to monitor
and manage multicast in order to provide a stable and reliable network service, we
identify some of the generic challenges for managing the technology, while deferring some of the protocol-specific issues to Section 11.3.3.
While multicast is by no means a new technology, it is not yet mature. Because
it has only been deployed in a significant way in the last few years, there does not
yet exist the experience and knowledge surrounding it as exists with unicast service.
This manifests itself in two related ways. First, engineers and operators in many
cases are unfamiliar with the technology and face a steep learning curve in troubleshooting and monitoring multicast. Second, due to a rather limited deployment
experience, the kinds of tools that have evolved in the unicast world and that have
been essential in route monitoring do not yet exist for multicast.
Putting aside the relative newness of the technology, there are aspects of multicast
that make it inherently more challenging to manage than unicast. Most obviously,
the nature of what constitutes a route followed by a packet has changed. In unicast
routing, the path taken by a packet from source to destination consists of a sequence

11

Measurements of Control Plane Reliability and Performance

383

of routers (usually no more than 20 or 30). This path is easily identifiable (e.g., using
tools such as traceroute) and can be presented to a network operator in a way that
is easy to understand. In multicast routing, a packet no longer traverses an ordered
sequence of routers, but rather follows a tree of routers from a source to multiple
destinations. The tree can be very large, consisting of hundreds of routers. Identifying the tree becomes more challenging, and perhaps more significantly, presenting
it to a network operator in a useful manner is difficult.
In addition to being large, multicast trees are not static. That is, they are driven
by application behavior, and the set of senders and receivers may change during
the lifetime of an application. As such, branches may be added to and pruned from
multicast trees over time, and these changes can happen on short timescales. Thus,
understanding the state of multicast is made more difficult by the dynamic nature of
the multicast trees.
Finally, the multicast routing state used to forward a packet from a source to a
set of receivers can be data driven. That is, the state may not be instantiated until an
application starts sending traffic or expresses interest in receiving it. In contrast, with
unicast routing, the FIB entries used to route a packet from a source to a destination
are independent of the existence of application traffic. Thus, routing table entries can
be queried (either directly with SNMP or indirectly with a utility like traceroute) in
order to discover or verify a route. With multicast the analogous routing state may
not exist until applications are started. Using PIM-SM as an example, the shared tree
from the RP to receivers is formed as a result of receivers joining a multicast group.
Similarly, the state needed to route a packet from a source to the RP is not created
until the source sends a PIM Register message to the RP and the RP subsequently
sends an .S; G/ Join to the source. Given this, answering such questions (as one
might want to do in advance of a streaming broadcast) as “how would packets be
routed from the source to receivers” is problematic.
Given the inherent difficulties in monitoring and managing multicast routing,
there exists a need for new tools, methods and capabilities to assist in this process.
We now turn to the challenges of monitoring specific protocols and the ways in
which these challenges can be met.

11.3.3 Multicast Route Monitoring
Multicast routing involves complex protocols. In order to understand, troubleshoot
and debug the state of multicast in a network, operators need to be able to answer
several key questions. These include:
 What is the FIB entry for a particular source and group at a router?
 What is the multicast tree for a .S; G/ or . ; G/ pair?
 What route will a packet take from a source to one or more receivers? (As will

be explained below, this question differs subtly from the preceding one.)
 Are multicast trees stable or dynamic?

384

L. Breslau and A. Shaikh

 Are packets transmitted by source S to group G being received where they should

be?
 Is multicast routing properly configured in the network?

Answering these and other questions about multicast requires a new set of management tools and capabilities. In this section, we describe how monitoring tools can
be used to answer these questions. Before doing so, we briefly review the network
management capabilities developed during earlier experiences with multicast.

11.3.3.1 Early MBone Tools
The MBone grew from a few dozen subnets in 1992 to over 3,000 four years
later [55]. At its inception, it connected a small community of collaborating researchers, but it expanded to include a much broader set of users and applications. It
was initially maintained by a few people who knew administrators at all the participating sites. Therefore, monitoring and debugging of the infrastructure developed
in an ad hoc manner.
As the MBone grew, it faced an increasing set of management challenges. To
meet these challenges, the researchers who managed and used it developed a broad
set of tools. While we avoid an exhaustive review of these tools we give a few
representative examples here which encompass both application and network layer
tools.
 mrinfo discovered the multicast topology by querying multicast routers for their

neighbors.
 mtrace was used to discover the path packets traversed to reach a receiver from
a source.
 rtpmon was an application-level monitoring tool that provided end-to-end performance measurements for a multicast group.
 The DVMRP Route Monitor [56] monitored routing exchanges between multicast routers in the MBone.
The tools mentioned here, and the many others that were developed (see [57, 58]
for a more complete list) provided great value to the early MBone users. They addressed real problems and allowed operators and users to understand, monitor, and
troubleshoot the experimental network. While in many cases they provided insight
and lessons, which inform current efforts, they are unable to form the basis for
a current multicast management strategy. Many of the tools use RTCP and monitor application performance. Others were built specifically to monitor mrouted, the
public domain multicast routing daemon used in the early MBone. Neither of these
support the needs of large ISPs to monitor their multicast infrastructure. Instead, today’s multicast management and monitoring strategy must be built around tools that
work in the context of the multi-vendor commercial routers managed by the ISPs.

11

Measurements of Control Plane Reliability and Performance

385

11.3.3.2 Information Sources
While the earlier experience with the MBone provided some valuable insight as
to the challenges with managing multicast, it also showed the need for tools that
worked with commercial routers and that could be deployed by service providers
at scale. Such tools must work in the confines of the capabilities available on the
routers that support multicast. We discuss the options for gathering information
about multicast in this section, in order to motivate the kinds of solutions described
later.
As described in Section 11.2.3, route monitors provide enormous capability
with respect to monitoring unicast routing. BGP monitors peer with BGP speaking routers to collect routing updates and thereby monitor network reachability and
s