User Guide Network In R
User Manual:
Open the PDF directly: View PDF .
Page Count: 241
Download | |
Open PDF In Browser | View PDF |
UseR ! Douglas A. Luke A User’s Guide to Network Analysis in R Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani More information about this series at http://www.springer.com/series/6991 Use R! Albert: Bayesian Computation with R (2nd ed. 2009) Bivand/Pebesma/Gómez-Rubio: Applied Spatial Data Analysis with R (2nd ed. 2013) Cook/Swayne: Interactive and Dynamic Graphics for Data Analysis: With R and GGobi Hahne/Huber/Gentleman/Falcon: Bioconductor Case Studies Paradis: Analysis of Phylogenetics and Evolution with R (2nd ed. 2012) Pfaff: Analysis of Integrated and Cointegrated Time Series with R (2nd ed. 2008) Sarkar: Lattice: Multivariate Data Visualization with R Spector: Data Manipulation with R Douglas A. Luke A User’s Guide to Network Analysis in R 123 Douglas A. Luke Center for Public Health Systems Science George Warren Brown School of Social Work Washington University St. Louis, MO, USA ISSN 2197-5736 ISSN 2197-5744 (electronic) Use R! ISBN 978-3-319-23882-1 ISBN 978-3-319-23883-8 (eBook) DOI 10.1007/978-3-319-23883-8 Library of Congress Control Number: 2015955739 Springer Cham Heidelberg New York Dordrecht London © Springer International Publishing Switzerland 2015 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www. springer.com) To my most important social network—Sue, Alina, and Andrew Preface In early 2000, Stephen Hawking said that “. . .the next century will be the century of complexity.” If his prediction is true, the implication is that we will need new scientific theories, data collection methods, and analytic techniques that are appropriate for the study of complex systems and behavior. Network science is one such approach that views the world through a network lens, where physical and social systems are made up of heterogeneous actors who are connected to one another through different types of relational ties. Network analysis is the set of analytic tools used to study these types of systems. Over the past several decades network analysis has become an increasingly important part of the analytic toolbox for social, health, and physical scientists. Until recently, network analysis required specialized software, both for network data management and analyses. However, starting around 2000, network analytic tools became available in the R statistical programming environment. This not only made network analytic techniques more visible to the broader statistical community but also provided the breadth and power of R’s data management, graphic visualization, and general statistical modeling capabilities to the network analyst community. As the title suggests, this book is a user’s guide to network analysis in R. It provides a practical hands-on tour of the major network analytic tasks that can currently be done in R. The book concentrates on four primary tasks that a network analyst typically concerns herself with: network data management, network visualization, network description, and network modeling. The book includes all the R code that is used in the network analysis examples. It also comes with a set of network datasets that are used throughout the book. (See Chap. 1 for more details on the structure of the book, as well as instructions on how to obtain the network data.) The book is written for anybody who has an interest in doing network analysis in R. It can be used as a secondary text in a network science or analysis class or can simply serve as a reference for network techniques in R. This book would not exist without the help, support, guidance, and mentoring I have received over the last 30 years from my own personal and professional social networks. In the mid-1980s I took a graduate network analysis class from Stan Wasserman at the University of Illinois in Champaign. I remember being excited vii viii Preface about this new way to analyze data, but thought that I was not likely to ever use it in my career. However, my colleagues in psychology and public health encouraged me in my early work exploring how network analysis could answer important research and evaluation questions. These include Julian Rappaport, Ed Seidman, Bruce Rapkin, Kurt Ribisl, Sharon Homan, Ross Brownson, and Matt Kreuter. Whether they know it or not, I have been inspired and encouraged by an amazing group of network and systems scientists, including Tom Valente, Steve Borgatti, Martina Morris, Tom Snijders, Scott Leischow, Patty Mabry, Stephen Marcus, and Ross Hammond. My best network ideas have come from my friends and colleagues at the Center for Public Health Systems Science, particularly Bobbi Carothers, Amar Dhand, Chris Robichaux, and Nancy Mueller. I am especially grateful to the students in my network analysis classes and workshops over the years; they have not only improved this book, but they have improved my thinking about network analysis. A very special thank you to Jenine Harris. Jenine was my first doctoral student, now I am inspired by the rigor and elegance of her own work in network science. I would also like to thank the Centers for Disease Control and Prevention, the National Institutes of Health, and the Missouri Foundation for Health for providing research and evaluation support that allowed me to develop and refine my approach to network analysis. Finally, my deepest thanks go to my family. They gave me specific suggestions about the content, provided me space and time to work hard on this book (including a crucial Father’s Day gift), and cheered me on when I most needed it. Thank you, Sue, Ali, and Andrew. St. Louis, MO, USA July, 2015 Douglas A. Luke Contents 1 Introducing Network Analysis in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 What Are Networks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 What Is Network Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Five Good Reasons to Do Network Analysis in R . . . . . . . . . . . . . . . . 1.3.1 Scope of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Free and Open Nature of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Data and Project Management Capabilities of R . . . . . . . . . . 1.3.4 Breadth of Network Packages in R . . . . . . . . . . . . . . . . . . . . . . 1.3.5 Strength of Network Modeling in R . . . . . . . . . . . . . . . . . . . . . 1.4 Scope of Book and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Book Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 4 4 5 5 6 6 6 6 7 8 Part I Network Analysis Fundamentals 2 The Network Analysis ‘Five-Number Summary’ . . . . . . . . . . . . . . . . . . . 2.1 Network Analysis in R: Where to Start . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Simple Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Basic Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Clustering Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 11 12 12 12 14 15 15 16 3 Network Data Management in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Network Data Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Network Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Information Stored in Network Objects . . . . . . . . . . . . . . . . . . 17 17 17 20 ix x Contents 3.2 Creating and Managing Network Objects in R . . . . . . . . . . . . . . . . . . 3.2.1 Creating a Network Object in statnet . . . . . . . . . . . . . . . . . 3.2.2 Managing Node and Tie Attributes . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Creating a Network Object in igraph . . . . . . . . . . . . . . . . . . 3.2.4 Going Back and Forth Between statnet and igraph . . . 3.3 Importing Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Common Network Data Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Filtering Networks Based on Vertex or Edge Attribute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Transforming a Directed Network to a Non-directed Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 24 28 30 30 32 32 39 Part II Visualization 4 Basic Network Plotting and Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 The Challenge of Network Visualization . . . . . . . . . . . . . . . . . . . . . . . 4.2 The Aesthetics of Network Layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Basic Plotting Algorithms and Methods . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Finer Control Over Network Layout . . . . . . . . . . . . . . . . . . . . 4.3.2 Network Graph Layouts Using igraph . . . . . . . . . . . . . . . . . 45 45 47 49 50 52 5 Effective Network Graphic Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Design Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Node Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Node Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Node Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Node Label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Edge Width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.6 Edge Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.7 Edge Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.8 Legends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 55 55 56 60 62 66 68 69 70 71 6 Advanced Network Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Interactive Network Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Simple Interactive Networks in igraph . . . . . . . . . . . . . . . . 6.1.2 Publishing Web-Based Interactive Network Diagrams . . . . . . 6.1.3 Statnet Web: Interactive statnet with shiny . . . . . . . . . . 6.2 Specialized Network Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Arc Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Chord Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Heatmaps for Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Creating Network Diagrams with Other R Packages . . . . . . . . . . . . . . 6.3.1 Network Diagrams with ggplot2 . . . . . . . . . . . . . . . . . . . . . 73 73 74 74 77 77 78 79 82 84 84 Contents xi Part III Description and Analysis 7 Actor Prominence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2 Centrality: Prominence for Undirected Networks . . . . . . . . . . . . . . . . 92 7.2.1 Three Common Measures of Centrality . . . . . . . . . . . . . . . . . . 93 7.2.2 Centrality Measures in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.2.3 Centralization: Network Level Indices of Centrality . . . . . . . 96 7.2.4 Reporting Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.3 Cutpoints and Bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 8 Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.2 Social Cohesion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 8.2.1 Cliques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 8.2.2 k-Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.3 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.3.1 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.3.2 Community Detection Algorithms . . . . . . . . . . . . . . . . . . . . . . 118 9 Affiliation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 9.1 Defining Affiliation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 9.1.1 Affiliations as 2-Mode Networks . . . . . . . . . . . . . . . . . . . . . . . 126 9.1.2 Bipartite Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 9.2 Affiliation Network Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 9.2.1 Creating Affiliation Networks from Incidence Matrices . . . . 127 9.2.2 Creating Affiliation Networks from Edge Lists . . . . . . . . . . . . 129 9.2.3 Plotting Affiliation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 130 9.2.4 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 9.3 Example: Hollywood Actors as an Affiliation Network . . . . . . . . . . . 133 9.3.1 Analysis of Entire Hollywood Affiliation Network . . . . . . . . 134 9.3.2 Analysis of the Actor and Movie Projections . . . . . . . . . . . . . 139 Part IV Modeling 10 Random Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 10.1 The Role of Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 10.2 Models of Network Structure and Formation . . . . . . . . . . . . . . . . . . . . 148 10.2.1 Erdős-Rényi Random Graph Model . . . . . . . . . . . . . . . . . . . . . 148 10.2.2 Small-World Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 10.2.3 Scale-Free Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 10.3 Comparing Random Models to Empirical Networks . . . . . . . . . . . . . . 160 xii Contents 11 Statistical Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 11.2 Building Exponential Random Graph Models . . . . . . . . . . . . . . . . . . . 165 11.2.1 Building a Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 11.2.2 Including Node Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 11.2.3 Including Dyadic Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 11.2.4 Including Relational Terms (Network Predictors) . . . . . . . . . 175 11.2.5 Including Local Structural Predictors (Dyad Dependency) . . 177 11.3 Examining Exponential Random Graph Models . . . . . . . . . . . . . . . . . 179 11.3.1 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 11.3.2 Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 11.3.3 Model Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 11.3.4 Simulating Networks Based on Fit Model . . . . . . . . . . . . . . . . 183 12 Dynamic Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 12.1.1 Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 12.1.2 RSiena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 12.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 12.3 Model Specification and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 12.3.1 Specification of Model Effects . . . . . . . . . . . . . . . . . . . . . . . . . 198 12.3.2 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 12.4 Model Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 12.4.1 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 12.4.2 Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 12.4.3 Model Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 13 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 13.1 Simulations of Network Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 13.1.1 Simulating Social Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 13.1.2 Simulating Social Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Chapter 1 Introducing Network Analysis in R Begin at the beginning, the King said, very gravely, and go on till you come to the end: then stop. (Lewis Carroll, Alice in Wonderland) 1.1 What Are Networks? This book is a user’s guide for conducting network analysis in the R statistical programming language. Networks are all around us. Humans naturally organize themselves in networked systems. Our families and friends form personal social networks around each of us. Neighborhoods and communities organize themselves in networked coalitions to advocate for change. Businesses work with (and against) each other in complex, interlocking networks of trade and financial partnerships. Public health is advanced through partnerships and coalitions of governmental and NGO organizations (Luke and Harris 2007). Nations are connected to one another through systems of migration, trade, and treaty obligations. Moreover, non-human networks exist almost anywhere you look. Our genes and proteins interact with one another through complex biological networks. The human brain is now viewed as a complex network, or ‘connectome’ (Sporns 2012). Similarly, human diseases and their underlying genetic roots are connected as a ‘diseasome’ (Barabási 2007). Animal species interact in many complex ways, one of which is a networked food-web that describes interactions in ‘who-eats-whom’ relationships. Information itself is networked. Our legal system is built on an interconnecting network of prior legal decisions and precedents. Social and scientific progress is driven by a diffusion of innovation process by which information is disseminated across connected social systems, whether they are Iowa corn farmers (Rogers 2003) or public health scientists (Harris and Luke 2009). It appears that one of the ways the universe is organized is with networks. So what is a network? Figures 1.1 and 1.2 present two examples of important and interesting social networks. Figure 1.1 presents the contact network of the 19 9–11 hijackers, based on the work of Valdis Krebs (2002). Every social network is made up of a set of actors (also called nodes) that are connected to one another via some type of social relationship (also called a tie). In the figure, nodes are the circles and the ties are the lines connecting some of the nodes. The network shows © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 1 1 2 1 Introducing Network Analysis in R us that the hijackers had some contact with one another before September 11th, but the network is not very densely connected and there appears to be no prominent network member who is connected to all or even most of the other hijackers. AA11 (WTC North) AA77 (Pentagon) UA175 (WTC South) UA93 (Pennsylvania) Fig. 1.1 Network of 9–11 hijackers The second example in Fig. 1.2 is from a very different sort of social network. Here the nodes are members of the 2010 Netherlands FIFA World Cup team, who went on to lose in the final to Spain. The ties represent passes between the different players during the World Cup matches. The arrows show the directional pattern of the passes. We can see that the goalkeeper passed primarily to the defenders, and the forwards received passes primarily from the midfielders (except for #6, who appears to have a different passing pattern than the other two forwards). These two examples may appear to have little in common. However, they both share a fundamental characteristic common to all social networks. The social patterns that are displayed in the network figures are not random. They reflect underlying social processes that can be explored using network science theories and methods. The terrorist network has no prominent leader and is not tightly interconnected because it makes the network harder to detect or disrupt. The pattern of passing ties in the soccer network reflects the assigned positions of the players, the rules of the game, and the strategies of the coach. The network analysis does not ‘know’ about any of those rules or strategies. Yet, network analysis can be used to reveal these patterns that reflect the underlying rules and regularities. 1.2 What Is Network Analysis? 3 7 Defender Forward Goalkeeper Midfielder 8 9 5 6 4 11 10 3 2 1 Fig. 1.2 Network of Netherlands 2010 World Cup soccer team 1.2 What Is Network Analysis? Network science is a broad approach to research and scholarship that uses a relational lens to study and understand biological, physical, social, and informational systems. The primary tool for network scientists is network analysis, which is a set of methods that are used to (1) visualize networks, (2) describe specific characteristics of overall network structure as well as details about the individual nodes, ties, and subgroups within the networks, and (3) build mathematical and statistical models of network structures and dynamics. Because the core question of network science is about relationships, most of the methods used in network analysis are quite distinct from the more traditional statistical tools used by social and health scientists. Network analysis as a distinct scientific enterprise with its own theories and methods grew out of developments in many other disciplines, particularly graph theory and topology in mathematics, the study of kinship systems in anthropology, and social groups and process from sociology and psychology. Although network analysis was not invented by one person at a specific place and time, the initial development of what we now recognize as modern network analysis can be traced back to the work of Jacob Moreno in the 1930s. He defined the study of social relations as sociometry, and founded the journal Sociometry that would publish the early studies in this area. He also invented the sociogram, which was a visual way to display 4 1 Introducing Network Analysis in R network structures. The first published sociogram appeared in the New York Times in 1933, and it was a network diagram of the friendship ties among a 4th grade class. (These data are available as part of the network dataset package that accompanies this book, see Sect. 1.4.3 below.) The theories and methods of network analysis were developed throughout the rest of the twentieth century, with important contributions from sociology, psychology, political science, business, public health, and computer science. Network science as an empirical practice was propelled by the development of a number of network specific software tools and packages, including UCINet, STRUCTURE, Negopy, and Pajek. The interest in network science has exploded in the last 20–30 years, driven by at least three different factors. First, mathematicians, physicists, and other researchers developed a number of influential theories of network structure and formation that brought attention and energy to network science (see Chap. 10 for some discussion of these theories). Second, advances in computational power and speed allowed network methods to be applied to large and very large networks, such as the internet, the population of the planet, or the human brain. Finally, advances in statistical network theory allowed analysts for the first time to move beyond simple network description to be able to build and test statistical models of network structures and processes (see Chaps. 11 and 12). 1.3 Five Good Reasons to Do Network Analysis in R As the title suggests, this book is designed as a general guide for how to do network analysis in the R statistical language and environment. Why is R an ideal platform for developing and conducting network analyses? There are at least five good reasons. 1.3.1 Scope of R The R statistical programming language and environment comprise a vast integrated system of thousands of packages and functions that allow it to handle innumerable data management, analysis, or visualization tasks. The R system includes a number of packages that are designed to accomplish specific network analytic tasks. However, by performing these network tasks within the R environment, the analyst can take advantage of any of the other capabilities of R. Most other network analysis programs (e.g., Pajek, UCINet, Gephi) are stand-alone packages, and thus do not have the advantages of working within an integrated statistical programming environment. 1.3 Five Good Reasons to Do Network Analysis in R 5 1.3.2 Free and Open Nature of R One of the important reasons for R’s popularity and success is its free and open nature. This is formally ensured via the GNU General Public License (GPL) that R-code is released under. More informally, there is a vast R user and developer community which is continually working to enhance and improve R base code and the thousands of R packages that can be freely accessed. The social network capabilities of R described in this book have, in fact, been developed by the R user community. This open nature of R facilitates faster (and arguably, cleaner and more powerful) development and dissemination of new statistical and data analytic techniques, such as these network analytic tools. 1.3.3 Data and Project Management Capabilities of R Although there are many good network analysis programs available which can handle a wide variety of network descriptive statistics and visualization tasks, no other network package has the same power to handle often complex data and project management tasks for larger-scale network analyses compared to R. First, as suggested above, network analysis in R can take advantage of the powerful data management, cleaning, import and export capabilities of base R. As described in Chap. 3, network analysis often starts by importing and transforming data from other sources into a form that can be analyzed by network tools. All network packages have some data management capabilities, but no other program can match R’s breadth and depth. Second, when conducting sophisticated scientific or commercial network analyses, it is important to have the right project management tools to facilitate code storage and retrieval, managing analysis outputs such as statistical results and information graphics, and producing reports for internal and external audiences. Traditional statistical analysis platforms such as SAS and SPSS have these sorts of tools, but most network programs do not. By pairing R up with an integrated development environment (IDE) such as RStudio (http://rstudio.org/) and taking advantage of packages such as knitr and shiny, the user has the ability to manage any type of complex network project. In fact, the development and availability of these tools has been one of the driving forces of the reproducible research movement (Gentleman and Lang 2007), which emphasizes the importance of combining data, code, results, and documentation in permanent and shareable forms. As one example of the power of the reproducible research tools accessible in R is this book, which was created entirely in RStudio. 6 1 Introducing Network Analysis in R 1.3.4 Breadth of Network Packages in R The primary reason R is ideal for network analysis is the breadth of packages that are currently available to manage network data and conduct network visualization, network description, and network modeling. There are dozens of network-related packages, and more are being created all the time. R network data can be managed and stored in R native objects by the network and igraph packages, and the data can be exchanged between formats with the intergraph package. Basic network analysis and visualization can be handled with the sna package contained within the much broader statnet suite of network packages, as well as within igraph. More sophisticated network modeling can be handled by ergm and its associated libraries, and dynamic actor-based network models are produced by RSiena. Freestanding network analysis programs have many strengths (e.g., the visualization capabilities of Gephi), but no single program matches the combined power of the social network analysis packages contained in R. 1.3.5 Strength of Network Modeling in R Finally, the particular network modeling strengths of R should be mentioned. R is the only generally available software package that includes comprehensive facilities to do stochastic network modeling (e.g., exponential random graph models), dynamic actor-based network models that allow study of how networks change over time, and other network simulation procedures. 1.4 Scope of Book and Resources 1.4.1 Scope As the title suggests, the goal of this book is to provide a hands-on, practical guide to doing network analysis in the R statistical programming environment. It is hands-on in the sense that the book provides guidance primarily in the form of short network analysis code snippets applied to realistic network data. The results of the analyses follow immediately. All the code and data are available to the reader, so that it is easy to replicate what is shown in the book, experiment with your own data or code extensions, and thus facilitate learning. The practical goal of the book is to demonstrate network analytic techniques in R that will be useful for a wide variety of data analysis and research goals. This includes data management, network visualization, computation of relevant network descriptive statistics, and performing mathematical, statistical, and dynamic 1.4 Scope of Book and Resources 7 modeling of networks. The intended audiences include students, analysts and researchers across a wide variety of disciplines, particularly the social, health, business, and engineering domains. It is also useful to state what this book is not designed to do. First, it does not provide an in-depth treatment of network science theories or history. There are many good books, papers, training courses, and online resources available that cover this material. For good general overviews, the classic text by Wasserman and Faust (1994) is still relevant, and John Scott provides a good, more current treatment (2012). For more in-depth treatment of network science and statistical theory, see Newman (2010) or Kolaczyk (2009). Finally, two edited volumes that have good coverage of the recent history of network science as well as well-executed examples of empirical network research are Newman et al. (2006) and Scott and Carrington (2011). Second, this book is not in any way an adequate introduction to R programming and statistical analysis. Although every attempt is made to make each code example clear and succinct, a novice R user will find some of the techniques and code syntax hard to follow. In particular, understanding R’s capabilities for data management, graphics, and the object-oriented approach to statistical modeling will be very helpful for getting the most out of this user-guide. Thus, the book is designed for the interested student, analyst, or researcher who is familiar with R and has some understanding of network science theories and methods. It could serve as a secondary text for a graduate level class in network analysis. It also could be useful as a primer for an experienced R analyst who wants to incorporate network analysis into her programming and analytic toolbox. 1.4.2 Book Roadmap The book is organized into four main sections, which correspond to the four fundamental tasks that network analysts will spend most of their time on: data management, network visualization, network description, and network modeling. The first section has two chapters that cover both a simple introduction to basic network techniques, then a more in-depth presentation of data management issues in network analysis. The three chapters in the Visualization section cover basic network graphics layout, network graphic design suggestions, and some discussion of advanced graphics topics and techniques. The Description and Analysis section has three chapters that cover the most widely used techniques for describing important network characteristics, including actor prominence, network subgroups and communities, and handling affiliation networks. The final section, Modeling, includes four chapters that present advanced techniques for mathematical modeling, statistical modeling, modeling of dynamic networks, and network simulations. Table 1.1 presents this roadmap. 8 1 Introducing Network Analysis in R Chapter Introduction 5 number summary Network data Basic visualization Graphic design Advanced graphics Prominence Subgroups Affiliation networks Mathematical models Stochastic models Dynamic models Simulations Packages statnet, sna statnet, network, igraph statnet, sna statnet, sna, igraph arcdiagram, circlize, visNetwork, networkD3 statnet, sna igraph igraph igraph ergm RSiena igraph Datasets FIFA Nether, Krebs Moreno DHHS, ICTS Moreno, Bali Bali Simpsons, Bali DHHS, Bali DHHS, Moreno, Bali hwd lhds TCnetworks Coevolve Table 1.1 User’s Guide roadmap 1.4.3 Resources The most important resource for this user guide is a collection of network datasets that have been curated and made available to the readers of this book. Over a dozen network datasets are included in the form of an R package called UserNetR. These datasets are used throughout the book to support the coding and analysis examples. The network data included in the UserNetR package mostly come from published network studies, while a few are created to help illustrate particular analytic options. Table 1.1 lists the names of the datasets that are featured in each chapter. The UserNetR package is maintained on GitHub, and must be downloaded and installed to make the network data available. This can be done using the following code. (The devtools package must also be installed if it is not on your system.) library(devtools) install_github("DougLuke/UserNetR") Once this is done, the package must be loaded to make the various datafiles available. This can be done with the library() function, just like for any R package. This command will not always be explicitly shown throughout the book, so make sure to load the package prior to executing any of the included R code. library(UserNetR) Finally, the documentation for the UserNetR package can be viewed through the R help system. help(package='UserNetR') Part I Network Analysis Fundamentals Chapter 2 The Network Analysis ‘Five-Number Summary’ There is nothing like looking, if you want to find something. You certainly usually find something, if you look, but it is not always quite the something you were after. (J.R.R. Tolkien – The Hobbit) 2.1 Network Analysis in R: Where to Start How should you start when you want to do a network analysis in R? The answer to this question rests of course on the analytic questions you hope to answer, the state of the network data that you have available, and the intended audience(s) for the results of this work. The good news about performing network analysis in R is that, as will be seen in subsequent chapters, R provides a multitude of available network analysis options. However, it can be daunting to know exactly where to start. In 1977, John Tukey introduced the five-number summary as a simple and quick way to summarize the most important characteristics of a univariate distribution. Networks are more complicated than single variables, but it is also possible to explore a set of important characteristics of a social network using a small number of procedures in R. In this chapter, we will focus on two initial steps that are almost always useful for beginning a network analysis: simple visualization, and basic description using a ‘five-number summary.’ This chapter also serves as a gentle introduction to basic network analysis in R, and demonstrates how quickly this can be done. 2.2 Preparation Similar to most types of statistical analysis using R, the first steps are to load appropriate packages (installing them first if necessary), and then making data available for the analyses. The statnet suite of network analysis packages will be used here for the analyses. The data used in this chapter (and throughout the rest of the book) are from the UserNetR package that accompanies the book. The specific dataset used here is called Moreno, and contains a friendship network of fourth grade students first collected by Jacob Moreno in the 1930s. © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 2 11 12 2 The Network Analysis ‘Five-Number Summary’ library(statnet) library(UserNetR) data(Moreno) 2.3 Simple Visualization The first step in network analysis is often to just take a look at the network. Network visualization is critical, but as Chaps. 4, 5 and 6 indicate, effective network graphics take careful planning and execution to produce. That being said, an informative network plot can be produced with one simple function call. The only added complexity here is that we are using information about the network members’ gender to color code the nodes. The syntax details underlying this example will be covered in greater depth in Chaps. 3, 4 and 5. gender <- Moreno %v% "gender" plot(Moreno, vertex.col = gender + 2, vertex.cex = 1.2) The resulting plot makes it immediately clear how the friendship network is made up of two fairly distinct subgroups, based on gender. A quickly produced network graphic like this can often reveal the most important structural patterns contained in the social network. 2.4 Basic Description Tukey’s original five-number summary was intended to describe the most important distributional characteristics of a variable, including its central tendency and variability, using easy to produce statistical summaries. Similarly, using only a few functions and lines of R code, we can produce a network five-number summary that tells us how large the network is, how densely connected it is, whether the network is made up of one or more distinct groups, how compact it is, and how clustered are the network members. 2.4.1 Size The most basic characteristic of a network is its size. The size is simply the number of members, usually called nodes, vertices or actors. The network.size() function is the easiest way to get this. The basic summary of a statnet network object also provides this information, among other things. The Moreno network has 33 2.4 Basic Description 13 Fig. 2.1 Moreno sociogram members, based on the network.size and summary calls. (Setting the print.adj to false suppresses some detailed adjacency information that can take up a lot of room.) network.size(Moreno) ## [1] 33 summary(Moreno,print.adj=FALSE) ## Network attributes: ## vertices = 33 ## directed = FALSE ## hyper = FALSE ## loops = FALSE ## multiple = FALSE 14 2 The Network Analysis ‘Five-Number Summary’ ## bipartite = FALSE ## total edges = 46 ## missing edges = 0 ## non-missing edges = 46 ## density = 0.0871 ## ## Vertex attributes: ## ## gender: ## numeric valued attribute ## attribute summary: ## Min. 1st Qu. Median Mean 3rd Qu. ## 1.00 1.00 2.00 1.52 2.00 ## vertex.names: ## character valued attribute ## 33 valid vertex names ## ## No edge attributes Max. 2.00 2.4.2 Density Of all the basic characteristics of a social network, density is among the most important as well as being one of the easiest to understand. Density is the proportion of observed ties (also called edges, arcs, or relations) in a network to the maximum number of possible ties. Thus, density is a ratio that can range from 0 to 1. The closer to 1 the density is, the more interconnected is the network. Density is relatively easy to calculate, although the underlying equation differs based on whether the network ties are directed or undirected. An undirected tie is one with no direction. Collaboration would be a good example of an undirected tie; if A collaborates with B, then by necessity B is also collaborating with A. Directed ties, on the other hand, have direction. Money flow is a good example of a directed tie. Just because A gives money to B, does not necessarily mean that B reciprocates. For a directed network, the maximum number of possible ties among k actors is k ∗ (k − 1), so the formula for density is: L , k × (k − 1) where L is the number of observed ties in the network. Density, as defined here, does not allow for ties between a particular node and itself (called a loop). 2.4 Basic Description 15 For an undirected network the maximum number of ties is k ∗ (k − 1)/2 because non-directed ties should only be counted once for every dyad (i.e., pair of nodes). So, density for an undirected network becomes: 2L . k × (k − 1) The information obtained in the previous section told us that the Moreno network has 33 nodes and 46 non-directed edges. We could then use R to calculate that by hand, but it is easier to simply use the gden() function. den_hand <- 2*46/(33*32) den_hand ## [1] 0.0871 gden(Moreno) ## [1] 0.0871 2.4.3 Components A social network is sometimes split into various subgroups. Chapter 8 will describe how to use R to identify a wide variety of network groups and communities. However, a very basic type of subgroup in a network is a component. An informal definition of a component is a subgroup in which all actors are connected, directly or indirectly. The number of components in a network can be obtained with the components function. (Note that the meaning of components is more complicated for directed networks. See help(components) for more information.) components(Moreno) ## [1] 2 2.4.4 Diameter Although the overall size of a network may be interesting, a more useful characteristic of the network is how compact it is, given its size and degree of interconnectedness. The diameter of a network is a useful measure of this compactness. A path is the series of steps required to go from node A to node B in a network. The shortest path is the shortest number of steps required. The diameter then for an entire network is the longest of the shortest paths across all pairs of nodes. This is a measure of compactness or network efficiency in that the diameter reflects the ‘worst 16 2 The Network Analysis ‘Five-Number Summary’ case scenario’ for sending information (or any other resource) across a network. Although social networks can be very large, they can still have small diameters because of their density and clustering (see below). The only complicating factor for examining the diameter of a network is that it is undefined for networks that contain more than one component. A typical approach when there are multiple components is to examine the diameter of the largest component in the network. For the Moreno network there are two components (see Fig. 2.1). The smaller component only has two nodes. Therefore, we will use the larger component that contains the other 31 connected students. In the following code the largest component is extracted into a new matrix. The geodesics (shortest paths) are then calculated for each pair of nodes using the geodist() function. The maximum geodesic is then extracted, which is the diameter for this component. A diameter of 11 suggests that this network is not very compact. It takes 11 steps to connect the two nodes that are situated the furthest apart in this friendship network. lgc <- component.largest(Moreno,result="graph") gd <- geodist(lgc) max(gd$gdist) ## [1] 11 2.5 Clustering Coefficient One of the fundamental characteristics of social networks (compared to random networks) is the presence of clustering, or the tendency to formed closed triangles. The process of closure occurs in a social network when two people who share a common friend also become friends themselves. This can be measured in a social network by examining its transitivity. Transitivity is defined as the proportion of closed triangles (triads where all three ties are observed) to the total number of open and closed triangles (triads where either two or all three ties are observed). Thus, like density, transitivity is a ratio that can range from 0 to 1. Transitivity of a network can be calculated using the gtrans() function. The transitivity for the 4th graders is 0.29, suggesting a moderate level of clustering in the classroom network. gtrans(Moreno,mode="graph") ## [1] 0.286 In the rest of this book, we will examine in more detail how the power of R can be harnessed to explore and study the characteristics of social networks. The preceding examples show that basic plots and statistics can be easily obtained. The meaning of these statistics will always rest on the theories and hypotheses that the analyst brings to the task, as well as history and experience doing network analysis with other similar types of social networks. Chapter 3 Network Data Management in R Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it. (Samuel Johnson) 3.1 Network Data Concepts A major advantage of using R for network analysis is the power and flexibility of the tools for accessing and manipulating the actual network data. One of the things that I often tell my quantitative methods students is that they will typically spend the majority of their time dealing with data management tasks and challenges. In fact, the time spent analyzing and modeling data is dwarfed by the time spent getting data ready for analyses. This is no different for network analysis. In fact, given the specialized nature of network data, the data management tasks loom even larger. In this chapter we cover three main topics. First, the general nature of network data is explored and defined. Second, we learn how network data objects can be created and managed in R. Finally, a number of typical network data management tasks are illustrated through a set of examples. 3.1.1 Network Data Structures For many types of data analysis the data are stored in rectangular data structures, where rows are used to depict cases or observations, and columns depict individual variables. Spreadsheets use this type of data organization, as well as most statistics packages such as SPSS. In R one of the fundamental data types is a ‘data frame,’ which uses this same rectangular format. Networks, because of their need to depict more complicated relational structures, require a different type of data storage. That is, in rectangular data structures the fundamental piece of information is an attribute (column) of a case (row). In network analysis, the fundamental piece of information is a relationship (tie) between two members of a network. Consider the following simple example of a directed network. The network graphic itself depicts all of the information about the network. It is made up of © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 3 17 18 3 Network Data Management in R five nodes (named A through E), and there are a total of six directed ties. Because these are directed ties, we can call them arcs (as compared to non-directed edges). Although the network diagram is an efficient way to communicate the network information to humans, computers need to use other methods to store, access, and operate on the underlying network data. E C A B D Fig. 3.1 Simple directed network 3.1.1.1 Sociomatrices Another way to depict the network data that is more useful for computer storage is to arrange the information in a matrix. This type of matrix containing network information is a sociomatrix. Table 3.1 contains the sociomatrix that corresponds to Fig. 3.1. A sociomatrix is a square matrix where a 1 indicates a tie between two nodes, and a 0 indicates no tie. So in Table 3.1 we see that there is a 1 in cell 1,2–this indicates a tie going from node A to node B. The convention is that rows indicate the starting node, and columns indicate the receiving node. A sociomatrix is also sometimes called an adjacency matrix, because the 1s in the cells indicate which nodes are adjacent to one another in the network. If the network is non-directed (only edges instead of arcs), then the sociomatrix would be symmetric around the diagonal. Here, however, cell 2,1 has a zero, indicating that there is not an arc that goes from node B back to node A. For simple networks, there are no self-loops, where a tie connects back to its own node. So, diagonals are all zeros for simple networks. 3.1 Network Data Concepts 19 A B C D E A 0 0 0 0 0 B 1 0 1 0 0 C 1 1 0 0 1 D 0 1 0 0 0 E 0 0 0 0 0 Table 3.1 Sociomatrix of the example directed network 3.1.1.2 Edge-Lists Sociomatrices are elegant ways to depict networks, and they are a common way that many network analysis programs store and manipulate network data. In particular, many basic network algorithms are based on mathematical or statistical operations on sociomatrices. For example, to find geodesic distances between all pairs of nodes in a network the underlying sociomatrices are multiplied together (Wasserman and Faust 1994). However, sociomatrices have one large disadvantage. As networks get larger, sociomatrices become very sparse. That is, most of the matrix will be made up of empty cells (cells with 0s). Table 3.2 shows the dramatic increase in both the size and sparseness of a sociomatrix as the network size increases, keeping the average degree constant at 3. This poses challenges for data storage, data manipulation, and data display. Nodes 10 100 1,000 Avg. degree 3 3 3 Edges 15 150 1,500 Density 0.33 0.03 0.00 Empty cells 70 9,700 997,000 Table 3.2 Demonstration of sparse sociomatrices Fortunately, there is another way to depict network information that avoids this problem of sociomatrices. Table 3.3 presents the edge list format for the example network. As its name suggests, the edge list format depicts network information by simply listing every tie in the network. Each row corresponds to a single tie, that goes from the node listed in the first column to the node listed in the second column. Although the size of the sociomatrix and the edge list matrix are similar for this small example (25 cells for the sociomatrix and 12 cells for the edge list matrix), edge lists become much more efficient for large networks. Referring back to Table 3.2, for a network with 1,000 nodes, the sociomatrix would have 1,000,000 cells. The edge list for this network, with nodes having average degree of 3, would only have 3,000 cells (1,500 edges between pairs of nodes). 20 3 Network Data Management in R From A A B B C E To B C C D B C Table 3.3 Edge list format for example directed network 3.1.2 Information Stored in Network Objects Although basic matrices can be used to store some network information, R and other statistics packages use more complex data structures to contain a wide variety of network node, tie, metadata, and miscellaneous characteristics. In general, a network data object can contain up to five types of information, as listed in Table 3.4. Type Nodes Ties Node attributes Tie attributes Metadata Description List of nodes in network, along with node labels List of ties in the network Attributes of the nodes Attributes of the ties Other information about the entire network Required? Required Required Optional Optional Depends Table 3.4 Types of information contained in network data objects First, a network data object must know which objects belong to the network, these are generally known as nodes (in statnet they are called vertices). The second required component in a network object is the list of ties that connect the nodes to one another. Without these two types of information, the data object is not really a network object. In addition to node and tie listings, network data objects will often be able to store characteristics of those nodes and ties. For example, if the nodes in the network are people, then basic information on those peoples such as gender or income could be contained in the data object. Similarly, ties themselves may have characteristics such as strength or valence (e.g., positive vs. negative). Finally, network data objects may also contain metadata about the whole network or other information that may be relevant or useful when accessing or analyzing the data. For example, statnet stores global information about the network as metadata, including whether the network is directed, whether loops are allowed, and whether the network is bipartite. 3.2 Creating and Managing Network Objects in R 21 3.2 Creating and Managing Network Objects in R Given R’s object-oriented design, it is not surprising that the main way that R expects to access network data is through some type of a network data object. As part of the statnet suite of packages, the network package defines a network class that is an object structure designed to hold network data. Although statnet can recognize relational data that are stored in basic matrices or data frames, much of the power and flexibility of R’s network analyses is unlocked when using network data objects. For more detailed information about network objects in statnet, see Butts (2008). 3.2.1 Creating a Network Object in statnet To create a network object, the identically-named network() function is called. This function has a number of options, but the most common way to use it is to feed relational data to it–typically an adjacency matrix or edge list. To see how this works we will continue with the example directed network from Fig. 3.1. First, we will create a network using an adjacency matrix. netmat1 <- rbind(c(0,1,1,0,0), c(0,0,1,1,0), c(0,1,0,0,0), c(0,0,0,0,0), c(0,0,1,0,0)) rownames(netmat1) <- c("A","B","C","D","E") colnames(netmat1) <- c("A","B","C","D","E") net1 <- network(netmat1,matrix.type="adjacency") class(net1) ## [1] "network" summary(net1) ## Network attributes: ## vertices = 5 ## directed = TRUE ## hyper = FALSE ## loops = FALSE ## multiple = FALSE ## bipartite = FALSE ## total edges = 6 ## missing edges = 0 ## non-missing edges = 6 ## density = 0.3 22 3 Network Data Management in R ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Vertex attributes: vertex.names: character valued attribute 5 valid vertex names No edge attributes Network A B C A 0 1 1 B 0 0 1 C 0 1 0 D 0 0 0 E 0 0 1 adjacency matrix: D E 0 0 1 0 0 0 0 0 0 0 The results of the class() and summary() calls show that we have successfully created a new network object. Also, this demonstrates that if the matrix has identical row and column names, they will be used as the labels for the nodes. We can also see that this is the same network as the earlier example by plotting it (Fig. 3.2). gplot(net1, vertex.col = 2, displaylabels = TRUE) The same network can be created using an edge list format. This will often be more convenient than adjacency matrices. Not only are edge lists smaller than sociomatrices, but network data are often obtained naturally in this format. For example, email communications can be analyzed as networks, where each email corresponds to a tie from the email sender to the receiver. This leads easily to edge list node pairs. netmat2 <- rbind(c(1,2), c(1,3), c(2,3), c(2,4), c(3,2), c(5,3)) net2 <- network(netmat2,matrix.type="edgelist") network.vertex.names(net2) <- c("A","B","C","D","E") summary(net2) ## Network attributes: ## vertices = 5 ## directed = TRUE ## hyper = FALSE ## loops = FALSE 3.2 Creating and Managing Network Objects in R 23 E C A B D Fig. 3.2 Plot of new network object ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## multiple = FALSE bipartite = FALSE total edges = 6 missing edges = 0 non-missing edges = 6 density = 0.3 Vertex attributes: vertex.names: character valued attribute 5 valid vertex names No edge attributes Network A B C A 0 1 1 B 0 0 1 C 0 1 0 D 0 0 0 E 0 0 1 adjacency matrix: D E 0 0 1 0 0 0 0 0 0 0 This produces the same network as before. Notice that the edgelist was provided in the form of node ID numbers. To label the nodes properly, we used a special vertex attribute constructor, network.vertex.names. We have seen that to create network objects in R we can use a workflow that takes data in a number of basic matrix formats and transforms them into the network class 24 3 Network Data Management in R object. However, statnet also includes a number of tools that allow you to reverse this workflow, by coercing network data into other matrix formats. as.sociomatrix(net1) ## ## ## ## ## ## A B C D E A 0 0 0 0 0 B 1 0 1 0 0 C 1 1 0 0 1 D 0 1 0 0 0 E 0 0 0 0 0 class(as.sociomatrix(net1)) ## [1] "matrix" A more general coercion function is as.matrix(). It can be used to produce a sociomatrix or an edgelist matrix. all(as.matrix(net1) == as.sociomatrix(net1)) ## [1] TRUE as.matrix(net1,matrix.type = "edgelist") ## ## ## ## ## ## ## ## ## ## ## [,1] [,2] [1,] 1 2 [2,] 3 2 [3,] 1 3 [4,] 2 3 [5,] 5 3 [6,] 2 4 attr(,"n") [1] 5 attr(,"vnames") [1] "A" "B" "C" "D" "E" This ability to go back and forth between network objects and more fundamental data structures such as sociomatrices and edgelist matrices gives the analyst great power and flexibility when managing network data. We will take advantage of these tools later in this chapter as well as throughout the book. 3.2.2 Managing Node and Tie Attributes One of the major advantages of using network objects when doing network analysis in R rather than using simpler matrix objects is the ability to store additional 3.2 Creating and Managing Network Objects in R 25 attribute information about the nodes and ties within the same network object. The analyst typically knows much more about the members of a network than just the simple list of nodes and ties. These node or tie characteristics can be used in network visualization (see Chap. 5), network description, and network modeling (Chap. 11). For both nodes and ties, statnet provides a set of functions that can be used to create, delete, access, and list any attribute information of relevance. These functions have a lot of capabilities, see help(attribute.methods) for more details. 3.2.2.1 Node Attributes In the following example we use two different methods to set a pair of node attributes (called vertex attributes by statnet). The first example uses the more formal method to assign gender codes to the nodes in net1. The second example uses a shorthand method to assign a numeric vector as an attribute. In this case we are storing the sum of the indegrees and outdegrees of each node as a new vertex attribute. set.vertex.attribute(net1, "gender", c("F", "F", "M", "F", "M")) net1 %v% "alldeg" <- degree(net1) list.vertex.attributes(net1) ## [1] "alldeg" "gender" ## [4] "vertex.names" "na" summary(net1) ## Network attributes: ## vertices = 5 ## directed = TRUE ## hyper = FALSE ## loops = FALSE ## multiple = FALSE ## bipartite = FALSE ## total edges = 6 ## missing edges = 0 ## non-missing edges = 6 ## density = 0.3 ## ## Vertex attributes: ## ## alldeg: ## numeric valued attribute ## attribute summary: ## Min. 1st Qu. Median Mean 3rd Qu. Max. 26 3 Network Data Management in R ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 1.0 1.0 2.0 2.4 4.0 4.0 gender: character valued attribute attribute summary: F M 3 2 vertex.names: character valued attribute 5 valid vertex names No edge attributes Network A B C A 0 1 1 B 0 0 1 C 0 1 0 D 0 0 0 E 0 0 1 adjacency matrix: D E 0 0 1 0 0 0 0 0 0 0 In this example, we see that information obtained outside of the network (i.e., gender) or information obtained from the network itself (i.e., degree) can be used as node attributes. Once node attributes have been set, they can be examined with the list.vertex.attributes command (note the plural). Also, the summary of the network will provide some basic information about any stored attributes. To see the actual values stored in a vertex attribute, you can use the following two equivalent methods. get.vertex.attribute(net1, "gender") ## [1] "F" "F" "M" "F" "M" net1 %v% "alldeg" ## [1] 2 4 4 1 1 3.2.2.2 Tie Attributes Information about tie characteristics can also be stored and managed in the network objects, using the similarly named set.edge.attributes and get.edge.attributes functions. In the following example we create a new edge attribute that contains a random number for each edge in the network, and then access that information. 3.2 Creating and Managing Network Objects in R 27 list.edge.attributes(net1) ## [1] "na" set.edge.attribute(net1,"rndval", runif(network.size(net1),0,1)) list.edge.attributes(net1) ## [1] "na" "rndval" summary(net1 %e% "rndval") ## ## Min. 1st Qu. 0.163 0.165 Median 0.220 Mean 3rd Qu. 0.382 0.476 Max. 0.980 summary(get.edge.attribute(net1,"rndval")) ## ## Min. 1st Qu. 0.163 0.165 Median 0.220 Mean 3rd Qu. 0.382 0.476 Max. 0.980 A more typical situation where you will want to create a new edge attribute is when you are creating or working with valued networks. A valued network is one where the network tie has some numeric value. For example, a resource exchange network may include not just whether this is a flow of money from one node to another, but the actual amount of that money. In statnet, the actual values of the valued ties are stored in an edge attribute. To see how this works, consider our example network now as a friendship network, where the five network members were asked to indicate how much they liked one another, on a scale of 0 (not at all) to 3 (very much). The following example shows how we would proceed from the raw valued sociomatrix to storing the values in an edge attribute called ‘like.’ netval1 <- rbind(c(0,2,3,0,0), c(0,0,3,1,0), c(0,1,0,0,0), c(0,0,0,0,0), c(0,0,2,0,0)) netval1 <- network(netval1,matrix.type="adjacency", ignore.eval=FALSE,names.eval="like") network.vertex.names(netval1) <- c("A","B","C","D","E") list.edge.attributes(netval1) ## [1] "like" "na" get.edge.attribute(netval1, "like") ## [1] 2 1 3 3 2 1 The key here are the ignore.eval and names.eval options. These two options, as set here, tell the network function to evaluate the actual values in the 28 3 Network Data Management in R sociomatrix, and store those values in a new edge attribute called ‘like.’ Once values are stored in an edge attribute, the original valued matrix can be restored using as option of the as.sociomatrix coercion function. as.sociomatrix(netval1) ## ## ## ## ## ## A B C D E A 0 0 0 0 0 B 1 0 1 0 0 C 1 1 0 0 1 D 0 1 0 0 0 E 0 0 0 0 0 as.sociomatrix(netval1,"like") ## ## ## ## ## ## A B C D E A 0 0 0 0 0 B 2 0 1 0 0 C 3 3 0 0 2 D 0 1 0 0 0 E 0 0 0 0 0 3.2.3 Creating a Network Object in igraph The other major R package that can be used to store and manipulate network data is igraph, which is a comprehensive set of network data management and analytic tools that have been implemented in R, Python, and C/C++. More information can be obtained at igraph.org. To start working with igraph, the package needs to be installed and loaded. It contains a number of functions that have the same names as those found in the statnet suite of packages, so it is a good idea to detach statnet before loading igraph. detach(package:statnet) library(igraph) For the most part, igraph can be used to store and access network, node, and edge information in similar ways as the network package. In particular, igraph network objects (called ‘graphs’) can be created from more basic sociomatrix or edge list data structures. inet1 <- graph.adjacency(netmat1) class(inet1) ## [1] "igraph" 3.2 Creating and Managing Network Objects in R 29 summary(inet1) ## IGRAPH DN-- 5 6 -## + attr: name (v/c) str(inet1) ## ## ## ## IGRAPH DN-- 5 6 -+ attr: name (v/c) + edges (vertex names): [1] A->B A->C B->C B->D C->B E->C The summary information from an igraph graph object is slightly more cryptic than from a statnet network. After the ‘IGRAPH’ tag is listed (indicating that this is an igraph object), a series of codes are presented. In this case the ‘D’ indicates a directed graph, and the ‘N’ indicates that the vertices are named. Other codes might appear that would designate whether the graph is weighted (i.e., valued) or bipartite. After these codes the number of vertices (5) and edges (6) are then displayed. See the help entry for summary.igraph for more details. The str() function provides slightly more information, including the edge list. Similarly, an igraph graph object can be created from an edge list. inet2 <- graph.edgelist(netmat2) summary(inet2) ## IGRAPH D--- 5 6 -Node and tie attributes can be created, accessed, and transformed in similar ways as within statnet. (In fact, management of node and tie attributes is somewhat easier in igraph because of the underlying elegance of the accessor functions.) To create and use node attributes, the V() vertex accessor function is used. Similarly, to manage edge attributes, the E() edge accessor function is used. In this example we use these functions to set names for the nodes, and to set edge values for the observed ties. V(inet2)$name <- c("A","B","C","D","E") E(inet2)$val <- c(1:6) summary(inet2) ## IGRAPH DN-- 5 6 -## + attr: name (v/c), val (e/n) str(inet2) ## ## ## ## IGRAPH DN-- 5 6 -+ attr: name (v/c), val (e/n) + edges (vertex names): [1] A->B A->C B->C B->D C->B E->C 30 3 Network Data Management in R 3.2.4 Going Back and Forth Between statnet and igraph There will be times when you will want to use statnet network functions on network data stored in an igraph graph object, and vice versa. To facilitate this, the intergraph package can be used to transform network data objects between the two formats. In the following example, we transform the net1 data into the igraph format using the asIgraph function. If we wanted to go in the opposite direction, we would use asNetwork. library(intergraph) class(net1) ## [1] "network" net1igraph <- asIgraph(net1) class(net1igraph) ## [1] "igraph" str(net1igraph) ## ## ## ## ## ## IGRAPH D--- 5 6 -+ attr: alldeg (v/n), gender (v/c), na | (v/l), vertex.names (v/c), na (e/l), | rndval (e/n) + edges: [1] 1->2 3->2 1->3 2->3 5->3 2->4 3.3 Importing Network Data Importing raw data into R for subsequent network analyses is relatively straightforward, as long as the external data are in edge list, adjacency list, or sociomatrix form (or can easily be transformed into such). This example creates an edge list that corresponds to the same example network from Sect. 3.2.1 and then saves it as an external CSV file. This file is then read in using read.csv and then turned into a network data object. detach("package:igraph", unload=TRUE) library(statnet) netmat3 <- rbind(c("A","B"), c("A","C"), c("B","C"), 3.3 Importing Network Data c("B","D"), c("C","B"), c("E","C")) net.df <- data.frame(netmat3) net.df ## ## ## ## ## ## ## 1 2 3 4 5 6 X1 X2 A B A C B C B D C B E C write.csv(net.df, file = "MyData.csv", row.names = FALSE) net.edge <- read.csv(file="MyData.csv") net_import <- network(net.edge, matrix.type="edgelist") summary(net_import) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Network attributes: vertices = 5 directed = TRUE hyper = FALSE loops = FALSE multiple = FALSE bipartite = FALSE total edges = 6 missing edges = 0 non-missing edges = 6 density = 0.3 Vertex attributes: vertex.names: character valued attribute 5 valid vertex names No edge attributes Network A B C A 0 1 1 B 0 0 1 C 0 1 0 D 0 0 0 adjacency matrix: D E 0 0 1 0 0 0 0 0 31 32 3 Network Data Management in R ## E 0 0 1 0 0 gden(net_import) ## [1] 0.3 The network package in the statnet suite can read in external network data that are in Pajek format (either Pajek .net or .paj files), using the read.paj() function. The igraph package can also import Pajek files, as well as a few other formats including GraphML and UCINet DL files. 3.4 Common Network Data Tasks The preceding sections covered the basic information needed to create and manage network data objects in R. However, the data managements tasks for network analysis do not end there. There are any number of network analytic challenges that will require more sophisticated data management and transformation techniques. In the rest of this chapter, two such examples are covered: preparing subsets of network data for analysis by filtering on node and edge characteristics, and turning directed networks into non-directed networks. 3.4.1 Filtering Networks Based on Vertex or Edge Attribute Values It is quite common to want to examine a subset of a network, either for quick visualization or for further analyses. There are many ways to define or identify interesting subnetworks in a larger network, and Chap. 8 covers many of them. However, as a basic data management task, you can filter a network based on values contained either in edge attributes or vertex attributes. For both of these cases, you will delete either the nodes or the edges, based on selection criteria that you set. 3.4.1.1 Filtering Based on Node Values If a network object contains node characteristics, stored as vertex attributes, this information can be used to select a new subnetwork for analysis. In our example network we have the gender vertex attribute, so if you wanted to look at the subnetwork made up of females, you would use the following code (after switching back from igraph to statnet). 3.4 Common Network Data Tasks 33 n1F <- get.inducedSubgraph(net1, which(net1 %v% "gender" == "F")) n1F[,] ## A B ## A 0 1 ## B 0 0 ## D 0 0 D 0 1 0 The get.inducedSubgraph() function returns a new network object that is filtered based on the vertex attribute criteria. This works because the %v% operator returns a list of vertex ids. gplot(n1F,displaylabels=TRUE) The same process can work with numeric node characteristics. The following code will plot the subset of the example network who all have degree greater than or equal to 2. (But note that the nodes in the new subnetwork will of course not have the same original degree values!) This works the same way but uses the %s% operator, which is a shortcut for the get.inducedSubgraph function (Fig. 3.3). A B D Fig. 3.3 Female subnetwork deg <- net1 %v% "alldeg" n2 <- net1 %s% which(deg > 1) gplot(n2,displaylabels=TRUE) 34 3 Network Data Management in R 3.4.1.2 Removing Isolates Another common filtering task with networks is to examine the network after removing all the isolates (i.e., nodes with degree of 0). We could use the get.inducedSubgraph function from the previous section, but given that we want to delete certain nodes we can take a more direct approach (Fig. 3.4). B C A Fig. 3.4 High degree subnetwork For this short example, we will use the ICTS network dataset, which is available as part of the UserNetR package that accompanies this book. The members of this network are scientists, and they have a tie if they worked together on a scientific grant submission. Using the isolates() function, we can see that this network has a fair number of isolated nodes. data(ICTS_G10) gden(ICTS_G10) ## [1] 0.0112 length(isolates(ICTS_G10)) ## [1] 96 The isolates() function returns a vector of vertex IDs. This can be fed to the delete.vertices() function. However, unlike most R functions we have seen, delete.vertices() does not return an object, but it directly operates on the network that is passed to it. For that reason, it is safer to work on a copy of the object. n3 <- ICTS_G10 delete.vertices(n3,isolates(n3)) gden(n3) 3.4 Common Network Data Tasks 35 ## [1] 0.0173 length(isolates(n3)) ## [1] 0 3.4.1.3 Filtering Based on Edge Values A social network often contains valued ties. For example, a resource exchange network may list not only who exchanges money (or some other resource) with each other, but the amount of money. Remember that in statnet information about ties is stored in edge attributes (see Sect. 3.2.2). When a network has valued ties, it is not unusual to want to examine the part of the network that only has certain values for those ties. For example, you might want to visualize or analyze only those persons in the resource exchange network who have given or received over a certain amount of money. For this you will need to filter the network using tie values contained in the appropriate edge attribute. For this next example we will use a larger, more realistic social network. The DHHS Collaboration Network (DHHS) contains network data from a study of the relationships among 54 tobacco control experts working in 11 different agencies in the Department of Health and Human Services in 2005. The main relationship included in this dataset is collaboration – two members have a tie if they worked together in the past year. This tie is valued to capture differences in the strength of the collaboration. Specifically, the collaboration tie could take on one of four values: (1) Shared information only; (2) Worked together informally; (3) Worked together formally on a project; and (4) Worked together formally on multiple projects. We can see that the raw network is relatively dense, and because of that the network structure is somewhat hard to interpret when plotted (Fig. 3.5). data(DHHS) d <- DHHS gden(d) ## [1] 0.312 op <- par(mar = rep(0, 4)) gplot(d,gmode="graph",edge.lwd=d %e% 'collab', edge.col="grey50",vertex.col="lightblue", vertex.cex=1.0,vertex.sides=20) par(op) The graphic hard to interpret partly because of the high density, as well as having some edge widths being thicker based on the value of the ‘collab’ attribute. We may have a more interesting network graph (and one that is easier to interpret), if we only examine the network ties for formal collaboration. That is, we can filter the 36 3 Network Data Management in R Fig. 3.5 DHHS collaborations original network and show only those ties where collaboration is coded 3 or higher. To understand how edge filtering works, it is important to remember how valued ties are stored in a network object. The ties themselves are stored as a binary indicator in the network object, while the values of those ties are stored in an edge attribute. We can see how this works for the DHHS Collaboration network. First, we examine the network ties for the first six members of the network. Then we determine where the collaboration values are stored, and then use that to view the tie values for the same set of six actors. as.sociomatrix(d)[1:6,1:6] ## ## ## ## ## ## ## ACF-1 ACF-2 AHRQ-1 AHRQ-2 AHRQ-3 AHRQ-4 ACF-1 ACF-2 AHRQ-1 AHRQ-2 AHRQ-3 AHRQ-4 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 1 0 list.edge.attributes(d) ## [1] "collab" "na" as.sociomatrix(d,attrname="collab")[1:6,1:6] 3.4 Common Network Data Tasks ## ## ## ## ## ## ## ACF-1 ACF-2 AHRQ-1 AHRQ-2 AHRQ-3 AHRQ-4 37 ACF-1 ACF-2 AHRQ-1 AHRQ-2 AHRQ-3 AHRQ-4 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 3 3 3 0 0 3 0 3 2 0 0 3 3 0 3 0 0 3 2 3 0 The summary of the network object tells us that there are 447 ties in the DHHS network. We can easily see the distribution of tie values. table(d %e%"collab") ## ## 1 2 ## 163 111 3 94 4 79 This indicates that of the 447 ties, 163 are informal sharing (1), 111 are informal (2), 94 are formal on a single project (3), and the final 79 ties are between DHHS members who have worked together formally on multiple projects (4). Now we can filter the edges to only include formal collaboration ties. This takes three steps. First, a valued sociomatrix is created that contains the tie values stored in the ‘collab’ edge attribute. Then we filter out the ties that we want to ignore. In this case the ties that are coded 1 and 2 are replaced with 0s. Then, we create a new network based on the filtered sociomatrix. The key here is that a tie will be created anywhere a non-zero value is found in d.val. Also, by using the ignore.eval and names.eval options we store the retained edge values in an edge attribute called ‘collab.’ d.val <- as.sociomatrix(d,attrname="collab") d.val[d.val < 3] <- 0 d.filt <- as.network(d.val, directed=FALSE, matrix.type="a",ignore.eval=FALSE, names.eval="collab") We can see that the new network has the same number of actors, but only 173 ties (corresponding to the original numbers for the 3 and 4-levels of collab). Also, not surprisingly, the density is now much lower. summary(d.filt,print.adj=FALSE) ## Network attributes: ## vertices = 54 ## directed = FALSE ## hyper = FALSE ## loops = FALSE ## multiple = FALSE ## bipartite = FALSE 38 3 Network Data Management in R ## total edges = 173 ## missing edges = 0 ## non-missing edges = 173 ## density = 0.121 ## ## Vertex attributes: ## vertex.names: ## character valued attribute ## 54 valid vertex names ## ## Edge attributes: ## ## collab: ## numeric valued attribute ## attribute summary: ## Min. 1st Qu. Median Mean 3rd Qu. ## 3.00 3.00 3.00 3.46 4.00 Max. 4.00 gden(d.filt) ## [1] 0.121 Now when the network is plotted we can examine a smaller set of ties for important structural information (Fig. 3.6). op <- par(mar = rep(0, 4)) gplot(d.filt,gmode="graph",displaylabels=TRUE, vertex.col="lightblue",vertex.cex=1.3, label.cex=0.4,label.pos=5, displayisolates=FALSE) par(op) Note that the gplot() function itself has a limited ability to display only the ties that exceed some lower threshold, using the thresh option. For example, this command will display the same network as the previous code, without having to go through the steps to create a new filtered network (results not shown here). Note that for this to work a valued sociomatrix has to be passed to gplot, not an actual network object. op <- par(mar = rep(0, 4)) d.val <- as.sociomatrix(d,attrname="collab") gplot(d.val,gmode="graph",thresh=2, vertex.col="lightblue",vertex.cex=1.3, label.cex=0.4,label.pos=5, displayisolates=FALSE) par(op) 3.4 Common Network Data Tasks 39 HRSA-1 NIH-16 NIH-13 HRSA-3 FDA-1 NIH-7 CMS-1 NIH-9 NIH-14 HRSA-2 NIH-11 NIH-2 FDA-2 NIH-8 NIH-3 NIH-6 CDC-4 NIH-4 NIH-10 NIH-1 NIH-5 NIH-12 CDC-1 CDC-3 CDC-5 OS-5 SAMHSA-3 AHRQ-3 CDC-2 CDC-10 CDC-6 CDC-9 ACF-2 AHRQ-4 IHS-1 CDC-12 CDC-11 CDC-7 OS-4 AHRQ-1 IHS-2 CDC-8 AHRQ-2 OGC-2 OGC-1 OS-2 OS-3 OS-1 Fig. 3.6 DHHS formal collaborations 3.4.2 Transforming a Directed Network to a Non-directed Network It is often the case that even though the raw data in a network analysis is made up of directed ties, the analyst wishes to consider the data as non-directed. This could happen for several reasons. First, although the network relationship is non-directed, the data collection procedures may result in directed ties. For example, in a survey of collaboration among organizational representatives, even though collaboration is non-directed (if agency A is collaborating with agency B, then B is also collaborating with A), the raw data matrices are not likely to be perfectly symmetric. That is, in self-report data, there may be error in the data or pairs or respondents may not agree with each other on collaboration status. Respondent A may believe that agency A collaborates with agency B, but respondent B may not believe the two agencies are collaborating. In any case, you end up with a directed network that you wish to ‘fix’ by transforming it into a non-directed network. 40 3 Network Data Management in R You may also wish to transform directed ties into non-directed ties for conceptual reasons, and not simply to fix data disagreements. For example, in studying trust relationships you collect data on directed perceptions of trust. Here a tie between A and B indicates that A trusts B. This is directed, in the sense that just because A trusts B, that does not mean that B trusts A in return. However, you may wish to analyze this in a non-directed sense, where an edge exists between two actors if there is any trust relationship between the pair. So whether A trusts B, B trusts A, or even if there is a reciprocal trusting relationship between A and B, then you would treat A and B as having a trust relationship where you are ignoring the directionality of the trust. For either of these reasons, R makes it easy to transform a directed network into a non-directed network. To do this you can use the symmetrize() function. The name of the function should remind you that when network data are stored in a sociomatrix, if the data are symmetric around the diagonal that indicates that the ties are non-directed. net1mat <- symmetrize(net1,rule="weak") net1mat ## ## ## ## ## ## [1,] [2,] [3,] [4,] [5,] [,1] [,2] [,3] [,4] [,5] 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 0 net1symm <- network(net1mat,matrix.type="adjacency") network.vertex.names(net1symm) <- c("A","B","C","D","E") summary(net1symm) ## Network attributes: ## vertices = 5 ## directed = TRUE ## hyper = FALSE ## loops = FALSE ## multiple = FALSE ## bipartite = FALSE ## total edges = 10 ## missing edges = 0 ## non-missing edges = 10 ## density = 0.5 ## ## Vertex attributes: ## vertex.names: 3.4 Common Network Data Tasks ## ## ## ## ## ## ## ## ## ## ## ## 41 character valued attribute 5 valid vertex names No edge attributes Network A B C A 0 1 1 B 1 0 1 C 1 1 0 D 0 1 0 E 0 0 1 adjacency matrix: D E 0 0 1 0 0 1 0 0 0 0 The symmetrize procedure is relatively straightforward, except it returns a sociomatrix (or, optionally, an edgelist). So we need to then turn it into a network object, as we have done previously. The ‘rule’ option gives you four different choices on how to symmetrize the ties. The ‘weak’ rule corresponds to a Boolean ‘OR’ condition where a tie is created between nodes i and j if there is a directed tie either from i to j or from j to i. There is also a ‘strong’ rule, corresponding to a Boolean ‘AND’ where a tie is created between i and j only if there are directed ties from i to j and j to i. This creates a symmetric network where the only ties preserved are the fully reciprocated ties. Part II Visualization Chapter 4 Basic Network Plotting and Layout Above all else, show the data. (Edward R. Tufte, The Visual Display of Quantitative Information) 4.1 The Challenge of Network Visualization As suggested in Chap. 2, producing and examining a network plot is often one of the first steps in network analysis. The overall purpose of a network graphic (as with any information graphic) is to highlight the important information contained in the underlying data. However, there are innumerable ways to visually layout network nodes and ties in two-dimensional space, as well as using graphical elements (e.g., node size, line color, figure legend, etc.) to communicate the story in the network data. In the next three chapters we go over basic principles of effective network graph design, and how to produce effective network visualizations in R. An effective network graphic will convey the important information in a social network, such as the overall structure, location of important actors in the network, presence of distinctive subgroups, etc. At the same time, the graphic should do its best to minimize irrelevant information. For example, tie length in a network graphic is arbitrary in the sense that the length of a tie is not meaningful. An effective network figure will be designed and laid out in a way that minimizes the chance that a viewer will misinterpret the meaning of tie lengths. The purpose of this chapter is to introduce basic plotting techniques for networks in R, and discuss the various options for specifying the layout of the network on the screen or page. The following example shows how interpretation of a network graphic can be impeded or enhanced by its basic layout. data(Moreno) op <- par(mar = rep(0, 4),mfrow=c(1,2)) plot(Moreno,mode="circle",vertex.cex=1.5) plot(Moreno,mode="fruchtermanreingold",vertex.cex=1.5) par(op) At first glance it may appear that the figures are showing two quite different networks. In fact, they are two different visual representations of the same underlying © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 4 45 46 4 Basic Network Plotting and Layout Fig. 4.1 Same network, different layouts social network, in this case the friendship ties among a 4th grade class. Despite representing the same network data, the righthand figure is easier for us to interpret. In particular, it is much easier to see that the network is made up of two separate components, and that the large component has two fairly distinct cohesive subgroups. That is, the important structural characteristics of the network are easier to determine with the second layout compared to the first. Although it is possible to lay out a network in 3D-space, the vast majority of network visualizations are two-dimensional. Nodes are represented by shapes, typically circles, and ties are represented by straight or sometimes curved lines. The lines themselves can be tricky to interpret for somebody new to network visualization. In particular, the length of the line has no real meaning. Consider the following two graphs, which display the same simple network (Fig. 4.2). At a quick glance it might appear that node D is further away from B and C in the second graph. But the ties simply indicate which nodes are adjacent to one another, so the length of each line does not communicate any substantive information. B A B D C A D C Fig. 4.2 Line length is arbitrary However, as the Moreno 4th grade friendship network example illustrated (Fig. 4.1), despite the arbitrary nature of some of the layout elements, the way a network is depicted in a graphic can enhance or obscure other important structural information. 4.2 The Aesthetics of Network Layouts 47 This is the fundamental challenge of network visualization: to reveal important structural characteristics of the network without distortion or as Edward Tufte stated, The minimum we should hope for with any display technology is that it should do no harm (Tufte 1990). 4.2 The Aesthetics of Network Layouts Although there are not in fact an infinite number of ways to display a network on a screen, the number of possibilities might as well be. (For example, consider a moderate sized network of 50 nodes, and a display grid of 10 by 10. In actuality, the display grid would be much larger than this. The first node in the network could be placed in any one of 100 positions, the 2nd node in 99 positions, and so on. In this example, there are 3.1 × 1093 different possible network layouts.) Most of the possibilities will produce ugly or confusing layouts, therefore there must be some way to pick a layout that has a better than average chance of being visually acceptable. Fortunately, network and visualization scientists have studied what makes network graph layouts easier to understand and interpret. What has emerged from this line of work is a set of aesthetic principles that can be used to more effectively display networks. Network graphics are easier to understand if they follow as much as possible the following five guidelines: • • • • • Minimize edge crossings. Maximize the symmetry of the layout of nodes. Minimize the variability of the edge lengths. Maximize the angle between edges when they cross or join nodes. Minimize the total space used for the network display. A large number of approaches have been developed for automatic layout of network graphics. One general class of algorithms, called force-directed, has proven to be a flexible and powerful approach to automatic network layouts. These algorithms work iteratively to minimize the total energy in a network, where the energy can be defined in a number of ways. A popular approach is to have connected nodes have a spring-like attractive force, while simultaneously assigning repulsive forces to all pairs of nodes. The springs in this algorithm act to pull connected nodes closer to one another, while the repulsive forces push unconnected nodes away from each other. The resulting network system will move around and oscillate for a while before settling into a steady state that tends to minimize the energy in the network system. This describes how the algorithm works, but the remarkable feature is that the resulting network graph tends to produce displays that are aesthetically pleasing, in the sense described above (Fruchterman and Reingold 1991). To see the positive results of using one of these algorithms, consider the comparison in Fig. 4.3. On the left-hand side, the Moreno network is displayed randomly. On the right-hand side we are using the Fruchterman-Reingold algorithm for the network display. Fruchterman and Reingold introduced one of the first force-directed 48 4 Basic Network Plotting and Layout network display algorithms, and it is still very widely used. In fact, it is the default algorithm used by the statnet network plotting functions. On the right-hand side the nodes are displayed more symmetrically, there are relatively fewer edge crossings, and the tie lengths are more uniform. All of this makes it easier to interpret the structural information contained in the network. op <- par(mar = c(0,0,4,0),mfrow=c(1,2)) gplot(Moreno,gmode="graph",mode="random", vertex.cex=1.5,main="Random layout") gplot(Moreno,gmode="graph",mode="fruchtermanreingold", vertex.cex=1.5,main="Fruchterman-Reingold") par(op) Random layout Fruchterman-Reingold Fig. 4.3 Moreno network-random vs. Fruchterman-Reingold layouts As stated above, a force-directed algorithm works by iteratively adjusting the overall network layout until some measure of overall network energy is minimized. The details of this are usually not of interest, but to see how this works in practice consider Fig. 4.4, which displays the Bali terrorist network. Starting from a circle layout, it shows how the Fruchterman-Reingold layout algorithm works through successive iterations, from 0 (the starting circle) to 50. The Fruchterman-Reingold algorithm, along with other force-directed approaches, are iterative and non-deterministic. That means that each time you run the plotting algorithm you will not get the exact same layout. However, you will get a layout that tends to be symmetrical, minimize edge crossings, etc. 4.3 Basic Plotting Algorithms and Methods 49 Iteration = 0 Iteration = 1 Iteration = 5 Iteration = 10 Iteration = 25 Iteration = 50 Fig. 4.4 Iterative Fruchterman-Reingold algorithm 4.3 Basic Plotting Algorithms and Methods Network visualization in statnet is handled by two closely related functions, plot and gplot. The latter has more layout options, so it may be more generally useful. To use a different layout algorithm, it is as simple as specifying the appropriate layout option. Figure 4.5 shows six of the layout options for the gplot function. op <- par(mar=c(0,0,4,0),mfrow=c(2,3)) gplot(Bali,gmode="graph",edge.col="grey75", vertex.cex=1.5,mode='circle',main="circle") gplot(Bali,gmode="graph",edge.col="grey75", vertex.cex=1.5,mode='eigen',main="eigen") gplot(Bali,gmode="graph",edge.col="grey75", vertex.cex=1.5,mode='random',main="random") gplot(Bali,gmode="graph",edge.col="grey75", vertex.cex=1.5,mode='spring',main="spring") gplot(Bali,gmode="graph",edge.col="grey75", vertex.cex=1.5,mode='fruchtermanreingold', 50 4 Basic Network Plotting and Layout main='fruchtermanreingold') gplot(Bali,gmode="graph",edge.col="grey75", vertex.cex=1.5,mode='kamadakawai', main='kamadakawai') par(op) circle eigen random spring fruchtermanreingold kamadakawai Fig. 4.5 Network layout options 4.3.1 Finer Control Over Network Layout The layout options provided in statnet (and igraph, see below) work algorithmically or heuristically, usually with some randomness. So, even with the same layout option, a different graphic layout will be produced each time the network is plotted. Fortunately, R provides a way to have exact control over the layout coordinates. This allows for exact positioning, or saving the layout coordinates after a particular network is plotted. The coord option in the plot function is used for this. This option expects a matrix with two columns. Each row corresponds to one node, the first column gives the X coordinate, and the second column gives the Y coordinate. Also, the results 4.3 Basic Plotting Algorithms and Methods 51 of a plotting function can be saved to an object, which will contain the coordinates of the produced plot. This is demonstrated in the next example. Here, we produce an initial plot of the Bali network, saving the coordinates. We then stretch out the layout of the graph by multiplying the Y coordinates by a constant. Both plots are shown in the figure, along with axes to make it easier to see how the coordinates have changed (Fig. 4.6). There are many other ways to use specific coordinates, but the main use is to preserve a particular layout for future production and examination. mycoords1 <- gplot(Bali,gmode="graph", vertex.cex=1.5) mycoords2 <- mycoords1 mycoords2[,2] <- mycoords1[,2]*1.5 mycoords1 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## x y [1,] -6.299 11.84 [2,] -3.887 13.80 [3,] -8.355 9.89 [4,] -4.672 10.28 [5,] -8.537 11.65 [6,] -5.932 12.63 [7,] -2.420 12.96 [8,] -7.694 10.09 [9,] -8.334 10.86 [10,] -0.935 8.08 [11,] -3.015 6.98 [12,] -1.863 7.10 [13,] -1.094 9.30 [14,] -2.061 8.51 [15,] -7.715 12.83 [16,] -10.453 11.25 [17,] -8.357 12.43 mycoords2 ## ## [1,] ## [2,] ## [3,] ## [4,] ## [5,] ## [6,] ## [7,] ## [8,] ## [9,] ## [10,] x -6.299 -3.887 -8.355 -4.672 -8.537 -5.932 -2.420 -7.694 -8.334 -0.935 y 17.8 20.7 14.8 15.4 17.5 18.9 19.4 15.1 16.3 12.1 52 4 Basic Network Plotting and Layout ## ## ## ## ## ## ## [11,] -3.015 10.5 [12,] -1.863 10.7 [13,] -1.094 14.0 [14,] -2.061 12.8 [15,] -7.715 19.2 [16,] -10.453 16.9 [17,] -8.357 18.6 op <- par(mar=c(4,3,4,3),mfrow=c(1,2)) gplot(Bali,gmode="graph",coord=mycoords1, vertex.cex=1.5,suppress.axes = FALSE, ylim=c(min(mycoords2[,2])-1,max(mycoords2[,2])+1), main="Original coordinates") gplot(Bali,gmode="graph",coord=mycoords2, vertex.cex=1.5,suppress.axes = FALSE, ylim=c(min(mycoords2[,2])-1,max(mycoords2[,2])+1), main="Modified coordinates") par(op) 4.3.2 Network Graph Layouts Using igraph The igraph package provides the user with a similar set of options for controlling the layouts of network graphics. The layout option is used to specify an existing layout function or refer to a set of vertex coordinates. See ?igraph.plotting for more information on plotting and layout options in igraph (Fig. 4.7). detach(package:statnet) library(igraph) library(intergraph) iBali <- asIgraph(Bali) op <- par(mar=c(0,0,3,0),mfrow=c(1,3)) plot(iBali,layout=layout_in_circle, main="Circle") plot(iBali,layout=layout_randomly, main="Random") plot(iBali,layout=layout_with_kk, main="Kamada-Kawai") par(op) 4.3 Basic Plotting Algorithms and Methods 53 Modified coordinates 30 25 20 15 10 5 5 10 15 20 25 30 Original coordinates -12 -12 -8 -6 -4 -2 0 -8 -6 -4 -2 0 Fig. 4.6 Network layouts with modified coordinates 6 5 7 Kamada-Kawai Random Circle 7 4 4 15 3 12 5 8 1 2 11 8 8 9 12 9 14 7 1 13 10 9 2 17 17 6 14 3 16 13 16 12 13 14 15 10 5 11 11 4 10 6 16 Fig. 4.7 Network layout options with igraph 3 15 17 1 2 Chapter 5 Effective Network Graphic Design As with any graphic, networks are used in order to discover pertinent groups or to inform others of the groups and structures discovered. It is a good means of displaying structures. However, it ceases to be a means of discovery when the elements are numerous. The figure rapidly becomes complex, illegible and untransformable. (Jacques Bertin) 5.1 Basic Principles Achieving effective network graphic design is not that different from any other type of information graphic. As Edward Tufte pointed out in his seminal The Visual Display of Quantitative Information, “Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.” Network graphics actually start out with an important advantage in that they typically have a high information/ink ratio. The goal for any network graphic design should be to produce a figure that reveals the important or interesting information that is contained in the network data. To do this, the analyst must make decisions about every graphical element that can appear in the figure. R, and the plotting functions contained in the statnet and igraph packages, give the analyst almost complete programmatic control over the appearance of the network graphic. The purpose of this chapter is to walk through many of the most useful design elements in network graphics, and discuss how to use them and why they should be used in certain ways. 5.2 Design Elements Like any other type of information graphic, network visualizations are made up of a large number of distinct visual elements. These individual elements include things that are distinctive to network graphics, such as nodes and ties, as well as other elements common to most graphics, such as titles, legends, etc. The plotting functions in statnet and igraph provide a great deal of programmatic control to the user. Although a simple call to a plotting function is enough to produce a default network graphic, it is almost always the case that you will need to take time to set appropriate function options and develop some additional R code to produce an © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 5 55 56 5 Effective Network Graphic Design effective graphic. Some design decisions will be made on aesthetic grounds, while many others will be based on the most important pattern or story that you wish to convey with the graphic and that is supported by the underlying network data. The following sections present a quick tour of the most commonly used individual graphing elements for network visualizations. They will each be covered on their own in turn. 5.2.1 Node Color By default, statnet produces a network graphic with red circles as nodes. To designate a different color, the vertex.col option is used in the gplot function. (The gmode option is also used here to tell statnet not to handle Bali as a directed graph.) So, for example, it is a simple matter to produce a plot with attractive light blue nodes (Fig. 5.1). data(Bali) gplot(Bali,vertex.col="slateblue2",gmode="graph") In general, all of the basic color-handling options of R are available for plotting networks. This opens up a lot of power and flexibility for graphic design, but to use color effectively will require some homework. In particular, it will be useful to read more in-depth treatments of color use in R (e.g., Murrell 2005). As the above example suggests, a color can be designated by its color name. To see all of the 657 possible color names recognized by R, use the colors() command. In addition to specifying the name, colors can be selected using RedGreen-Blue (RGB) triplets of intensities. Alternatively, the RGB specification can also be provided using a hexadecimal string of the form ‘#RRGGBB’, where each of the RR, GG, and BB parts of the string is a hexadecimal number that provides the red, green or blue intensity ranging from 00 to FF. The following code will produce the same network graphic with the same light blue nodes (figure not shown), showing how you can obtain colors using the rgb and hexadecimal approaches. To get the appropriate rgb values for a particular color name, you can use the col2rgb() function. The hexadecimal codes were obtained at http://www.javascripter.net/faq/rgbtohex.htm. col2rgb('slateblue2') gplot(Bali,vertex.col=rgb(122,103,238, maxColorValue=255),gmode="graph") gplot(Bali,vertex.col="#7A67EE",gmode="graph") One less common color feature in R can come in handy for network diagrams, especially with large networks where the nodes overlap in the graphic. Normally, colors are fully opaque, so overlapping nodes in a graphic will lead to large color ‘blobs’ where it is hard to distinguish the nodes. However, it is possible to make 5.2 Design Elements 57 Fig. 5.1 Bali network with light blue nodes use of partially transparent colors, using a built-in alpha transparency channel. The rgb() function can be used to specify the amount of transparency, from 0 (fully transparent) to 1 (fully opaque). See ?rgb for more details. Figure 5.2 shows the difference when using the alpha transparency color channel. Both graphics show the same random network of 300 nodes. (The layouts are different because each is using the Fruchterman-Reingold force-directed algorithm.) The figure on the left is using a fully opaque dark blue color. The figure on the right is still using dark blue, but with an alpha transparency channel of approximately 30 %. The overlapping nodes are much easier to see when transparent colors are used. Note that some graphics devices in R may not support transparent colors. ndum <- rgraph(300,tprob=0.025,mode="graph") op <- par(mar = c(0,0,2,0),mfrow=c(1,2)) gplot(ndum,gmode="graph",vertex.cex=2, vertex.col=rgb(0,0,139,maxColorValue=255), edge.col="grey80",edge.lwd=0.5, main="Fully opaque") gplot(ndum,gmode="graph",vertex.cex=2, vertex.col=rgb(0,0,139,alpha=80, maxColorValue=255), edge.col="grey80",edge.lwd=0.5, main="Partly transparent") par(op) In these previous examples, every node has the same color. A more important use of color is to communicate some characteristic of the node or network by having different nodes have different colors. Specifically, information stored in a categorical node attribute can often be communicated through judicious node color choices. 58 5 Effective Network Graphic Design Fully opaque Partly transparent Fig. 5.2 Alpha transparency channel example For example, the Bali terrorist network has the role vertex attribute which stores the categorical description of the role that each member played in the network. CT means a member of the command team, BM is a bomb maker, etc. (See ?Bali for more information.) Node colors can be used to effectively distinguish the network member roles. Since this information is already stored in a vertex attribute, statnet can use this to automatically pick node colors. (This is only true for plot(), not gplot().) rolelab <- get.vertex.attribute(Bali,"role") op <- par(mar=c(0,0,0,0)) plot(Bali,usearrows=FALSE,vertex.cex=1.5,label=rolelab, displaylabels=T,vertex.col="role") par(op) Figure 5.3 displays the Bali network with nodes colored according to the role each member played in the network. (The node labels are also printed out to facilitate interpretation. Node labeling will be discussed in Sect. 5.2.4.) The network is much more interpretable by using color coding in this way. For example, we can more easily understand the subgroup structure by noting that the greater density between the members of Team Lima (TL, cyan), as well as the bombmakers (BM, black). However, by simply using the name of an existing vertex attribute, statnet picks node colors from the existing default color palette in R. Viewing this palette, we can see that “BM” is assigned to the color black because “BM” comes first alphabetically in the role attribute string, and black is the first entry in the color palette. “CT” comes second, so it is assigned red which is the 2nd entry in the palette, an so on. 5.2 Design Elements 59 palette() ## [1] "black" ## [5] "cyan" "red" "green3" "magenta" "yellow" "blue" "gray" Using the default palette has a number of disadvantages. First, it is limited to eight colors. (R will cycle through the set of eight colors if there are more than eight types of nodes to color.) Second, the default palette starts with black which is often not a good color choice to include with other colors in a network graphic. More generally, the default colors do not represent an aesthetically pleasing or useful set of colors for displaying categorical classifications. A more flexible, and usually aesthetically more satisfying approach is to set up your own color palette and then index into it for color selection. The RColor Brewer package provides a number of predesigned color palettes that are very useful when using color to distinguish between a relatively small set of categories. For more information, see ?RColorBrewer. More details on the ColorBrewer OA BM SB BM CT OA BM CT BM BM OA CT TL TL TL TL SB Fig. 5.3 Bali network colored by role 60 5 Effective Network Graphic Design system are available at http://www.colorbrewer.org. (There are many other color and palette picking options available, for example see the interactive palette chooser at http://paletton.com.) library(RColorBrewer) display.brewer.pal(5, "Dark2") In the following code, a user-defined palette is created by selecting five colors from a larger palette called Dark2 provided by RColorBrewer. Once the palette has been defined, it can be used in the network plotting call. This approach produces more pleasing sets of colors, and is much more flexible than relying on the default color palette (Fig. 5.4). Dark2 (qualitative) Fig. 5.4 A set of five colors chosen from an RColorBrewer palette Note that we convert the role vertex attribute character vector to a factor so that the indexing will work. This means that if you have a numeric vector stored as a vertex attribute that you do not have to turn it into a factor. In other words, the indexing works with factors or numeric vectors, but not character vectors (Fig. 5.5). my_pal <- brewer.pal(5,"Dark2") rolecat <- as.factor(get.vertex.attribute(Bali,"role")) plot(Bali,vertex.cex=1.5,label=rolelab, displaylabels=T,vertex.col=my_pal[rolecat]) 5.2.2 Node Shape In addition to using color to distinguish between different types of nodes, statnet can be directed to use different shapes for the nodes. This is mainly useful when 5.2 Design Elements 61 TL TL TL BM TL BM SB SB CT BM OA BM CT BM CT OA OA Fig. 5.5 Bali network with better colors there is a small number of node types. It is also particularly useful for situations where you will not be able to use color to distinguish nodes (or help viewers who may be color-blind). Unfortunately, statnet has only a limited ability to distinguish nodes by shapes, by designating the number of sides used to plot the node polygon (normally, the number of sides is 50, which produces a circle). If the number of sides is 3 you get a triangle, 4 a square, and so on. This is only useful for a very small number of node types (Fig. 5.6). If you have a particular need to use node shapes in a network graphic, igraph is much more flexible in this regard. See Sect. 9.2.3 for an igraph plotting example with different node shapes. op <- par(mar=c(0,0,0,0)) sidenum <- 3:7 plot(Bali,usearrows=FALSE,vertex.cex=4, displaylabels=F,vertex.sides=sidenum[rolecat]) par(op) 62 5 Effective Network Graphic Design Fig. 5.6 Bali network with different node shapes 5.2.3 Node Size Network node sizes are controlled by the vertex.cex option in the statnet plot and gplot functions (similar to how sizes of graphic elements are controlled in the base R graphics system). The overall sizes of the nodes should be set so that the nodes are large enough to be distinguishable, but small enough that they do not extensively overlap. In the following ‘Goldilocks’ example, we can see how vertex.cex can be adjusted to find an effective node size (Fig. 5.7). op <- par(mar = c(0,0,2,0),mfrow=c(1,3)) plot(Bali,vertex.cex=0.5,main="Too small") plot(Bali,vertex.cex=2,main="Just right") plot(Bali,vertex.cex=6,main="Too large") par(op) Rather than setting the same overall size for every node, it is often useful to use the node size in a network graphic to communicate some important quantitative characteristic. For example, nodes vary in their positions in the overall network. Some nodes are very central, while others are more peripheral. Chapter 7 discusses node prominence and centrality in more detail, but for now we will simply calculate some node characteristics such that larger numbers indicate more central nodes. To set this up, we will calculate three different measures of node centrality. Each of these lines of code produces a vector of centrality measures for each node, and larger numbers indicate greater centrality. 5.2 Design Elements 63 deg <- degree(Bali,gmode="graph") deg ## [1] ## [16] 9 6 4 9 9 15 9 10 3 9 9 5 5 5 5 5 9 cls <- closeness(Bali,gmode="graph") cls ## [1] 0.696 0.552 0.696 0.941 0.696 0.727 0.533 ## [8] 0.696 0.696 0.571 0.571 0.571 0.571 0.571 ## [15] 0.696 0.485 0.696 bet <- betweenness(Bali,gmode="graph") bet ## [1] ## [7] ## [13] 2.333 0.000 0.000 0.333 1.667 0.000 1.667 61.167 1.667 0.000 1.667 0.000 1.667 0.000 1.667 6.167 0.000 Once you have this node-level vector of quantitative information, it can be used to set the relative sizes of the nodes. This is done by using the same vertex.cex Too small Just right Too large Fig. 5.7 Adjusting overall node size option as before, but instead of assigning a single number we assign the vector of node information. op <- par(mar = c(0,0,2,1),mfrow=c(1,2)) plot(Bali,usearrows=T,vertex.cex=deg,main="Raw") plot(Bali,usearrows=FALSE,vertex.cex=log(deg), main="Adjusted") par(op) However, as we can see by comparing the two panels in Fig. 5.8, the raw numbers in the deg vector produce nodes that are much too large. They need to be adjusted, 64 5 Effective Network Graphic Design and in this case we get usable sizes by taking the log of the deg values. This results in node sizes where we can more easily see the nodes with higher degree relative to nodes with lower degree. Raw Adjusted Fig. 5.8 Adjusting relative node size – Example 1 The next two examples show other types of adjustments that might be necessary when setting relative node sizes. Using cls (closeness) we have the opposite problem from the previous example, where the nodes sizes start out too small. So an appropriate adjustment is to multiply the original values (Fig. 5.9). The bet vector (betweenness) provides a more complex example. First, the raw vector sizes vary across several orders of magnitude (with one node with a size of 122.3). In addition, some of the nodes have 0 for their bet values. These zeros would result in the nodes being plotted with 0 size, so we need to handle this by adding 1 to the entire vector before taking the square root (Fig. 5.10) . op <- par(mar = c(0,0,2,1),mfrow=c(1,2)) plot(Bali,usearrows=T,vertex.cex=cls,main="Raw") plot(Bali,usearrows=FALSE,vertex.cex=4*cls, main="Adjusted") par(op) op <- par(mar = c(0,0,2,1),mfrow=c(1,2)) plot(Bali,usearrows=T,vertex.cex=bet,main="Raw") plot(Bali,usearrows=FALSE,vertex.cex=sqrt(bet+1), main="Adjusted") par(op) The adjustments for relative node sizes can be tedious, although R does give you complete control for how to adjust the sizes. The following function can be used to save some time when figuring out the best node sizes. The function rescale() takes a vector of node characteristics (actually can be any numeric vector), and rescales the values to fit between the low and high values. 5.2 Design Elements Raw 65 Adjusted Fig. 5.9 Adjusting relative node size – Example 2 Raw Adjusted Fig. 5.10 Adjusting relative node size – Example 3 rescale <- function(nchar,low,high) { min_d <- min(nchar) max_d <- max(nchar) rscl <- ((high-low)*(nchar-min_d))/(max_d-min_d)+low rscl } The next plot shows how the function works and rescales the raw degree values for the Bali network to set the node sizes to vary from one to six (Fig. 5.11). 66 5 Effective Network Graphic Design plot(Bali,vertex.cex=rescale(deg,1,6), main="Adjusted node sizes with rescale function.") Adjusted node sizes with rescale function. Fig. 5.11 Rescaling the node size based on degree 5.2.4 Node Label A network graphic is often more interesting and easier to interpret if nodes are labelled so that the audience can see who or what makes up the network. This is particularly helpful for smaller networks; if networks get too large then the labels themselves may get in the way of the network information. If a network object in statnet contains the special vertex attribute vertex.names, then this can be used to automatically display node labels when plotting. Other characteristics of the node labels can be controlled such as font size, color, and distance from node (Fig. 5.12). get.vertex.attribute(Bali,"vertex.names") ## ## [1] "Muklas" "Amrozi" [5] "Dulmatin" "Idris" "Imron" "Mubarok" "Samudra" "Husin" 5.2 Design Elements 67 ## [9] "Ghoni" ## [13] "Hidayat" ## [17] "Sarijo" "Arnasan" "Junaedi" "Rauf" "Patek" "Octavia" "Feri" op <- par(mar = c(0,0,0,0)) plot(Bali,displaylabels=TRUE,label.cex=0.8, pad=0.4,label.col="darkblue") par(op) Mubarok Amrozi Idris Arnasan Junaedi Muklas Imron Ghoni Rauf Samudra Sarijo Dulmatin Husin Octavia Patek Hidayat Feri Fig. 5.12 Bali network with labelled nodes. The automatic labels based on information stored in the vertex.names attribute may not be the most important or useful information. For example, in the case of the Bali network the actual names of the terrorists are not that interesting to most viewers. Fortunately, you can use other text information to label the nodes. We saw an example of this earlier in Fig. 5.3. In this case we are using the text stored in the role vertex attribute to label the nodes. The key here is to use the label option to specify what text vector to use for the labels (Fig. 5.13). rolelab <- get.vertex.attribute(Bali,"role") plot(Bali,usearrows=FALSE,label=rolelab, displaylabels=T,label.col="darkblue") 68 5 Effective Network Graphic Design TL TL SB BM BM SB OA TL CT BM TL BM BM CT CT OA OA Fig. 5.13 Bali network with role labels 5.2.5 Edge Width If your network data include valued ties, or in general any quantitative information that can be related to ties between nodes, then you can communicate that information visually by altering the width of the displayed ties in a network graphic. For example, the strength of friendship ties might be known, or the amount of money that flows between organizations in a directed network might be measured. In these cases, thicker ties can denote greater strength or greater flow (Fig. 5.14). The Bali network includes a tie attribute called IC, which is a simple five-level ordinal scale that was used to measure the amount of interaction between members of the network. This attribute can be used to set the width of the ties in the network visualization. In the example below the IC values are extracted from the stored edge attribute, this allows us to transform the vector to better distinguish among the five IC levels (by multiplying the vector by 1.5). op <- par(mar = c(0,0,0,0)) IClevel <- Bali %e% "IC" plot(Bali,vertex.cex=1.5, edge.lwd=1.5*IClevel) par(op) 5.2 Design Elements 69 5.2.6 Edge Color While edge width can be set to communicate quantitative information about network ties, the color of the edge can be set to communicate qualitative information about the tie, similar to how node colors can set. For example, you could use different colors of line graphics to distinguish between positive and negative ties in a social network (Fig. 5.15). The Bali network does not contain categorical or qualitative information stored in an edge attribute, so here we create a random categorical vector to demonstrate how to use different edge colors in a network graphic. For this example, we set up a color palette that can be used to index the correct color choice, based on the categorical edge vector. In this case blue will be used for edge type #1, red for edge type #2, and green for edge type #3. This might reflect neutral ties (blue), negative ties (red), and positive ties (green). (Also see Fig. 7.8 in Chap. 7 for a more realistic example of using different line colors.) n_edge <- network.edgecount(Bali) edge_cat <- sample(1:3,n_edge,replace=T) linecol_pal <- c("blue","red","green") Fig. 5.14 Bali network with edge widths indicating amount of interaction 70 5 Effective Network Graphic Design plot(Bali,vertex.cex=1.5,vertex.col="grey25", edge.col=linecol_pal[edge_cat],edge.lwd=2) 5.2.7 Edge Type While edge width can be set to communicate quantitative information about network ties, the type of the edge can be set to communicate qualitative information about the ties. For example, you could use different types of line graphics to distinguish between positive and negative ties in a social network. The Bali network does not contain categorical or qualitative information stored in an edge attribute, so here we create a random categorical vector to demonstrate how to use different edge types in a network graphic. Here three different line types are used (2 = dashed; 3 = dotted; 4 = dotdash). Also, the different line types do not show up clearly using plot(), so gplot() is used here (Fig. 5.16). Fig. 5.15 Bali network with different edge colors n_edge <- network.edgecount(Bali) edge_cat <- sample(1:3,n_edge,replace=T) line_pal <- c(2,3,4) 5.2 Design Elements 71 gplot(Bali,vertex.cex=0.8,gmode="graph", vertex.col="gray50",edge.lwd=1.5, edge.lty=line_pal[edge_cat]) Although this works as intended, the resulting graphic is not very attractive and (in my mind) is hard to interpret. Different line types should be used sparingly, and probably only for very small networks with only two different line types. Most published network graphics stick to color and maybe line width to distinguish among different types of network ties. Fig. 5.16 Bali network with different edge types 5.2.8 Legends The examples above show how network graphic elements such as node color, node shape, node size, edge type, edge width can be used to communicate important characteristics of the network. As with other types of information graphics, it is often useful to provide a legend so that the meaning of this information is clear to the user. The basic plotting functions contained in statnet do not have built-in functionality for providing a network graphic legend. Fortunately, it is easy to use the legend() function provided by basic R to add a legend to a network graphic. In the example below we replicate the network graphic from Fig. 5.5 but add a legend to provide the node color key. We also scale the node sizes to reflect node 72 5 Effective Network Graphic Design prominence based on degree. See ?legend for more details on how to use legends. Figure 5.17, in fact, serves as a nice example of a carefully designed network graphic that could be used as a final product. It uses node size, node color, and a legend to efficiently and clearly communicate the most important information contained in the Bali network. my_pal <- brewer.pal(5,"Dark2") rolecat <- as.factor(get.vertex.attribute(Bali,"role")) plot(Bali,vertex.cex=rescale(deg,1,5), vertex.col=my_pal[rolecat]) legend("bottomleft",legend=c("BM","CT","OA","SB","TL"), col=my_pal,pch=19,pt.cex=1.5,bty="n", title="Terrorist Role") Terrorist Role BM CT OA SB TL Fig. 5.17 Bali network with legend Chapter 6 Advanced Network Graphics One eye sees, the other feels. (Paul Klee) As the previous two chapters demonstrate, both statnet and igraph have sophisticated plotting capabilities that can produce a very wide variety of network graphics. However, these plotting functions cannot meet all of the analytic or presentation needs. In particular, network scientists may wish to produce more specialized network graphics. Also, while statnet and igraph excel at producing high-quality publication ready network graphics, these graphics are static. Fortunately, developers have started exploring how to take network graphics and deliver them to web-based platforms where users can interact with the diagrams. This chapter explores a few of these more specialized network graphic techniques, as well as demonstrating how to produce some simple web-based interactive network diagrams. 6.1 Interactive Network Graphics One of the useful features of many other network analysis packages such as UCINet and Pajek is the ability to produce network diagrams that are interactive at some level. For example, in Pajek a network visualization can be produced in a separate ‘Draw’ window, and then the user can interact with that window in various ways to edit or change the network graphic. These capabilities can be very useful for exploring the network, as well as fine-tuning a network graphic for subsequent dissemination. Although R’s programmatic framework allows for detailed control over all the elements of a network graphic, this is generally not made available to the user in an interactive way. There are a few exceptions to this, as well as some new packages that allow for creating interactive network diagrams that can be published to the web. In this section a few of these options are demonstrated. © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 6 73 74 6 Advanced Network Graphics 6.1.1 Simple Interactive Networks in igraph The igraph package includes the tkplot() function which supports simple interactive network plots through a Tk graphics window. Only some features of the network graphics can be modified. A typical use for this feature is to produce the interactive graphic, adjust the node positions to improve the network layout, save the node position coordinates and then use the coordinates to produce a final (noninteractive) network diagram. This work flow is illustrated below with the Bali network, see Chap. 8 for a more in-depth example with these data. library(intergraph) library(igraph) data(Bali) iBali <- asIgraph(Bali) Coord <- tkplot(iBali, vertex.size=3, vertex.label=V(iBali)$role, vertex.color="darkgreen") # Edit plot in Tk graphics window before # running next two commands. MCoords <- tkplot.getcoords(Coord) plot(iBali, layout=MCoords, vertex.size=5, vertex.label=NA, vertex.color="lightblue") 6.1.2 Publishing Web-Based Interactive Network Diagrams Instead of building interactive network graphics within R itself, more people are beginning to look at ways to produce interactive graphics that are published on the Web, using frameworks like the D3 JavaScript library (http://http://d3js. org/) and Shiny (http://shiny.rstudio.com/). None of these approaches yet have come close to matching what a fully-developed network graphics application such as Gephi can do. However, I anticipate that we will be seeing rapid development of more R-connected approaches to web-based network visualization in the next few years. The networkD3 package is a small set of functions that can be used to build simple interactive network graphics that can be displayed in shiny-aware documents (i.e., RStudio) or in HTML web-pages. The following code shows how simple it is to produce an interactive graphic. The first set of lines will send a graphic to the Viewer window if you run the commands within RStudio. The simpleNetwork() function expects the network data in the form of an edgelist stored in a dataframe. (The output from the examples in this section is not shown here, because it requires RStudio or a web browser to view.) 6.1 Interactive Network Graphics 75 library(networkD3) src <- c("A","A","B","B","C","E") target <- c("B","C","C","D","B","C") net_edge <- data.frame(src, target) simpleNetwork(net_edge) To save the interactive network to a freestanding HTML file, use the following code. net_D3 <- simpleNetwork(net_edge) saveNetwork(net_D3,file = 'Net_test1.html', selfcontained=TRUE) The output from simpleNetwork is so simple that it mainly is useful as a proof-of-concept or tech demo. Slightly more sophisticated network graphics can be produced using the forceNetwork() function. For this example, we are using the Bali network again. The function expects data to be passed to it in two data frames. The ‘links’ dataframe will have the network data in edgelist format. The ‘nodes’ dataframe will have the node id and properties of the nodes. Currently only a categorical grouping variable is allowed. If the nodes have numeric ids, they must start at 0. So, the main work to use the function is putting the data into the correct format. iBali_edge <- get.edgelist(iBali) iBali_edge <- iBali_edge - 1 iBali_edge <- data.frame(iBali_edge) iBali_nodes <- data.frame(NodeID=as.numeric(V(iBali)-1), Group=V(iBali)$role, Nodesize=(degree(iBali))) forceNetwork(Links = iBali_edge, Nodes = iBali_nodes, Source = "X1", Target = "X2", NodeID = "NodeID",Nodesize = "Nodesize", radiusCalculation="Math.sqrt(d.nodesize)*3", Group = "Group", opacity = 0.8, legend=TRUE) Once again, this can be saved to an external file. Be careful, you will get an error if you try to overwrite an existing file, even if it is not open in your browser. net_D3 <- forceNetwork(Links = iBali_edge, Nodes = iBali_nodes, Source = "X1", Target = "X2", NodeID = "NodeID",Nodesize = "Nodesize", radiusCalculation="Math.sqrt(d.nodesize)*3", Group = "Group", opacity = 0.8, legend=TRUE) 76 6 Advanced Network Graphics saveNetwork(net_D3,file = 'Net_test2.html', selfcontained=TRUE) The visNetwork package is a similar set of tools that uses the vis.js javascript library (http://visjs.org/) to produce web-based interactive network graphics. This package also requires network data to be provided in a nodes data frame and an edges data frame. The nodes data frame should include an id column, and the edges data frame should have from and columns. Using the Bali network, the following code sets up the data and produces a minimal example of an interactive network graphic. Like in the previous example, this code produces an interactive network in the Viewer window of RStudio. library(visNetwork) iBali_edge <- get.edgelist(iBali) iBali_edge <- data.frame(from = iBali_edge[,1], to = iBali_edge[,2]) iBali_nodes <- data.frame(id = as.numeric(V(iBali))) visNetwork(iBali_nodes, iBali_edge, width = "100%") The visNetwork package has a large number of options that can be used to control the appearance of the network diagram, as well as for controlling how the plot can be embedded in Shiny web applications. See the package help file for more information, as well as a more in-depth demonstration of its capabilities available at http://dataknowledge.github.io/visNetwork/. The next code shows off some of these options. iBali_nodes$group <- V(iBali)$role iBali_nodes$value <- degree(iBali) net <- visNetwork(iBali_nodes, iBali_edge, width = "100%",legend=TRUE) visOptions(net,highlightNearest = TRUE) First, some of the display options are controlled by saving node or edge information into the nodes or edges data frames. Here, the group variable stores the ‘role’ attribute, and the value variable is used to store the node sizes (in this case, the degree). The visNetwork() and visOptions() functions are used to display the network, add a legend based on the grouping variable, set default colors for each group, and then allow for the user to highlight individual nodes and their immediate neighbors when clicking on a node in the diagram. As before, these interactive plots will appear in a plot window if you are using RStudio. Once the plot has been designed, it can be exported to a freestanding webpage or embedded in other web platforms (e.g., with Shiny). This last example shows how to save the plot in a separate web file, using the saveWidget() function from the htmlwidgets package, which is installed when you install the 6.2 Specialized Network Diagrams 77 visNetwork package. This example adds a set of navigation buttons to the final network plot that allows moving the network and zooming in or out. net <- visNetwork(iBali_nodes, iBali_edge, width = "100%",legend=TRUE) net <- visOptions(net,highlightNearest = TRUE) net <- visInteraction(net,navigationButtons = TRUE) library(htmlwidgets) saveWidget(net, "Net_test3.html") 6.1.3 Statnet Web: Interactive statnet with shiny As evidence of the rapid development of interactive network tools, the Statnet development team has recently published a web-based version of their R network analytic tools using the shiny web application framework. Statnet Web can be used by connecting directly to the shinyapps.io server at https://statnet.shinyapps.io/statnetWeb. Or, the tools can be run locally by installing the statnetWeb package. In addition to producing basic network plots by selecting parameters and options from drop-down boxes, statnetWeb can produce a variety of network statistics as well as fit and test ERGMs (see Chap. 11). Although web-based statnet does not give as much control over or reproducibility of network analytic results as a programming approach does, it is an impressive platform for quickly exploring network characteristics and will be useful for teaching as well as disseminating network analytic results. library(statnetWeb) run_sw() 6.2 Specialized Network Diagrams Traditionally, network diagrams are plotted to illustrate fundamental network and node properties such as prominence (see Chap. 4). However, there are a number of more specialized plotting techniques that can be used that are appropriate for highlighting other important or interesting aspects of the networks. Three of these approaches are demonstrated in this section: arc diagrams, chord diagrams, and heatmaps. 78 6 Advanced Network Graphics 6.2.1 Arc Diagrams Arc diagrams can be used when the positioning of nodes in a network is of less interest than the pattern of ties. Here is a simple example of an arc diagram, using the arcdiagram package. Note that this has to be installed using GitHub. library(devtools) install_github("gastonstat/arcdiagram") The set-up for this example includes loading all the required libraries, then creating an edgelist object for the arcdiagram() function. For this example, we are using the Simpsons dataset, which contains a set of (fictitious) network data that shows the primary interaction ties between 15 of the characters on the Simpsons television show. library(arcdiagram) library(igraph) library(intergraph) data(Simpsons) iSimp <- asIgraph(Simpsons) simp_edge <- get.edgelist(iSimp) A basic arc diagram can be produced with one function call (Fig. 6.1). arcplot(simp_edge) The arc diagram can be enhanced in a number of ways to highlight node and other network characteristics. Here we define some subgroups in the network (1 = family, 2 = work, 3 = school, 4 = neighborhood) and use colors to distinguish the groups (colors taken from a palette at colorbrewer2.org). Also, the degree of each node is used to adjust its size (Fig. 6.2). s_grp <- V(iSimp)$group s_col = c("#a6611a", "#dfc27d","#80cdc1","#018571") cols = s_col[s_grp] node_deg <- degree(iSimp) arcplot(simp_edge, lwd.arcs=2, cex.nodes=node_deg/2, labels=V(iSimp)$vertex.names, col.labels="darkgreen",font=1, pch.nodes=21,line=1,col.nodes = cols, bg.nodes = cols, show.nodes = TRUE) 7 14 13 9 8 5 15 12 11 79 10 6 4 3 2 1 6.2 Specialized Network Diagrams Fig. 6.1 Simpsons contact network 6.2.2 Chord Diagrams Chord diagrams are a specialized type of information graphic that uses a circular layout to display the interrelationships between data in a matrix. They have become particularly popular in genetics research. Because network information can be organized in matrices, chord diagrams are an interesting graphic option for network plots. This is especially true for valued (weighted) and directed networks, where the amount and direction of the ‘flows’ are of interest. The circlize package, by Zuguang Gu, implements a variety of circular graphics, including chord diagrams. The package has a lot of features, giving the user great control over the graphical appearance. The included vignette, circular visualization of matrix is suggested reading. In this example, we return to the network of the 2010 Netherlands World Cup soccer team. Although Fig. 1.2 shows the basic pattern of passing flows between the eleven members of the team, it ignores the number of passes (stored in the vertex attribute passes). Here we will create a chord diagram to further examine these patterns. The first steps are to load the required packages and prepare the data. The main requirement is to have the network data in the form of a sociomatrix, with the entries corresponding to the strength or size of the tie if it is a valued network. The matrix will also have to have names assigned for the rows and columns. (In this example Ned Flanders Ralph Wiggum Milhouse Moe Lenny Carl Gr. Willie Pr. Skinner Smithers Mr. Burns Maggie Lisa Bart Marge 6 Advanced Network Graphics Homer 80 Fig. 6.2 Simpsons contact network – Version 2 we have an N×N matrix, so the names will be the same for rows and columns. The circlize package can also be used for N×k matrices, so chord diagrams will also be useful for 2-mode affiliation networks, such as those discussed in Chap. 9.) library(statnet) library(circlize) data(FIFA_Nether) FIFAm <- as.sociomatrix(FIFA_Nether,attrname='passes') names <- c("GK1","DF3","DF4","DF5","MF6", "FW7","FW9","MF10","FW11","DF2","MF8") rownames(FIFAm) = names colnames(FIFAm) = names FIFAm ## ## GK1 ## DF3 ## DF4 GK1 DF3 DF4 DF5 MF6 FW7 FW9 MF10 FW11 DF2 0 42 67 21 2 27 7 5 2 17 30 0 44 14 42 15 8 7 10 36 38 43 0 57 18 11 7 21 1 7 6.2 Specialized Network Diagrams ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## DF5 MF6 FW7 FW9 MF10 FW11 DF2 MF8 6 9 4 0 1 3 29 12 MF8 GK1 3 DF3 29 DF4 28 DF5 42 MF6 21 FW7 18 FW9 2 MF10 21 FW11 12 DF2 15 MF8 0 14 28 12 0 11 2 38 25 47 25 1 1 11 2 8 26 81 0 10 21 8 22 3 3 38 11 0 21 7 43 7 45 23 50 41 0 12 29 6 38 13 20 28 15 0 20 11 10 12 40 37 33 31 0 15 18 32 1 14 9 16 28 0 26 11 4 34 25 7 13 21 0 24 The sociomatrix reveals a number of ties that have very low numbers of passes. To make the subsequent graphics a little easier to interpret we drop all ties with less than ten passes. FIFAm[FIFAm < 10] <- 0 FIFAm ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## GK1 DF3 DF4 DF5 MF6 FW7 FW9 MF10 FW11 DF2 MF8 GK1 DF3 DF4 DF5 MF6 GK1 DF3 DF4 DF5 MF6 FW7 FW9 MF10 FW11 DF2 0 42 67 21 0 27 0 0 0 17 30 0 44 14 42 15 0 0 10 36 38 43 0 57 18 11 0 21 0 0 0 14 47 0 11 50 20 40 0 0 0 28 25 10 0 41 28 37 14 34 0 12 0 21 21 0 15 33 0 25 0 0 0 0 0 12 0 31 16 0 0 11 11 22 43 29 20 0 28 13 0 0 0 0 0 0 11 15 0 21 29 38 0 0 45 38 10 18 26 0 12 25 26 38 23 13 12 32 11 24 MF8 0 29 28 42 21 82 6 Advanced Network Graphics ## ## ## ## ## ## FW7 FW9 MF10 FW11 DF2 MF8 18 0 21 12 15 0 With a sociomatrix that has names assigned, a basic chord diagram can be produced by a simple call to the chordDiagram() function (Fig. 6.3). chordDiagram(FIFAm) Chord diagrams can contain a lot of information, especially for larger networks, so it is usually important to fine tune the plot to highlight the most important information. In this next plot, a number of options are used to make the graphic a little easier to interpret. First, colors are set so that players in the same position (Forward, Midfielder, etc.) have the same color. Then, because this is a directed network, flows (passes, in this case) go in both directions. The directional option is used so that the departing passes start further away from outer circle, making it easier to see the difference between passes sent and passes received. Finally, the order option is used to sort the players by their position. grid.col <- c("#AA3939",rep("#AA6C39",4), rep("#2D882D",3),rep("#226666",3)) chordDiagram(FIFAm,directional = TRUE, grid.col = grid.col, order=c("GK1","DF2","DF3","DF4","DF5", "MF6","MF8","MF10","FW7", "FW9","FW11")) In the resulting chord diagram (Fig. 6.4), it is much easier to see the patterns of passes among the players. We can see that FW7 receives more than twice the number of passes than the other two forwards. Similarly, we can see that the goalkeeper’s favorite target is DF4, and that DF4 likes to pass frequently to DF5. 6.2.3 Heatmaps for Network Data Heatmaps are another example of a specialized graphic that can be used for networks, especially valued or weighted networks. Here, a heatmap is produced to highlight the players who are passing or receiving the most among the Netherlands teammates. First, a sociomatrix is created with the cells reflecting the tie weight, in this case ‘passes.’ Row and column names are defined for the margin labels. 6.2 Specialized Network Diagrams 83 0 MF1 0 200 300 400 0 FW11 100 0 10 10 DF 2 0 20 0 FW 9 0 30 0 10 300 100 200 MF8 FW 200 7 0 30 10 0 0 0 0 0 0 400 GK1 0 200 MF6 200 300 100 400 20 0 00 04 10 0 0 10 30 0 0 30 20 0 DF 5 F3 D 0 100 0 0 400 300 200 40 100 DF4 Fig. 6.3 Chord diagram of Netherlands 2010 soccer team, with all default options data(FIFA_Nether) FIFAm <- as.sociomatrix(FIFA_Nether,attrname='passes') colnames(FIFAm) <- c("GK1","DF3","DF4","DF5", "MF6","FW7","FW9","MF10", "FW11","DF2","MF8") rownames(FIFAm) <- c("GK1","DF3","DF4","DF5", "MF6","FW7","FW9","MF10", "FW11","DF2","MF8") Once the data are set up, the heatmap is relatively easy to produce (Fig. 6.5). The colorRampPalette() function is used to designate a color range that will be used for the low and high ends of the values in the sociomatrix. (The color ranges chosen here were taken from color chooser tools at paletton.com.) The network data are directed, so it is important to remember that here the rows are the ‘passers’ and the columns are the ‘receivers.’ 84 6 Advanced Network Graphics MF10 300 8 MF 0 0 400 100 200 300 400 0 10 0 20 FW 20 0 00 7 1 0 40 0 0 0 30 0 200 MF6 300 9 FW100 0 100 10 FW11 0 GK1 0 200 DF5 200 300 100 0 400 10 0 0 10 2 0 40 DF 00 2 0 30 0 0 30 DF 20 4 0 0 100 0 400 300 200 100 DF3 Fig. 6.4 Chord diagram of Netherlands 2010 soccer team, with advanced options palf <- colorRampPalette(c("#669999", "#003333")) heatmap(FIFAm[,11:1],Rowv = NA,Colv = NA,col = palf(60), scale="none", margins=c(11,11) ) The heatmap also shows the same pattern of heavy passers as Fig. 6.4. The darkest square is for the passes from the goalkeeper to DF4. 6.3 Creating Network Diagrams with Other R Packages 6.3.1 Network Diagrams with ggplot2 Although ggplot2 is not designed to handle all of the requirements of a fullfledged network visualization package, some of its advanced graphics capabilities can be used to create specialized network plotting routines. The following example 6.3 Creating Network Diagrams with Other R Packages 85 MF8 DF2 FW11 MF10 FW9 FW7 MF6 DF5 DF4 DF3 GK1 DF3 DF4 DF5 MF6 FW7 FW9 MF10 FW11 DF2 MF8 GK1 Fig. 6.5 Heatmap of Netherlands 2010 soccer team number of passes is based on code developed by David Sparks, and posted on the blog he runs with his colleague Christopher DeSante, is.R() (http://is-r.tumblr.com). The edgeMaker() function and supporting code can be used to create attractive and functional plots of directed networks using ‘tapered-intensity-curved’ edges. The bulk of the work is done by the edgeMaker() function which creates the curved ties between each connected dyad. edgeMaker <- function(whichRow,len=100, curved = TRUE){ fromC <- layoutCoordinates[adjacencyList[whichRow,1],] toC <- layoutCoordinates[adjacencyList[whichRow,2],] graphCenter <- colMeans(layoutCoordinates) bezierMid <- c(fromC[1], toC[2]) distance1 <- sum((graphCenter - bezierMid)ˆ2) if(distance1 < sum((graphCenter - c(toC[1], fromC[2]))ˆ2)){ bezierMid <- c(toC[1], fromC[2]) } bezierMid <- (fromC + toC + bezierMid) / 3 if(curved == FALSE){bezierMid <- (fromC + toC) / 2} edge <- data.frame(bezier(c(fromC[1], bezierMid[1], toC[1]), c(fromC[2], bezierMid[2], toC[2]), evaluation = len)) 86 6 Advanced Network Graphics edge$Sequence <- 1:len edge$Group <- paste(adjacencyList[whichRow, 1:2], collapse = ">") return(edge) } In addition to the core sna and ggplot2 packages, the Hmisc package is used which provides the bezier() function used by edgeMaker. library(sna) library(ggplot2) library(Hmisc) As has been typical with the examples in this chapter, the network data has to be transformed to an edgelist format prior to using the plotting functions. For this example, we also drop the ties that have the associated weight ‘passes’ less than 10. Finally, the edgeMaker function expects the edgelist object to be named ‘adjacencyList.’ data(FIFA_Nether) fifa <- FIFA_Nether fifa.edge <- as.edgelist.sna(fifa,attrname='passes') fifa.edge <- data.frame(fifa.edge) names(fifa.edge)[3] <- "value fifa.edge <- fifa.edge[fifa.edge$value > 9,] adjacencyList <- fifa.edge Now, we use edgeMaker to create the curved edges. Also, gplot (from sna) is called once to store the layout coordinates for the ggplot2 function. (This means that any set of coordinates can be fed to ggplot2.) layoutCoordinates <- gplot(network(fifa.edge)) allEdges <- lapply(1:nrow(fifa.edge), edgeMaker, len = 500, curved = TRUE) allEdges <- do.call(rbind, allEdges) Before producing the plot, we create an empty ggplot2 theme. This is used to clean up after producing the plot. new_theme_empty <- theme_bw() new_theme_empty$line <- element_blank() new_theme_empty$rect <- element_blank() new_theme_empty$strip.text <- element_blank() new_theme_empty$axis.text <- element_blank() new_theme_empty$plot.title <- element_blank() new_theme_empty$axis.title <- element_blank() new_theme_empty$plot.margin <- structure(c(0,0,-1,-1), 6.3 Creating Network Diagrams with Other R Packages 87 unit = "lines", valid.unit = 3L, class = "unit") And now the final step is to create the plot using ggplot(). Familiarity with ggplot2 will help in understanding this code. The scale colour gradient option controls the intensity of the gradient, and the scale size option controls the amount of the taper (Fig. 6.6). zp1 <- ggplot(allEdges) zp1 <- zp1 + geom_path(aes(x = x, y = y, group = Group, colour=Sequence, size=-Sequence)) zp1 <- zp1 + geom_point(data = data.frame(layoutCoordinates), aes(x = x, y = y), size = 4, pch = 21, colour = "black", fill = "gray") zp1 <- zp1 + scale_colour_gradient(low = gray(0), high = gray(9/10), guide = "none") zp1 <- zp1 + scale_size(range = c(1/10, 1.5), guide = "none") zp1 <- zp1 + new_theme_empty print(zp1) Fig. 6.6 Netherlands 2010 passing network with curved ties Part III Description and Analysis Chapter 7 Actor Prominence . . .we’re now tied up with human beings, tied to you and forced to go on with this adventure according to the laws of visibility. (Jean Genet) 7.1 Introduction Networks are interesting because of their specific structural patterns, and how those structures affect the members of the network. Stated more simply, networks affect their members based on where those members are located in the networks. A person who is connected to many other members of a network is likely to view the rest of the network quite differently from somebody who is relatively isolated from the other members. Network analysis provides many tools for viewing, analyzing, and assessing the locations of individual nodes and ties. This is often the first type of network analysis that is performed once network data are obtained, beyond simple network description. By examining the location of individual network members, we can assess the prominence of those members. An actor is prominent if the ties of the actor make that actor visible to the other members in the network (Knoke and Burt 1983). In the rest of this chapter, we will cover a number of the most common ways to assess network member prominence. For non-directed networks we will look at centrality; where we view a central actor as one who is involved in many (direct or indirect) ties. For directed networks, prominence is usually referred to as prestige; a prestigious actor is one who is the object of extensive ties. This chapter will also cover how individual node-level measures of centrality and prestige can be aggregated into network-level centralization measures. An example of how to report the results of prominence analysis will be presented. Finally, there will be a short discussion of identifying cutpoints and bridges in networks. These are technically not measures of prominence, but are simple locational properties of individual nodes or ties, and as such are somewhat similar to the rest of the chapter’s subject matter. © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 7 91 92 7 Actor Prominence 7.2 Centrality: Prominence for Undirected Networks It makes intuitive sense that a network member who is connected to many other members of the network is in a prominent position. For non-directed networks, we will say that this type of actor has high centrality, or that it is in a central position. However, there are a number of ways of operationalizing this type of prominence. In fact, there are dozens of centrality statistics available to the network analyst. To see how we can come up with different types of centrality measures, consider the example network displayed in Fig. 7.1, based on the following simple sociomatrix, called net mat. ## ## ## ## ## ## ## ## ## ## ## a b c d e f g h i j a 0 1 1 0 0 0 0 0 0 0 b 1 0 1 0 0 0 0 0 0 0 c 1 1 0 1 1 0 1 0 0 0 d 0 0 1 0 1 0 0 0 0 0 e 0 0 1 1 0 1 0 0 0 0 f 0 0 0 0 1 0 1 0 0 0 g 0 0 1 0 0 1 0 1 0 0 h 0 0 0 0 0 0 1 0 1 1 i 0 0 0 0 0 0 0 1 0 0 j 0 0 0 0 0 0 0 1 0 0 Which node is most central? Nodes c and g are both positioned in the center of the graph, but as we learned in Chap. 4, the location of nodes in network graphics may or may not hold any particular meaning. However, node c is directly connected to more network members than any other node, so in that sense we could view c as a central node. Alternatively, node g does not have as many direct network ties, but it is positioned in such a way that it connects two different parts of the network. In particular, the only way that information from nodes h, i, and j gets to the rest of the network is through node g. Finally, even though node g is only directly connected to two other nodes, it is positioned so that it is fairly close to every other node in the network. Specifically, node g can reach every other node in only one or two steps. That is, node g is connected to the rest of the network by paths of length one or two. So, in these two very different senses, node g can also be thought of as a central node. In the next three sections, we will cover the three most commonly used measures of centrality. 7.2 Centrality: Prominence for Undirected Networks a 93 b j g d h c i e f Fig. 7.1 Network graph example to demonstrate concepts of prominence 7.2.1 Three Common Measures of Centrality 7.2.1.1 Degree Centrality The simplest measure of centrality by far is based on the notion that a node that has more direct ties is more prominent than nodes with fewer or no ties. Degree centrality thus, is simply the degree of each node. We first introduced node degree in Chap. 2. The degree of a node is the number of ties it has with other nodes. Following the notation of Wasserman and Faust (1994), degree centrality is defined as: CD (ni ) = d(ni ) The network in Fig. 7.1 is simple enough that we could count up the node degrees by hand. However, here is how degree centrality can be calculated in statnet, assuming that we have the data stored in a network object called net. net <- network(net_mat) net %v% 'vertex.names' ## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" degree(net, gmode="graph") ## [1] 2 2 5 2 3 2 3 3 1 1 The first line of code simply reminds you of the names of the nodes and their order. The degree() function calculates and returns the degree centrality scores for each node. The gmode option tells the function to treat the network object as a non-directed network (graph). (This option needs to be used, even if the network is created and stored as a non-directed network.) The results confirm what we had already suggested above. Node c has the highest degree centrality. It is connected to five other nodes in the network, more than any other node. 94 7 Actor Prominence 7.2.1.2 Closeness Centrality Instead of examining only the direct connections of the nodes, we can focus on how close each node is to every other node in a network. This leads to the concept of closeness centrality, where nodes are more prominent to the extent they are close to all other nodes in the network. Here is the relatively more complicated equation for closeness centrality, CC (ni ) = −1 g ∑ d(ni , n j ) j=1 where d is the path distance between two nodes. Closeness centrality, then, is the inverse of the sum of all the distances between node i and all the other nodes in the network. closeness(net, gmode="graph") ## ## [1] 0.409 0.409 0.600 0.429 0.450 0.450 0.600 [8] 0.474 0.333 0.333 This tells us that nodes c and g are tied with the highest closeness. 7.2.1.3 Betweenness Centrality Betweenness centrality measures the extent that a node sits ‘between’ pairs of other nodes in the network, such that a path between the other nodes has to go through that node. A node with high betweenness is prominent, then, because that node is in a position to observe or control the flow of information in the network. The equation for betweenness centrality is CB (ni ) = ∑ g jk (ni )/g jk j2]) graph.density(iDHHS) ## [1] 0.153 To identify the k-core structure in the network, the graph.coreness function is used. It returns a vector listing the highest core that each vertex belongs to in the network. The results tell us the k-cores range from 1 to 6. coreness <- graph.coreness(iDHHS) table(coreness) ## coreness ## 1 2 3 ## 7 6 2 4 5 5 6 2 26 maxCoreness <- max(coreness) maxCoreness ## [1] 6 To better understand the k-core structure, we can plot the network using the k-core set information. This example illustrates how igraph uses the special vertex attributes name and color. The name attribute is used by default to label the nodes in a plot. Here we copy over the vertex names that were stored in the statnet vertex attribute vertex.names. The color vertex attribute is used to set the default colors of the nodes. Here we add 1 to the k-core values stored in the coreness vector as a quick and dirty way to pick different colors for each k-core, as well as to avoid black (Fig. 8.4). Vname <- get.vertex.attribute(iDHHS,name='vertex.names', index=V(iDHHS)) V(iDHHS)$name <- Vname V(iDHHS)$color <- coreness + 1 op <- par(mar = rep(0, 4)) plot(iDHHS,vertex.label.cex=0.8) par(op) To help with the interpretation, we can label the nodes with their k-core membership value. Also, in this example we demonstrate an alternative way to automatically pick a distinctive set of colors for the nodes (Fig. 8.5). colors <- rainbow(maxCoreness) op <- par(mar = rep(0, 4)) plot(iDHHS,vertex.label=coreness, vertex.color=colors[coreness]) par(op) 112 8 Subgroups FDA-2 FDA-1 OS-1 NIH-13 NIH-16 OS-3 OGC-1 NIH-7 NIH-11 NIH-14 OS-2 CDC-4 NIH-9 OGC-2 CDC-10 NIH-8 NIH-6 NIH-1 NIH-12 CDC-7 CDC-12 CDC-1 CDC-5 CDC-2CDC-6 NIH-2 NIH-5 AHRQ-3 OS-4 CDC-11 SAMHSA-3 AHRQ-4 CDC-8 IHS-1 CMS-1 NIH-10 OS-5 CDC-3 CDC-9 NIH-3 NIH-4 IHS-2 ACF-2 AHRQ-1 AHRQ-2 HRSA-2 HRSA-1 HRSA-3 Fig. 8.4 DHHS k-core structure This figure shows that the center of the network is made up primarily of the highest k-core. In this case, the 6-core is comprised of 26 of the 54 total nodes. Because of the nested structure of k-cores, we can further examine the subgroup patterns by progressively ‘peeling away’ each of the lower k-cores in turn. To do this we take advantage of the induced.subgraph function (Fig. 8.6). V(iDHHS)$name <- coreness V(iDHHS)$color <- colors[coreness] iDHHS1_6 <- iDHHS iDHHS2_6 <- induced.subgraph(iDHHS, vids=which(coreness > 1)) iDHHS3_6 <- induced.subgraph(iDHHS, vids=which(coreness > 2)) iDHHS4_6 <- induced.subgraph(iDHHS, vids=which(coreness > 3)) 8.2 Social Cohesion 113 1 1 1 2 2 1 4 1 6 6 5 6 6 1 6 6 6 6 6 6 6 2 6 2 6 2 6 6 2 6 6 6 6 6 6 6 6 6 6 3 4 3 6 1 4 4 5 4 Fig. 8.5 DHHS k-core structure, with k-core membership values iDHHS5_6 <- induced.subgraph(iDHHS, vids=which(coreness > 4)) iDHHS6_6 <- induced.subgraph(iDHHS, vids=which(coreness > 5)) lay <- layout.fruchterman.reingold(iDHHS) op <- par(mfrow=c(3,2),mar = c(3,0,2,0)) plot(iDHHS1_6,layout=lay,main="All k-cores") plot(iDHHS2_6,layout=lay[which(coreness > 1),], main="k-cores 2-6") plot(iDHHS3_6,layout=lay[which(coreness > 2),], main="k-cores 3-6") plot(iDHHS4_6,layout=lay[which(coreness > 3),], main="k-cores 4-6") plot(iDHHS5_6,layout=lay[which(coreness > 4),], main="k-cores 5-6") plot(iDHHS6_6,layout=lay[which(coreness > 5),], main="k-cores 6-6") par(op) 114 8 Subgroups All k-cores 1 k-cores 2-6 2 1 1 2 2 66 6 6 1 6 6 6 66 2 6 6 66 5 6 4 6 66 3 666 6 3 664 4 6 4 66 5 4 2 2 2 1 2 2 6 4 4 1 4 k-cores 4-6 4 4 6 6 6 4 4 6 6 6 6 6 6 66 6 6 6 6 6 6 6 6 6 6 4 6 6 6 5 6 5 3 6 6 6 3 6 4 6 4 6 6 6 6 6 6 6 5 6 6 6 6 4 6 5 6 6 6 6 6 6 5 6 6 6 6 6 5 Fig. 8.6 Peeling away DHHS k-cores 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 66 6 k-cores 6-6 6 6 66 6 6 6 6 6 6 6 6 6 4 6 6 6 6 k-cores 5-6 6 6 6 6 6 4 6 6 6 6 6 6 66 6 6 66 6 5 6 66 3 6 66 6 3 6 6 4 6 6 6 5 2 2 k-cores 3-6 6 2 4 1 6 6 6 8.3 Community Detection 115 8.3 Community Detection Both cliques and k-cores are examples of subgroup identification that rely entirely on the pattern of internal ties defining the particular groups. Network scientists have developed a wide variety of subgroup identification algorithms and heuristics that define groups not just based on the internal ties, but also the pattern of ties between different groups. That is, a subgroup in a network is a set of nodes that has a relatively large number of internal ties, and also relatively few ties from the group to other parts of the network. These approaches vary in their details, but they are all designed to identify internally cohesive subgroups that are somewhat separated or isolated from other groups or nodes. These approaches are sometimes called community detection algorithms. 8.3.1 Modularity An important characteristic of a network that is used in many community detection algorithms is that of modularity. Modularity is a measure of the structure of the network, specifically the extent to which nodes exhibit clustering where there is greater density within the clusters and less density between them (Newman 2006). Modularity can be used in an exploratory fashion, where an algorithm tries to maximize modularity and returns the node classification that is found to best explain the observed clustering. Conversely, modularity can be used in a descriptive fashion where the modularity statistic is calculated for any node classification variable of interest. For example, an analyst can calculate the modularity score for a friendship network given the gender of network members. Used this way, modularity reflects the extent to which gender explains the observed clustering among the friends in the network. Modularity is a chance-corrected statistic, and is defined as the fraction of ties that fall within the given groups minus the expected such fraction if ties were distributed at random. The modularity statistic can range from −1/2 to +1. The closer to 1, the more the network exhibits clustering with respect to the given node grouping. Consider the following simple example of a network with nine nodes. We have two categorical vertex attributes which each classify the nodes into three groups (Fig. 8.7). g1 <- graph.formula(A-B-C-A,D-E-F-D,G-H-I-G,A-D-G-A) V(g1)$grp_good <- c(1,1,1,2,2,2,3,3,3) V(g1)$grp_bad <- c(1,2,3,2,3,1,3,1,2) op <- par(mfrow=c(1,2)) plot(g1,vertex.color=(V(g1)$grp_good), vertex.size=20, 116 8 Subgroups main="Good Grouping") plot(g1,vertex.color=(V(g1)$grp_bad), vertex.size=20, main="Bad Grouping") par(op) Good Grouping Bad Grouping F E F E C D B D B A A C G I G H H I Fig. 8.7 Modularity example As the figure suggests, the clustering that is evident in the network is better accounted for by the grp good node attribute compared to the grp bad variable. This can be confirmed by calculating the modularity score provided by the modularity function in igraph. modularity(g1,V(g1)$grp_good) ## [1] 0.417 modularity(g1,V(g1)$grp_bad) ## [1] -0.333 Real-world social networks are often characterized by clustering, but it is of course harder to judge the extent of the clustering by eye. Earlier in the chapter we saw that there was interesting subgroup structure contained in the DHHS network, and that this structure might be partially explained by the DHHS organization that the person worked for. library(intergraph) data(DHHS) iDHHS <- asIgraph(DHHS) table(V(iDHHS)$agency) ## ## ## 0 2 1 2 4 12 3 2 4 2 5 3 6 7 2 16 8 3 9 10 5 3 8.3 Community Detection 117 V(iDHHS)[1:10]$agency ## [1] 0 0 1 1 1 1 2 2 2 2 modularity(iDHHS,(V(iDHHS)$agency+1)) ## [1] 0.14 The modularity function expects that the node grouping variable is numbered starting at 1 (and in fact the current version of the function will crash if community membership has zeros.) In this case, agency is numbered starting at 0 so we add 1 to it before passing it to the modularity function. The modularity score of 0.14 suggests that the DHHS agency does explain some of the clustering that is present in the network. However, like most network descriptive statistics, the number in and of itself has little meaning. The interpretation of the network characteristic becomes more meaningful when it is compared to another relevant measure. For example, how does the modularity change over time? Or, how does this modularity score compare to the modularity score for a different vertex attribute on the same network? As a comparison, we can look at the modularity scores for two other datasets included in UserNetR. For the Moreno data, we can see how gender accounts for subgroup structure. For the Facebook network, we can use the group node attribute, which designates the type of social group the Facebook friends belong to (family, work, music, high school, etc.). The results show us that both the Moreno and Facebook social networks exhibit higher modularity than the DHHS network. data(Moreno) iMoreno <- asIgraph(Moreno) table(V(iMoreno)$gender) ## ## 1 2 ## 16 17 modularity(iMoreno,V(iMoreno)$gender) ## [1] 0.476 data(Facebook) levels(factor(V(Facebook)$group)) ## [1] "B" "C" "F" "G" "H" "M" "S" "W" grp_num <- as.numeric(factor(V(Facebook)$group)) modularity(Facebook,grp_num) ## [1] 0.615 118 8 Subgroups 8.3.2 Community Detection Algorithms The main reason that this chapter uses igraph is that it includes support for many if not most of the existing community detection approaches. Table 8.2 lists the currently supported algorithms, along with whether each function supports directed networks, weighted networks (networks with valued ties), and whether the algorithm can be used on networks with more than one component. Name Edge-betweenness Leading eigenvector Fast-greedy Louvain Walktrap Label propagation InfoMAP Spinglass Optimal Function cluster edge betweenness cluster leading eigen cluster fast greedy cluster louvain cluster walktrap cluster label prop cluster infomap cluster spinglass cluster optimal Directed T F F F F F T F F Weighted T F T T T T T T T Components T T T T F F T F T Table 8.2 Community detection functions in igraph The basic workflow for conducting community detection in igraph is to run one of the community detection functions on a network and store the results in a communities class object. Then, the identified subgroups in the network can be explored using a number of igraph functions that know how to operate with communities objects. The networks can also be plotted easily to show the results of the community detection. For example, consider the simple Moreno friendship network that is clearly divided into two subgroups based on gender. Community detection on this network proceeds as follows. cw <- cluster_walktrap(iMoreno) membership(cw) ## [1] 1 1 1 1 1 1 1 1 3 3 3 5 5 5 5 1 3 2 2 2 4 4 4 ## [24] 2 2 2 2 2 2 2 2 6 6 modularity(cw) ## [1] 0.618 Modularity is fairly high, suggesting that the walktrap algorithm has done a good job at detecting subgroup structure. The membership function reveals that six different subgroups have been identified. These are best understood through visualization. If the plot function is passed a communities object along with the network it belongs to, then an attractive plot is produced with each subgroup getting its own color-coded shaded polygon (Fig. 8.8). 8.3 Community Detection 119 plot(cw, iMoreno) Once you have a community detection subgroup solution, it can be examined like any membership vector. For example, you can compare it to an existing partition based on a node characteristic. In this case, how does a walktrap solution compare to the specific agency of the nodes in the DHHS network? 27 28 24 29 19 25 30 18 6 31 20 22 21 26 23 33 1 5 32 2 16 11 14 17 3 4 13 9 7 10 12 13 8 Fig. 8.8 Community detection on Moreno network cw <- cluster_walktrap(iDHHS) modularity(cw) ## [1] 0.165 membership(cw) ## [1] 4 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [24] 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 ## [47] 2 1 2 1 1 1 1 1 table(V(iDHHS)$agency,membership(cw)) ## ## ## ## ## ## ## ## 0 1 2 3 4 5 1 0 4 12 2 2 3 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 1 0 0 0 0 0 5 1 0 0 0 0 0 120 ## ## ## ## ## 8 Subgroups 6 7 8 9 10 2 0 0 3 3 0 0 0 16 3 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 A common practice is to use more than one community detection algorithm and compare the results. (Remember that only some algorithms can handle particular types of networks such as directed networks.) The following examples explore how different algorithms find slightly different subgroups in the Bali terrorism network. data(Bali) iBali <- asIgraph(Bali) cw <- cluster_walktrap(iBali) modularity(cw) ## [1] 0.283 membership(cw) ## [1] 2 1 2 1 2 2 1 2 2 3 3 3 3 3 2 2 2 ceb <- cluster_edge_betweenness(iBali) modularity(ceb) ## [1] 0.239 membership(ceb) ## [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 cs <- cluster_spinglass(iBali) modularity(cs) ## [1] 0.297 membership(cs) ## [1] 1 2 1 3 1 1 2 1 1 3 3 3 3 3 1 1 1 cfg <- cluster_fast_greedy(iBali) modularity(cfg) ## [1] 0.263 membership(cfg) ## [1] 2 2 1 2 1 2 2 1 1 3 3 3 3 3 1 1 1 8.3 Community Detection 121 clp <- cluster_label_prop(iBali) modularity(clp) ## [1] 0.239 membership(clp) ## [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 cle <- cluster_leading_eigen(iBali) modularity(cle) ## [1] 0.275 membership(cle) ## [1] 1 1 1 2 1 1 2 1 1 2 2 2 2 2 1 1 1 cl <- cluster_louvain(iBali) modularity(cl) ## [1] 0.297 membership(cl) ## [1] 3 1 3 2 3 3 1 3 3 2 2 2 2 2 3 3 3 co <- cluster_optimal(iBali) modularity(co) ## [1] 0.297 membership(co) ## [1] 1 2 1 3 1 1 2 1 1 3 3 3 3 3 1 1 1 These results show that all the detection algorithms identify either two or three subgroups. Modularity ranges from about 0.24 to 0.30. The community detection results can be compared to one another using a number of classification comparison metrics, including the adjusted Rand statistic. table(V(iBali)$role,membership(cw)) ## ## ## ## ## ## ## BM CT OA SB TL 1 0 1 2 0 0 2 5 2 1 1 0 3 0 0 0 1 4 122 8 Subgroups compare(as.numeric(factor(V(iBali)$role)),cw, method="adjusted.rand") ## [1] 0.35 compare(cw,ceb,method="adjusted.rand") ## [1] 0.616 compare(cw,cs,method="adjusted.rand") ## [1] 0.89 compare(cw,cfg,method="adjusted.rand") ## [1] 0.669 Finally, we can plot multiple solutions to better understand the similarities and differences among the different community detection algorithms (Fig. 8.9). op <- par(mfrow=c(3,2),mar=c(3,0,2,0)) plot(ceb, iBali,vertex.label=V(iBali)$role, main="Edge Betweenness") plot(cfg, iBali,vertex.label=V(iBali)$role, main="Fastgreedy") plot(clp, iBali,vertex.label=V(iBali)$role, main="Label Propagation") plot(cle, iBali,vertex.label=V(iBali)$role, main="Leading Eigenvector") plot(cs, iBali,vertex.label=V(iBali)$role, main="Spinglass") plot(cw, iBali,vertex.label=V(iBali)$role, main="Walktrap") par(op) 8.3 Community Detection 123 Edge Betweenness Fastgreedy SB OA BM SB OA CT BM CT CT BM BM BM BM BM BM BM BM OA CT CT CT CT TL TL SB TL CT TL TL TL Label Propagation Leading Eigenvector SB TL OA OA TL TL TL TL SB CT TL CT BM CT BM CT OA BM SB OA BM BM CT OA SB CT BM BM CT TL BM BM BM TL BM TL SB TL Spinglass Walktrap SB TL SB TL OA TL TL TL TL TL OA TL OA CT CT CT CT OA CT CT BM BM BM BM BM BM BM BM BM BM OA OA SB Fig. 8.9 Community detection comparisons on Bali network SB Chapter 9 Affiliation Networks A tribe is a group of people connected to one another, connected to a leader, and connected to an idea. For millions of years, human beings have been part of one tribe or another. A group needs only two things to be a tribe: a shared interest and a way to communicate. (Seth Godin – Tribes: We Need You to Lead Us) 9.1 Defining Affiliation Networks Until now, all the networks that we have examined are based on direct ties. That is, the social ties connecting the actors in the social network have been confirmed through self-report, direct observation, or some other type of data collection that tells us how actors are directly connected to one another. However, social scientists are often interested in situations where there may be the opportunity for social relationships, but these relationships cannot be directly observed. However, by virtue of occupying the same social situation, we may infer that there is an opportunity or potential for social connections. We call this new type of social network an affiliation network. An affiliation network is a network where the members are affiliated with one another based on co-membership in a group, or co-participation in some type of event. For example, students who all belong to the same class can be thought of as being connected to one another, although we may not know whether they actually have direct social ties. The classic example from network science is the case of corporate interlocks. Company directors have the opportunity to interact with each other when they sit together on the same corporate board of directors. Moreover, the companies themselves can be seen to be connected through their shared director memberships. That is, when the same director sits on two different company boards, those companies are connected through that director. Sociologists and political scientists have used these types of affiliation networks to explain how companies tend to behave in similar ways to one another (Galaskiewicz 1985). © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 9 125 126 9 Affiliation Networks 9.1.1 Affiliations as 2-Mode Networks As a simple example of an affiliation network, consider the following data table of students grouped in classes (Table 9.1). C1 <- c(1,1,1,0,0,0) C2 <- c(0,1,1,1,0,0) C3 <- c(0,0,1,1,1,0) C4 <- c(0,0,0,0,1,1) aff.df <- data.frame(C1,C2,C3,C4) row.names(aff.df) <- c("S1","S2","S3","S4","S5","S6") S1 S2 S3 S4 S5 S6 C1 1 1 1 0 0 0 C2 0 1 1 1 0 0 C3 0 0 1 1 1 0 C4 0 0 0 0 1 1 Table 9.1 Students grouped by classes This type of data matrix is called an incidence matrix, and it depicts how n actors belong to g groups. In this case we have six students grouped into four classes. An incidence matrix is similar to an adjacency matrix, but an adjacency matrix is an nxn square matrix where each dimension refers to the actors in the network. An incidence matrix, on the other hand, is an nxg rectangular matrix with two different dimensions: actors and groups. For this reason, affiliation networks are also known as two-mode networks. 9.1.2 Bipartite Graphs In affiliation networks, there are always two types of nodes: one type for the actors, and another type for the groups or events to which the actors belong. Ties then connect the actors to those groups. One consequence of this is that there are no direct ties among actors, and there are no direct ties between the groups. Figure 9.1, which shows the example student affiliation network, illustrates this defining characteristic of a bipartite graph. Both statnet and igraph have functionality built in to recognize and operate on affiliation networks. The process with igraph is a little more straightforward, so it will be used in this chapter. 9.2 Affiliation Network Basics 127 library(igraph) bn <- graph.incidence(aff.df) plt.x <- c(rep(2,6),rep(4,4)) plt.y <- c(7:2,6:3) lay <- as.matrix(cbind(plt.x,plt.y)) shapes <- c("circle","square") colors <- c("blue","red") plot(bn,vertex.color=colors[V(bn)$type+1], vertex.shape=shapes[V(bn)$type+1], vertex.size=10,vertex.label.degree=-pi/2, vertex.label.dist=1.2,vertex.label.cex=0.9, layout=lay) 9.2 Affiliation Network Basics 9.2.1 Creating Affiliation Networks from Incidence Matrices An affiliation network can be stored as an igraph object in a few different ways. If the underlying data are available as an incidence matrix (e.g., Table 9.1), then this can be done in one line of code. bn <- graph.incidence(aff.df) bn ## IGRAPH UN-B 10 11 -## + attr: type (v/l), name (v/c) ## + edges (vertex names): ## [1] S1--C1 S2--C1 S2--C2 S3--C1 S3--C2 S3--C3 ## [7] S4--C2 S4--C3 S5--C3 S5--C4 S6--C4 The graph.incidence function takes a matrix or data.frame and transforms it into an affiliation network, reading the rows as actors and the columns as the groups or events. Note that both the row and column names should be defined so that they can be correctly assigned to the individual actors and groups. By typing in the name of the igraph object, we get a cryptic two line summary of the network. The ‘B’ in the ‘UN-B’ string tells us that this is a bipartite network. Furthermore, the second line shows that this network has two vertex attributes: 128 9 Affiliation Networks name stores the name of the vertex, and type is a logical vector that igraph uses to distinguish between the two different types of nodes (students and classes, in this case). S1 S2 C1 S3 C2 S4 C3 S5 C4 S6 Fig. 9.1 Affiliation network as bipartite graph More information about the affiliation network can be obtained using traditional igraph functions. get.incidence(bn) ## ## ## ## ## ## ## S1 S2 S3 S4 S5 S6 C1 C2 C3 C4 1 0 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 V(bn)$type ## ## [1] FALSE FALSE FALSE FALSE FALSE FALSE [8] TRUE TRUE TRUE TRUE V(bn)$name ## [1] "S1" "S2" "S3" "S4" "S5" "S6" "C1" "C2" "C3" ## [10] "C4" 9.2 Affiliation Network Basics 129 Here we can see the underlying incidence matrix, and the vertex names and types. The correspondence between the names and the logical type vectors is clear, where all the students have type = FALSE, and the class nodes have type = TRUE. 9.2.2 Creating Affiliation Networks from Edge Lists For larger networks, it is more common to have the underlying data available as an edge list. Edge lists can also be translated into an affiliation network, as long nodes of one type (e.g., students) are only connected to nodes of the other type (e.g., classes). The following code constructs the same example affiliation network from edge list data. el.df <- data.frame(rbind(c("S1","C1"), c("S2","C1"), c("S2","C2"), c("S3","C1"), c("S3","C2"), c("S3","C3"), c("S4","C2"), c("S4","C3"), c("S5","C3"), c("S5","C4"), c("S6","C4"))) el.df ## ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 10 11 X1 S1 S2 S2 S3 S3 S3 S4 S4 S5 S5 S6 X2 C1 C1 C2 C1 C2 C3 C2 C3 C3 C4 C4 bn2 <- graph.data.frame(el.df,directed=FALSE) bn2 ## IGRAPH UN-- 10 11 -## + attr: name (v/c) ## + edges (vertex names): 130 ## ## 9 Affiliation Networks [1] S1--C1 S2--C1 S2--C2 S3--C1 S3--C2 S3--C3 [7] S4--C2 S4--C3 S5--C3 S5--C4 S6--C4 The above creates an network object, but igraph does not know that it is a bipartite graph. (Note that the network description lacks the ‘B’ that indicates a bipartite graph.) To fix this, we can simply set the type vertex attribute. Once this is done, we have an affiliation network object that is formed from a bipartite graph. V(bn2)$type <- V(bn2)$name %in% el.df[,1] bn2 ## IGRAPH UN-B 10 11 -## + attr: name (v/c), type (v/l) ## + edges (vertex names): ## [1] S1--C1 S2--C1 S2--C2 S3--C1 S3--C2 S3--C3 ## [7] S4--C2 S4--C3 S5--C3 S5--C4 S6--C4 graph.density(bn)==graph.density(bn2) ## [1] TRUE 9.2.3 Plotting Affiliation Networks As with any type of network, affiliation networks can be plotted for visual inspection. However, it is useful to designate different node shapes and colors to make the affiliation structure easier to interpret. Here we will set the students to be blue circles, and the classes to be red squares. The code shows how the type attribute can be used as an index into both a shapes and a colors vector to select the appropriate shape and color for each node. Note that 1 is added to type index because as a logical vector it starts at 0, whereas we want to select either the first or second elements of the shapes/colors vectors. shapes <- c("circle","square") colors <- c("blue","red") plot(bn,vertex.color=colors[V(bn)$type+1], vertex.shape=shapes[V(bn)$type+1], vertex.size=10,vertex.label.degree=-pi/2, vertex.label.dist=1.2,vertex.label.cex=0.9) 9.2 Affiliation Network Basics 131 S1 S2 C1 C2 S3 S4 C3 S5 C4 S6 Fig. 9.2 Simple plot of affiliation network 9.2.4 Projections By examining Fig. 9.2 we can see how classes are indirectly connected through their shared students. For example, classes 2 and 3 are indirectly connected through two shared students (S3 and S4). In comparison, classes 3 and 4 are also indirectly connected, but only via one shared student (S5). We can also focus the examination on the individual-level nodes. Here the figure reveals that students 2 and 4 are indirectly connected by their co-affiliation in class 2. Examining both types of nodes in a two-mode network graphic is often the first step in studying an affiliation network. However, it is also useful to examine the direct connections among the nodes of one type at a time (classes and students, in this case). This can be done by extracting and visualizing the one-mode projections of the two-mode affiliation network. Every actor by event affiliation network can produce two one-mode networks, one of actors and one of the events or affiliations. In igraph the projections can be obtained again by just one line of code. The bipartite.projection function returns a list of two igraph network objects. The first network is made up of the direct ties among the first mode (in our case students), and the second network shows the ties among the second mode (classes). bn.pr <- bipartite.projection(bn) bn.pr ## ## ## ## $proj1 IGRAPH UNW- 6 8 -+ attr: name (v/c), weight (e/n) + edges (vertex names): 132 ## ## ## ## ## ## ## ## 9 Affiliation Networks [1] S1--S2 S1--S3 S2--S3 S2--S4 S3--S4 S3--S5 [7] S4--S5 S5--S6 $proj2 IGRAPH UNW- 4 4 -+ attr: name (v/c), weight (e/n) + edges (vertex names): [1] C1--C2 C1--C3 C2--C3 C3--C4 Each of the list members can be accessed and treated like a typical igraph network object, either within the list, or by extracting the list member. graph.density(bn.pr$proj1) ## [1] 0.533 bn.student <- bn.pr$proj1 bn.class <- bn.pr$proj2 graph.density(bn.student) ## [1] 0.533 The adjacency matrix of each one-mode projection can be obtained with the get.adjacency function. In the code below, notice how the edge attribute weight is specified. This produces a valued adjacency matrix, where the values indicate how many ties connect any of the nodes. So, for example, the Class adjacency matrix indicates that classes 2 and 3 have a weight of 2. This reflects the observation we made earlier that classes 2 and 3 share two students (S3 and S4). get.adjacency(bn.student,sparse=FALSE,attr="weight") ## ## ## ## ## ## ## S1 S2 S3 S4 S5 S6 S1 S2 S3 S4 S5 S6 0 1 1 0 0 0 1 0 2 1 0 0 1 2 0 2 1 0 0 1 2 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 get.adjacency(bn.class,sparse=FALSE,attr="weight") ## ## ## ## ## C1 C2 C3 C4 C1 C2 C3 C4 0 2 1 0 2 0 2 0 1 2 0 1 0 0 1 0 9.3 Example: Hollywood Actors as an Affiliation Network 133 Each of the one-mode projections can, of course, be plotted for visual examination. Additionally, we can (at least for smaller networks) take advantage of the weight edge attribute to explore the relative strengths of the ties (Fig. 9.3). shapes <- c("circle","square") colors <- c("blue","red") op <- par(mfrow=c(1,2)) plot(bn.student,vertex.color="blue", vertex.shape="circle",main="Students", edge.width=E(bn.student)$weight*2, vertex.size=15,vertex.label.degree=-pi/2, vertex.label.dist=1.2,vertex.label.cex=1) plot(bn.class,vertex.color="red", vertex.shape="square",main="Classes", edge.width=E(bn.student)$weight*2, vertex.size=15,vertex.label.degree=-pi/2, vertex.label.dist=1.2,vertex.label.cex=1) par(op) 9.3 Example: Hollywood Actors as an Affiliation Network The data file hwd in the UseNetR package contains a larger and more interesting affiliation network that can be explored using these techniques. Hollywood actors are a good example of an affiliation network, actors are connected to one another through the movies in which they appear together. The hwd dataset is an igraph bipartite graph object. The data are originally from IMDB (www.imdb.com). The dataset contains the ten most popular movies (as judged by IMBD users) for Students Classes C4 S6 S5 C3 C2 S4 S3 S2 S1 Fig. 9.3 Plots of one-mode projections C1 134 9 Affiliation Networks each year from 1999 to 2014, and the first ten actors listed on each movie’s IMDB page. In addition to the movie and actor names, each movie has the year of its release, its IMDB user rating, and the MPAA movie rating (i.e., G, PG, PG-13, and R) stored as a node characteristic. 9.3.1 Analysis of Entire Hollywood Affiliation Network The first steps to analyze these data are to load the file, and explore the basic affiliation structure of the network. data(hwd) h1 <- hwd h1 ## ## ## ## ## ## ## ## ## ## ## ## IGRAPH UN-B 1365 1600 -+ attr: name (v/c), type (v/l), year (v/n), | IMDBrating (v/n), MPAArating (v/c) + edges (vertex names): [1] Inception--Leonardo DiCaprio [2] Inception--Joseph Gordon-Levitt [3] Inception--Ellen Page [4] Inception--Tom Hardy [5] Inception--Ken Watanabe [6] Inception--Dileep Rao [7] Inception--Cillian Murphy + ... omitted several edges V(h1)$name[1:10] ## [1] ## [2] ## [3] ## [4] ## [5] ## [6] ## [7] ## [8] ## [9] ## [10] "Inception" "Alice in Wonderland" "Kick-Ass" "Toy Story 3" "How to Train Your Dragon" "Despicable Me" "Scott Pilgrim vs. the World" "Hot Tub Time Machine" "Harry Potter and the Deathly Hallows: Part 1" "Tangled" V(h1)$type[1:10] ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [10] TRUE V(h1)$IMDBrating[1:10] 9.3 Example: Hollywood Actors as an Affiliation Network ## 135 [1] 8.8 6.5 7.8 8.4 8.2 7.7 7.5 6.5 7.7 7.9 V(h1)$name[155:165] ## [1] "Notting Hill" ## [2] "Eyes Wide Shut" ## [3] "The Green Mile" ## [4] "10 Things I Hate About You" ## [5] "American Pie" ## [6] "Girl, Interrupted" ## [7] "Leonardo DiCaprio" ## [8] "Joseph Gordon-Levitt" ## [9] "Ellen Page" ## [10] "Tom Hardy" ## [11] "Ken Watanabe" The summary description of h1 indicates that hwd is indeed a bipartite graph. We can surmise that the ties link each actor to the movie that actor was in. The description also reveals that the network has 1,365 nodes and 1,600 ties. This is a little harder to decipher for an affiliation network, but given what we already know we can figure out that there are 160 movie nodes, 1,205 actor nodes, and the 1,600 ties arise from each movie having links to just ten actors. (There are only 1,205 actors listed because some actors appear in more than one movie.) The entire network is too large to show here, but we can examine a small subset of it before doing more focused analyses. As a first step, we can take advantage of igraph’s ability to store plotting information within the network object itself. In this case, the node color and shape can be designated by defining these as vertex attributes. (Compare to how this was done in the previous plotting example, where the node colors and shapes were designated within the plot function call.) V(h1)$shape <- ifelse(V(h1)$type==TRUE, "square","circle") V(h1)$shape[1:10] ## ## [1] "square" "square" "square" "square" "square" [6] "square" "square" "square" "square" "square" V(h1)$color <- ifelse(V(h1)$type==TRUE, "red","lightblue") For the first plot, we will look at a subset of Martin Scorsese movies that were released in the past 15 years. This example also illustrates how to create a subgraph by extracting only the edges that are incident to vertices with certain properties (in this case the name matches one of the three listed Scorsese movies). The key here is the inc special function of the E() edge iterator. The inc function takes a vertex sequence as an argument, and returns the incident edges. In this case, we are extracting all of the edges that are incident to the three Scorsese movies. For 136 9 Affiliation Networks more information see the igraph help entry on iterators, which can be found with help(E). The resulting graphic highlights the special role of Leonardo DiCaprio in these Scorsese movies, being the only actor to star in all three (Fig. 9.4). h2 <- subgraph.edges(h1, E(h1)[inc(V(h1)[name %in% c("The Wolf of Wall Street", "Gangs of New York", "The Departed")])]) plot(h2, layout = layout_with_kk) Stephen Graham Jim Broadbent Liam Neeson Gary Lewis Brendan Gleeson John C. Reilly Henry Thomas Cameron Diaz Gangs of New York Daniel Day-Lewis Jonah Hill Kyle Chandler Joanna Lumley Leonardo DiCaprio Ray Winstone Vera Farmiga Matthew McConaughey The Wolf of Wall Street Jon Bernthal Rob Reiner The Departed Alec Baldwin Margot Robbie Jon Favreau Jean Dujardin Martin Sheen Matt Damon Jack Nicholson Anthony Anderson Kevin Corrigan Mark Wahlberg Fig. 9.4 Scorsese affiliation network What can be learned from the entire Hollywood network? Most network descriptive statistics can be applied to affiliation networks, but they often need to be adjusted either in how they are constructed or how they are interpreted. For example, the overall density of the affiliation network can be easily calculated, but it is not very meaningful given how the network data were collected (every actor by definition is connected to a movie) and that there can be no ties among either the movie nodes or among the actor nodes. graph.density(h1) ## [1] 0.00172 9.3 Example: Hollywood Actors as an Affiliation Network 137 Instead, node degree may be more informative, at least for actors. (In this dataset, every movie has the same degree = 10). The degree function allows specification of which vertices to include, and that is used to select only the actors (for which the node characteristic type is FALSE). table(degree(h1,v=V(h1)[type==FALSE])) ## ## 1 2 ## 955 165 3 47 4 23 5 11 6 2 7 1 8 1 mean(degree(h1,v=V(h1)[type==FALSE])) ## [1] 1.33 This shows that the vast majority of actors only appeared in one movie, but there were 15 actors who each starred in five or more movies since 1999. Across all the actors, they starred in an average of 1.3 movies. This information can then be used to identify the busiest actors of the past decade and a half. They owe a lot of thanks to Harry Potter and Batman! V(h1)$deg <- degree(h1) V(h1)[type==FALSE & deg > 4]$name ## [1] "Leonardo DiCaprio" ## [3] "Richard Griffiths" ## [5] "Daniel Radcliffe" ## [7] "James Franco" ## [9] "Martin Freeman" ## [11] "Christian Bale" ## [13] "Natalie Portman" ## [15] "Liam Neeson" "Emma Watson" "Harry Melling" "Rupert Grint" "Ian McKellen" "Bradley Cooper" "Samuel L. Jackson" "Brad Pitt" busy_actor <- data.frame(cbind( Actor = V(h1)[type==FALSE & deg > 4]$name, Movies = V(h1)[type==FALSE & deg > 4]$deg )) busy_actor[order(busy_actor$Movies,decreasing=TRUE),] ## ## ## ## ## ## ## ## ## Actor Movies 5 Daniel Radcliffe 8 11 Christian Bale 7 1 Leonardo DiCaprio 6 2 Emma Watson 6 3 Richard Griffiths 5 4 Harry Melling 5 6 Rupert Grint 5 7 James Franco 5 138 ## ## ## ## ## ## ## 9 Affiliation Networks 8 Ian McKellen 9 Martin Freeman 10 Bradley Cooper 12 Samuel L. Jackson 13 Natalie Portman 14 Brad Pitt 15 Liam Neeson 5 5 5 5 5 5 5 This tells us who the busiest actors were. If we wanted to assess the popularity of the actors based on the popularity of the movies they appeared in, we could do this by accessing the characteristics of each movie that each actor starred in. This is slightly more complicated than the previous example, but can be done by utilizing igraph’s abilities to identify the adjacent neighbors for any node in the graph. The following code loops through the actor nodes in the network, and sums up the IMDBrating for all the neighbors of each node. Note that the loop only assigns the summed IMDBrating scores for the actor nodes (which are listed after the first 160 movie nodes). for (i in 161:1365) { V(h1)[i]$totrating <- sum(V(h1)[nei(i)]$IMDBrating) } Once we have this we can once again examine the most popular actors, which is based on both the number of movies, and the overall popularity of those movies. max(V(h1)$totrating,na.rm=TRUE) ## [1] 60.9 pop_actor <- data.frame(cbind( Actor = V(h1)[type==FALSE & totrating > 40]$name, Popularity = V(h1)[type==FALSE & totrating > 40]$totrating)) pop_actor[order(pop_actor$Popularity,decreasing=TRUE),] ## ## ## ## ## ## Actor Popularity 3 Daniel Radcliffe 60.9 4 Christian Bale 55.5 1 Leonardo DiCaprio 49.6 2 Emma Watson 45 5 Brad Pitt 40.5 Finally, network characteristics can always be examined using more traditional graphical and statistical approaches. For example, we can see if the busiest actors are starring in more popular movies, on average. First, we calculate an avgrating characteristic that is based on the mean IMDBrating, rather than the sum. Then, a simple scatterplot and regression are examined to see the relationship between 9.3 Example: Hollywood Actors as an Affiliation Network 139 number of movies and the average ratings of those movies. The results suggest that there is not a strong relationship between how busy an actor has been and the popularity of their movies. However, the scatterplot also suggests that actors who appear in less popular movies are most likely to appear in only one or two movies (Fig. 9.5). for (i in 161:1365) { V(h1)[i]$avgrating <- mean(V(h1)[nei(i)]$IMDBrating) } num <- V(h1)[type==FALSE]$deg avgpop <- V(h1)[type==FALSE]$avgrating summary(lm(avgpop ˜ num)) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = avgpop ˜ num) Residuals: Min 1Q Median -3.986 -0.433 0.198 3Q 0.617 Max 1.614 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.3387 0.0544 134.91 <2e-16 num 0.0471 0.0353 1.34 0.18 Residual standard error: 0.96 on 1203 df Multiple R-sq: 0.00148,Adjusted R-sq: 0.000653 F-statistic: 1.79 on 1 and 1203 DF, p-value: 0.182 scatter.smooth(num,avgpop,col="lightblue", ylim=c(2,10),span=.8, xlab="Number of Movies", ylab="Avg. Popularity") 9.3.2 Analysis of the Actor and Movie Projections Following the same procedures as presented in Sect. 9.2.4, the two projections of the Hollywood affiliation network can be created and analyzed. This will produce an actor network where actors have ties if they starred together in the same movie, and a movie network where the movies are connected if they shared the same actors. The actor projection will thus have 1,205 nodes, and the movie projection network will have 160 nodes. 9 Affiliation Networks 2 4 Avg. Popularity 6 8 10 140 1 2 3 4 5 6 Number of Movies 7 8 Fig. 9.5 Relationship between actor activity and popularity h1.pr <- bipartite.projection(h1) h1.act <- h1.pr$proj1 h1.mov <- h1.pr$proj2 h1.act ## ## ## ## ## ## ## ## ## ## ## ## IGRAPH UNW- 1205 6903 -+ attr: name (v/c), year (v/n), IMDBrating | (v/n), MPAArating (v/c), shape (v/c), | color (v/c), deg (v/n), totrating (v/n), | avgrating (v/n), weight (e/n) + edges (vertex names): [1] Leonardo DiCaprio--Joseph Gordon-Levitt [2] Leonardo DiCaprio--Ellen Page [3] Leonardo DiCaprio--Tom Hardy [4] Leonardo DiCaprio--Ken Watanabe [5] Leonardo DiCaprio--Dileep Rao + ... omitted several edges h1.mov ## ## ## ## ## ## IGRAPH UNW- 160 472 -+ attr: name (v/c), year (v/n), IMDBrating | (v/n), MPAArating (v/c), shape (v/c), | color (v/c), deg (v/n), totrating (v/n), | avgrating (v/n), weight (e/n) + edges (vertex names): 9.3 Example: Hollywood Actors as an Affiliation Network ## ## ## ## ## ## 141 [1] Inception--The Wolf of Wall Street [2] Inception--Django Unchained [3] Inception--The Departed [4] Inception--Gangs of New York [5] Inception--Catch Me If You Can + ... omitted several edges Fig. 9.6 Movie affiliation network In this figure, the entire movie network is presented, with node size based on the IMDBrating, so that more popular movies have larger nodes (Fig. 9.6). op <- par(mar = rep(0, 4)) plot(h1.mov,vertex.color="red", vertex.shape="circle", vertex.size=(V(h1.mov)$IMDBrating)-3, vertex.label=NA) 142 9 Affiliation Networks par(op) Some basic network descriptives provide more information about the Hollywood movie network. Although there are some isolated movies (i.e., movies that did not share actors with any of the other movies), most (148) of the movies form a large connected component. graph.density(h1.mov) ## [1] 0.0371 no.clusters(h1.mov) ## [1] 12 clusters(h1.mov)$csize ## [1] 148 ## [12] 1 1 1 1 1 1 1 2 1 1 1 table(E(h1.mov)$weight) ## ## 1 ## 411 2 21 3 12 4 16 5 6 6 1 7 2 10 3 The complete movie network can be filtered to examine the single large connected component. In the next figure the edge width has been set to equal the square root of weight edge attribute. This results in the ties being thicker for movies that share more actors between them (Fig. 9.7). h2.mov <- induced.subgraph(h1.mov, vids=clusters(h1.mov)$membership==1) plot(h2.mov,vertex.color="red", edge.width=sqrt(E(h1.mov)$weight), vertex.shape="circle", vertex.size=(V(h2.mov)$IMDBrating)-3, vertex.label=NA) The previous figure is still large and the relatively high density makes it somewhat challenging to interpret any interesting structural features. To help with that, we can identify the higher density cores of the graph, and use that to ‘zoom in’ on the more interconnected part of the network. (See Chap. 8 for more information.) This network is small enough that we can add node labels to help with the interpretation. This helps us see that the most tightly connected sections of the network 9.3 Example: Hollywood Actors as an Affiliation Network 143 Fig. 9.7 Largest component of movie affiliation network correspond to popular movie series, in particular Harry Potter, Batman, Star Wars, and The Hobbit. This makes sense because movies in a series will naturally share many or most of the same actors (Fig. 9.8). table(graph.coreness(h2.mov)) ## ## 1 ## 11 2 3 4 5 5 23 65 29 6 7 7 8 h3.mov <- induced.subgraph(h2.mov, vids=graph.coreness(h2.mov)>4) h3.mov ## ## ## ## IGRAPH UNW- 44 158 -+ attr: name (v/c), year (v/n), IMDBrating | (v/n), MPAArating (v/c), shape (v/c), | color (v/c), deg (v/n), totrating (v/n), 144 ## ## ## ## ## ## ## ## 9 Affiliation Networks | avgrating (v/n), weight (e/n) + edges (vertex names): [1] Inception--The Wolf of Wall Street [2] Inception--Django Unchained [3] Inception--The Dark Knight Rises [4] Inception--The Dark Knight [5] Inception--The Departed + ... omitted several edges plot(h3.mov,vertex.color="red", vertex.shape="circle", edge.width=sqrt(E(h1.mov)$weight), vertex.label.cex=0.7,vertex.label.color="darkgreen", vertex.label.dist=0.3, vertex.size=(V(h3.mov)$IMDBrating)-3) Rise of the Planet of the Apes X-Men The Hobbit: The Battle of the Five Armies Spider-Man The Hobbit: The Desolation of Smaug The Interview The Hobbit: An Unexpected Journey Nativity! The Prestige American Psycho Pineapple Express Exodus: Gods and Kings Hot Fuzz Superbad Taken This Is the End Arthur Christmas Hot Tub Time Machine Love Actually Harry Potter and the Deathly Hallows: Part 1 Harry Potter and the Prisoner of Azkaban The Dark Knight Batman Begins American Hustle The Dark Knight Rises The Departed Gangs of New York Harry Potter and the Goblet of Fire The Wolf of Wall Street Captain America: The First Avenger Harry Potter and the Half-Blood Prince Inception Harry Potter and the Chamber of Secrets Harry Potter and the Order of the Phoenix Catch Me If You Can Harry Potter and the Deathly Hallows: Part 2 Star Wars: Episode I - The Phantom Menace Harry Potter and the Sorcerer's Stone Alice in Wonderland Django Unchained Big Fish Star Wars: Episode II - Attack of the Clones Star Wars: Episode III - Revenge of the Sith V for Vendetta Fig. 9.8 Core movie affiliation network Part IV Modeling Chapter 10 Random Network Models What is this? A center for ants? How can we be expected to teach children to learn how to read. . . if they can’t even fit inside the building? (Derek Zoolander, in the movie Zoolander, after looking at a miniature model of a school building.) 10.1 The Role of Network Models According to Linton Freeman (2004), modern social network analysis has four main characteristics: 1. 2. 3. 4. It is motivated by a structural intuition based on ties linking social actors; It is grounded in systematic empirical data; It draws heavily on graphic imagery; and It relies on the use of mathematical and/or computational models. The preceding sections of this book focused on the first three elements of Freeman’s characterization. The next four chapters now turn to consider his last point, the utility of modeling in network analysis. Scientific models are simplified descriptions of the real world that are used to predict or explain the characteristics or behavior of the phenomenon of interest. Models can be used in network science in the same way. With network models we can move beyond simple description to build and test hypotheses about network structures, formation processes, and network dynamics In this chapter, a number of basic mathematical models of network structure and formation are covered. These are important models in the history of network science, but they are still useful today to provide insight into fundamental properties of social networks, to serve as baseline or comparison models for empirical social networks, and to act as building blocks for more complex network simulations. Well over a dozen functions are provided in igraph for generating random networks based on a number of mathematical algorithms and heuristics. These all use ‘game’ as the final part of the function name, for example barabasi.game() produces scale-free random graphs based on the Barabaśi-Albert model (1999). In the rest of this chapter, a number of important mathematical network models available in igraph are presented, along with some examples of how to use these models to explore network properties and as comparisons to observed, empirical social networks. © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 10 147 148 10 Random Network Models 10.2 Models of Network Structure and Formation 10.2.1 Erdős-Rényi Random Graph Model The earliest historically, and still one of the most important mathematical models of network structure, is the random graph model first developed by Paul Erdős and Alfred Rényi in the late 1950s and early 1960s (Newman 2010). This is sometimes called the Poisson random graph model (because of the Poisson degree distribution of large random graphs), or sometimes even just the random graph model. The model is quite simple, G(n, m), where a random graph G is defined with n vertices and m edges among those vertices chosen randomly. An equivalent model that is easier to work with is G(n, p), where instead of specifying m edges, each edge appears in the graph with probability p. This random graph model is implemented in igraph with the erdos.reny.game() function. A random graph is produced by specifying the size of the desired network, and either the number of edges, or the probability of observing an edge. The type argument is used to specify whether the second argument should be interpreted as probability of an edge p, or number of edges m. library(igraph) g <- erdos.renyi.game(n=12,10,type='gnm') g ## ## ## ## ## ## IGRAPH U--- 12 10 -- Erdos renyi (gnm) graph + attr: name (g/c), type (g/c), loops (g/l), | m (g/n) + edges: [1] 4-- 5 3-- 6 2-- 8 1-- 9 8-- 9 8--10 1--11 [8] 6--11 8--12 9--12 graph.density(g) ## [1] 0.152 The random nature of the graphs can be seen by producing and examining multiple graphs. In each case the number of vertices is the same, but the ties are randomly determined (Fig. 10.1). op <- par(mar=c(0,1,3,1),mfrow=c(1,2)) plot(erdos.renyi.game(n=12,10,type='gnm'), vertex.color=2, main="First random graph") plot(erdos.renyi.game(n=12,10,type='gnm'), vertex.color=4, main="Second random graph") par(op) 10.2 Models of Network Structure and Formation 149 First random graph Second random graph 10 10 3 8 7 4 9 5 8 11 1 4 11 7 1 5 6 12 9 12 6 2 3 2 Fig. 10.1 Two random graphs Despite the simplicity of the random graph model, it has led to a number of important discoveries about network structures. First, as suggested above, for large n the network will have a Poisson degree distribution (Fig. 10.2). g <- erdos.renyi.game(n=1000,.005,type='gnp') plot(degree.distribution(g), type="b",xlab="Degree",ylab="Proportion") More unexpectedly, it turns out that random graphs become entirely connected for fairly low values of average degree. That means even when edges are determined randomly, each individual network member does not have to be connected to too many other members for the network itself to be connected (i.e., the network has only one component). More precisely, if p is greater than lnn , then the random graph is likely to be connected in one large component (Newman 2010). The average degree of a random graph, c, is related to graph size and edge probability: c = (n − 1)p So this means that across the range of network sizes typically seen in social network analysis (say, 100–10,000), the average degree required to have a completely connected network will be less than approximately 12. The following random graph simulation and plot demonstrates this relationship (Fig. 10.3). crnd <- runif(500,1,8) cmp_prp <- sapply(crnd,function(x) max(clusters(erdos.renyi.game(n=1000, p=x/999))$csize)/1000) 10 Random Network Models 0.00 0.05 Proportion 0.10 0.15 150 5 10 Degree 15 Fig. 10.2 Degree distribution for G with n = 1,000 and p = 0.005 smoothingSpline <- smooth.spline(crnd,cmp_prp, spar=0.25) plot(crnd,cmp_prp,col='grey60', xlab="Avg. Degree", ylab="Largest Component Proportion") lines(smoothingSpline,lwd=1.5) This demonstration requires some unpacking to easily understand. First, a vector is created with 500 random values ranging from one to eight. This will be used as input to a function that will create 500 random graphs, with average degree varying from 1 to 8. Next, the sapply() function is used to call the random graph function repeatedly, and assign the results to cmp prp. The function itself creates each random graph with 1,000 nodes, and with p equal to the desired average degree divc .) Once the random graph is produced, the size ided by 999. (From above, p = n−1 of the largest component is calculated using the clusters() function. This number is then divided by the size of the network (1,000) to get the proportion of the network accounted for by the largest component. The results show that for random networks of 1,000 nodes, the network will be almost or completely connected when the average degree is larger than four or five. Another surprising property of random graphs is that the connected random graphs are quite compact. That is, the diameter of the largest components in random graphs stays relatively small even for large networks. n_vect <- rep(c(50,100,500,1000,5000),each=50) g_diam <- sapply(n_vect,function(x) 151 Largest Component Proportion 0.2 0.4 0.6 0.8 1.0 10.2 Models of Network Structure and Formation 1 2 3 4 5 Avg. Degree 6 7 8 Fig. 10.3 Relationship of average degree and connectedness in random graphs diameter(erdos.renyi.game(n=x,p=6/(x-1)))) library(lattice) bwplot(g_diam ˜ factor(n_vect), panel = panel.violin, xlab = "Network Size", ylab = "Diameter") The above code runs a total of 250 simulations, producing random graphs from 50 to 5,000 nodes. As the plot shows, although the size of the graphs increases across two orders of magnitude, the diameter of the largest component in each graph increases much more slowly, from about five to ten (Fig. 10.4). These two characteristics of random graphs: being completely connected with low average degree, and the diameter increasing slowly relative to graph size, may be partly responsible for some of the ‘small-world’ characteristics of real-world social networks (Newman 2010). 10.2.2 Small-World Model The Erdős-Rényi random graph model has one major limitation in that it does not describe the properties of many real-world social networks. In particular, fully random graphs have degree distributions that do not match observed networks very well, and they also have quite low levels of clustering (transitivity). One type of model, called the small-world model by Watts and Strogatz (1998), produces random networks that are somewhat more realistic than Erdős-Rényi 152 10 Random Network Models graphs. In particular, small-world model networks have more realistic levels of transitivity along with small diameters. The small-world model starts with a circle of nodes, where each node is connected to its c immediate neighbors (forming a formal lattice structure). Then, a small number of existing edges are rewired, where they are removed and then replaced with another tie that connects two random nodes. If the rewiring probability is 0, then we end up with the original lattice network. When p is 1, then we have an Erdős-Rényi random graph. The main interesting discovery of Watts and Strogatz (and others), is that only a small fraction of ties needs to be rewired to dramatically reduce the diameter of the network. 10 9 Diameter 8 7 6 5 4 50 100 500 1000 Network Size 5000 Fig. 10.4 Relationship of random graph size and diameter, for average degree = 6 Figure 10.5 shows how various small-world model networks look with different rewiring probabilities. The watts.strogatz.game() is called to produce a small-world network of 30 nodes. Setting the option nei=2 (for neighborhood) will start the network with each node tied to the closest two neighbors on either side. This results in each node having degree = 4. g1 <- watts.strogatz.game(dim=1, size=30, nei=2, g2 <- watts.strogatz.game(dim=1, size=30, nei=2, g3 <- watts.strogatz.game(dim=1, size=30, nei=2, g4 <- watts.strogatz.game(dim=1, size=30, nei=2, op <- par(mar=c(2,1,3,1),mfrow=c(2,2)) plot(g1,vertex.label=NA,layout=layout_with_kk, p=0) p=.05) p=.20) p=1) 10.2 Models of Network Structure and Formation main=expression(paste(italic(p)," plot(g2,vertex.label=NA, main=expression(paste(italic(p)," plot(g3,vertex.label=NA, main=expression(paste(italic(p)," plot(g4,vertex.label=NA, main=expression(paste(italic(p)," par(op) 153 = 0"))) = .05"))) = .20"))) = 1"))) p= 0 p = .05 p = .20 p= 1 Fig. 10.5 Small-world models with increasing rewiring probabilities The following simulation and figure shows how quickly rewiring reduces the diameter of a network in the small-world model. Working with a network with 100 nodes, each node starts out connected to its two neighbors on each side. The graph will thus have 200 edges. The starting diameter of the lattice network is 25 (getting from one node to the other side of the circle takes 25 steps). 154 10 Random Network Models g100 <- watts.strogatz.game(dim=1,size=100,nei=2,p=0) g100 ## ## ## ## ## ## ## ## ## ## ## ## IGRAPH U--- 100 200 -- Watts-Strogatz random grap + attr: name (g/c), dim (g/n), size (g/n), | nei (g/n), p (g/n), loops (g/l), multiple | (g/l) + edges: [1] 1-- 2 2-- 3 3-- 4 4-- 5 5-- 6 6-- 7 [7] 7-- 8 8-- 9 9--10 10--11 11--12 12--13 [13] 13--14 14--15 15--16 16--17 17--18 18--19 [19] 19--20 20--21 21--22 22--23 23--24 24--25 [25] 25--26 26--27 27--28 28--29 29--30 30--31 [31] 31--32 32--33 33--34 34--35 35--36 36--37 + ... omitted several edges diameter(g100) ## [1] 25 The simulation is set to calculate 300 networks, ten each for the number of edges to rewire ranging from 1 to 30. Because we know how many edges are in each graph (200), the rewiring probability can be calculated by the number of rewired edges divided by total number of edges. If 30 edges are rewired, then, the probability is 0.15. p_vect <- rep(1:30,each=10) g_diam <- sapply(p_vect,function(x) diameter(watts.strogatz.game(dim=1, size=100, nei=2, p=x/200))) smoothingSpline = smooth.spline(p_vect, g_diam, spar=0.35) plot(jitter(p_vect,1),g_diam,col='grey60', xlab="Number of Rewired Edges", ylab="Diameter") lines(smoothingSpline,lwd=1.5) The plot demonstrates that after only rewiring ten of the edges (p = 0.05), the diameter has shrunk at least 60 %, from 25 to about 10 (Fig. 10.6). 10.2.3 Scale-Free Models An important limitation of the previous two mathematical network models is that they produce graphs with degree distributions that are not representative of many 155 10 Diameter 15 20 25 10.2 Models of Network Structure and Formation 0 5 10 15 20 25 Number of Rewired Edges 30 Fig. 10.6 Relationship of rewiring probability to network diameter for the small-world model real-world social networks. Numerous studies, in fact, have shown that a wide variety of observed networks have heavy-tailed degree distributions that approximately follow a power law. These are typically called scale-free networks. For example, both the network of sexual partners and the World-Wide-Web exhibit this scale-free pattern (Broder et al. 2000; Liljeros et al. 2001). That is, some people have many sexual partners (high degree), but most people have a small number of sexual partners. Similarly, some websites have a very large number of other websites connected to them, but most websites have only a few connections. How does this power-law characteristic feature of scale-free social networks arise? A number of network scientists have explored this question, and have determined that a network formation process of cumulative advantage, or preferential attachment can explain this. That is, as networks grow, new nodes are more likely to form ties with other nodes that already have many ties, due to their visibility in the network. This ‘rich-gets-richer’ phenomena has been shown to lead to the powerlaw distribution in networks (de Solla Price 1976; Barabási and Albert 1999). The preferential attachment model of Barabási and Albert is implemented in igraph with the barabasi.game() function. This is a more complicated algorithm than those for the previous models, partly because this is a network growth model, not just a static network structure model. Figure 10.7 displays a 500-node network that is formed with this preferential attachment model. The default behavior of the algorithm is that as each new node is added to the network, it is connected to another node in the network, with probability proportional to the degree of that node. Thus, some nodes in the network will end up with many more ties than most of the other nodes. The code for this example 156 10 Random Network Models highlights these hubs by coloring the nodes with degree > 9 red. Also, the nodes are sized based on their degree, using the rescale() function from Chap. 5. g <- barabasi.game(500, directed = FALSE) V(g)$color <- "lightblue" V(g)[degree(g) > 9]$color <- "red" node_size <- rescale(node_char = degree(g), low = 2, high = 8) plot(g, vertex.label = NA, vertex.size = node_size) We can see the heavy-tail distribution in a few ways. The median degree is 1, and the mean is close to 2. The highest degree is 27. The pattern is easily seen in Fig. 10.8. The left panel shows the raw degree distribution, while the right panel displays the same degree distribution, but on log-scales. If the distribution follows a power-law, then the datapoints should fall along a straight line (at least for the tail). (Note that it is difficult to assess power-law relationships with small networks.) median(degree(g)) ## [1] 1 mean(degree(g)) ## [1] 2 Fig. 10.7 Example scale-free network with default options 10.2 Models of Network Structure and Formation 157 table(degree(g)) ## ## 1 ## 314 ## 16 ## 1 2 86 19 1 3 41 27 1 4 23 5 11 6 10 7 3 9 1 10 3 11 2 12 2 14 1 op <- par(mfrow=c(1,2)) plot(degree.distribution(g),xlab="Degree", ylab="Proportion") plot(degree.distribution(g),log='xy', xlab="Degree",ylab="Proportion") par(op) Proportion 0 5 10 15 20 25 Degree 0.002 0.005 0.010 0.020 0.050 0.100 0.200 0.500 0.3 0.0 0.1 0.2 Proportion 0.4 0.5 0.6 The user can adjust a number of parameters in the barabasi.game() function to produce a wide variety of preferential attachment networks. For example, the following code produces a network that might be viewed as a little more realistic than that shown in the previous figure. Here, instead of each new node connecting to exactly one other node, the out.dist option is used to specify a distribution of tie probabilities. In this case a new node will have a tie to 0 other nodes (isolate) 25 % of the time, will be tied to one other node 50 % of the time, and to two nodes 25 % of 1 2 5 10 20 Degree Fig. 10.8 Degree distribution of scale-free model (linear and log-linear scales) 158 10 Random Network Models the time. Also, the zero.appeal option is used to make it somewhat more likely that new nodes are connected to previously existing isolates. Figure 10.9 shows that the resulting graph has produced a number of isolates and slightly fewer nodes with high degree. g <- barabasi.game(500, out.dist = c(0.25, 0.5, 0.25), directed = FALSE, zero.appeal = 1) V(g)$color <- "lightblue" V(g)[degree(g) > 9]$color <- "red" node_size <- rescale(node_char = degree(g), low = 2, high = 8) plot(g, vertex.label = NA, vertex.size = node_size) Fig. 10.9 Example scale-free network with modified options Finally, to illustrate how preferential attachment networks grow, the following figure shows what the networks look like at four different stages, with graph sizes of 10, 25, 50, and 100 nodes (Fig. 10.10). g1 <- barabasi.game(10,m=1,directed=FALSE) g2 <- barabasi.game(25,m=1,directed=FALSE) g3 <- barabasi.game(50,m=1,directed=FALSE) 10.2 Models of Network Structure and Formation 159 g4 <- barabasi.game(100,m=1,directed=FALSE) op <- par(mfrow=c(2,2),mar=c(4,0,1,0)) plot(g1, vertex.label= NA, vertex.size xlab = "n = 10") plot(g2, vertex.label= NA, vertex.size xlab = "n = 25") plot(g3, vertex.label= NA, vertex.size xlab = "n = 50") plot(g4, vertex.label= NA, vertex.size xlab = "n = 100") par(op) = 3, = 3, = 3, = 3, n = 10 n = 25 n = 50 n = 100 Fig. 10.10 Growth of networks using preferential attachment model 160 10 Random Network Models 10.3 Comparing Random Models to Empirical Networks The mathematical models described here, as well as many others, have been used to study the theoretical properties of networks. These models can also be useful as comparisons to empirical social networks. As as simple example of this, we can explore some basic network properties of the lhds data taken from Harris’ 2013 book on exponential random graph models (see Chap. 11). The lhds network is made up of communication ties among 1,283 leaders of local public health departments. The network has quite low density, although the average degree is over four (Fig. 10.11). data(lhds) ilhds <- asIgraph(lhds) ilhds ## ## ## ## ## ## ## ## ## ## ## ## IGRAPH U--- 1283 2708 -+ attr: title (g/c), hivscreen (v/c), na | (v/l), nutrition (v/c), popmil (v/n), | state (v/c), vertex.names (v/c), years | (v/n), na (e/l) + edges: [1] 2-- 10 2-- 11 2-- 19 2-- 20 [6] 5-6 6-- 11 6-- 17 10-- 11 [11] 11-- 26 2-- 12 6-- 12 10-- 12 [16] 12-- 19 12-- 26 9-- 14 14-- 15 [21] 14-- 25 14-- 27 14-- 226 14-- 414 + ... omitted several edges 5--1003 11-- 19 11-- 12 14-- 18 14-- 697 graph.density(ilhds) ## [1] 0.00329 mean(degree(ilhds)) ## [1] 4.22 The following code builds three network models that have the same size and approximately the same density as the lhds network. By comparing the characteristics of the network models with the empirical network, we can highlight the interesting or important characteristics of the empirical network that might be worth further exploration (by using, for example, the types of statistical modeling and simulation approaches presented in the next few chapters). g_rnd <- erdos.renyi.game(1283,.0033,type='gnp') g_smwrld <- watts.strogatz.game(dim=1,size=1283, nei=2,p=.25) 10.3 Comparing Random Models to Empirical Networks 161 g_prfatt <- barabasi.game(1283,out.dist=c(.15,.6,.25), directed=FALSE,zero.appeal=2) Table 10.1 presents some descriptive network statistics for the three models and the lhds network. Although each model captures some of the characteristics of the observed network, none of them match across all the statistics. In particular, the lhds network has much higher transitivity than any of the models (Fig. 10.12). Fig. 10.11 Local health department communication network Name Erdos-Renyi Small world Preferential attachment Health department Size 1283 1283 1283 1283 Density 0.003 0.003 0.002 0.003 Avg. degree 4.404 4.000 2.195 4.221 Transitivity 0.002 0.088 0.003 0.306 Table 10.1 Comparison of model and empirical network characteristics Isolates 21 1 109 58 162 10 Random Network Models Small World 0.00 0.10 0.20 0.30 0.00 0.05 0.10 0.15 0.20 Erdos-Renyi Random Graph 0 5 10 15 Degree 20 0 25 10 15 Degree 20 25 Local Health Dept. Network 0.00 0.30 0.1 0.05 0.2 0.10 0.3 0.15 Preferential Attachment 5 0 5 10 15 20 Degree 25 0 5 10 15 Degree 20 25 Fig. 10.12 Comparison of degree distributions for random models and empirical local health department network Chapter 11 Statistical Network Models Prediction and explanation are exactly symmetrical. Explanations are, in effect, predictions about what has happened; predictions are explanations about what’s going to happen. (John Rogers Searle) 11.1 Introduction As suggested in the previous chapter, for most of the history of network science analysts were limited to network visualization and network description. Only in the past couple of decades has network statistical theory and computational power developed enough to allow for valid and feasible statistical modeling of networks. The primary barrier to statistical network modeling was the fundamental assumption of independence of observations that underlies much of traditional statistical theory. Networks by definition are non-independent. If you know that one actor is tied to another actor, you have information about the second actor that is dependent on the first. Over the years statistical theorists gradually developed increasingly sophisticated models that could be applied to empirical network data, including dyadic dependence and dyadic independence models (such as p* , see Harris 2013). The focus of this chapter is on exponential random graph models (ERGMs), which have turned out to be the most powerful, flexible, and widely used modeling approach for building and testing statistical models of networks. An ERGM is a true generative statistical model of network structure and characteristics (Hunter et al. 2008). This means that inferential hypotheses can be proposed and tested. It is generative in the sense that characteristics of the individual elements in the network (i.e., actors) and local structural properties can be used to predict properties of the entire network (e.g., diameter, degree distribution, etc.). ERGMs are popular for at least four reasons. First, they can handle the complex dependencies of network data without the types of degeneracy problems that were frequently encountered in earlier network models. Second, ERGMs are flexible and can handle many different types of predictors and covariates. Third, the generative approach where overall network characteristics are predicted from individual actor and local structural properties enhances the validity of the models. Finally, ERGM © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 11 163 164 11 Statistical Network Models models have been implemented in programming suites and statistical packages such as R, making it easier for applied analysts to build, test, and disseminate the results of their network models. ERGMs are fit using Monte Carlo Markov Chain maximum-likelihood estimation (MCMC). The estimates of the statistical parameters are based on an underlying simulation, where many (typically thousands) of networks are produced to reflect the particular model being tested. One way to express the basic ERGM model is: P(yi j = 1 | YiCj ) = K 1 exp{ ∑ θk zk (y)} c k=1 This shows that the model is predicting the probability of a tie between actors i and j, conditional on the rest of the network (all other ties). The thetas (θk ) are the coefficientsofthe network statistics of interest, one for each of the K included statistics, zk (y). 1c is simply a normalizing constant that ensures that the probabilities stay within 0 and 1. (See Harris 2013, for more details.) ERGMs are implemented in the ergm package that is contained in the statnet suite of network analysis packages in R. It is actively maintained and developed by network scientists at the University of Washington, among others. More technical details of ergm are provided in excellent papers by Hunter and colleagues (2008) as well as Goodreau (2007). As suggested above, one of the strengths of ERGMs is the ability to handle a wide variety of predictors. In fact, the documentation for ergm lists more than a hundred possible types of terms that can be included in an ergm model specification. It is easier to navigate the possibilities once one knows that all the possible predictors in an ERGM fall into one of four broad categories: node-level predictors, dyadic predictors, relational predictors, and local structural predictors. Table 11.1 lists these categories, along with some of the most commonly used ergm terms for each type. The first type of predictor is node or actor-level characteristics, where having a particular characteristic is hypothesized to affect the likelihood of observing a tie. For example, if you hypothesize that girls are more likely to make friends than boys in middle school, then you could use actor gender as a node-level predictor. Dyad-level predictors are used when you hypothesize that the characteristics of both actors in a dyad may influence the probability of observing a tie between those two actors. These types of predictors allow you to test hypotheses of assortative or disassortative mixing in networks, leading to patterns of homophily or heterophily. For example, if you think that friendships are more likely to be formed within the same grade in a middle school (assortative mixing), then you could use grade as a dyad-level predictor. The third type of predictor in ERGMs is a powerful option to use information about other relationships or ties when predicting the observed ties in a network. That is, you can use one type of network tie to predict a second type of tie (as long as they are both collected on the same set of network members). Finally, information about local structural properties of the observed network can be used as model covariates. This, for example, allows the network model to be conditioned on the observed degree distribution, or on the level of transitivity (closed triangles) that is observed. 11.2 Building Exponential Random Graph Models Predictor type Node Dyad Relation Structure 165 Term nodefactor nodecov nodemix nodematch absdiff edgecov gwdegree gwdsp gwesp Table 11.1 Common ERGM terms In the following examples, we will build a series of ERGMs that use predictors from each of these four broad types of ergm terms. For more detailed information about the terms included in ergm, see help(’ergm-terms’). A more general overview is provided by Morris, Handcock, and Hunter (2008). Finally, ergm includes a helpful vignette that shows how the various terms are related to one another (vignette(’ergm-term-crossRef’)). 11.2 Building Exponential Random Graph Models To explore a number of the stochastic modeling possibilities of the ergm package, we will use network data that describe interorganizational relationships among 25 agencies within the Indiana state tobacco control program in 2010. These data include three different types of interorganizational ties: frequency of contact, level of collaboration, and whether each pair of agencies communicated with one another about a particular evidence-based guideline published by the Center’s for Disease Control and Prevention (CDC), called Best Practices for Tobacco Control. This latter relationship is conceptualized as a type of dissemination tie. The following example models will focus on predicting the pattern of these dissemination ties among the Indiana tobacco control organizations. The data are included in the UserNetR package as a network list object called TCnetworks. The network data include a number of node characteristics (e.g., tob yrs, which records how long an agency has been working in tobacco control), edge characteristics, and a sociomatrix (TCdist) which contains the geographic distance between each pair of agencies. For more information, see the help file for TCnetworks. Prior to modeling, the network data need to be extracted from the list object, and any preliminary descriptive and visualization tasks can be conducted. data(TCnetworks) TCcnt <- TCnetworks$TCcnt TCcoll <- TCnetworks$TCcoll 166 11 Statistical Network Models TCdiss <- TCnetworks$TCdiss TCdist <- TCnetworks$TCdist summary(TCdiss,print.adj=FALSE) ## Network attributes: ## vertices = 25 ## directed = FALSE ## hyper = FALSE ## loops = FALSE ## multiple = FALSE ## bipartite = FALSE ## title = IN_Diffusion ## total edges = 103 ## missing edges = 0 ## non-missing edges = 103 ## density = 0.343 ## ## Vertex attributes: ## ## agency_cat: ## numeric valued attribute ## attribute summary: ## Min. 1st Qu. Median Mean ## 1.00 2.00 2.00 3.24 ## ## agency_lvl: ## numeric valued attribute ## attribute summary: ## Min. 1st Qu. Median Mean ## 1.00 1.00 2.00 2.04 ## ## lead_agency: ## numeric valued attribute ## attribute summary: ## Min. 1st Qu. Median Mean ## 0.00 0.00 0.00 0.04 ## ## tob_yrs: ## numeric valued attribute ## attribute summary: ## Min. 1st Qu. Median Mean ## 1.00 3.00 4.50 6.76 ## vertex.names: ## character valued attribute ## 25 valid vertex names 3rd Qu. 5.00 Max. 6.00 3rd Qu. 3.00 Max. 3.00 3rd Qu. 0.00 Max. 1.00 3rd Qu. 9.00 Max. 21.00 11.2 Building Exponential Random Graph Models 167 ## ## No edge attributes A quick examination of the network reveals that it is made up of three types of organizations (local, state, and national), is made up of one connected component that is fairly densely connected, and there is some variability of centrality across the network members (shown both by the betweenness centralization score, as well as the variability of the sizes of nodes in the figure, based on degree) (Fig. 11.1). components(TCdiss) ## [1] 1 gden(TCdiss) ## [1] 0.343 centralization(TCdiss,betweenness,mode='graph') ## [1] 0.381 deg <- degree(TCdiss,gmode='graph') lvl <- TCdiss %v% 'agency_lvl' plot(TCdiss,usearrows=FALSE,displaylabels=TRUE, vertex.cex=log(deg), vertex.col=lvl+1, label.pos=3,label.cex=.7, edge.lwd=0.5,edge.col="grey75") legend("bottomleft",legend=c("Local","State", "National"), col=2:4,pch=19,pt.cex=1.5) 11.2.1 Building a Null Model It is often useful to start the modeling process by building a null model, one with no substantive or structural predictors. This can be used as a baseline model to judge how much subsequent models are improving. A null model typically only has one term, edges, which produces a random graph model that has the same number of edges as the observed network. Fitting an ergm model uses syntax similar to other statistical modeling functions in R. The ergm function is called with a model formula. This formula lists the observed network as the dependent variable, and then all of the ergm model terms are listed on the right hand side. Other options can also be set by the user. The control option is used to pass control parameters to the ergm algorithm. Here we 168 11 Statistical Network Models Partnership for Prev. Quitline ALA TTAC IN Latino Inst DOH Diabetes Prev. & Chronic Disease Promotus CTFK CHANCES AHA IN Cancer Consortium ICSA Smokefree Pregnancies ITPC ANR Tippecanoe RTI Hancock Regional Medicaid King's Daughters Johnson Memorial Hlthy Madison Local State National Wabash DOE IDHA Fig. 11.1 Indiana tobacco control dissemination network set the random number seed to ensure that the same results will be seen across multiple runs. The results of fitting the model are stored in a model object for further examination and analysis. library(ergm) DSmod0 <- ergm(TCdiss ˜ edges, control=control.ergm(seed=40)) class(DSmod0) ## [1] "ergm" summary(DSmod0) ## ## ========================== ## Summary of model fit ## ========================== 11.2 Building Exponential Random Graph Models ## ## ## ## ## ## ## ## ## ## ## ## ## Formula: Iterations: 169 TCdiss ˜ edges 4 out of 20 Monte Carlo MLE Results: Estimate Std. Error MCMC % p-value edges -0.648 0.122 0 <1e-04 Null Deviance: 416 Residual Deviance: 386 AIC: 388 BIC: 392 on 300 on 299 degrees of freedom degrees of freedom (Smaller is better.) A null model includes only the edges term. This acts as a type of intercept for the model, and ensures that the simulated networks have the same number of edges as the observed network. This can be seen by taking the logistic transformation of the edges parameter, which gives the overall density of network. This demonstrates that the null model is constrained by the number of edges in the observed network. plogis(coef(DSmod0)) ## edges ## 0.343 11.2.2 Including Node Attributes Once a null model is obtained, more interesting models can be fit using a wide variety of predictors. Following the order suggested by Table 11.1, we start with main effect terms based on individual node characteristics. In the Indiana network, we know which agency is the lead agency (receiving funding from CDC), and we also know how long each agency has been working in tobacco control. It might be reasonable to assume that agencies are more likely to be connected to the lead agency. It also would make sense that agencies with a longer history of tobacco control experience would be more likely to be connected to other agencies. The following scatterplot does suggest that there may be a relationship between experience and interorganizational connections (Fig. 11.2). scatter.smooth(TCdiss %v% 'tob_yrs', degree(TCdiss,gmode='graph'), xlab='Years of Tobacco Experience', ylab='Degree') 11 Statistical Network Models 5 10 Degree 15 20 170 5 10 15 Years of Tobacco Experience 20 Fig. 11.2 Basic association between years of experience and node degree Both of the above hypotheses can be formally tested. There are two main ergm model terms for testing node characteristic main effects, nodefactor and nodecov. nodefactor is used for a categorical attribute (e.g., lead agency), while nodecov is used for quantitative characteristics (such as tob yrs). DSmod1 <- ergm(TCdiss ˜ edges + nodefactor('lead_agency') + nodecov('tob_yrs') , control=control.ergm(seed=40)) summary(DSmod1) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ========================== Summary of model fit ========================== Formula: Iterations: TCdiss ˜ edges + nodefactor("lead_agency") + nodecov("tob_yrs") 16 out of 20 Monte Carlo MLE Results: edges Estimate Std. Error -1.6783 0.3293 11.2 Building Exponential Random Graph Models ## ## ## ## ## ## ## ## ## ## ## 171 nodefactor.lead_agency.1 nodecov.tob_yrs 17.9366 933.5551 0.0599 0.0228 MCMC % p-value edges 0 <1e-04 nodefactor.lead_agency.1 0 0.9847 nodecov.tob_yrs 0 0.0091 Null Deviance: 416 Residual Deviance: 323 AIC: 329 BIC: 341 on 300 on 297 degrees of freedom degrees of freedom (Smaller is better.) The results show a number of interesting points. First, the number of years in tobacco control is positively and significantly associated with the likelihood of observing a tie between the two agencies (remember that here a tie reflects a dissemination connection between two organizations). Second, although the parameter for the effect of being the lead agency is quite large, it is not significant. This appears to be due to low precision in the estimate. With only one lead agency in a small network of 25 members, there may not be the power to detect this effect. Finally, the AIC (Akaike information criterion) for this model with two predictors is lower than the AIC for the null model. This shows that this model is doing a better job of explaining the data than the baseline model. Similar to logistic regression analysis, we can estimate the probability of observing certain types of ties using the fitted parameter estimates. This requires using the logistic transformation to obtain numbers that can be properly interpreted as probabilities. For example, based on Model 1, the following code calculates the probability that there is a dissemination tie between two agencies, one with 5 years of tobacco control experience, the other with 10 years of experience, neither of whom are the lead agency. p_edg <- coef(DSmod1)[1] p_yrs <- coef(DSmod1)[3] plogis(p_edg + 5*p_yrs + 10*p_yrs) ## edges ## 0.314 The result is 0.31, which is just a little less than the overall density (i.e., overall probability of observing a tie) of the dissemination network. 11.2.3 Including Dyadic Predictors A rich source of hypotheses for network structures derive from questions about homophily and heterophily. That is, are ties more or less likely between network 172 11 Statistical Network Models members who are similar to each other on some characteristic (homophily) or dissimilar (heterophily). This is a type of dyadic interaction predictor, and ergm includes a number of these terms. The raw frequencies of observed ties among and between different types of actors in the network can be displayed with the mixingmatrix() function. Here, for example, we see that the most frequent dissemination ties (24) are observed between local and state-level agencies (see ?TCnetworks for the covariate code definitions). It can be somewhat complicated to interpret these raw frequency patterns; however, they can generate hypotheses about dyadic interrelationships that can be formally tested in the ERGM. mixingmatrix(TCdiss,'agency_lvl') ## ## ## ## ## ## Note: Marginal totals can be misleading for undirected mixing matrices. 1 2 3 1 13 24 14 2 24 16 23 3 14 23 13 mixingmatrix(TCdiss,'agency_cat') ## ## ## ## ## ## ## ## ## Note: Marginal totals can be misleading for undirected mixing matrices. 1 2 3 4 5 6 1 0 12 3 2 3 4 2 12 19 18 6 1 14 3 3 18 0 4 3 2 4 2 6 4 1 1 6 5 3 1 3 1 0 0 6 4 14 2 6 0 4 In the following models, the non-significant lead agency predictor is dropped. Three version of Model 2 are estimated, showing different options for including the dyadic comparison of agency level as a predictor. DSmod2a <- ergm(TCdiss ˜ edges + nodecov('tob_yrs') + nodematch('agency_lvl'), control=control.ergm(seed=40)) summary(DSmod2a) ## ## ========================== ## Summary of model fit ## ========================== ## 11.2 Building Exponential Random Graph Models ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Formula: Iterations: 173 TCdiss ˜ edges + nodecov("tob_yrs") + nodematch("agency_lvl") 4 out of 20 Monte Carlo MLE Results: Estimate Std. Error MCMC % edges -2.4808 0.3413 0 nodecov.tob_yrs 0.1133 0.0201 0 nodematch.agency_lvl 0.6875 0.2770 0 p-value edges <1e-04 nodecov.tob_yrs <1e-04 nodematch.agency_lvl 0.014 Null Deviance: 416 Residual Deviance: 342 AIC: 348 BIC: 359 on 300 on 297 degrees of freedom degrees of freedom (Smaller is better.) Model 2a uses the basic nodematch term to include one network predictor that assesses the effect on the likelihood of a dissemination tie when both organizations are the same level (e.g., both are local organizations). This is a homophily hypothesis, where we are testing if the same types of organizations are more likely to communicate with one another. The positive and significant parameter indicates that there is a homophily effect here. DSmod2b <- ergm(TCdiss ˜ edges + nodecov('tob_yrs') + nodematch('agency_lvl',diff=TRUE), control=control.ergm(seed=40)) summary(DSmod2b) ## ## ## ## ## ## ## ## ## ## ## ========================== Summary of model fit ========================== Formula: TCdiss ˜ edges + nodecov("tob_yrs") + nodematch("agency_lvl", diff = TRUE) Iterations: 4 out of 20 174 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 11 Statistical Network Models Monte Carlo MLE Results: Estimate Std. Error MCMC % edges -2.7792 0.3685 0 nodecov.tob_yrs 0.1331 0.0217 0 nodematch.agency_lvl.1 1.6145 0.4983 0 nodematch.agency_lvl.2 -0.2148 0.3974 0 nodematch.agency_lvl.3 1.3016 0.4422 0 p-value edges <1e-04 nodecov.tob_yrs <1e-04 nodematch.agency_lvl.1 0.0013 nodematch.agency_lvl.2 0.5891 nodematch.agency_lvl.3 0.0035 Null Deviance: 416 Residual Deviance: 330 AIC: 340 BIC: 358 on 300 on 295 degrees of freedom degrees of freedom (Smaller is better.) Model 2b shows how to test a hypothesis of differential homophily. Here, instead of one dyad term, there are three; one each for the three levels of the agency level characteristic. Thus, this model now has three homophily terms, one for local agencies, one for state, and one for national. The results suggest that the overall homophily effect is seen mainly at the local and national levels. DSmod2c <- ergm(TCdiss ˜ edges + nodecov('tob_yrs') + nodemix('agency_lvl',base=1), control=control.ergm(seed=40)) summary(DSmod2c) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ========================== Summary of model fit ========================== Formula: Iterations: TCdiss ˜ edges + nodecov("tob_yrs") + nodemix("agency_lvl", base = 1) 4 out of 20 Monte Carlo MLE Results: Estimate Std. Error MCMC % edges -1.1757 0.5372 0 11.2 Building Exponential Random Graph Models ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## nodecov.tob_yrs mix.agency_lvl.1.2 mix.agency_lvl.2.2 mix.agency_lvl.1.3 mix.agency_lvl.2.3 mix.agency_lvl.3.3 edges nodecov.tob_yrs mix.agency_lvl.1.2 mix.agency_lvl.2.2 mix.agency_lvl.1.3 mix.agency_lvl.2.3 mix.agency_lvl.3.3 0.1340 -1.5805 -1.8354 -1.5363 -1.7073 -0.3110 p-value 0.0294 <1e-04 0.0044 0.0025 0.0070 0.0019 0.6116 Null Deviance: 416 Residual Deviance: 330 AIC: 344 175 BIC: 370 on 300 on 293 0.0222 0.5501 0.6028 0.5659 0.5445 0.6119 0 0 0 0 0 0 degrees of freedom degrees of freedom (Smaller is better.) Finally, Model 2c shows how to include the most detailed tests for homophily and heterophily. The nodemix term includes dyadic comparisons for all the possible patterns of a categorical node attribute. The base option sets a reference category for the effects, otherwise all possible effects are included (which may lead to model stability problems). Here the reference category is (1,1), indicating the ties among the local-level agencies. Note that for smaller networks and categorical attributes with a large number of values, the mixing matrix may have empty cells (as well as use up many more degrees of freedom). These can cause problems for the model estimation and interpretation. 11.2.4 Including Relational Terms (Network Predictors) The third type of predictor that can be included in ERGMs is a relational predictor, where information about ties among the network members is used to predict the likelihood of the dependent variable tie. This means that either other network variables can be used as predictors, or any other type of relational information among the actors. In this example, we use both a traditional network predictor and a relational quantitative variable as predictors. First, we have the contact network variable, which measured the frequency of contact among the Indiana tobacco control agencies (1 = yearly; 2 = quarterly; 3 = monthly; 4 = weekly; and 5 = daily). Second, the physical distance (in miles) between each pair of agencies was calculated and stored in a statnet network object. For each of these predictors, the hypothesis is fairly 176 11 Statistical Network Models evident. We expect that agencies who have more frequent contact with each other will be more likely share information on the Best Practices guidelines. Second, we also expect that the further apart two agencies are, the less likely they will be to disseminate information to each other. To include relational quantitative predictors in an ERGM, the edgecov term can be used. Also, since these predictors are in the form of valued networks, the attr option points to the edge attribute that contains the appropriate quantitative information. First, we view a subset of the relational information for each of the predictors, then the ERGM is fit. as.sociomatrix(TCdist,attrname = 'distance')[1:5,1:5] ## ## ## ## ## ## 1 2 3 4 5 1 0.00 1.94 492 1870 1.27 2 1.94 0.00 492 1869 1.91 3 491.87 491.97 0 2325 493.08 4 1869.88 1868.98 2325 0 1868.61 5 1.27 1.91 493 1869 0.00 as.sociomatrix(TCcnt,attrname = 'contact')[1:5,1:5] ## ## ## ## ## ## ITPC Promotus RTI Quitline IDHA ITPC Promotus RTI Quitline IDHA 0 5 4 4 3 5 0 3 4 0 4 3 0 2 0 4 4 2 0 0 3 0 0 0 0 DSmod3 <- ergm(TCdiss ˜ edges + nodecov('tob_yrs') + nodematch('agency_lvl',diff=TRUE) + edgecov(TCdist,attr='distance') + edgecov(TCcnt,attr='contact'), control=control.ergm(seed=40)) summary(DSmod3) ## ## ## ## ## ## ## ## ## ## ## ========================== Summary of model fit ========================== Formula: TCdiss ˜ edges + nodecov("tob_yrs") + nodematch("agency_lvl",diff = TRUE) + edgecov(TCdist, attr = "distance") + edgecov(TCcnt, attr = "contact") 11.2 Building Exponential Random Graph Models ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Iterations: 177 5 out of 20 Monte Carlo MLE Results: Estimate Std. Error edges -4.850619 0.629666 nodecov.tob_yrs 0.128750 0.028270 nodematch.agency_lvl.1 1.795102 0.625698 nodematch.agency_lvl.2 -0.646015 0.508164 nodematch.agency_lvl.3 1.721850 0.546870 edgecov.distance -0.000184 0.000253 edgecov.contact 1.124253 0.146957 MCMC % p-value edges 0 <1e-04 nodecov.tob_yrs 0 <1e-04 nodematch.agency_lvl.1 0 0.0044 nodematch.agency_lvl.2 0 0.2046 nodematch.agency_lvl.3 0 0.0018 edgecov.distance 0 0.4683 edgecov.contact 0 <1e-04 Null Deviance: 416 Residual Deviance: 237 AIC: 251 BIC: 277 on 300 on 293 degrees of freedom degrees of freedom (Smaller is better.) The results show that frequency of contact is positively associated with dissemination, as we hypothesized. On the other hand, physical distance between agencies does not appear to be associated with dissemination. 11.2.5 Including Local Structural Predictors (Dyad Dependency) The final type of predictor that can be included in ERGMs is information about local structural properties. Remember that we want to model how an entire network looks and operates. By including some information about local structural tendencies (such as the tendency for directed ties to be reciprocated), we can often produce models that are much better fits to the observed, empirical network. These types of predictors lead to what are called dyadic-dependency models, and these present many more computational and statistical challenges (Harris 2013). From the applied analyst perspective, when using these types of predictors you should expect that the execution time may go up dramatically, and that you are much more likely to run into problems of model degeneracy and convergence failure. In recent years, new types of local structural predictor terms have been discovered and developed that at least partially avoid the most problematic model stability 178 11 Statistical Network Models and convergence issues (Snijders and Pattison 2006). The three most commonly used of these terms are GWDegree (geometrically weighted degree distribution), GWESP (geometrically weighted edgewise shared partnerships), and GWDSP (geometrically weighted dyadwise shared partnerships). See Harris’ monograph (2013) for more details on how to choose and interpret these local structure predictors. For our example, we include GWESP. This measures the effect of local clustering (or transitivity) on the likelihood of observing a dissemination tie. The model results do indicate that there is transitivity in the Indiana network, and that it is positively associated with dissemination. DSmod4 <- ergm(TCdiss ˜ edges + nodecov('tob_yrs') + nodematch('agency_lvl',diff=TRUE) + edgecov(TCdist,attr='distance') + edgecov(TCcnt,attr="contact") + gwesp(0.7, fixed=TRUE), control=control.ergm(seed=40)) ## ## ## ## ## ## ## ## ## ## ## ## ## Starting maximum likelihood estimation via MCMLE: Iteration 1 of at most 20: The log-likelihood improved by 0.5168 Step length converged once. Increasing MCMC sample size. Iteration 2 of at most 20: The log-likelihood improved by 0.03238 Step length converged twice. Stopping. This model was fit using MCMC. To examine model diagnostics and check for degeneracy, use the mcmc.diagnostics() function. summary(DSmod4) ## ## ## ## ## ## ## ## ## ## ## ## ========================== Summary of model fit ========================== Formula: Iterations: TCdiss ˜ edges + nodecov("tob_yrs") + nodematch("agency_lvl", diff = TRUE) + edgecov(TCdist, attr = "distance") + edgecov(TCcnt,attr = "contact") + gwesp(0.7, fixed = TRUE) 2 out of 20 11.3 Examining Exponential Random Graph Models ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 179 Monte Carlo MLE Results: Estimate Std. Error edges -6.326172 0.960886 nodecov.tob_yrs 0.097673 0.029612 nodematch.agency_lvl.1 1.557478 0.595676 nodematch.agency_lvl.2 -0.266426 0.471595 nodematch.agency_lvl.3 1.506054 0.526105 edgecov.distance -0.000154 0.000243 edgecov.contact 1.039252 0.145160 gwesp.fixed.0.7 0.877169 0.377682 MCMC % p-value edges 0 <1e-04 nodecov.tob_yrs 0 0.0011 nodematch.agency_lvl.1 0 0.0094 nodematch.agency_lvl.2 0 0.5725 nodematch.agency_lvl.3 0 0.0045 edgecov.distance 0 0.5279 edgecov.contact 0 <1e-04 gwesp.fixed.0.7 0 0.0209 Null Deviance: 416 Residual Deviance: 231 AIC: 247 BIC: 276 on 300 on 292 degrees of freedom degrees of freedom (Smaller is better.) 11.3 Examining Exponential Random Graph Models 11.3.1 Model Interpretation The fitted ERGM objects contain a lot of information about the parameter estimates, simulated networks, and model fit. The most important information is included in the model summary output. Pay particular attention to any messages about convergence issues, as they typically indicate estimation problems that should be addressed. The parameter estimates themselves, along with their standard errors, can be interpreted similarly to logistic regression parameters. The p-values are associated with the ratio of the parameter estimates to their standard errors, which are distributed as Wald test statistics. Although not presented in the default summary output, the individual parameter estimates can be exponentiated to produce odds-ratios. The AIC and BIC values are related to overall model fit, where lower numbers indicate better fit of the model to the observed network. Note how the AIC and BIC 180 11 Statistical Network Models values were getting smaller as we progressively added predictors to the dissemination model. This is telling us that we were adding useful predictors to our ERGM. Finally, as with any type of multivariate statistical model, it is often hard to understand the model by examining the individual parameter estimates. It is usually helpful, then, to use the fitted model to produce sets of forecasts that can be plotted or examined to see how the model works across different profiles or ranges of predictor values. prd_prob1 <- plogis(-6.31 + 2*1*.099 + 1.52 + 4*1.042 + .858*(.50ˆ4)) prd_prob1 ## [1] 0.408 prd_prob2 <- plogis(-6.31 + 2*5*.099 + 1*1.042 + .858*(.50ˆ4)) prd_prob2 ## [1] 0.0144 As a simple example, based on our final model we predict the likelihood of a dissemination tie being observed between two agencies, where they both have been working in tobacco control for 1 year, they both are national-level agencies, they have weekly contact, and the network has an average level of transitivity. We ignore the effects of distance, given the small parameter value and lack of significance. For that predictor profile, the probability of observing a dissemination tie is 41 %. In comparison, for two agencies that have been working in tobacco control for 5 years, that are at different levels (local and national, for example), and only have contact yearly, the estimated probability is just 1.4 %. (See Harris 2013, for more in-depth worked examples, as well as details on how the forecasting handles the local structural parameters such as GWESP here.) 11.3.2 Model Fit The ergm package includes a number of tools that can be used to examine the fit of the network model to the data. First, make sure that there were no major problems with convergence, and that the parameter values, standard errors, and p-values make sense. AIC values can also be useful, especially examining how AIC (and BIC) is reduced as more predictors are added to the model. The simulations underlying the MCMC algorithms also provide useful information for judging model fit. The following procedure compares selected network properties of the simulated networks based on our final model to those same network characteristics of the observed Indiana tobacco control network. Specifically, here we will examine the geodesic distances, the distribution of edgewise shared partners, the degree distribution, and the triad census (frequency of different patterns of triangles). 11.3 Examining Exponential Random Graph Models DSmod.fit <- gof(DSmod4, GOF = ˜distance + espartners + degree + triadcensus, burnin=1e+5, interval = 1e+5) summary(DSmod.fit) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Goodness-of-fit for minimum geodesic distance obs min mean max MC p-value 1 103 88 102.90 116 1.00 2 197 123 169.44 200 0.04 3 0 0 8.90 34 0.20 4 0 0 0.02 1 1.00 Inf 0 0 18.74 69 0.80 Goodness-of-fit for edgewise shared partner esp0 esp1 esp2 esp3 esp4 esp5 esp6 esp7 esp8 esp9 esp10 esp11 esp12 esp13 esp14 esp15 obs min mean max MC p-value 1 0 0.76 3 1.00 8 0 4.94 14 0.28 10 3 10.28 21 1.00 7 7 17.68 34 0.02 15 8 20.21 32 0.24 11 7 16.77 25 0.32 9 4 12.83 22 0.48 14 3 7.73 16 0.12 14 0 5.08 13 0.00 5 0 2.77 8 0.32 4 0 1.74 7 0.28 1 0 1.19 3 1.00 1 0 0.61 4 0.88 2 0 0.13 1 0.00 1 0 0.17 2 0.32 0 0 0.01 1 1.00 Goodness-of-fit for degree 0 1 2 3 4 5 obs min mean max MC p-value 0 0 0.79 3 0.80 1 0 0.61 3 0.96 2 0 1.07 4 0.56 3 0 1.02 4 0.14 2 0 1.22 4 0.66 1 0 1.70 6 0.98 181 182 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 11 Statistical Network Models 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 2 2 0 1 3 2 1 1 2 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.87 2.47 2.76 2.57 2.25 2.15 1.45 1.05 0.62 0.22 0.13 0.05 0.08 0.11 0.24 0.22 0.24 0.09 0.02 5 7 7 8 6 5 5 4 3 2 1 1 1 1 1 1 1 1 1 1.00 1.00 0.14 0.54 0.84 1.00 1.00 1.00 0.20 0.40 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.04 Goodness-of-fit for triad census 0 1 2 3 obs 832 759 517 192 min mean max MC p-value 616 756 911 0.20 787 881 944 0.00 407 502 625 0.72 114 161 221 0.18 The goodness-of-fit object stores the results of those comparisons. If the model is doing an adequate or good job of describing the observed network, then we would expect to see that the simulated networks look like the observed network. For each possible value of the selected network statistic, the frequency of the observed value is reported alongside the minimum, mean, and maximum values across 100 (by default) randomly simulated networks. The Monte Carlo empirical p-values are also reported and these are the proportion of the simulated values of the statistic that are at least as extreme as the observed value. Thus, small p-values (traditionally <0.05) indicate cases where the model is not able to produce the particular network characteristic (i.e., poor fit). Examination of DSmod.fit suggests that our relatively simple model with seven predictors is doing a fairly good job of capturing the structural patterns in TCdiss. Out of 50 network statistics, only four show poor fit. In addition to examining the actual goodness-of-fit object, ergm can also easily produce informative plots of the goodness-of-fit. The resulting plot produces one panel for each of the four network statistics. Each panel includes box-plots and 95 % empirical confidence intervals (light grey lines) that show the variability of 11.3 Examining Exponential Random Graph Models 183 the individual network statistic across the simulated networks. The thick black line indicates the value of the same statistic for the observed network. A good-fitting model will thus have the black line sitting inside the confidence-range bands. In Fig. 11.3 we can see that our final model does a generally good job, but we can also see that our model tends to produce networks that underestimate the number of dyads with geodesics of two, while overestimating the dyads with geodesics of three. op <- par(mfrow=c(2,2)) plot(DSmod.fit,cex.axis=1.6,cex.label=1.6) par(op) 11.3.3 Model Diagnostics More detailed diagnostic information about the model estimation can be produced with a call to the mcmc.diagnostics() function. This is useful for seeing how the MCMC estimation process is running ‘under the hood,’ and is particularly important if you run into convergence problems. The diagnostics report includes statistical details for all of the covariates in the model, and also reports information on how the model behaves over time. The plots produced display the MCMC chain over time, and a resulting histogram. Both types of plots should show estimates that are centered around 0 (Because of its length, only the diagnostics plots are shown here.) (Fig. 11.4). mcmc.diagnostics(DSmod4) 11.3.4 Simulating Networks Based on Fit Model The fitted ERGM can be used to produce one or more simulated networks that can then be examined or analyzed as if it were an observed network. For example, this is what a simulated network based on our final model looks like compared to the observed network. Note that the simulated network will have the same node attributes as the empirical network (Fig. 11.5). sim4 <- simulate(DSmod4, nsim=1, seed=569) summary(sim4,print.adj=FALSE) ## Network attributes: ## vertices = 25 ## directed = FALSE 184 11 Statistical Network Models 0.0 0.00 proportion of edges 0.15 0.30 proportion of dyads 0.2 0.4 0.6 Goodness-of-fit diagnostics 0 3 6 9 12 16 edge-wise shared partners 0.00 proportion of nodes 0.15 0.30 proportion of triads 0.1 0.2 0.3 0.4 1 2 3 4 5 6 minimum geodesic distance 0 3 6 9 13 degree 17 0 1 2 triad census 3 Fig. 11.3 Goodness-of-fit plots for final tobacco control model ## hyper = FALSE ## loops = FALSE ## multiple = FALSE ## bipartite = FALSE ## title = IN_Diffusion ## total edges = 115 ## missing edges = 0 ## non-missing edges = 115 ## density = 0.383 ## ## Vertex attributes: ## ## agency_cat: ## numeric valued attribute ## attribute summary: ## Min. 1st Qu. Median Mean 3rd Qu. ## 1.00 2.00 2.00 3.24 5.00 ## Max. 6.00 11.3 Examining Exponential Random Graph Models 185 Sample statistics edges −20 −10 0 10 20 0.00 0.02 0.04 0.06 edges 1e+06 2e+06 3e+06 200 −400−200 0 0e+00 1e+06 2e+06 3e+06 −30 4e+06 nodecov.tob_yrs 4e+06 0.0000.0010.0020.0030.004 0e+00 −20 −10 0 10 20 nodecov.tob_yrs −400 −200 0 200 400 nodematch.agency_lvl.1 0.0 −5 0.1 0 0.2 5 0.3 nodematch.agency_lvl.1 0e+00 1e+06 2e+06 3e+06 −5 4e+06 0 5 Sample statistics nodematch.agency_lvl.2 −5 0 5 0.00 0.05 0.10 10 nodematch.agency_lvl.2 0e+00 1e+06 2e+06 3e+06 −10 4e+06 0 5 10 0.00 0.05 0.10 0.15 5 0 −5 0e+00 1e+06 2e+06 3e+06 −10 4e+06 1e+06 2e+06 3e+06 Fig. 11.4 MCMC diagnostics (partial) 4e+06 0e+00 2e-05 4e-05 6e-05 0e+00 −5 0 5 edgecov.distance 0 10000 edgecov.distance −20000 −5 nodematch.agency_lvl.3 nodematch.agency_lvl.3 −20000 −10000 0 10000 20000 10 186 11 Statistical Network Models ## agency_lvl: ## numeric valued attribute ## attribute summary: ## Min. 1st Qu. Median Mean 3rd Qu. ## 1.00 1.00 2.00 2.04 3.00 ## ## lead_agency: ## numeric valued attribute ## attribute summary: ## Min. 1st Qu. Median Mean 3rd Qu. ## 0.00 0.00 0.00 0.04 0.00 ## ## tob_yrs: ## numeric valued attribute ## attribute summary: ## Min. 1st Qu. Median Mean 3rd Qu. ## 1.00 3.00 4.50 6.76 9.00 ## vertex.names: ## character valued attribute ## 25 valid vertex names ## ## No edge attributes op <- par(mfrow=c(1,2),mar=c(0,0,2,0)) lvlobs <- TCdiss %v% 'agency_lvl' plot(TCdiss,usearrows=FALSE, vertex.col=lvl+1, edge.lwd=0.5,edge.col="grey75", main="Observed TC network") lvl4 <- sim4 %v% 'agency_lvl' plot(sim4,usearrows=FALSE, vertex.col=lvl4+1, edge.lwd=0.5,edge.col="grey75", main="Simulated network - Model 4") par(op) Max. 3.00 Max. 1.00 Max. 21.00 11.3 Examining Exponential Random Graph Models Observed TC network 187 Simulated network - Model 4 Fig. 11.5 Comparison of simulated network to observed tobacco control network Chapter 12 Dynamic Network Models What came first–the music or the misery? Did I listen to the music because I was miserable? Or was I miserable because I listened to the music? Do all those records turn you into a melancholy person? (Nick Hornby, High Fidelity.) 12.1 Introduction Exponential random graph models, as presented in Chap. 11, allow for sophisticated and powerful modeling of network structures and relationships. Generative models of networks can be built using a wide variety of predictors, including node characteristics, dyad characteristics, local structural characteristics, and even other network relations. Substantive hypotheses can be tested with ERGM models, and estimated models can be explored with the rich simulation and goodness-of-fit tools that are provided by the ergm package. However, ERGMs are generally limited to cross-sectional network data. Social networks, by their very nature, are dynamic. In particular, network ties are formed, maintained, and sometimes dissolved over time. These dynamic ties processes may be driven by a number of social processes, including characteristics of the actors, dyads, and local network structures. For example, one student may become a friend with another student partly because of the characteristics of the alter (e.g., attractiveness), partly because of their own similarity on some behavioral characteristic (e.g., they both like the same type of music), or because of other local network structures (e.g., they both are already friends with the same other student). This chapter covers stochastic actor-based models for network dynamics that are included in the RSiena package, and which can be used to build models and test hypotheses about networks as they change over time. 12.1.1 Dynamic Networks Networks can change over time in two fundamental ways. First, networks can grow or shrink over time, leading to changes in network composition. Second, as suggested above, network ties can change among the network members. The modeling methods discussed in this chapter apply primarily to changes of the second type. © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 12 189 190 12 Dynamic Network Models (Although RSiena can handle networks that have some changes in network composition, those changes themselves are not modeled.) One of the challenges and opportunities for modeling network dynamics is that overall network characteristics can often be the result of multiple underlying social mechanisms. For example, homophily is the tendency for people (or other social entities) to associate with others who are similar to them. Social networks tend to exhibit strong homophily, and this pattern has been observed for decades across many areas of social and health sciences. There are at least two social mechanisms that can account for homophily – social selection and social influence. Social selection occurs when an actor selects or forms a new social tie with another actor who is similar to her on some relevant characteristic. Social influence, on the other hand, acts across existing social ties. Social influence occurs when the behavior of one actor is changed to become more similar (or dissimilar) to the behavior of one or more other actors. Time 1 Time 2 Social Selection Social Influence Fig. 12.1 Comparison of social selection to social influence Figure 12.1 visually depicts these two mechanisms. The colors designate some node characteristic, such as smoking status. Blue nodes are smokers and green nodes are non-smokers, for example. The top row illustrates social selection – the focal actor (large blue node) starts out unconnected from the network. By time 2, the actor has formed two new ties with other actors who are the same as him regarding smoking status. The second row shows how social influence operates. Here, the focal actor starts out connected to the network, but differs from his friends on smoking status. At time 2 he has changed his behavior to match those of his friends. Note that the time 2 networks are identical, and show strong homophily. However, we see that this homophily can arise from two quite different mechanisms. Distinguishing these mechanisms with real network data has important scientific and applied consequences. For example, public health policies could use word-of-mouth communication strategies to disseminate prevention messages across social networks. 12.1 Introduction 191 This policy would likely have greater effect if social influence mechanisms are the primary dynamic, where social ties are already in place and messages are more likely to pass from person to person. In contrast, if social selection is the primary mechanism, then a word-of-mouth campaign may have less influence where individuals are choosing their new relationships based on behavioral similarity. Disentangling the effects of social influence and social selection is not possible to do with most network modeling techniques such as ERGM. This is mainly because ERGMs are generally limited to cross-sectional network data. Accurately assessing dynamic network mechanisms that operate over time requires dynamic modeling techniques such as those in RSiena. 12.1.2 RSiena SIENA stands for Simulation Investigation for Empirical Network Analysis, and is a set of analytic tools that can be used to model longitudinal network data, according to the stochastic actor-oriented model (SAOM) of Snijders and his colleagues (Snijders et al. 2010). RSiena is the R package that contains the SIENA model estimation functions, as well as a wide variety of supporting tools to plot, diagnose, and examine the estimated models and simulated networks. The core SAOM is a type of actor or agent-based simulation model – it uses estimation techniques following a Markov process that assumes that future changes in network states (i.e., formation or dissolution of a tie) are based probabilistically on the current state of the complete network (Snijders et al. 2010). RSiena combines the power of stochastic network modeling with longitudinal analysis. This opens up many analytic possibilities. RSiena can be used to model the evolution of one-mode networks, two-mode networks (see Chap. 9), and the co-evolution of one-mode or two- mode networks with behavior. This last type of model is what allows examination of social influence and social selection processes in the joint evolution of friendship and smoking, for example. With this power comes a fair amount of complexity. In particular, handling missing data, analyzing longitudinal network data when there are changes in composition (i.e., nodes that enter or leave over time), selecting from hundreds of potential parameter effects to include in the model, and dealing with estimation convergence issues are all challenges that are beyond the ability of this short chapter to handle in any detail. Instead, this chapter presents an introduction to RSiena analysis, using an example longitudinal network dataset that has been constructed to help illustrate a basic approach to dynamic network modeling. This should help readers get started using RSiena, but it is important to consult the many excellent papers, tutorials, and documentation that are available. The manual for RSiena is a good place to start, it is available at http://www.stats.ox.ac.uk/ ˜snijders/siena/siena_r.htm. 192 12 Dynamic Network Models 12.2 Data Preparation RSiena requires longitudinal (or panel) network data that have been collected from the same network at two, or preferably more, timepoints. The Coevolve dataset that is included in UserNetR has been developed to support exploration of a simple co-evolution model. The Coevolve data are in the form of a list of four igraph networks. These are friendship networks among 37 students, measured at four points in time (or waves). These data are based on an actual school-based directed friendship network presented in Valente (2010). The original network was cross-sectional. Here, we have added three additional fictional waves of data that show changes both in tie formation as well as changes in a fictional smoking status variable. In constructing the Coevolve networks, the following informal change rules were used to create the new waves: 1. At each wave one smoker was randomly changed to non-smoker and three nonsmokers were changed to smokers, for a net gain of two smokers. The new smokers were more likely to be network members who were connected to other smokers. 2. At each wave, 10 % of the existing ties were randomly deleted. Then, the same number of new ties were formed. This maintains the same overall density over time. 3. When adding new directed ties, the following rules were used: (a) Pick somebody who has the same smoking status (b) Pick somebody who is popular (i.e., high indegree) (c) Reciprocate an existing tie These informal rules were used to build in dynamics that are somewhat realistic, but also simple enough to detect with a basic RSiena model. To examine the networks, they can be plotted using igraph. The data are stored as a list object, so they should be extracted first. library(igraph) library("UserNetR") data(Coevolve) fr_w1 <- Coevolve$fr_w1 fr_w2 <- Coevolve$fr_w2 fr_w3 <- Coevolve$fr_w3 fr_w4 <- Coevolve$fr_w4 Figure 12.2 shows the four waves of friendship data. Node shape conveys gender (circle = female; square = male) and smoking status is conveyed by node color (green = non-smoker; blue = smoker). The increase in smoking status over time is fairly evident, and smoking status does seem to become more clustered, at least for males. 12.2 Data Preparation 193 colors <- c("darkgreen","SkyBlue2") shapes <- c("circle","square") coord <- layout.kamada.kawai(fr_w1) op <- par(mfrow=c(2,2),mar=c(1,1,2,1)) plot(fr_w1,vertex.color=colors[V(fr_w1)$smoke+1], vertex.shape=shapes[V(fr_w1)$gender], vertex.size=10,main="Wave 1",vertex.label=NA, edge.arrow.size=0.5,layout=coord) plot(fr_w2,vertex.color=colors[V(fr_w2)$smoke+1], vertex.shape=shapes[V(fr_w2)$gender], vertex.size=10,main="Wave 2",vertex.label=NA, edge.arrow.size=0.5,layout=coord) plot(fr_w3,vertex.color=colors[V(fr_w3)$smoke+1], vertex.shape=shapes[V(fr_w3)$gender], vertex.size=10,main="Wave 3",vertex.label=NA, edge.arrow.size=0.5,layout=coord) plot(fr_w4,vertex.color=colors[V(fr_w4)$smoke+1], vertex.shape=shapes[V(fr_w4)$gender], vertex.size=10,main="Wave 4",vertex.label=NA, edge.arrow.size=0.5,layout=coord) par(op) Table 12.1 presents some basic descriptive statistics of the Coevolve network data. As can be seen, the size and density of the networks remain constant, but the number of smokers increases over time, as does the modularity based on smoking status. Typically, detailed examination of network graphics and descriptive statistics would happen prior to jumping into the dynamic modeling. Wave Wave 1 Wave 2 Wave 3 Wave 4 Size 37 37 37 37 Density 0.134 0.134 0.134 0.134 Avg.InDegree 4.838 4.838 4.838 4.838 Smokers 8 10 12 14 Modularity 0.001 0.044 0.077 0.129 Table 12.1 Characteristics of Coevolve networks across four waves The plots and descriptive statistics suggest that there are some network and behavioral dynamics that can be modeled using RSiena. In the rest of this chapter we will explore a simple co-evolution model of friendship ties and smoking behavior. The outline of the model building process has three main steps: (1) data preparation; (2) model estimation; and (3) model exploration and testing. Before starting, make sure to download and install the RSiena package. Like most R packages, it is available through the CRAN repository. However, the RSiena developers make newer versions of the package available at their website: http://www.stats.ox.ac.uk/˜snijders/siena. This version, called 194 12 Dynamic Network Models Wave 1 Wave 2 Wave 3 Wave 4 Fig. 12.2 Changes over time in Coevolve networks RSienaTest, is the most up to date version and is used here. (For example, at the time of writing this chapter RSiena from CRAN was version 1.1-232, while RSienaTest was version 1.1-284.) RSiena does not use the traditional method of specifying a statistical formula for a modeling function, the way that ergm, lm, or in fact most other R statistical modeling procedures do. Instead, the modeling function (called siena07) is applied to a set of RSiena objects, which must minimally include a data object (containing all network and covariate data), an effects object (containing all of the parameter effects to be included in the model), and an algorithm object (which controls most of the modeling options). The first step, then, is packaging the network data in a way that RSiena can understand. RSiena can handle six different types of variables. A network variable is the basic dependent variable in an RSiena model, and can be a one-mode or 12.2 Data Preparation 195 two-mode network. A behavior variable is another type of dependent variable. It is a node characteristic that changes over time, the evolution of which may considered in a co-evolutionary model. In our example dataset, smoking status will be handled as a behavior variable. Then there are four types of variables that are all handled as covariates. A coCovar is a constant node attribute that does not change over time (e.g., gender). A varCovar, on the other hand, is an attribute that does change over time. (Note that a behavior variable is a type of varying covariate, but one that is being treated as a dependent variable.) Similar to ergm, RSiena can also handle dyadic covariates. A coDyadCovar is a constant dyadic covariate (for example, a kinship relationship), while varDyadCovar is a dyadic covariate that changes over time. For our Coevolve data, we have one dependent network variable (the friendship ties), one constant covariate (gender), and one varying covariate that will be handled as a behavior dependent variable (smoking status). RSiena cannot handle igraph or statnet data directly, it expects data in the form of raw arrays, matrices, or vectors. So, some of the data management approaches covered in Chap. 3 will be useful here. The first step is to transform the data into raw sociomatrices. library(RSienaTest) matw1 <- as.matrix(get.adjacency(fr_w1)) matw2 <- as.matrix(get.adjacency(fr_w2)) matw3 <- as.matrix(get.adjacency(fr_w3)) matw4 <- as.matrix(get.adjacency(fr_w4)) matw1[1:8,1:8] ## ## ## ## ## ## ## ## ## [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 Then, a dependent variable object is created with sienaDependent. This expects a stacked array with each array corresponding to one of the network waves. By default, sienaDependent expects data in the form of a sparse matrix (from the Matrix package). Here, the data are simple full matrices, so the sparse option must be set to false. fr4wav<-sienaDependent(array(c(matw1,matw2,matw3,matw4), dim=c(37,37,4)),sparse=FALSE) class(fr4wav) ## [1] "sienaDependent" 196 12 Dynamic Network Models fr4wav ## Type oneMode ## Observations 4 ## Nodeset Actors (37 elements) The only problem with this approach is that these sociomatrices will get quite large for larger networks. As discussed in Chap. 3, edgelists are preferable for handling large networks. RSiena cannot natively handle edgelists, so they must be transformed into sparse matrices as implemented by the Matrix package. The following code produces the same friendship RSiena dependent variable, but via edgelists instead of sociomatrices. library(Matrix) w1 <- cbind(get.edgelist(fr_w1), 1) w2 <- cbind(get.edgelist(fr_w2), 1) w3 <- cbind(get.edgelist(fr_w3), 1) w4 <- cbind(get.edgelist(fr_w4), 1) w1s <- spMatrix(37, 37, w1[,1], w1[,2], w1[,3]) w2s <- spMatrix(37, 37, w2[,1], w2[,2], w2[,3]) w3s <- spMatrix(37, 37, w3[,1], w3[,2], w3[,3]) w4s <- spMatrix(37, 37, w4[,1], w4[,2], w4[,3]) fr4wav2 <- sienaDependent(list(w1s,w2s,w3s,w4s)) fr4wav2 ## Type oneMode ## Observations 4 ## Nodeset Actors (37 elements) Once the RSienda dependent variable is constructed, then other data objects such as covariates can be created. Gender is stored in the igraph networks as a vertex characteristic, so it is easy to extract that to create the coCoVar object. Gender is coded 1 for female and 2 for males. The default for RSiena is to center any covariate. This does not make much sense here, so it is turned off. gender_vect <- V(fr_w1)$gender table(gender_vect) ## gender_vect ## 1 2 ## 22 15 gender <- coCovar(gender_vect,centered=FALSE) gender ## [1] 1 1 1 1 2 1 1 1 1 2 1 2 1 2 2 2 1 1 1 1 1 2 1 ## [24] 2 2 2 1 2 2 1 2 1 1 1 1 2 2 12.2 Data Preparation ## ## ## ## ## ## 197 attr(,"class") [1] "coCovar" attr(,"centered") [1] FALSE attr(,"nodeSet") [1] "Actors" Smoking status is our behavior variable for the co-evolution model. RSiena expects an N×W matrix, with N (actor) rows and W (wave) columns. Again, we extract this information from the igraph objects. Because a behavior variable is a type of dependent variable, the sienaDependent function is used, but we specify that this is a behavior variable. smoke <- array(c(V(fr_w1)$smoke,V(fr_w2)$smoke, V(fr_w3)$smoke,V(fr_w4)$smoke),dim=c(37,4)) smokebeh <- sienaDependent(smoke,type = "behavior") smokebeh ## Type behavior ## Observations 4 ## Nodeset Actors (37 elements) Finally, all the individual variable objects are packaged together into a single RSiena data object. friend <- sienaDataCreate(fr4wav,smokebeh,gender) friend ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Dependent variables: fr4wav, smokebeh Number of observations: 4 Nodeset Number of nodes Actors 37 Dependent variable Type Observations Nodeset Densities fr4wav oneMode 4 Actors 0.13 0.13 0.13 0.13 Dependent variable Type Observations Nodeset Range smokebeh behavior 4 Actors 0 - 1 Constant covariates: gender 198 12 Dynamic Network Models Once this object is created then a basic descriptive report can be generated that provides important information to examine prior to modeling. print01Report(friend,modelname = 'Coevolve Example' ) The results of this command are not directed to the console or saved in an R object. Instead an external text file is created in the working directory, named in this case ‘Coevolve Example.out’. It can be viewed using any text editor such as Notepad. The report contains a variety of information about the RSiena data, including information about any missing data, the degree distribution pattern observed across the waves of network data, and summary information about each dependent variable and covariate. A critical piece of information is found near the bottom of the report under the heading ‘Change in Networks.’ The Jaccard index, which is a measure of similarity, is calculated on the tie variables for each consecutive pair of waves. Although there needs to be enough change between the observation periods to allow for modeling, too much change would imply that the assumption of gradual change is not tenable. The authors of RSiena suggest that Jaccard values should be higher than 0.3 (Snijders et al. 2010). For the Coevolve data, we see Jaccard values of greater than 0.8, suggesting a fairly high level of stability over time. 12.3 Model Specification and Estimation 12.3.1 Specification of Model Effects Once the RSiena data have been put into the correct format, model specification and building can proceed. In Chap. 11 we saw that stochastic network models can have a wide variety of parameters that test hypotheses about node attributes, similarity of node attributes between dyads, tie attributes, and local network structural properties. Longitudinal network models allow for an even larger set of potential parameters, and choosing the theoretically appropriate set of parameters can be challenging. In this chapter we will explore only a very small set of parameters. For more detailed guidance, the RSiena documentation should be read closely (especially Chap. 5 – Model specification). The first step for model specification is to create an effects specification object with a minimal set of parameters. frndeff <- getEffects( friend ) frndeff ## name ## 1 fr4wav ## 2 fr4wav ## 3 fr4wav effectName constant fr4wav rate (period 1) constant fr4wav rate (period 2) constant fr4wav rate (period 3) 12.3 Model Specification and Estimation ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 199 fr4wav outdegree (density) fr4wav reciprocity smokebeh rate smokebeh (period 1) smokebeh rate smokebeh (period 2) smokebeh rate smokebeh (period 3) smokebeh behavior smokebeh linear shape include fix test initialValue parm TRUE FALSE FALSE 2.004 0 TRUE FALSE FALSE 2.004 0 TRUE FALSE FALSE 2.004 0 TRUE FALSE FALSE -0.807 0 TRUE FALSE FALSE 0.000 0 TRUE FALSE FALSE 0.208 0 TRUE FALSE FALSE 0.208 0 TRUE FALSE FALSE 0.208 0 TRUE FALSE FALSE 0.562 0 This is a basic model that only includes a small number of default effects, notably the outdegree and reciprocity effects. Typically we will add other effects that we are interested in testing or exploring. What are those effects and how are they specified? All the effects that are available given the structure of the friend data set can be seen using the effectsDocumentation function. Here is one place (of many) where RSiena shows its non-R roots. Instead of sending the results to the R console, or creating a new R object, this function creates an HTML file that can then be opened in your browser. effectsDocumentation(frndeff) The effects documentation report is a type of ‘effects dictionary’ that provides all the information necessary for selecting appropriate model parameters. However, the report can be daunting – for this simple example dataset with four timepoints, one covariate (gender) and one behavior dependent variable (smokebeh), there are over 400 possible effects! Table 12.2 presents a set of effects that can be tested, along with the information from the effects documentation that is used to correctly specify the effects in the model estimation function. To understand this information, please also refer to the ‘frndeff.html’ file that is produced by the effectsDocumentation() function. Because we are exploring a co-evolutionary model, we will examine effects on the likelihood of tie formations (the fr4wav dependent variable), and the effects on changes in behavior (the smokebeh dependent variable). So, there are two broad types of effects, and they are distinguished by the ‘Name’ column in the effects documentation report. (‘ED Name’ in Table 12.2.) The actual specific effect that will be included is specified by the term in the ‘shortName’ column. Many of these effects will need to refer to a specific covariate or dependent variable, this is typically specified by the term in the ‘interaction1’ column. To help navigate this information for 200 Effect 1 - Gender homophily 2 - Ego smoking effect 3 - Alter smoking effect 4 - Smoking homophily 5 - Avg. alter influence 6 - Total alter influence 7 - Reciprocity 8 - Transitivity 12 Dynamic Network Models Type ED name ED shortName Selection fr4wav sameX Selection fr4wav egoX Selection fr4wav altX Selection fr4wav sameX Influence smokebeh avSim Influence smokebeh totSim Structural fr4wav recip Structural fr4wav transTrip ED interaction1 ED row gender 139 smokebeh 200 smokebeh 197 smokebeh 212 fr4wav 321 fr4wav 324 NA 14 NA 17 Table 12.2 RSiena effects for Coevolve data the example, Table 12.2 also includes the particular row number from the complete effects documentation report, but this will be accurate only if the data have been prepared exactly the same way as in this chapter. Eight different hypotheses will be tested with the eight effects listed in the table. First, based on the pattern evident in Fig. 12.2, there appears to be a strong gender homophily effect, where students are much more likely to be friends with other students of the same gender. This is a type of social selection effect, we hypothesize that the likelihood of an ego forming a new friendship tie is higher with an alter who has the same gender. The RSiena term that will be used is ‘sameX.’ Next, two different social selection main effects are hypothesized. The first is the hypothesis that based on the ego’s smoking status, he or she is more or less likely to form a friendship tie (egoX). Conversely, the likelihood of forming a friendship tie may be related to the smoking status of alters (altX). This may occur, for example, in schools that have a strong pro-smoking culture. I may want to be a friend with somebody because they smoke and smoking is ‘cool,’ regardless of whether I smoke or not. Note that we have no reason to assume either of these hypotheses are true, given how these data were constructed. Finally, the last social selection hypothesis is another homophily effect, but this time in reference to smoking. The hypothesis is that new friendship ties are more likely to be formed between two students who have the same smoking status. Note that this hypothesis uses the same shortName. In this case the interaction1 term is used to indicate that the homophily relationship refers to smoking (instead of gender). The next two hypotheses are about potential social influences on behavior. They are similar to each other in that they focus on whether changes in smoking status can be explained by patterns of smoking of the ego’s friends (to whom the ego is tied). The first hypothesis is that likelihood of changing behavior is related to the average similarity of smoking status across all tied alters (avSim). The second hypothesis is similar, but assumes that the influence is based on the total similarity across alters, instead of average. (Total similarity captures the effect of having a larger personal social network that influences behavior.) The last two hypotheses are local structural effects. Here we will model the tendency for friendship ties to be reciprocated (recip) as well as the general pattern of transitivity (transTrip). 12.3 Model Specification and Estimation 201 These effects are specified in an RSiena effects object. Typically, effects are added one at a time using the includeEffects() function. The following code adds the effects that correspond to the eight hypotheses just described. frndeff <- getEffects( friend ) frndeff <- includeEffects(frndeff,sameX, interaction1="gender",name="fr4wav") ## effectName include fix test initialValue ## 1 same gender TRUE FALSE FALSE 0 ## parm ## 1 0 frndeff <- includeEffects(frndeff,egoX, interaction1="smokebeh",name="fr4wav") ## effectName include fix test initialValue ## 1 smokebeh ego TRUE FALSE FALSE 0 ## parm ## 1 0 frndeff <- includeEffects(frndeff,altX, interaction1="smokebeh",name="fr4wav") ## effectName include fix test initialValue ## 1 smokebeh alter TRUE FALSE FALSE 0 ## parm ## 1 0 frndeff <- includeEffects(frndeff,sameX, interaction1="smokebeh",name="fr4wav") ## effectName include fix test initialValue ## 1 same smokebeh TRUE FALSE FALSE 0 ## parm ## 1 0 frndeff <- includeEffects(frndeff,avSim, interaction1="fr4wav",name="smokebeh") ## effectName include ## 1 behavior smokebeh average similarity TRUE ## fix test initialValue parm ## 1 FALSE FALSE 0 0 frndeff <- includeEffects(frndeff,totSim, interaction1="fr4wav",name="smokebeh") 202 12 Dynamic Network Models ## effectName include ## 1 behavior smokebeh total similarity TRUE ## fix test initialValue parm ## 1 FALSE FALSE 0 0 frndeff <- includeEffects(frndeff,recip,transTrip, name="fr4wav") ## ## ## ## ## ## effectName include fix test 1 reciprocity TRUE FALSE FALSE 2 transitive triplets TRUE FALSE FALSE initialValue parm 1 0 0 2 0 0 frndeff ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## name effectName fr4wav constant fr4wav rate (period 1) fr4wav constant fr4wav rate (period 2) fr4wav constant fr4wav rate (period 3) fr4wav outdegree (density) fr4wav reciprocity fr4wav transitive triplets fr4wav same gender fr4wav smokebeh alter fr4wav smokebeh ego fr4wav same smokebeh smokebeh rate smokebeh (period 1) smokebeh rate smokebeh (period 2) smokebeh rate smokebeh (period 3) smokebeh behavior smokebeh linear shape smokebeh behavior smokebeh average similarity smokebeh behavior smokebeh total similarity include fix test initialValue parm 1 TRUE FALSE FALSE 2.004 0 2 TRUE FALSE FALSE 2.004 0 3 TRUE FALSE FALSE 2.004 0 4 TRUE FALSE FALSE -0.807 0 5 TRUE FALSE FALSE 0.000 0 6 TRUE FALSE FALSE 0.000 0 7 TRUE FALSE FALSE 0.000 0 8 TRUE FALSE FALSE 0.000 0 9 TRUE FALSE FALSE 0.000 0 10 TRUE FALSE FALSE 0.000 0 11 TRUE FALSE FALSE 0.208 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 12.4 Model Exploration ## ## ## ## ## 12 13 14 15 16 TRUE TRUE TRUE TRUE TRUE 203 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 0.208 0.208 0.562 0.000 0.000 0 0 0 0 0 12.3.2 Model Estimation The last step before estimating the model is to set up any algorithm options that are required. In this case, other than specifying a title string for the model, the algorithm object will include all default options. myalgorithm <- sienaAlgorithmCreate(projname='coevolve') An RSiena model is estimated using the siena07() function, and by passing the algorithm, data, and effects objects that were already created. Other options can be specified as well to control the estimation process. In the following example, the batch and verbose options are used to simplify the output. When run interactively, you may want to set these options to TRUE. The last three options are used to speed up the estimation process by using multiple cores of the computer’s CPU, if available. Here, three cores, out of four, are used. Be careful about using all the CPU cores, in case other processes are being run by the operating system. See the help file for more technical details. set.seed(999) RSmod1 <- siena07( myalgorithm, data = friend, effects = frndeff,batch=TRUE, verbose=FALSE,useCluster=TRUE, initC=TRUE,nbrNodes=3) 12.4 Model Exploration 12.4.1 Model Interpretation As is typical with any R estimation technique, the results of the modeling can be explored by viewing the contents of the model fit object. Either list the RSiena fit object directly, or use the summary() command for a more detailed output. (The formatting of the output listed here has been edited for length and legibility.) 204 12 Dynamic Network Models summary(RSmod1) ## Estimates, standard errors and convergence t-ratios ## ## Network Dynamics ## 1. rate constant fr4wav rate (period 1) ## 2. rate constant fr4wav rate (period 2) ## 3. rate constant fr4wav rate (period 3) ## 4. eval outdegree (density) ## 5. eval reciprocity ## 6. eval transitive triplets ## 7. eval same gender ## 8. eval smokebeh alter ## 9. eval smokebeh ego ## 10. eval same smokebeh ## ## Estimate Standard Convergence ## Error t-ratio ## ## 1. 1.1572 ( 0.2073 ) -0.0491 ## 2. 1.1410 ( 0.1990 ) -0.0114 ## 3. 1.1366 ( 0.1948 ) -0.0273 ## 4. -2.9556 ( 0.3949 ) 0.0141 ## 5. 0.7990 ( 0.2539 ) -0.0805 ## 6. 0.0860 ( 0.0785 ) -0.0508 ## 7. 1.1429 ( 0.3174 ) -0.1147 ## 8. 0.6885 ( 0.3792 ) -0.0122 ## 9. -0.0847 ( 0.2878 ) 0.0165 ## 10. 1.0975 ( 0.4373 ) -0.1392 ## ## Behavior Dynamics ## 11. rate rate smokebeh (period 1) ## 12. rate rate smokebeh (period 2) ## 13. rate rate smokebeh (period 3) ## 14. eval behavior smokebeh linear shape ## 15. eval behavior smokebeh average similarity ## 16. eval behavior smokebeh total similarity ## ## Estimate Standard Convergence ## Error t-ratio ## ## 11. 0.3028 ( 0.1684 ) 0.0082 ## 12. 0.3485 ( 0.1963 ) 0.0528 ## 13. 0.3363 ( 0.1949 ) 0.0406 ## 14. 4.0791 ( 8.7065 ) -0.0218 12.4 Model Exploration ## 15. 18.7278 ( 53.9223 ) ## 16. -1.4614 ( 6.9254 ) ## ## Total of 2340 iteration steps. ## 205 -0.1372 0.0824 For a coevolution model, the parameter estimates are presented in two sections. The network dynamics section contains the estimates pertaining to the tie formation (i.e., the fr4wav dependent variable). Conversely, the behavior dynamics section contains estimates related to changes in the network member behavior variable, here it is smoking status. The convergence t-ratios are not traditional t-statistics assessing the size of the parameter estimates. Instead, they represent tests of the lack of convergence for each estimate, so small values indicate good convergence. The RSiena manual suggests that absolute values less than 0.10 indicate excellent convergence, and absolute values less than 0.15 are reasonable. Here we see that all of the network dynamics parameters have excellent convergence, while a few of the behavior parameters show only reasonable convergence. The rate estimates correspond to the estimated number of opportunities for change per actor for each period (where period 1 is the time from wave 1 to wave 2). The eval estimates are the weights in the network evaluation function. The exact calculations for a precise interpretation of the meaning of these effects is complicated, see Snijders et al. (2010) for more details. However, they represent the relative ‘attractiveness’ of a particular network state for each actor. For example, the positive estimate for same gender (1.13) indicates that actors are more likely to form new ties (or maintain existing ties) with other actors who have the same gender as them. The significance of these evaluation function weights can be determined by dividing the estimates by their standard errors. These are distributed as t-statistics, so any absolute values greater than 2 are significant at the 0.05 significance level. For our example, we can see that our friendship formation is more likely with alters who have the same gender and same smoking status as the ego. Conversely, it appears that the main effects of ego smoking and alter smoking are not significant predictors of tie formation. Outdegree and reciprocity are significant structural predictors, but not transitivity. The behavior dynamics results suggest that smoking status is increasing over time (linear shape). The large positive estimate for average similarity would normally suggest that changes in smoking status are driven by the overall similarity of the smoking status for all tied alters. However, this estimate has a very large standard error, so the evaluation function weight is not being estimated with much precision. This is not too surprising, because detecting changes in actor behavior is harder than detecting changes in tie formation. There is greater power to detect tie changes because of the greater number of potential ties (on the order of the square of the size of the network for each period). Conversely, the number of behavior changes is on the order of the simple number of network members for each period. This is especially true for this example, where the network is relatively 206 12 Dynamic Network Models small and only a few changes in smoking status happened at each wave. So, although we know these smoking changes are ‘real’, the analysis does not have the required power to detect the effects. Typically, we will make adjustments and build subsequent models based on what we learned from earlier models. For this example, we will drop a few non-significant predictors. To do this we simply update the effects object with either new predictors, or by listing the predictors that we would like to drop. A dropped predictor is indicated by the ‘include = FALSE’ option. Here we drop the total similarity predictor for the behavior variable as well as the transitivity predictor. frndeff2 <- includeEffects(frndeff,totSim, interaction1="fr4wav", name="smokebeh", include=FALSE) ## [1] effectName include fix ## [4] test initialValue parm ## <0 rows> (or 0-length row.names) frndeff2 <- includeEffects(frndeff2,transTrip, name="fr4wav", include=FALSE) ## [1] effectName include fix ## [4] test initialValue parm ## <0 rows> (or 0-length row.names) frndeff2 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 name effectName fr4wav constant fr4wav rate (period 1) fr4wav constant fr4wav rate (period 2) fr4wav constant fr4wav rate (period 3) fr4wav outdegree (density) fr4wav reciprocity fr4wav same gender fr4wav smokebeh alter fr4wav smokebeh ego fr4wav same smokebeh smokebeh rate smokebeh (period 1) smokebeh rate smokebeh (period 2) smokebeh rate smokebeh (period 3) smokebeh behavior smokebeh linear shape smokebeh behavior smokebeh average similarity include fix test initialValue parm TRUE FALSE FALSE 2.004 0 TRUE FALSE FALSE 2.004 0 12.4 Model Exploration ## ## ## ## ## ## ## ## ## ## ## ## 3 4 5 6 7 8 9 10 11 12 13 14 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 207 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 2.004 -0.807 0.000 0.000 0.000 0.000 0.000 0.208 0.208 0.208 0.562 0.000 0 0 0 0 0 0 0 0 0 0 0 0 myalgorithm <- sienaAlgorithmCreate(projname='coevol2') Now the next model can be estimated. RSiena allows us to use the estimates obtained from a previous model as the starting values for the new model estimation. In this case we specify that the starting values should be based on the estimates contained in RSmod1, using the prevAns option. This is also sometimes helpful for improving the convergence of the individual weight estimates. Finally, in this second model, we use the returnDeps option. This stores some required auxiliary statistics on the simulated dependent variables for use in a subsequent exploration of goodness-of-fit. set.seed(999) RSmod2 <- siena07(myalgorithm,data = friend, effects = frndeff2, prevAns=RSmod1,batch=TRUE, verbose=FALSE,useCluster=TRUE, initC=TRUE,nbrNodes=3, returnDeps=TRUE) summary(RSmod2) ## Estimates, standard errors and convergence t-ratios ## ## Network Dynamics ## 1. rate constant fr4wav rate (period 1) ## 2. rate constant fr4wav rate (period 2) ## 3. rate constant fr4wav rate (period 3) ## 4. eval outdegree (density) ## 5. eval reciprocity ## 6. eval same gender ## 7. eval smokebeh alter ## 8. eval smokebeh ego 208 12 Dynamic Network Models ## 9. eval same smokebeh ## ## ## Estimate Standard Convergence ## Error t-ratio ## ## 1. 1.1392 ( 0.223 ) -0.0454 ## 2. 1.1321 ( 0.213 ) 0.0056 ## 3. 1.1260 ( 0.200 ) -0.0086 ## 4. -3.0585 ( 0.422 ) 0.0695 ## 5. 0.8387 ( 0.247 ) 0.0807 ## 6. 1.4113 ( 0.303 ) 0.0423 ## 7. 0.7123 ( 0.417 ) 0.0127 ## 8. -0.0733 ( 0.307 ) 0.0721 ## 9. 1.2041 ( 0.486 ) 0.0303 ## ## Behavior Dynamics ## 10. rate rate smokebeh (period 1) ## 11. rate rate smokebeh (period 2) ## 12. rate rate smokebeh (period 3) ## 13. eval behavior smokebeh linear shape ## 14. eval behavior smokebeh average similarity ## ## ## Estimate Standard Convergence ## Error t-ratio ## ## 10. 0.3076 ( 0.170 ) 0.0508 ## 11. 0.3680 ( 0.249 ) 0.0212 ## 12. 0.3493 ( 0.181 ) 0.0178 ## 13. 6.6261 ( 37.771 ) -0.0061 ## 14. 18.6088 ( 111.033 ) 0.0270 ## ## Total of 2140 iteration steps. ## By dropping some non-significant variables and starting with previously estimated weight estimates, we have improved the convergence, now all effects have excellent convergence. The similarity effect on smoking behavior is still positive, but also still with a large standard error. A larger network, or more waves of data would likely be required to improve the standard error. 12.4 Model Exploration 209 12.4.2 Goodness-of-Fit Similar to the ergm package, RSiena includes graphical and diagnostic facilities for exploring the goodness-of-fit of an estimated model to the observed network. As in ergm, goodness-of-fit statistics can be calculated on properties of the simulated networks that are not formally included as predictors in the fitted models. This allows us to assess the extent to which the model can produce simulated networks that ‘look like’ the observed network. Goodness-of-fit is assessed in RSiena using the sienaGOF function. A network descriptive statistic is specified and is calculated across the simulated networks at the end of each period. These values are then compared to the observed network using the Mahalanobis distance. In the current version of the RSienaTest package only a small number of network statistics are built into the GOF function. One of these is the indegree distribution. In the following code, we also constrain the possible values of indegree to the range 1–10 (using the levls option). This matches the range of indegrees in the observed friendship network at wave 4. table(degree(fr_w4,mode="in")) ## ## ## 1 2 2 3 3 7 4 6 5 5 6 4 7 6 8 2 9 10 1 1 gofi <- sienaGOF(RSmod2, IndegreeDistribution, levls=1:10,verbose=FALSE, join=TRUE, varName="fr4wav") Once the GOF object is created, it can be used to plot the goodness-of-fit information. In Fig. 12.3, violin plots are used to display the variability of the descriptive statistic across the simulated networks, in this case the indegrees. The dashed grey lines represent the empirical 95 % confidence interval. The red circles are the values of the descriptive statistic for the observed network. When the circles fit inside the confidence intervals we interpret that as evidence of good fit. In this case, our second model does an excellent job of producing simulated networks that have the same or similar indegree distributions. plot(gofi) Although only a few statistics are currently built into sienaGOF, RSiena supports adding in user-supplied statistic functions. This makes it easy to assess goodness-of-fit with almost any network characteristic that is of interest. The following example (taken from the sienaGOF-auxiliary help file) shows how to define and then use the Holland and Leinhardt triad census (1978). The triad census gives the frequency distribution of all possible triads in a directed network. The census uses a 3-digit numeric code to designate one of the 16 possible patterns of a triad. The first digit indicates the number of reciprocated ties in the triad, the second digit is the number of oneway ties, and the third digit is the number of empty 210 12 Dynamic Network Models Goodness of Fit of IndegreeDistribution 106 105 109 97 85 Statistic 74 51 32 17 7 1 2 3 4 5 6 p: 0.804 7 8 9 10 Fig. 12.3 Goodness-of-fit for indegree ties. So, 300 is the code for a triad connected by 3 reciprocal ties, while 003 is the code for a completely unconnected triad. See Wasserman and Faust (1994) for more information. In the following code, a new TriadCensus function is created. This function uses the existing triad.census function contained in the statnet package. (Note that igraph could also be used to calculate descriptive statistics to be used by sienaGOF.) This wrapper function is then used by sienaGOF. TriadCensus <- function(i,data,sims,wave, groupName,varName,levls=1:16){ unloadNamespace("igraph") # to avoid package clashes require(sna) require(network) x <- networkExtraction(i,data,sims,wave, groupName,varName) if (network.edgecount(x) <= 0){x <- symmetrize(x)} # because else triad.census(x) will lead to an error tc <- sna::triad.census(x)[1,levls] # names are transferred automatically tc 12.4 Model Exploration 211 } The GOF information stored in the fit object can also directly examined or summarized. The information can be plotted as before. For this type of auxiliary statistic it is advisable in the plot to center and scale. Here we see that our second model does a pretty good job of recreating the observed pattern of directed triads. It only fails in two cases: 021C (the model overestimates the number of this type of triad with two directed ties), and 120U (the model underestimates this type of triad with one reciprocal tie and two directed ties) (Fig. 12.4). goftc <- sienaGOF(RSmod2, TriadCensus, varName="fr4wav", verbose=FALSE, join=TRUE) descriptives.sienaGOF(goftc) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## max perc.upper mean median perc.lower min obs max perc.upper mean median perc.lower min obs max perc.upper mean median perc.lower min obs 003 012 12416 6506 12199 6248 11830 5895 11832 5900 11474 5538 11289 5235 11771 6018 111U 030T 336 100.0 301 88.0 262 68.3 263 68.0 227 51.0 214 39.0 245 82.0 210 300 128 67.0 118 55.0 102 43.3 102 43.0 85 32.0 74 25.0 105 36.0 102 021D 4301 203 4055 183 3775 152 3774 151 3513 124 3369 111 3816 137 030C 201 19.00 182 15.00 157 8.11 127 8.00 126 3.00 101 1.00 91 8.00 119 plot(goftc, center=TRUE, scale=TRUE) 021U 273 247 208 208 173 153 213 120D 80.0 70.0 57.5 58.0 45.0 41.0 61.0 021C 425 360 302 301 252 199 232 120U 54.0 46.0 35.9 36.0 25.0 19.0 49.0 111D 454 431 390 390 350 328 353 120C 85.0 69.0 53.7 54.0 41.0 35.0 65.0 212 12 Dynamic Network Models Goodness of Fit of TriadCensus Statistic (centered and scaled) 49 82 65 6018 61 3816 213 105 8 11771 119 245 137 36 232 353 003 012 102 021D 021U 021C 111D 111U 030T 030C 201 120D 120U 120C 210 300 p: 0.001 Fig. 12.4 Goodness-of-fit for triad census 12.4.3 Model Simulations The actual simulated networks can be accessed if the returnDeps option has been set to true when estimating the model. (This can take a lot of memory for larger networks, so the default is false.) The simulated network information is stored in a nested list object (called sims) within the general fitted model object. The information is organized as an edgelist for each simulated network for each period. The default number of simulated networks is 1,000 (set by the n3 parameter in the sienaAlgorithmCreate function). Within a particular simulation run, one predicted network is created at the end of each period. So, in this example, with four waves of data there will be three simulated networks, one each for the ends of periods one, two, and three. The general structure of the sims information can be seen with the following code. Here, the 500th simulation run is being accessed. 12.4 Model Exploration 213 str(RSmod2$sims[[500]]) ## List of 1 ## $ Data1:List of 2 ## ..$ fr4wav :List of 3 ## .. ..$ 1: int [1:188, 1:3] ## .. ..$ 2: int [1:185, 1:3] ## .. ..$ 3: int [1:176, 1:3] ## ..$ smokebeh:List of 3 ## .. ..$ 1: int [1:37] 1 0 0 ## .. ..$ 2: int [1:37] 1 1 0 ## .. ..$ 3: int [1:37] 1 0 0 1 1 2 2 2 2 3 3 3 3... 1 1 1 2 2 2 2 2 3 3... 1 1 1 2 2 2 2 2 2 2... 0 0 1 0 0 0 0 ... 0 0 1 0 0 0 0 ... 1 0 1 0 0 0 0 ... The actual edgelist can be obtained as follows, again for the 500th run. The first index specifies the run number, the second index is the number of the group (RSiena can do multiple-group estimation), the third index is the number of the dependent variable, and the last index is the period number (or, equivalently, Wave 1). For a coevolution model there are two dependent variables. The first one is the tie variable fr4wav, and the second is the behavior dependent variable smokebeh. This code, then, provides the friendship tie edgelist for the 500th run and the third period (limited to the first 25 cases). RSmod2$sims[[500]][[1]][[1]][[3]][1:25,] ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,] [,1] [,2] [,3] 1 7 1 1 8 1 1 11 1 2 7 1 2 17 1 2 21 1 2 30 1 2 32 1 2 35 1 2 37 1 3 13 1 3 17 1 3 20 1 3 21 1 3 33 1 4 13 1 4 17 1 4 21 1 4 27 1 5 10 1 5 22 1 214 ## ## ## ## 12 Dynamic Network Models [22,] [23,] [24,] [25,] 5 5 5 6 25 28 29 9 1 1 1 1 The simulated smoking status information for the same run and wave is then: RSmod2$sims[[500]][[1]][[2]][[3]] ## [1] 1 0 0 1 0 1 0 0 0 0 1 0 0 1 1 1 0 0 0 0 1 0 0 ## [24] 1 0 0 1 0 0 0 0 1 1 1 0 0 1 Using the above information, you can access and transform the data into a statnet or igraph network object. For example, the following code uses igraph to create, examine, and plot one of the simulated networks from the second RSiena model (Fig. 12.5). library(igraph) el <- RSmod2$sims[[500]][[1]][[1]][[3]] sb <- RSmod2$sims[[500]][[1]][[2]][[3]] fr_w4_sim <- graph.data.frame(el,directed = TRUE) V(fr_w4_sim)$smoke <- sb V(fr_w4_sim)$gender <- V(fr_w4)$gender fr_w4_sim ## ## ## ## ## ## ## ## ## ## ## ## IGRAPH DN-- 37 176 -+ attr: name (v/c), smoke (v/n), | (v/n), V3 (e/n) + edges (vertex names): [1] 1 ->7 1 ->8 1 ->11 2 ->7 [7] 2 ->30 2 ->32 2 ->35 2 ->37 [13] 3 ->20 3 ->21 3 ->33 4 ->13 [19] 4 ->27 5 ->10 5 ->22 5 ->25 [25] 6 ->9 6 ->11 6 ->15 6 ->34 [31] 7 ->13 7 ->18 8 ->1 8 ->4 [37] 8 ->19 8 ->21 8 ->30 8 ->32 + ... omitted several edges gender 2 3 4 5 7 8 9 ->17 ->13 ->17 ->28 ->1 ->7 ->6 2 3 4 5 7 8 9 ->21 ->17 ->21 ->29 ->8 ->17 ->7 modularity(fr_w4_sim,membership = V(fr_w4_sim)$smoke+1) ## [1] 0.112 modularity(fr_w4,membership = V(fr_w4)$smoke+1) ## [1] 0.129 12.4 Model Exploration 215 colors <- c("darkgreen","SkyBlue2") coord <- layout.kamada.kawai(fr_w4) op <- par(mfrow=c(1,2),mar=c(1,1,2,1)) plot(fr_w4,vertex.color=colors[V(fr_w4)$smoke+1], vertex.shape=shapes[V(fr_w4)$gender], vertex.size=10,main="Observed - Wave 4", vertex.label=NA, edge.arrow.size=0.5,layout=coord) plot(fr_w4_sim, vertex.color=colors[V(fr_w4_sim)$smoke+1], vertex.shape=shapes[V(fr_w4_sim)$gender], vertex.size=10,main="Simulated - Wave 4", vertex.label=NA, edge.arrow.size=0.5,layout=coord) par(op) Observed - Wave 4 Simulated - Wave 4 Fig. 12.5 Comparison of observed network to simulated network Chapter 13 Simulations Sometimes, if you want to change a man’s mind, you have to change the mind of the man next to him first. (Megan Whalen Turner – The King of Attolia) 13.1 Simulations of Network Dynamics Chapter 10 illustrated how R tools can be used to simulate networks with specific structures, often based on particular network science models. These modeled networks are useful in that they reveal social structures that may reflect reality, or are interesting for purely theoretical reasons. In any case, these are static networks. However, social networks are dynamic. Social networks can grow or shrink over time, and their composition can similarly change. For example, friendship networks grow as people expand their friendship group, get smaller if friends move away or friendships cool. The composition of friendship networks may change dramatically during transitions (e.g., from middle school to high school or from college to work). This type of network dynamics is captured by changes in node composition (which people are in the network) and by changes in the pattern of ties connecting the nodes. A second type of network dynamics is when some characteristic or behavior of members in a social network is influenced by the structural properties of the network itself. For example, in public health it has long been known that adolescents are more likely to start smoking if they have friends or family members in their social networks who smoke. In the rest of this chapter two detailed examples will be presented that illustrate how R can be used to model these two broad types of social dynamics. The ability to build simulations of network dynamics reflects a particular strength of R. In particular, these network simulations are possible because of the integration of data management, statistical programming, and network analysis in R. These types of models are not possible to do in any traditional network analysis package such as Pajek, UCINet, Gephi, or NodeXL. © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 13 217 218 13 Simulations 13.1.1 Simulating Social Selection When network structures change over time, we assume that there is an underlying process of social selection that determines how new ties are formed or dissolved. (Although this process may be at least partly random, we usually want to propose a more interesting model where tie formation is driven by node characteristics, local network structure, or network history.) Social selection has been studied extensively in the social sciences, and it has been shown to partly explain homophily, which is the general pattern that similar types of people tend to be connected to each other (McPherson et al. 2001; also see Fig. 12.1). In this section a dynamic model of social selection is constructed. The model is deliberately kept simple, both to make it easier to understand, but also to reflect good model building practices. In particular, it is generally a good idea when exploring computational simulations to start simple, and only add complexity once the simple model is fully understood. Also, this model is not built to emphasize efficiency (speed of execution). Efficiency could be built in later when the simulation has been fully tested and the analyst is ready to scale-up the simulation to handle larger runs. For our model, we will assume that we have a friendship network where friendship ties can change over time, and that these changes are driven by the similarity (and dissimilarity) among network members on some abstract node characteristic. This characteristic can be thought of as a behavior, or also possibly an attitude or opinion. The characteristic is also quantitative, so that we can think of network members having more or less of this characteristic. For example, the characteristic might represent physical activity behavior, where some network members have more of this characteristic (they exercise more) or less of it. In this example, we will use the simulation model to understand how overall network homophily changes over time based on individual changes in network tie formation and dissolution. 13.1.1.1 Setting Up the Simulation To start building and testing our simulation, we need to have a network to work with. We also want to define some basic network characteristics and model parameters that will be used as we proceed. We will start with a simple random network that has 25 members. library(igraph) N <- 25 netdum <- erdos.renyi.game(N, p=0.10) graph.density(netdum) mean(degree(netdum)) ## [1] 0.1 ## [1] 2.4 13.1 Simulations of Network Dynamics 219 The network also needs to include a node characteristic that will be used to drive the network changes. This abstract behavior (Bh) can range from 0 to 1 to capture diversity from low to high. To help with interpretation later on, a categorical node characteristic (BhCat) is also calculated. The categorical variable has five levels, and can be thought of as ‘very low,’ ‘somewhat low,’ ‘medium,’ ‘somewhat high,’ and ‘very high.’ Bh <- runif(N,0,1) BhCat <- cut(Bh, breaks=5, labels = FALSE) V(netdum)$Bh <- Bh V(netdum)$BhCat <- BhCat table(V(netdum)$BhCat) ## ## 1 2 3 4 5 ## 8 5 2 4 6 Here is what the network looks like after setting up a color palette that maps onto the five levels of BhCat. (See Chap. 5 for details on how color palettes work.) library(RColorBrewer) my_pal <- brewer.pal(5, "PiYG") V(netdum)$color <- my_pal[V(netdum)$BhCat] crd_save <- layout.auto(netdum) plot(netdum, layout = crd_save) 13.1.1.2 Creating an Update Function To start the simulation building process, we will work from the inside out. That is, we start by building the inner workings that allow a network to change, and then build a simulation framework around that. The core of a dynamic network model is allowing a network to change over time. Time is a rather abstract or slippery concept in this context, but essentially we want to be able to change a network and observe those changes. A good place to start is to define a function that updates the network structure, in this case by forming or dissolving a single tie between two nodes. Before the formal function is written, we can manually work through the process by which we would like to form a new tie or dissolve an existing tie. Starting with removing a tie is slightly easier. First, we need a way to identify all the existing ties for a particular node. In igraph there are a couple of equivalent ways to extract the adjacency list for a node in a network, which is a list of its direct ties. Here are the nodes that are tied to Node 24 in netdum. (To save some syntax space and protect us against inadvertent changes to the original data, the network is copied first.) 220 13 Simulations 3 8 17 23 20 25 5 14 1 9 13 7 10 4 19 18 24 12 16 2 6 11 15 21 22 Fig. 13.1 Random test network with five levels of behavior indicated g <- netdum get.adjlist(g)[24] ## [[1]] ## + 4/25 vertices: ## [1] 2 9 15 22 g[[24,]] ## [[1]] ## + 4/25 vertices: ## [1] 2 9 15 22 Referring back to Fig. 13.1, you can see that Node 24 is connected to four other nodes (2, 9, 15, and 22). The get.adjlist() syntax is easier to decipher than the double-bracket shortcut, so that will be used hereafter. In our model, we will want changes in ties to be driven by our node characteristic (Bh). Specifically, a reasonable model might suppose that friendship ties are more likely to be dissolved when the two friends are more dissimilar to each other on the behavior of interest. To model this we want to be able to compare the behavioral level of a particular network member with the behaviors of all of her friends. This is easy to do, building on the previous syntax. This uses the igraph vertex extractor function V(). Also, the get.adjlist() function returns a list, so the unlist() function is used to obtain a simple numeric vector. Finally, the stored adjacency list is used to filter the network to only display the behavior values for the nodes adjacent to Node 24. It looks like Node 24 has low amounts of exercise, which is similar to the level of Node 15. Node 24 is most dissimilar to Node 9, who has a very high level of exercise. 13.1 Simulations of Network Dynamics 221 V(g)[24]$Bh ## [1] 0.192 V_adj <- unlist(get.adjlist(g)[24]) V(g)[V_adj]$Bh ## [1] 0.389 0.952 0.114 0.325 All of this can be combined into a simple dissimilarity vector that measures the absolute values of the differences between the Bh values of Node 24 and its adjacent nodes. BhDiff <- abs(V(g)[V_adj]$Bh - V(g)[24]$Bh) BhDiff ## [1] 0.1966 0.7604 0.0783 0.1327 The next step is to use this information to select a tie that should be removed from the network. This can be done manually by identifying the two nodes and assigning FALSE to the node pair (again, by making a copy first): gdum <- g gdum[24,9] <- FALSE get.adjlist(gdum)[24] ## [[1]] ## + 3/25 vertices: ## [1] 2 15 22 A more programmatic way to select the tie to remove is to identify the most dissimilar pair of nodes. This can be done using index filtering (Fig. 13.2). gdum <- g V_sel <- V_adj[BhDiff == max(BhDiff)] gdum[24,V_sel] <- FALSE get.adjlist(gdum)[24] ## [[1]] ## + 3/25 vertices: ## [1] 2 15 22 plot(gdum, layout = crd_save) However, in our model we won’t always want to remove the tie from most dissimilar pair of nodes. Instead, we would like to randomly remove ties where the 222 13 Simulations 3 8 17 23 20 25 5 14 1 9 13 7 10 4 19 18 24 12 16 2 6 11 15 21 22 Fig. 13.2 Test network with one tie (24-9) removed probability is weighted by the amount of dissimilarity. This adds some randomness and heterogeneity to the model. Once we have the dissimilarity vector, it can be used to do this type of random tie selection. gdum <- g V_sel <- sample(V_adj,1,prob=BhDiff) gdum[24,V_sel] <- FALSE get.adjlist(gdum)[24] ## [[1]] ## + 3/25 vertices: ## [1] 2 15 22 The sampling function selects one node from the list of adjacent nodes, with probability weighted by the dissimilarity vector. So, the more dissimilar the node pair is, the more likely it will be selected to be removed. To check that the sampling is working properly, we can sample multiple times and look at the selection distribution. This shows that Node 9 is selected most often, and Node 15 least often, which matches expectations based on the similarity on Bh values. smplCheck <- sample(V_adj,500,replace=TRUE,prob=BhDiff) table(smplCheck) ## smplCheck ## 2 9 15 ## 91 327 35 22 47 Adding a new tie from a particular node to any other node proceeds in a similar fashion. The two main differences are that instead of picking the most dissimilar 13.1 Simulations of Network Dynamics 223 pair of nodes, we want to pick two nodes that are close to each other on the Bh characteristic. Second, because we are wanting to add a new tie, we need a way to select all of the non-adjacent nodes (i.e., nodes that are not directly tied to the target node). The non-adjacent nodes can be selected by removing the adjacent nodes as well as the target vertex ID from a list of all the nodes. Nodes in igraph are numbered from 1 to the total size of the network. So, if we are still interested in Node 24, the following finds all non-adjacent nodes. This uses the vector indexing facility of R where values are dropped if a negative sign is used. vtx <- 24 nodes <- 1:vcount(g) V_nonadj <- nodes[-c(vtx,V_adj)] V_nonadj ## [1] 1 3 4 5 6 ## [16] 19 20 21 23 25 7 8 10 11 12 13 14 16 17 18 Following the same logic as before, we can now randomly create a new tie, based on the similarity between a pair of nodes. The inverse of the absolute differences is calculated, so now the vector BhDiff2 contains similarity scores. BhDiff2 <- 1-abs(V(g)[V_nonadj]$Bh - V(g)[vtx]$Bh) BhDiff2 ## [1] 0.912 0.546 0.956 0.771 0.906 0.967 0.712 ## [8] 0.900 0.207 0.663 0.859 0.309 0.997 0.194 ## [15] 0.305 0.480 0.433 0.486 0.820 0.249 Sel_V <- sample(V_nonadj,1,prob=BhDiff2) gnew <- g gnew[vtx,Sel_V] <- TRUE get.adjlist(gnew)[vtx] ## [[1]] ## + 5/25 vertices: ## [1] 2 9 10 15 22 The following code assigns a different color (“darkred”) to the newly added tie, so that it can be seen easier in the plot (Fig. 13.3). E(gnew)$color <- "grey" E(gnew, P = c(vtx, Sel_V))$color <- "darkred" plot(gnew, layout = crd_save) All of this is preparation for creating a simple update function that can be called within a larger network simulation. The function that follows accepts a network (igraph) object and a target vertex. It first checks to see if the passed vertex in the 224 13 Simulations network is an isolate, if it is then the function silently returns the unaltered network (because you can’t remove a tie from an isolate). If the vertex is not an isolate, then the function randomly removes an existing tie with probability based on the dissimilarity between all tied pairs. It then adds a new tie with probability based on the similarity of all non-tied pairs. The two operations are combined into the same function so that the returned network object has the same density (i.e., number of total ties) as the original network. This makes the subsequent simulation easier to interpret. Note that this function relies on the igraph object already having the vertex attribute Bh defined. Sel_update <- function(g,vtx){ V_adj <- neighbors(g,vtx) if(length(V_adj)==0) return(g) BhDiff1 <- abs(V(g)[V_adj]$Bh - V(g)[vtx]$Bh) Sel_V <- sample(V_adj,1,prob=BhDiff1) g[vtx,Sel_V] <- FALSE nodes <- 1:vcount(g) V_nonadj <- nodes[-c(vtx,V_adj)] BhDiff2 <- 1-abs(V(g)[V_nonadj]$Bh - V(g)[vtx]$Bh) Sel_V <- sample(V_nonadj,1,prob=BhDiff2) g[vtx,Sel_V] <- TRUE g } 3 8 17 23 20 25 5 14 1 9 13 7 10 4 19 18 24 12 16 2 6 11 15 21 22 Fig. 13.3 Test network with one new tie added to Node 24 To test the function, pass it an igraph object along with the vertex whose ties are to be updated. 13.1 Simulations of Network Dynamics 225 gtst <- g node <- 24 gnew <- Sel_update(g,node) neighbors(gtst,node) ## + 4/25 vertices: ## [1] 2 9 15 22 neighbors(gnew,node) ## + 4/25 vertices: ## [1] 2 7 15 22 13.1.1.3 Building a Simple Simulation of Social Selection Now that an update function is available that randomly adds and drops ties in the abstract friendship network, a dynamic simulation model of social selection can be built. The following simulation model is simple, but has all of the elements of a dynamic network model. It operates on a network, makes changes over time, and those changes are observable. In a dynamic model, time can be realistic or highly abstract. For the social selection model presented here, an abstract notion of time is used. Specifically, nodes will be selected randomly for updating, and we assume that there is some passage of time between each update. However, no further specific characteristics of time are provided or needed. The following function Sel sim encapsulates the social selection simulation. The function accepts an igraph network object and the number of desired updates. It starts by defining a list object that will be used to store the updated networks. Then inside a loop that runs for the number of desired updates, a random node is selected and the update function is called where an existing tie is removed, and a new tie added. The updated network is stored in the list object after each step, and after the loop is finished the entire network list is returned. Sel_sim <- function(g,upd){ g_lst <- lapply(1:(upd+1), function(i) i) g_lst[[1]] <- g for (i in 1:upd) { gnew <- g_lst[[i]] node <- sample(1:vcount(g),1) gupd <- Sel_update(gnew,node) g_lst[[i+1]] <- gupd } g_lst } 226 13 Simulations This simulation function is inefficient in a few ways. First, the loop could possibly be replaced with a vectorized function. This could speed up the function, with some loss in readability. More importantly, the function stores the entire network for each update step. This will result in very large objects being returned by the function, based on the size of the network and number of updates. It is helpful at early stages of simulation development to preserve the whole networks, so that they can be examined. However, in later stages it would be typical to change the function to only return specific information about the networks, rather than the networks themselves. The next step is to create a larger random network that will be used as input for the social selection simulation. N <- 100 netdum <- erdos.renyi.game(N, p=0.10) graph.density(netdum) ## [1] 0.0949 mean(degree(netdum)) ## [1] 9.4 Bh <- runif(N,0,1) BhCat <- cut(Bh, breaks=5, labels = FALSE) V(netdum)$Bh <- Bh V(netdum)$BhCat <- BhCat table(V(netdum)$BhCat) ## ## 1 2 3 4 5 ## 13 19 32 13 23 Now that a starting network has been created, the actual simulation is run and the results stored in a list of network objects. In this case, the simulation starts with the netdum network object, and it is run for 500 updates. The returned list of network objects contains the original network in the first position, and then 500 additional networks which correspond to each update in the simulation. set.seed(999) g_lst <- Sel_sim(netdum,500) length(g_lst) ## [1] 501 summary(g_lst[[1]]) ## IGRAPH U--- 100 470 -- Erdos renyi (gnp) graph ## + attr: name (g/c), type (g/c), loops (g/l), ## | p (g/n), Bh (v/n), BhCat (v/n) 13.1 Simulations of Network Dynamics 227 13.1.1.4 Interpreting the Results of the Simulation The results of the simulation should always be examined first to determine that the simulation ran as expected, and then relevant characteristics of the networks can be studied to see what patterns have emerged from the simulation modeling. Then it can be determined if the results inform some research question or hypothesis about the network dynamics. A simple methods check for this simulation is that if the simulation worked as intended, then we should see no changes in density over the simulated networks. However, we should not see the same patterns of direct ties for any particular node in the network. (Note that the tie patterns will only be different for a particular node if that node was selected to be updated in the simulation. This is one reason to make sure to run the simulation many more times than the size of the network, to help ensure that all or at least most of the nodes have been updated.) graph.density(g_lst[[1]]) ## [1] 0.0949 graph.density(g_lst[[501]]) ## [1] 0.0949 neighbors(g_lst[[1]],1) ## + 5/100 vertices: ## [1] 16 33 51 65 72 neighbors(g_lst[[501]],1) ## + 3/100 vertices: ## [1] 4 65 66 After it has been determined that the simulation is running properly, then more substantive assessments can be done. A basic hypothesis for this simple example is that we would expect the network to become more homophilous over time. This is because the tie updates are partially driven by the similarities of the nodes on the abstract behavioral characteristic. In this case, network modularity is a useful metric (see Chap. 8 for more information about modularity). modularity(g_lst[[1]],BhCat) ## [1] -0.0221 modularity(g_lst[[501]],BhCat) ## [1] 0.168 Here we can see that the modularity was lower in the starting network compared to the final updated network. The higher modularity at the end tells us that ties 228 13 Simulations between nodes in the same BhCat category are relatively more likely than ties between different categories. That is, connected nodes are more similar to each other than when the simulation started, thus demonstrating homophily. This argument is more convincing if we use much more of the data provided by the simulation. The following plots show that there is substantial variability of modularity across the steps of the simulation. More importantly, modularity increases over time until near the end of the simulation run (Fig. 13.4). sim_stat <- unlist(lapply(g_lst, function(u) modularity(u,V(u)$BhCat))) op <- par(mfrow=(c(1,2))) plot(density(sim_stat),main="",xlab="Modularity") plot(0:500,sim_stat,type="l", xlab="Simulation Step",ylab="Modularity") par(op) 13.1.2 Simulating Social Influence This next example focuses on a second dynamic network process that can explain homophily in social networks – namely, social influence. Social influence is the process by which behaviors (or attitudes, opinions, etc.) of an individual are influenced by the behaviors of those other persons who are close to them in their social network. Here, we develop a simple model of this process where an abstract behavior (Bh) for a particular member of a social network is influenced by the average behaviors of all of those to whom the person is directly tied. This constitutes a simple model of peer social influence. To make this slightly more realistic (and interesting) we will also build into the model the concept of a tolerance region. Every member of the network has a tolerance range (Tl). If an adjacent member has a value of (Bh) that falls outside of the tolerance range, then that person’s behavior will not influence them. So, for example, if we again think of Bh as a measure of the level of physical activity, then in the following model the levels of physical activity of a network member’s friends will influence her own level of activity, but only when those friends levels of activity are somewhat close to her own. 13.1.2.1 Setting Up the Simulation The model building process is similar to the previous example, so we can proceed with slightly less exposition. The first step is to set up an example network. The only difference here is that we add a new vertex characteristic, Tl, the tolerance range for each network member. To start with we assume that every member has the same tolerance range of 0.20. (Remember that because of the random nature of the simulation, your network values will not match what is presented below.) 229 0.10 Modularity 0.05 4 3 0 1 0.00 2 Density 5 6 0.15 7 13.1 Simulations of Network Dynamics −0.05 0.05 0.15 Modularity 0 100 200 300 400 500 Simulation Step Fig. 13.4 Modularity over time in the social selection simulation N <- 25 netdum <- erdos.renyi.game(N, p=0.10) Bh <- runif(N,0,1) BhCat <- cut(Bh, breaks=5, labels = FALSE) V(netdum)$Bh <- Bh V(netdum)$BhCat <- BhCat V(netdum)$Tl <- 0.20 V(netdum)$Tl[1:10] ## [1] 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 13.1.2.2 Creating an Update Function The first modeling step is once again to define a core update function. In this case, instead of updating ties, for social influence the vertex attribute Bh needs to be updated. g <- netdum V_adj <- neighbors(g,24) V_adj ## + 4/25 vertices: ## [1] 2 9 15 22 230 13 Simulations V(g)[24]$Bh ## [1] 0.192 V(g)[V_adj]$Bh ## [1] 0.389 0.952 0.114 0.325 For Node 24 in the example network netdum, Bh starts quite low (0.19). The Bh values for Node 24’s four neighbors range from 0.11 to 0.95. A simple way to update the Bh value for Node 24 would be to take the mean of the starting value of Bh and the aggregate of the Bh values of all the neighbors. The new Bh value is 0.32, but this has been adjusted by the Bh values for all four neighbors. It has been particularly influenced by the high Bh value of Node 9 (0.95). newval 0, new_Bh <- .5*(V_Bh + mean(N_Bh[abs(N_Bh-V_Bh) < TL])), new_Bh <- V_Bh ) new_Bh } Testing out the new update function, it returns the correct value for Node 24. newval3 <- Inf_update(g,24) newval3 ## [1] 0.234 13.1.2.3 Building the Simulation of Social Influence Now that the social influence update function has been created, it can be put inside a function that runs the dynamic network model. This is similar to the social selection model in the previous section, with one big difference. It makes more sense to have every node in the network be influenced at the same time, because we assume that social influence is more of a continuous process. That means that instead of selecting nodes randomly to get updated, here we will update every node in the network for every run of the simulation. (Once again, the function presented here could be made more efficient in a number of ways.) Inf_sim <- function(g,runs){ g_lst <- lapply(1:(runs+1), function(i) i) g_lst[[1]] <- g for (i in 1:runs) { gnew <- g_lst[[i]] 232 13 Simulations for (j in 1:length(V(g))) { V(gnew)[j]$Bh <- Inf_update(g=gnew,vtx=j) } g_lst[[i+1]] <- gnew } g_lst } The following code sets up the same type of random starting network, and then runs the influence model 50 times on the network. The simulation does not need to be run as long as the selection model, because every node gets updated for every run of the influence model. N <- 100 netdum <- erdos.renyi.game(N, p=0.10) Bh <- runif(N,0,1) V(netdum)$Tl <- .20 V(netdum)$Bh <- Bh V(netdum)$BhCat <- BhCat set.seed(999) g_lst <- Inf_sim(netdum,50) 13.1.2.4 Interpreting the Results of the Simulation As the influence model runs, we might expect to see the variability of Bh to decrease because of the way that network members are adjusting their behaviors based on the average of their selected neighbors’ behaviors (Fig. 13.5). op <- par(mfrow=(c(3,2))) plot(density(V(g_lst[[1]])$Bh),xlim=c(-.2,1.2), main="Original network") plot(density(V(g_lst[[6]])$Bh),xlim=c(-.2,1.2), main='After 5 runs') plot(density(V(g_lst[[11]])$Bh),xlim=c(-.2,1.2), main='After 10 runs') plot(density(V(g_lst[[16]])$Bh),xlim=c(-.2,1.2), main='After 15 runs') plot(density(V(g_lst[[26]])$Bh),xlim=c(-.2,1.2), main='After 25 runs') plot(density(V(g_lst[[51]])$Bh),xlim=c(-.2,1.2), main='After 50 runs') par(op) 233 Density 0.8 0.4 0.0 Density Original network 0.0 0.5 1.0 1.5 13.1 Simulations of Network Dynamics −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 N = 100 Bandwidth = 0.1054 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 N = 100 Bandwidth = 0.08352 1.2 6 4 2 Density 8 After 15 runs 0 Density 0.0 0.5 1.0 1.5 2.0 After 10 runs −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 N = 100 Bandwidth = 0.06472 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 N = 100 Bandwidth = 0.02197 After 50 runs Density 0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 N = 100 Bandwidth = 0.0009744 1.2 0 100000 250000 50 100 150 After 25 runs Density After 5 runs −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 N = 100 Bandwidth = 1.182e-06 Fig. 13.5 Variability of Bh over time The plots show us a few interesting things. First, it takes a few runs of the model to start shifting the behavioral variability. However, the homogenization process does not require 50 runs, by Run 25 almost all of the nodes have shifted their behavior to match those in the middle. We can also see, however, that the network has become more homophilous. Figure 13.6 shows that by Run 25 most of the nodes fall into the middle Bh category, whereas they started out evenly distributed among the five categories. V(g_lst[[1]])$BhCat <- cut(V(g_lst[[1]])$Bh, breaks=c(0,.2,.4,.6,.8,1), labels = FALSE) V(g_lst[[26]])$BhCat <- cut(V(g_lst[[26]])$Bh, breaks=c(0,.2,.4,.6,.8,1), labels = FALSE) V(g_lst[[51]])$BhCat <- cut(V(g_lst[[51]])$Bh, breaks=c(0,.2,.4,.6,.8,1), labels = FALSE) 234 13 Simulations V(g_lst[[1]])$color <- my_pal[V(g_lst[[1]])$BhCat] V(g_lst[[26]])$color <- my_pal[V(g_lst[[26]])$BhCat] op <- par(mfrow=c(1,2),mar=c(0,0,2,0)) plot(g_lst[[1]],vertex.label=NA, main="Original network") plot(g_lst[[26]],vertex.label=NA, main="Network after Run 25") par(op) These simulation examples are intended to provide simple examples to illustrate the power of R for exploring network and behavioral dynamics. They can be extended in a number of ways, both to learn how to build such simulations, but also to apply them to more serious scientific questions. There are at least two simple ways to extend the examples presented here. First, in the social influence simulation instead of having every network member start with the same tolerance range (0.20), the effects of heterogeneous tolerance values on characteristics of social influence could be explored. (Or, you can explore the effects of decreasing or increasing the tolerance levels.) Second, the examples presented here separate out social influence and social selection processes. The two simulations could be combined to explore the characteristics of simultaneous selection and influence processes. (This is examined in Chap. 12 from a statistical modeling perspective.) Original network Fig. 13.6 Greater homophily over time Network after Run 25 References Barabási AL (2007) Network medicine from obesity to the diseasome. N Engl J Med. http://www.nejm.org/doi/full/10.1056/nejme078114 Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science. http://www.sciencemag.org/content/286/5439/509.short Broder A, Kumar R, Maghoul F (2000) Graph structure in the web. Computer Netw. http://www.sciencedirect.com/science/article/pii/ S1389128600000839 Butts CT (2008) network: a package for managing relational data in R. J Stat Softw. http://cran.repo.bppt.go.id/web/packages/network/ vignettes/networkVignette.pdf Freeman LC (2004) The development of social network analysis: a study in the sociology of science. Empirical Press, p 205. ISBN:1594577145. https:// books.google.com/books?id=VcxqQgAACAAJ&pgis=1 Fruchterman TMJ, Reingold EM (1991) Graph drawing by force-directed placement. Softw Pract Exp. ftp://132.180.22.143/axel/papers/ reingold:graph_drawing_by_force_directed_placement. pdf Galaskiewicz J (1985) The influence of corporate power, social status, and market position on corporate interlocks in a regional network. Social Forces. http:// sf.oxfordjournals.org/content/64/2/403.abstract Gentleman R, Lang DT (2007) Statistical analyses and reproducible research. J Comput Graph Stat. http://www.jstor.org/stable/27594227 Goodreau SM (2007) Advances in exponential random graph (p*) models applied to a large social network. Soc Netw 29(2):231–248. ISSN:03788733. doi:10.1016/j. socnet.2006.08.001 Granovetter MS (1973) The strength of weak ties. Am J Sociol. http://www. jstor.org/stable/2776392 Handcock MS et al (2008) statnet: software tools for the representation, visualization, analysis and simulation of network data. J Stat Softw 24(1):1–11. ISSN:1548–7660 © Springer International Publishing Switzerland 2015 D.A. Luke, A User’s Guide to Network Analysis in R, Use R!, DOI 10.1007/978-3-319-23883-8 235 236 References Harris JK (2013) An introduction to exponential random graph modeling. SAGE, p 136. ISBN:148332205X. https://books.google.com/ books?hl=en&lr=&id=lkYXBAAAQBAJ&pgis=1 Harris JK, Luke DA (2009) Forty years of secondhand smoke research: the gap between discovery and delivery. Am J Prev Med. http://www.sciencedirect.com/science/article/pii/ S0749379709001548 Holland PW, Leinhardt S (1978) An omnibus test for social structure using triads. Sociol Methods Res. http://smr.sagepub.com/content/7/2/227. short Hunter DR et al (2008) ergm: a package to fit, simulate and diagnose exponential-family models for networks. J Stat Softw 24(3):nihpa54860. ISSN:1548-7660. http://www.pubmedcentral.nih.gov/ articlerender.fcgi?artid=2743438&tool=pmcentrez& rendertype=abstract Knoke D, Burt RS (1983) Prominence. Appl Netw Anal. https://scholar. google.com/scholar?q=knoke+burt+prominence&btnG=& hl=en&as_sdt=0%2C26#0 Kolaczyk ED (2009) Statistical analysis of network data: methods and models. Springer, p 398. ISBN:0387881468. https://books.google.com/ books?id=Q-GNLsqq7QwC&pgis=1 Krebs VE (2002) Uncloaking terrorist networks. http://firstmonday.org/ ojs/index.php/fm/article/view/941/863 Leischow SJ, Luke DA, et al. (2010) Mapping US government tobacco control leadership: networked for success? Nicotine Tob Res. http://ntr. oxfordjournals.org/content/12/9/888.short Liljeros F, Edling CR, Amaral LAN (2001) The web of human sexual contacts. Nature. http://www.nature.com/nature/journal/v411/n6840/ full/411907a0.html Luke DA, Harris JK (2007) Network analysis in public health: history, methods, and applications. Annu Rev Public Health 28:69–93. ISSN:0163-7525. doi:10.1146/ annurev.publhealth.28.021406.144132 Luke DA, Stamatakis KA (2012) Systems science methods in public health: dynamics, networks, and agents. Annu Rev Public Health. http://www.ncbi. nlm.nih.gov/pmc/articles/PMC3644212/ Luke DA, Wald LM (2013) Network influences on dissemination of evidence-based guidelines in state tobacco control programs. Health Educ Behav. http:// heb.sagepub.com/content/40/1_suppl/33S.short Luke DA, Harris JK, Shelton S (2010) Systems analysis of collaboration in 5 national tobacco control networks. Am J Public Health. http://www.ncbi. nlm.nih.gov/pmc/articles/PMC2882404/ McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: homophily in social networks. Annu Rev Sociol. http://www.jstor.org/stable/ 2678628 References 237 Morris M, Handcock MS, Hunter DR (2008) Specification of exponentialfamily random graph models: terms and computational aspects. J Stat Softw 24(4):1548–7660. ISSN:1548–7660. http://www.pubmedcentral.nih. gov/articlerender.fcgi?artid=2481518&tool=pmcentrez& rendertype=abstract Murrell P (2005) R graphics. Taylor & Francis, p 328. ISBN:158488486X. https://books.google.com/books?id=fUUVngEACAAJ&pgis=1 Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA. http://www.pnas.org/content/103/23/8577. short Newman M (2010) Networks: an introduction. Oxford University Press, Oxford, p 784. ISBN:0191500704. https://books.google.com/books? id=LrFaU4XCsUoC&pgis=1 Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E Stat Nonlinear Soft Matter Phys 69(2):1–15. ISSN:1063651X. doi:10.1103/PhysRevE.69.026113. arXiv: 0308217 [cond-mat] Newman M, Barabási A-L, Watts DJ (2006) The structure and dynamics of networks. Princeton University Press, p 582. ISBN:0691113572. https:// books.google.com/books?id=0FNQ1LYKTMwC&pgis=1 Rogers EM (2003) Diffusion of innovations, 5th edn. Simon and Schuster, p 576. ISBN:0743258231. https://books.google.com/books? id=9U1K5LjUOwEC&pgis=1 Scott J (2012) Social network analysis (3rd Ed.) SAGE Publications. Scott J, Carrington PJ (2011) The SAGE handbook of social network analysis. SAGE Publications. Snijders TAB, Pattison PE (2006) New specifications for exponential random graph models. Sociol Methodol. url: http://smx.sagepub.com/content/ 36/1/99.short. Snijders TAB, Van de Bunt GG, Steglich CEG (2010) Introduction to stochastic actor-based models for network dynamics. Soc Netw. http://www.sciencedirect.com/science/article/pii/ S0378873309000069 de Solla Price DJ (1976) A general theory of bibliometric and other cumulative advantage process. J Am Soc Info Sci. https://scholar.google. com/scholar?q=price+1976+bibliometric&btnG=&hl=en&as_ sdt=0%2C26#4 Sporns O (2012) Discovering the human connectome. MIT, p 232. ISBN:026017903. https://books.google.com/books? id=uoNf2x0J8LMC&pgis=1 Tufte ER (1990) Envisioning information, vol 914. Graphics Press, p 126. https://books.google.com/books?id=1uloAAAAIAAJ&pgis=1 Tufte ER (2001) The visual display of quantitative information. Graphics Press, p 197. ISBN:0961392142. https://books.google.com/books? id=GTd5oQEACAAJ&pgis=1 238 References Tukey JW (1977) Exploratory data analysis. Addison-Wesley, p 688. ISBN:0201076160. https://books.google.com/books? id=UT9dAAAAIAAJ&pgis=1 Valente TW (2010) Social networks and health: models, methods, and applications. Oxford University Press, Oxford, p 296. ISBN:0199719721. https:// books.google.com/books?id=xnMzd1-7iGgC&pgis=1 Wasserman S, Faust K (1994) Social network analysis: methods and applications, vol 25. Cambridge University Press, p 825. ISBN:0521387078. https:// books.google.com/books?id=CAm2DpIqRUIC&pgis=1 Watts DJ, Strogatz SH (1998) Collective dynamics of’small-world’networks. Nature. http://www.nature.com/nature/journal/v393/n6684/ abs/393440a0.html
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : Yes Has XFA : No Language : EN XMP Toolkit : Adobe XMP Core 5.2-c001 63.143651, 2012/04/05-09:01:49 Create Date : 2015:11:26 15:54:51+05:30 Creator Tool : Adobe InDesign CS6 (Windows) Modify Date : 2015:11:27 20:15:55+05:30 Metadata Date : 2015:11:27 20:15:55+05:30 Producer : Adobe PDF Library 10.0.1 Format : application/pdf Document ID : uuid:c648abbf-3001-4e75-b1bb-e7264b72e2ef Instance ID : uuid:dbaf6120-cc36-4092-8f05-b30ed4345549 Rendition Class : default Version ID : 1 History Action : converted, converted History Instance ID : uuid:a3abbcc0-e59c-4dfc-ae3f-946342cba1a6, uuid:c35223f8-ba64-4563-bb92-b15d38af460f History Parameters : converted to PDF/A-1b, converted to PDF/A-1b History Software Agent : pdfToolbox, pdfToolbox History When : 2015:11:27 01:45:08+05:30, 2015:11:27 20:15:55+05:30 Part : 1 Conformance : B Schemas Namespace URI : http://ns.adobe.com/pdf/1.3/ Schemas Prefix : pdf Schemas Schema : Adobe PDF Schema Schemas Property Category : internal Schemas Property Description : A name object indicating whether the document has been modified to include trapping information Schemas Property Name : Trapped Schemas Property Value Type : Text Page Layout : SinglePage Page Mode : UseOutlines Page Count : 241 Creator : Adobe InDesign CS6 (Windows)EXIF Metadata provided by EXIF.tools