MARS Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 211
Download | ![]() |
Open PDF In Browser | View PDF |
MARS Manual Multiversion Asynchronous Replicated Storage 01101011101001 Thomas Schöbel-Theuer (tst@1und1.de) Version 0.1-64 Copyright (C) 2013-16 Thomas Schöbel-Theuer Copyright (C) 2013-16 1&1 Internet AG (see http://www.1und1.de shortly called 1&1 in the following). Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License”. Abstract MARS is a block-level storage replication system for long distances / flaky networks under GPL. It runs as a Linux kernel module. The sysadmin interface is similar to DRBD1 , but its internal engine is completely different from DRBD: it works with transaction logging, similar to some database systems. Therefore, MARS can provide stronger consistency guarantees. Even in case of network bottlenecks / problems / failures, the secondaries may become outdated (reflect an elder state), but never become inconsistent. In contrast to DRBD, MARS preserves the order of write operations even when the network is flaky (Anytime Consistency). The current version of MARS supports k > 2 replicas and works asynchronously. Therefore, application performance is completely decoupled from any network problems. Future versions are planned to also support synchronous or near-synchronous modes. MARS supports a new method for building Cloud Storage / Software Defined Storage, called LV Football. It comes with some automation scripts, leading to a similar functionality than Kubernetes, but devoted to stateful LVs over virtual LVM pools in the petabytes range. 01101011101001 1 Registered trademarks are the property of their respective owner. Contents 1. Architectures of Cloud Storage / Software Defined Storage / Big Data 1.1. What is Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. What is Cloud Storage . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3. Granularity at Architecture . . . . . . . . . . . . . . . . . . . . . . . 1.4. Replication vs Backup . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1. Example: Point-in-time Replication via ZFS Snapshots . . . 1.5. Local vs Centralized Storage . . . . . . . . . . . . . . . . . . . . . . 1.5.1. Internal Redundancy Degree . . . . . . . . . . . . . . . . . . 1.5.2. Capacity Differences . . . . . . . . . . . . . . . . . . . . . . . 1.5.3. Caching Differences . . . . . . . . . . . . . . . . . . . . . . . 1.5.4. Latencies and Throughput . . . . . . . . . . . . . . . . . . . . 1.5.5. Reliability Differences CentralStorage vs Sharding . . . . . . 1.5.6. Proprietary vs OpenSource . . . . . . . . . . . . . . . . . . . 1.6. Distributed vs Local: Scalability Arguments from Architecture . . . 1.6.1. Variants of Sharding . . . . . . . . . . . . . . . . . . . . . . . 1.6.2. FlexibleSharding . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.3. Principle of Background Migration . . . . . . . . . . . . . . . 1.7. Cost Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1. Cost Arguments from Technology . . . . . . . . . . . . . . . . 1.7.2. Cost Arguments from Architecture . . . . . . . . . . . . . . . 1.8. Reliability Arguments from Architecture . . . . . . . . . . . . . . . . 1.8.1. Storage Server Node Failures . . . . . . . . . . . . . . . . . . 1.8.1.1. Simple intuitive explanation . . . . . . . . . . . . . 1.8.1.2. Detailed explanation . . . . . . . . . . . . . . . . . . 1.8.2. Optimum Reliability from Architecture . . . . . . . . . . . . 1.8.3. Error Propagation to Client Mountpoints . . . . . . . . . . . 1.8.4. Similarities and Differences to Copysets . . . . . . . . . . . . 1.9. Performance Arguments from Architecture . . . . . . . . . . . . . . . 1.10. Scalability Arguments from Architecture . . . . . . . . . . . . . . . . 1.10.1. Example Failures of Scalability . . . . . . . . . . . . . . . . . 1.10.2. Properties of Storage Scalability . . . . . . . . . . . . . . . . 1.10.2.1. Influence Factors at Scalability . . . . . . . . . . . . 1.10.2.2. Example Scalability Scenario . . . . . . . . . . . . . 1.10.3. Scalability of Filesystem Layer vs Block Layer . . . . . . . . 1.11. Recommendations for Designing and Operating Storage Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 9 10 11 11 12 12 13 13 14 16 17 18 20 21 22 24 24 24 25 25 25 26 30 31 31 33 34 35 37 37 39 41 42 2. Use Cases for MARS vs DRBD 2.1. Network Bottlenecks . . . . . . . . . . . . . . . . . . . . 2.1.1. Behaviour of DRBD . . . . . . . . . . . . . . . . 2.1.2. Behaviour of MARS . . . . . . . . . . . . . . . . 2.2. Long Distances / High Latencies . . . . . . . . . . . . . 2.3. Explanation via CAP Theorem . . . . . . . . . . . . . . 2.3.1. CAP Differences between DRBD and MARS . . 2.3.2. CAP Commonalities between DRBD and MARS 2.4. Higher Consistency Guarantees vs Actuality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 44 44 47 50 50 51 53 54 3. Quick Start Guide 3.1. Preparation: What you Need . . . . . . . . . 3.2. Setup Primary and Secondary Cluster Nodes 3.2.1. Kernel and MARS Module . . . . . . 3.2.2. Setup your Cluster Nodes . . . . . . . 3.3. Creating and Maintaining Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 56 57 57 58 59 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents 3.4. Keeping Resources Operational . . . . . . . . . . . . . . . 3.4.1. Logfile Rotation / Deletion . . . . . . . . . . . . . 3.4.2. Switch Primary / Secondary Roles . . . . . . . . . 3.4.2.1. Intended Switching / Planned Handover . 3.4.2.2. Forced Switching . . . . . . . . . . . . . . 3.4.3. Split Brain Resolution . . . . . . . . . . . . . . . . 3.4.4. Final Destruction of a Damaged Node . . . . . . . 3.4.5. Online Resizing during Operation . . . . . . . . . . 3.5. The State of MARS . . . . . . . . . . . . . . . . . . . . . 3.6. Inspecting the State of MARS . . . . . . . . . . . . . . . . 4. Basic Working Principle 4.1. The Transaction Logger . . . . . . . . . . . 4.2. The Lamport Clock . . . . . . . . . . . . . 4.3. The Symlink Tree . . . . . . . . . . . . . . . 4.4. Defending Overflow of /mars/ . . . . . . . . 4.4.1. Countermeasures . . . . . . . . . . . 4.4.1.1. Dimensioning of /mars/ . . 4.4.1.2. Monitoring . . . . . . . . . 4.4.1.3. Throttling . . . . . . . . . 4.4.2. Emergency Mode and its Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 60 61 61 63 66 67 68 69 70 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 71 73 74 76 76 76 77 78 79 5. The Macro Processor 5.1. Predefined Macros . . . . . . . . . . . . . . . . . . 5.1.1. Predefined Complex and High-Level Macros 5.1.2. Predefined Primitive Macros . . . . . . . . 5.1.2.1. Intended for Humans . . . . . . . 5.1.2.2. Intended for Scripting . . . . . . . 5.2. Creating your own Macros . . . . . . . . . . . . . . 5.2.1. General Macro Syntax . . . . . . . . . . . . 5.2.2. Calling Builtin / Primitive Macros . . . . . 5.2.3. Predefined Variables . . . . . . . . . . . . . 5.3. Scripting HOWTO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 81 81 86 86 88 91 91 93 97 98 6. The Sysadmin Interface (marsadm and /proc/sys/mars/) 6.1. Cluster Operations . . . . . . . . . . . . . . . . . . . . 6.2. Resource Operations . . . . . . . . . . . . . . . . . . . 6.2.1. Resource Creation / Deletion / Modification . 6.2.2. Operation of the Resource . . . . . . . . . . . . 6.2.3. Logfile Operations . . . . . . . . . . . . . . . . 6.2.4. Consistency Operations . . . . . . . . . . . . . 6.3. Further Operations . . . . . . . . . . . . . . . . . . . . 6.3.1. Inspection Commands . . . . . . . . . . . . . . 6.3.2. Setting Parameters . . . . . . . . . . . . . . . . 6.3.2.1. Per-Resource Parameters . . . . . . . 6.3.2.2. Global Parameters . . . . . . . . . . . 6.3.3. Waiting . . . . . . . . . . . . . . . . . . . . . . 6.3.4. Low-Level Expert Commands . . . . . . . . . . 6.3.5. Senseless Commands (from DRBD) . . . . . . 6.3.6. Forbidden Commands (from DRBD) . . . . . . 6.4. The /proc/sys/mars/ and other Expert Tweaks . . . 6.4.1. Syslogging . . . . . . . . . . . . . . . . . . . . . 6.4.1.1. Logging to Files . . . . . . . . . . . . 6.4.1.2. Logging to Syslog . . . . . . . . . . . 6.4.1.3. Tuning Verbosity of Logging . . . . . 6.4.2. Tuning the Sync . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 100 102 102 104 109 110 110 110 111 111 111 112 112 113 113 113 113 114 114 114 115 . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Contents 7. Tips and Tricks 7.1. Avoiding Inappropriate Clustermanager Types for Medium and Long-Distance Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1. General Cluster Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2. Handover / Failover Reasons and Scenarios . . . . . . . . . . . . . . . . 7.1.3. Granularity and Layering Hierarchy for Long Distances . . . . . . . . . 7.1.4. Methods and their Appropriateness . . . . . . . . . . . . . . . . . . . . . 7.1.4.1. Failover Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4.2. Handover Methods . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4.3. Hybrid Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.5. Special Requirements for Long Distances . . . . . . . . . . . . . . . . . 7.2. systemd Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1. Why systemd? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2. Working Principle of the systemd Template Engine . . . . . . . . . . . 7.2.3. Example systemd Templates . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4. Handover involving systemd . . . . . . . . . . . . . . . . . . . . . . . . 7.3. Creating Backups via Pseudo Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 116 117 117 118 118 124 124 124 125 125 125 127 127 128 8. LV Football / VM Football / Container Football 8.1. Football Overview . . . . . . . . . . . . . . . . . . . . 8.2. HOWTO instantiate / customize Football . . . . . . . 8.2.1. Block Device Layer . . . . . . . . . . . . . . . . 8.2.2. Mechanics Layer of Cluster Operations . . . . . 8.2.3. Mechanics Layer of Football Operations . . . . 8.2.3.1. Configuring and Overriding Variables 8.2.3.2. football-basic.sh Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 130 134 134 134 135 135 135 9. MARS for Developers 9.1. Motivation / Politics . . . . . . . . . . . . . . . . . 9.2. Architecture Overview . . . . . . . . . . . . . . . . 9.3. Some Architectural Details . . . . . . . . . . . . . 9.3.1. MARS Architecture . . . . . . . . . . . . . 9.3.2. MARS Full Architecture (planned) . . . . . 9.4. Documentation of the Symlink Trees . . . . . . . . 9.4.1. Documentation of the MARS Symlink Tree 9.5. XIO Worker Bricks . . . . . . . . . . . . . . . . . . 9.6. StrategY Worker Bricks . . . . . . . . . . . . . . . 9.7. The XIO Brick Personality . . . . . . . . . . . . . 9.8. The Generic Brick Infrastructure Layer . . . . . . 9.9. The Generic Object and Aspect Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 137 139 139 139 140 140 141 141 141 141 141 141 . . . . . . . . . . . . . . . . . . . . . . . . 116 A. Technical Data MARS B. Handout for Midnight Problem Solving B.1. Inspecting the State of MARS . . . . . . . . . . . . B.2. Replication is Stuck . . . . . . . . . . . . . . . . . B.3. Resolution of Emergency Mode . . . . . . . . . . . B.4. Resolution of Split Brain and of Emergency Mode B.5. Handover of Primary Role . . . . . . . . . . . . . . B.6. Emergency Switching of Primary Role . . . . . . . 142 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 143 143 144 145 146 146 C. Alternative Methods for Split Brain Resolution 148 D. Alternative De- and Reconstruction of a Damaged Resource 149 E. Cleanup in case of Complicated Cascading Failures 150 F. Experts only: Special Trick Switching and Rebuild 152 6 Contents G. Mathematical Model of Architectural Reliability 154 G.1. Formula for DRBD / MARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 G.2. Formula for Unweighted BigCluster . . . . . . . . . . . . . . . . . . . . . . . . . . 154 G.3. Formula for SizeWeighted BigCluster . . . . . . . . . . . . . . . . . . . . . . . . . 155 H. Command Documentation for Userspace H.1. marsadm --help . . . . . . . . . . . H.2. football.sh --help . . . . . . . . H.3. football.sh --help --verbose . . H.4. screener.sh --help . . . . . . . . H.5. screener.sh --help --verbose . . Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 156 166 170 190 194 I. Football Redundancy Diagrams I.1. Parallel migrate . . . . . . I.2. Stepwise migrate . . . . . . I.3. Plain shrink . . . . . . . . I.4. Full migrate+shrink . . . . I.5. Stepwise migrate+shrink . I.6. FAST migrate+shrink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 202 202 202 203 203 204 . . . . . . J. GNU Free Documentation License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 7 1. Architectures of Cloud Storage / Software Defined Storage / Big Data Datacenter architects have no easy job. Building up some petabytes of data in the wrong way can easily endanger a company, as will be shown later. There are some architectural laws to know and some rules to follow. First, we need to take a look at the most general possibilities how storage can be architecturally designed: Storage Local Storage Network Storage 1:1 or 1:k Centralized Storage Distributed Storage 1:n n:m The topmost question is: do we always need to access bigger masses of (typically unstructured) data over a network? There is a common belief that both reliability and scalability could be only achieved this way. In the past, local storage has often been viewed as “too simple” to provide both enterprise grade reliability, and scalability. In the past, this was sometimes true. However, with the advent of a new method called “LV Football” this picture has changed, see chapter 8. We will later review what level of reliability and scalability can be achieved with each of the fundamental models mentioned here. 1.1. What is Architecture From https://en.wikipedia.org/wiki/Software_architecture: Software architecture refers to the high level structures of a software system and the discipline of creating such structures and systems. Throughout this paper, the term “architecture” is strictly separated from “implementations”. Any of “architecture” or “implementation” can relate to both hard- and software in general. Confusion of “architecture” with “implementation” is a major source of ill-designs, which then often cause major product flaws and/or operational problems. Be sure to understand the difference. Another source of costly ill-designs is starting with a particular implementation in mind, and not sufficiently reasoning abouts its fundamental architecture. Recommended best practice is to (1) look at the problem space, then (2) consider a set of architectural solution classes, and (3) look at the mappings between them. This means: start with architectural requirements for a particular application area (typically covering multiple use cases), then look at multiple solution architectures, and finally go down to a set of potential implementations, but only after the former has been understood. 8 1.2. What is Cloud Storage Starting with a particular single solution in mind is almost a guarantee for a non-optimum solution, or even a failed project, or even a disaster at company level when enterprise-critical mass data is involved. Nevertheless, don’t think in waterfall models. Always work iteratively and evolutionary, but nevertheless obey the principle that any bug in an architectural ill-design cannot be fixed by the best implementation of the world. Be sure to understand the fundamental difference between architecture and its (multiple / alternative) implemenations by their respective reach. 1.2. What is Cloud Storage According to a popular definition from https://en.wikipedia.org/wiki/Cloud_storage (retrieved June 2018), cloud storage is (1) Made up of many distributed resources, but still act as one. (2) Highly fault tolerant through redundancy and distribution of data. (3) Highly durable through the creation of versioned copies. (4) Typically eventually consistent with regard to data replicas. Notice that the term “network” does not occur in this definition. However, the term “distributed resources” is implying some(!) kind of network. Important! The definition does not imply some specific type of network, such as a storage network which must be capable of transporting masses of IO operations in realtime. We are free to use other types of networks, such as replication networks, which need not be dimensioned for realtime IO traffic, but are usable for background data migration, and even over long distances, where the network typically has some bottlenecks. Notice that the definition says nothing about the time scale of operations1 . We are free to implement certain operations, such as background data migrations, in a rather long timescale (from a human point of view). Example: increasing the number of replicas in an operational Ceph cluster, already containing a few hundreds of terabytes of data, will not only require additional storage hardware, but also take a rather long time, implied by the very nature of such reorganisational tasks. The famous CAP theorem is one of the motivations behind requirement (4) “eventually consistent”. This is not an accident. There is a reason for it, although it is not a hard requirement. Strict consistency is not needed for many applications running on top of cloud storage. In addition, the CAP theorem and some other theorems cited at https: //en.wikipedia.org/wiki/CAP_theorem are telling us that Strict Consistency would be difficult and expensive to achieve at global level in a bigger Distributed System, and at the cost of other properties. More detailed explanations are in section 2.3 on page 50. There are some consequences from this definition of Cloud Storage, for each of our high-level storage architectures: 1 Notice: go down to a time scale of microseconds. You will then notice that typical IO operations will require several hundreds of machine instructions between IO request submission and the corresponding IO request completion. This is not only true for local IO. In network clusters like Ceph, it will even involve creation of network packets, and lead to additional IO latencies implied by the network packet transfer latencies. 9 1. Architectures of Cloud Storage / Software Defined Storage / Big Data Distributed Storage, in particular BigCluster architectures (see section 1.6): many of them (with few exceptions) are conforming to all of these requirements. Typical granularity are objects, or chunks, or other relatively small units of data. Centralized Storage: does not conform to (1) and to (4) by definition2 . By introduction of synchronous or asynchronous replication, it can be made to almost conform, except for (1) where some concept mismatches remain (probably resolvable by going to a RemoteSharding model on top of CentralStorage, where CentralStorage is only a sub-component). Typical granularity is replication of whole internal storage pools, or of LVs, or of filesystem data. LocalStorage, and some further models like RemoteSharding (see section 1.6.1): (1) can be achieved at LV granularity with Football (see chapter 8), which creates a Big Virtual LVM Pool. (2) can be achieved at disk granularity with local RAID, and at LV granularity with DRBD or MARS. (3) can be achieved at LV granularity with LVM snapshots, and/or ZFS (or other filesystem) snapshots, and/or above filesystem layer by addition of classical backup. (4) at least Eventually Consistent or better can be alternatively achieved by (4a) DRBD, which provides Strict consistency during connected state, but works only reliably with passive crossover cables over short distances (see CAP theorem in section 2.3 on page 50). (4b) MARS, which works over long distances and provides two different consistency guarantees at different levels, both at the same time: locally: Strict local consistency at LV granularity, also within each of the LV replicas. globally: Eventually consistent between different LV replicas. However, in case of a planned handover, it is also strictly consistent at a global level, but may need some extra time for catching up. Notice: BigCluster architectures are creating virtual storage pools out of physically distributed storage servers. For fairness reasons, creation of a big virtual LVM pool, must be considered as another valid Cloud Storage model, matching the above definition of Cloud Storage. The main architectural difference is granularity, as explained in section 1.3, and the stacking order of sub-components. Notice that Football is creating location transparency inside of the distributed virtual LVM pool. This is an important (though not always required) basic property of any type of clusters and/or grids. 1.3. Granularity at Architecture Here are the most important architectural differences between object-based storages and LVbased (Logical Volume) storages: 2 Notice that sharding on top of CentralStorage is no longer a CentralStorage model by definition, but a RemoteSharding model according to section 1.6.1. 10 1.4. Replication vs Backup Granularity Number of instances Typical access Update in place Resize during operation Object support LV support Filesystem support Scalable Location distances Centralized pool management Easy sharding support Objects small (typically KiB) very high random keys no no native on top of on top of at cluster per datacenter / on campus per cluster cumbersome LVs huge (several TiB) low to medium named yes yes on top of native on top of both cluster and grid long distances possible Football uniting clusters yes 1.4. Replication vs Backup Intuitively, data backup and data replication are two different solution classes, addressing different problems. However, there exist descriptions where both solution classes are overlapping, as well as their corresponding problem classes. For example, backup as explained in https://en.wikipedia. org/wiki/Backup could be seen as also encompassing some types of storage replications explained in https://en.wikipedia.org/wiki/Replication_(computing). For a rough comparison of typical implementations, see the following typical differences: Fast handover (planned) Fast failover (unplanned) Protect for physical failures Protect for logical data corruption Disaster Recovery Time (MTTR) Backup no, or cumbersome no, or cumbersome yes yes typically (very) slow Replication yes yes yes typically no fast Because of these typical differences, enterprise-critical data typically deserves both solution classes. Confusion of solution classes and/or their corresponding problem classes / properties can be harmful to enterprises and to carreers of responsible persons. 1.4.1. Example: Point-in-time Replication via ZFS Snapshots Some ZFS advocates believe that ZFS snapshots, which were originally designed for backup-like use cases, are also appropriate solutions for achieving geo-redundancy. The basic idea is to run incremental ZFS snapshots in an endless loop, e.g. via some simple scripts, and expediting to another host where the snapshots are then applied to another ZFS instance. When there is less data to be expedited, loop cycle times can go down to a few seconds. When much data is written at the primary site, loop cycle times will rise up. The following table tries to explain why geo-redundancy is not as simple to achieve as believed, at least without addition of sophisticated additional means3 : 3 ZFS advocates often argue with many features which aren’t present at other filesystem types. The above table shows some dimensions not dealing with properties of local filesystems, but with problems / tasks arising in long-distance distributed systems involving masses of enterprise-critical storage. 11 1. Architectures of Cloud Storage / Software Defined Storage / Big Data Synchronity (in average) Generic solution Granularity Built-in snapshots Long distances Built-in primary/secondary roles Built-in handover (planned) Built-in failover (unplanned) Built-in data overflow handling Unnoticed data loss due to overflow Split-brain awareness Execute split-brain resolution Protect against illegal data modification DRBD yes yes LVs no no yes mostly yes unnecessary no yes yes yes MARS delay yes LVs no yes yes yes yes yes no yes yes yes ZFS delay * 1.5 FS-specific subvolumes yes yes no no no no, missing possible no no no The last item means that ZFS by itself does not protect against amok-running applications modifiying the secondary (backup) side in parallel to the replication process (at least not by default). Workarounds may be possible, but are not easy to create and to test for enterprisecritical applications. 1.5. Local vs Centralized Storage There is some old-fashioned belief that only centralized storage systems, as typically sold by commercial storage vendors, could achieve a high degree of reliability, while local storage were inferior by far. In the following, we will see that this is only true for an unfair comparison involving different classes of storage systems. 1.5.1. Internal Redundancy Degree Centralized commerical storage systems are typically built up from highly redundant internal components: 1. Redundant power supplies with UPS. 2. Redundancy at the storage HDDs / SSDs. 3. Redandancy at internal transport busses. 4. Redundant RAM / SSD caches. 5. Redundant network interfaces. 6. Redundant compute heads. 7. Redundancy at control heads / management interfaces. What about local hardware RAID controllers? Many people think that these relatively cheap units were massively inferior at practically each of these points. However, please take a really deep look at what classical RAID chip manufacturers like LSI / Avago / Broadcom and their competitors are offering as configuration variants of their top notch models. The following enumeration is in the same order as above (item by item): 1. Redundant hardware RAID cards with BBU caches, each with local goldcaps surviving power outages, their BBU caches cross-coupled via high-speed interconnects. 2. HDD / SSD redundancy: almost any RAID level you can think of. 3. Redundant SAS cross-cabling: any head can access any device. 4. BBU caches are redundant and cross-coupled, similarly to RDMA. When SSD caches are added to both cards, you also get redundancy there. 12 1.5. Local vs Centralized Storage 5. When using cross-coupled redundant cards, you automatically get redundant host bus interfaces (HBAs). 6. The same story: you also get two independent RAID controller instances which can do RAID computations independently from each other. Some implementations do this even in hardware (ASICs). 7. Dito: both cards may be plugged into two different servers, thereby creating redundancy at control level. As a side effect, you may also get a similar functionality than DRBD. If you compare typical prices for both competing systems, you will notice a huge difference. See also section 1.7. 1.5.2. Capacity Differences There is another hard-to-die myth: commercial storage would provide higher capacity. Please read the data sheets. It is possible (but not generally recommended) to put several hundreds of spindles into several external HDD enclosures, and then connect them to a redundant crosscoupled pair of RAID controllers via several types of SAS busses. By filling a rack this way, you can easily reach similar, if not higher capacities than commercial storage boxes, for a fraction of the price. However, this is not the recommended way for general use cases (but could be an option for low demands like archiving). The big advantage of RAID-based local storage is massive scale-out by sharding, as explained in section 1.6. 1.5.3. Caching Differences A frequent argument is that centralized storage systems had bigger caches than local RAID systems. While this argument is often true, it neglects an important point: Local RAID systems often don’t need bigger caches, because they are typically located at the bottom of a cache hierarchy, playing only a particular role in that hierarchy. There exist further caches which are erronously not considered by such an argument! Example, see also section 1.9 for more details: At 1&1 Shared Hosting Linux (ShaHoLin), a typical LXC container containing several thousands to tenthousands of customer home directories, creates a long-term average(!) IOPS load at block layer of about 70 IOPS. No, this isn’t a typo. It is not 70,000 IOPS. It is only 70 IOPS. Linux kernel experts know why I am not kidding. The standard Linux kernel has two main caches, the Page Cache for file content, and the Dentry Cache (plus Inode slave cache) for metadata. Both caches are residing in RAM, which is the fastest type of cache you can get. Nowadays, typical servers have several hundreds of gigabytes of RAM, sometimes even up to terabytes, resulting in an incredible caching behaviour which can be measured by those people who know how to do it (caution: it can be easily done wrongly). Many people are neglecting these caches, sometimes not knowing of their existence, and are falsely assuming that 1 application read() or write() operation will also lead to 1 IOPS at block layer. As a consequence, they are demanding 50,000 IOPS or 100,000 or even 1,000,000 IOPS. Some (but not all) commercial storage systems can deliver similar IOPS rates, because they have internal RAM caches in the same order of magnitude. People who are buying such systems are typically falling into some of the following classes (list is probably incomplete): • some people know this, but price does not matter - the more caches, the better. Wasted money for doubled caches does not count for them, or is even viewed as an advantage to them (personally). Original citation of an anonymous person: “only the best and the most expensive storage is good enough for us”. • using NFS, which has extremely poor filesystem caching behaviour because the Linux nfs client implementation does not take full advantage of the dentry cache. Sometimes people know this, sometimes not. It seems that few people have read an important paper on the Linux implementation of nfs. Please search the internet for “Why nfs sucks” from Olaf Kirch (who is one of the original Linux nfs implementors), and read it. Your opinion about nfs might change. 13 1. Architectures of Cloud Storage / Software Defined Storage / Big Data • have transactional databases, where high IOPS may be really needed, but exceptionally (!) for this class of application. For very big enterprise databases like big SAP installations, there may be a very valid justification for big RAM caches at storage layers. However: smaller transactional loads, as in webhosting, are often (not always) hammering a low number of hot spots, where big caches are not really needed. Relatively small BBU caches of RAID cards will do it also. Often people don’t notice this because they don’t measure the workingset behaviour of their application, as could be done for example with blkreplay (see https://blkreplay.org). • do not notice that well-tuned filesystem caches over iSCSI are typically demanding much less IOPS, sometimes by several orders of magnitude, and are wasting money with caches at commercial boxes they don’t need (classical over-engineering). Anyway, local storage can be augmented with various types of local caches with various dimensioning. However, there is no point in accessing the fastest possible type of RAM cache remotely over a network. Even expensive hardware-based RDMA (e.g. over Infiniband) cannot deliver the same performance as directly caching your data in the same RAM where your application is running. The Dentry Cache in the Linux kernel provides highly optimized shared metadata in SMP and NUMA systems (nowadays scaling to more than 100 processor cores), while the Page Cache provides shared memory via hardware MMU. This is crucial for the performance of classical local filesystems. The physical laws of Einstein and others are telling us that neither this type of caching, nor its shared memory behaviour, can be transported over whatever type of network without causing performance degradation. 1.5.4. Latencies and Throughput First of all: today there exist only a small number of HDD manufacturers on the world. The number of SSD manufacturers will likely decline in the long run. Essentially, commercial storage vendors are more or less selling you the same HDDs or SSDs as you could buy and deploy yourself. If at all, there are only some minor technical differences. In the meantime, many people agree to a Google paper that the ratio of market prices (price per terabyte) between HDDs and SSDs are unlikely to change in a fundamental4 way during the next 10 years. Thus, most large-capacity enterprise storage systems are built on top of HDDs. Typically, HDDs and their mechanics are forming the overall bottleneck. • by construction, a local HDD attached via HBAs or a hardware RAID controller will show the least additional overhead in terms of additional latencies and throughput degradation caused by the attachment. • When the same HDD is indirectly attached via Ethernet or Infiniband or another rack-torack transport, both latencies and throughput will become worse. Depending on further factors and influences, the overall bottleneck may shift to the network. The laws of information transfer are telling us: with increasing distance, both latencies (laws of Einstein) and throughput (laws of energy needed for compensation of SNR = signal to noise ratio) are becoming worse. Distance matters. And the number of intermediate components, like routers / switches and their queuing, matters too. This means that local storage has always an advantage in front of any attachment via network. Centralized storages are bound to some network, and thus suffer from disadvantages in terms of latencies and throughput. 4 In folklore, there exists a fundamental empirical law, fuzzily called “Storage Pyramid” or “Memory Hierarchy Law” or similar, which is well-known at least in German OS academic circles. The empirical law (extrapolated from observations, similarly to Moore’s law) tells us that faster storage technology is always more expensive than slower storage technology, and that capacities of faster storage are typically always lesser than capacity of slower storage. This observation has been roughly valid for more than 50 years now. You can find it in several German lecture scripts. Unfortunately, the Wikipedia article https://en.wikipedia. org/wiki/Memory_hierarchy (retrieved in June 2018) does not cite this very important fundamental law about costs. In contrast, the German article https://de.wikipedia.org/wiki/Speicherhierarchie about roughly the same subject is mentioning “Kosten” which means “cost”, and “teuer” which means “expensive”. 14 1.5. Local vs Centralized Storage What is the expected long-term future? Will additional latencies and throughput of centralized storages become better over time? It is difficult to predict the future. Let us first look at the past evolution. The following graphics has taken its numbers from Wikipedia articles https://en.wikipedia.org/wiki/List_of_ device_bit_rates and https://en.wikipedia.org/wiki/History_of_hard_disk_drives, showing that HDD capacities have grown over-proportionally by about 2 orders of magnitude over about 30 years, when compared to the relative growth of network bandwidth. In the following graphics, effects caused by decreasing form factors have been neglected, which would even amplify the trend. For fairness, bundling of parallel disks or parallel communication channels5 have been ignored. All comparisons are in logarithmic y axis scale: Capacity in [MByte], BitRates in [MBit/s] Long-Term Development of HDD Capacity vs Network Bandwidth 1x108 "HDD.capacity" "Ethernet.rates" "Infiniband.rates" 1x107 1x106 100000 10000 1000 100 10 1 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 Year What does this mean when extrapolated into the future? It means that concentrating more and more capacity into a single rack due to increasing data density will likely lead to more problems in future. Accessing more and more data over the network will become increasingly more difficult when concentrating high-capacity HDDs or SSDs6 into the same space volume as before. In other words: centralized storages are no good idea yet, and will likely become an even worse idea in the future. Example: there was a major incident at a German web hosting company at the beginning of the 2000’s. Their entire webhosting main business was running on a single proprietary highly redundant CentralStorage solution, which failed. Restore from backup took way too long from the viewpoint of a huge number of customers, leading to major press attention. Before this incident, they were the #1 webhoster in Germany. A few years later, 1&1 was the #1 instead. You can speculate whether this has to do with the incident. But anyway, the later geo-redundancy strategy of 1&1 basing on a sharding model (originally using DRBD, later MARS) was motivated by conclusions drawn from this incident. Another example: in the 1980s, a CentralStorage “dinosaur7 ” architecture called SLED = Single Large Expensive Disk was propagated with huge marketing noise and effort, but its 5 It is easy to see that the slopes of HDD.capacity vs Infiniband.rates are different. Parallelizing by bundling of Infiniband wires will only lift the line a little upwards, but will not alter its slope in logarithmic scale. For extrapolated time t → ∞, the extrapolated empirical long-term behaviour is rather striking. 6 It is difficult to compare the space density of contemporary SSDs in a fair way. There are too many different form factors. For example, M2 cards are typically consuming even less cm3 /T B than classical 2.5 inch form factors. This trend is likely to continue in future. 7 With the advent of NVME, SSDs are almost directly driven by DMA. Accessing any high-speed DMA devices by default via network is a foolish idea, similarly foolish than playing games via an expensive high-end gamer graphics cards which is then indirectly attached via RDMA, or even via Ethernet. Probably no serious gamer would ever try to do that. But some storage vendors do, for strategic reasons. Probably for their own survival, their customers are to be misguided to overlook the blinking red indicators that centralized SSD storage is likely nothing but an expensive dead end in the history of dinosaur architectures. 15 1. Architectures of Cloud Storage / Software Defined Storage / Big Data historic fate was predictable for real experts not bound to particular interests: SLED finally lost against their contemporary RAID competition. Nowadays, many people don’t even remember the term SLED. Today’s future is likely dominated by scaling-out architectures like sharding, as explained in section 1.6. 1.5.5. Reliability Differences CentralStorage vs Sharding In this section, we look at fatal failures only, ignoring temporary failures. A fatal failure of a storage is an incident which needs to be corrected by restore from backup. By definition, even a highly redundant CentralStorage is nevertheless a SPOF = Single Point of Failure. This also applies to fatal failures. Some people are incorrectly arguing with redundancy. However, the problem is that any system, even a highly redundant one, can fail fatally. There exists no perfect system on earth. One of the biggest known sources of fatal failure is human error. In contrast, sharded storage (for example the LocalSharding model, see also section 1.6.1) has MPOF = Multiple Points Of Failure. It is unlikely that many shards are failing fatally at the same time, because shards are independent 8 from each other by definition. What is the difference from the viewpoint of customers of the services? When a CentralStorage fails fatally, a huge number of customers will be affected for a long time (see the example German webhoster mentioned in section 1.5.4). Reason: restore from backup will take extremely long because huge masses of data have to be restored. MTBF = Mean Time Between Failures is (hopefully) longer thanks to redundancy, but MTTR = Mean Time To Repair is also very long. With (Local)Sharding, the risk of some fatal incident somewhere in the sharding pool is higher, but the size of such an incident is smaller in three dimensions at the same time: 1. There are much less customers affected (typically only 1 shard out of n shards). 2. MTTR = Mean Time To Repair is typically much better because there is much less data to be restored. 3. Residual risk plus resulting fatal damage by un-repairable problems is thus lower. What does this mean from the viewpoint of an investor of a big “global player” company? As is promised by the vendors, let us assume that failure of CentralStorage might be occurring less frequently. But when it happens on enterprise-critical mass data, the stock exchange value of the affected company will be exposed to a hazard. This is not bearable from the viewpoint of an investor. In contrast, the (Local)Sharding model is distributing the indispensible incidents (because perfect systems do not exist, and perfect humans do not exist) to a lower number of customers with higher frequency, such that the total impact onto the business becomes bearable. Risk analysis of enterprise-critical use cases is summarized in the following table: Probability of some fatal incident # Customers affected MTBF per storage MTTR per storage Unrepairable residual risk Total impact Investor’s risk CentralStorage lower very high higher higher higher higher unbearable (Local)Sharding higher very low lower lower lower lower stock exchange compatible Summary: CentralStorage is something for 8 When all shards are residing in the same datacenter, there exists a SPOF by power loss or other impacts onto the whole datacenter. However, this applies to both the CentralStorage and to the LocalSharding model. In contrast to CentralStorage, LocalSharding can be more easily distributed over multiple datacenters. 16 1.5. Local vs Centralized Storage • Small to medium-sized companies which don’t have the manpower and the skills for professionally building and operating a (Local)Sharding (or similar) system for their enterprise-critical mass data their business is relying upon. • Monolithic enterprise applications like classical SAP which are anyway bound to a specific vendor, where you cannot select a different solution (so-called Vendor Lock-In). • When your application is neither shardable by construction (c.f. section 1.6), or when doing so would be a too high effort, nor going to BigCluster9 (e.g. Ceph / Swift / etc, see secion 1.8 on page 25) is an option. If you have an already sharded system, e.g. in webhosting, don’t convert it to a non-shardable one, and don’t introduce SPOFs needlessly. You will introduce technical debts which are likely to hurt back somewhen in future! As a real big “global player”, or as a company being part of such a structure, you should be careful when listening to “marketing drones” of proprietary CentralStorage vendors. Always check your concrete use case. Never believe in wrongly generalized claims, which are only valid in some specific context, but do not really apply to your use case. It could be about your life. 1.5.6. Proprietary vs OpenSource In theory, the following dimensions are orthogonal to each other: Architecture: LocalStorage vs CentralStorage vs DistributedStorage Licensing: Proprietary vs OpenSource In practice, however, many vendors of proprietary storage systems are selecting the CentralStorage model. This way, they can avoid inter-operability with their competitors. This opens the door for the so-called Vendor Lock-In. In contrast, the OpenSource community is based on cooperation. Opting for OpenSource means that you can combine and exchange numerous components with each other. Key OpenSource players are basing their business on the usefulness of their software components for you, their customer. Please search the internet for further explanations from Eric S. Raymond. Therefore interoperability is a must in the opensource business. For example, you can relatively easily migrate between DRBD and MARS, forth and backwards, see section 3.2. The generic block devices provided by both DRBD and MARS (and by the kernel LVM2 implementation, and many others . . .) can interact with zillions of filesystems, VMs, applications, and so forth. Summary: genericity is a highly desired property in OpenSource communities, while proprietary products often try to control their usage by limiting either technical interoperability at certain layers, and/or legally by contracts. Trying to do so with OpenSource would make no sense, because you, the customer, are the real king who can really select and combine components. You can form a really customized system to your real needs, not as just promised but not always actually delivered by so-called “marketing drones” from commercial vendors who are actually prefering the needs of their employer in front of yours. There is another fundamental difference between proprietary software and OpenSource: the former is bound to some company, which may vanish from the market. Commercial storage systems may be discontinued. 9 Theoretically, BigCluster can be used to create 1 single huge remote LV (or 1 single huge remote FS instance) out of a pool of storage machines. Double-check, better triple-check that such a big logical SPOF is really needed, and cannot be circumvented by any means. Only in such a case, the current version of MARS cannot help (yet), because its current focus is on a big number of machines each having relatively small LVs. At 1&1 ShaHoLin, the biggest LVs are 40TiB at the moment, running for years now, and bigger ones are certainly possible. Only when current local RAID technology with external enclosures cannot easily create a single LV in the petabyte scale, BigCluster is probably the better solution (c.f. section 1.8 on page 25). 17 1. Architectures of Cloud Storage / Software Defined Storage / Big Data This can be a serious threat to your business relying on the value of your data. In particular, buying storage systems from small vendors may increase this risk10 . OpenSource is different: it cannot die, even if the individual, or the (small) company which produced it, does no longer exist. The sourcecode is in the public. It just could get outdated over time. However, as long as there is enough public interest, you will always find somebody who is willing to adapt and to maintain it. Even if you would be the only one having such an interest, you can hire a maintainer for it, specifically for your needs. You aren’t helpless. 1.6. Distributed vs Local: Scalability Arguments from Architecture Datacenters aren’t usually operated for fun or for hobby. Scalability of an architecture (cf section 1.1) is very important, because it can seriously limit your business. Overcoming architectural ill-designs can grow extremely cumbersome and costly. Many enterprise system architects are starting with a particular architecture in mind, called “Big Cluster”. There is a common belief that otherwise scalability could not be achieved: ... Storage 999 like cross-bar x2 Frontend 999 ... for geo-redundancy User 999999 User 14 Frontend 6 ... O(n^2) REALTIME Access Storage 6 Storage 5 Storage 4 Storage 3 Storage 2 Storage 1 Internal Storage (or FS) Network User 13 User 12 User 11 O(n*k) Frontend 5 User 10 User 9 User 8 Internet Frontend 4 User 7 User 6 Frontend 3 User 5 User 4 Frontend 2 User 3 User 2 Frontend 1 User 1 Badly Scaling Architecture: Big Cluster MARS Presentation by Thomas Schöbel-Theuer The crucial point is the storage network here: n storageservers are interconnected with m = O(n) frontend servers, in order to achieve properties like scalability, failure tolerance, etc. Since any of the m frontends must be able to access any of the n storages in realtime, the storage network must be dimensioned for O(n · m) = O(n2 ) network connections running in parallel. Even if the total network throughput is scaling only with O(n), nevertheless O(n2 ) network connections have to be maintained at connection oriented protocols and at various layers of the operating software. The network has to switch the packets from n sources to m destinations (and their opposite way back) in realtime. This cross-bar functionality in realtime makes the storage network complicated and expensive. Some further factors are increasing the costs of storage networks: 10 There is a risk of a domino effect: once there is a critical incident on highly redundant CentralStorage boxes from a particular (smaller) vendor, this may lead to major public media attention. This may form the root cause for such a vendor to vanish from the market. Thus you may be left alone with a buggy system, even if you aren’t the victim of the concrete incident. In contrast, bugs in an OpenSource component can be fixed by a larger community of interested people, or by yourself if you hire somebody for this. 18 1.6. Distributed vs Local: Scalability Arguments from Architecture • In order to limit error propagation from other networks, the storage network is often built as a physically separate = dedicated network. • Because storage networks are heavily reacting to high latencies and packet loss, they often need to be dimensioned for the worst case (load peaks, packet storms, etc), needing one of the best = typically most expensive components for reducing latency and increasing throughput. Dimensioning to the worst case instead of an average case plus some safety margins is nothing but an expensive overdimensioning / over-engineering. • When multipathing is required for improving fault tolerance of the storage network itself, these efforts will even double. • When geo-redundancy is required, the total effort may easily more than double another time because in cases of disasters like terrorist attacks the backup datacenter must be prepared for taking over for multiple days or weeks. Fortunately, there is an alternative called “Sharding Architecture” or “Shared-nothing Architecture”. Definition of Sharding Notice that the term “Sharding” originates from database architecture https://en.wikipedia.org/wiki/Shard_(database_architecture) where it has a slightly different meaning than used here. Our usage of the term “sharding” reflects slightly different situations in some webhosting companies11 , and can be certainly transferred to some more application areas. Our more specific use of the term “sharding” has the following properties, all at the same time: 1. User / customer data is partitioned. This is very similar to database sharding. However, the original database term also allows some data to remain unpartitioned. In webhosting, suchalike may exists also, but typically only for system data, like OS images, including large parts of their configuration data. Suchalike system data is typically replicated from a central “golden image” in an offline fashion, e.g. via regular rsync cron jobs, etc. Typically, it comprises only of few gigabytes per instance and is mostly read-only with a slow change rate, while total customer data is typically in the range of some petabytes with a higher total change rate. 2. Servers have no single point of contention, and thus are completely independent from each other, like in shared-nothing architectures https://en.wikipedia. org/wiki/Shared-nothing_architecture. However, the original term “shared-nothing” has also been used for describing replicas, e.g. DRBD mirrors. In our context of “sharding”, the shared-nothing principle only refers to the “no single point of contention” principle at partitioning level, which means it only refers to to the partitioning of the user data, but not to their replicas. Shared-nothing replicas in the sense of DRBD may be also present (and in fact they are at 1&1 Shared Hosting Linux), but these replicas are not meant by our usage of the term “sharding”. Customer data replicas form an independent dimension called “replication layer”. The replication layer also obeys the shared-nothing principle in original sense, but it is not meant by our term “sharding” in order to avoid confusion12 between these two independent dimensions. Our sharding model does not need a dedicated storage network at all, at least when built and dimensioned properly. Instead, it should have (but not always needs) a so-called replication network which can, when present, be dimensioned much smaller because it does neither need realtime operations nor scalabiliy to O(n2 ): 11 According to https://en.wikipedia.org/wiki/Shared-nothing_architecture, Google also uses the term “sharding” for a particular “shared-nothing architecture”. Although our above definition of “sharding” does not fully comply with its original meaning, a similar usage by Google probably means that our usage of the term is not completely uncommon. 12 Notice that typically BigCluster architectures are also abstracting away their replicas when talking about their architecture. 19 1. Architectures of Cloud Storage / Software Defined Storage / Big Data Smaller Replication Network for Batch Migration O(n) +++ traffic shaping possible MARS Presentation by Thomas Schöbel-Theuer x2 Storage + Frontend 999 Storage + Frontend 6 +++ big scale out +++ ++ local scalability: spare RAID slots, ... ... for geo-redundancy User 999999 User 14 User 13 User 11 ... O(n*k) Storage + Frontend 5 Storage + Frontend 4 Storage + Frontend 3 Storage + Frontend 2 Storage + Frontend 1 Internet User 12 Sharding User 10 User 9 User 8 User 7 User 6 User 5 User 4 User 3 User 2 User 1 Well-Scaling Architecture: => method really scales to petabytes Sharding architectures are extremely well suited when both the input traffic and the data is already partitioned. For example, when several thousands or even millions of customers are operating on disjoint data sets, like in web hosting where each webspace is residing in its own home directory, or when each of millions of mySQL database instances has to be isolated from its neighbour. Masses of customers are also appearing at cloud storage applications like Cloud Filesystems (e.g. Dropbox or similar). Even in cases when any customer may potentially access any of the data items residing in the whole storage pool (e.g. like in a search engine), sharding can be often applied. The trick is to create some relatively simple content-based dynamic switching or redirect mechanism in the input network traffic, similar to HTTP load balancers or redirectors. Only when partitioning of input traffic plus data is not possible in a reasonable way, big cluster architectures as implemented for example in Ceph or Swift (and partly even possible with MARS when restricted to the block layer) have a very clear use case. In the following sections, we will see: when sharding is possible, it is the preferred model due to reliability and cost and performance reasons. Another good explanation can be found at http: //www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/. 1.6.1. Variants of Sharding LocalSharding The simplest possible sharding architecture is simply putting both the storage and the compute CPU power onto the same iron. Example: at 1&1 Shared Hosting Linux (ShaHoLin), we have dimensioned several variants of this. (a) we are using 1U pizza boxes with local hardware RAID controllers with fast hardware BBU cache and up 10 local disks for the majority of LXC container instances where the “small-sized” customers (up to ~100 GB webspace per customer) are residing. Since most customers have very small home directories with extremely many but small files, this is a very cost-efficient model. (b) less that 1 permille of all customers have > 250 GB (up to 2TB) per home directory. For these few customers we are using another dimensioning variant of the same architecture: 4U servers with 48 high-capacity spindles on 3 RAID sets, delivering a total PV capacity of ~300 TB, which are then cut down to ~10 LXC containers of ~30 TB each. In order to operate this model at a bigger scale, you should consider the “container football” method as described in section 1.6.3 and in chapter 8 on page 130. 20 1.6. Distributed vs Local: Scalability Arguments from Architecture RemoteSharding This variant needs a (possibly dedicated) storage network, which is however only O(n) in total. Each storage server exports a block device over iSCSI (or over another transport) to at most O(k) dedicated compute nodes where k is some constant. Hint 1: it is advisable to build this type of storage network with local switches and no routers inbetween, in order to avoid O(n2 )-style network architectures and traffic. This reduces error propagation upon network failures. Keep the storage and the compute nodes locally close to each other, e.g. in the same datacenter room, or even in the same rack. Hint 2: additionally, you can provide some (low-dimensioned) backbone for exceptional(!) cross-traffic between the local storage switches. Don’t plan to use any realtime cross-traffic regularly, but only in clear cases of emergency! FlexibleSharding This is a dynamic combination of LocalSharding and RemoteSharding, dynamically re-configurable, as explained below. BigClusterSharding The sharding model can also be placed on top of a BigCluster model, or possibly “internally” in such a model, leading to a similar effect. Whether this makes sense needs some discussion. It can be used to reduce the logical BigCluster size from O(n) to some O(k), such that it is no longer a “big cluster” but a “small cluster”, and thus reducing the serious problems described in section 1.8 to some degree. This could make sense in the following use cases: • When you already have invested into a big cluster, e.g. Ceph or Swift, which does not really scale and/or does not really deliver the expected reliability. Some possible reasons for this are explained in section 1.8. • When you really need a single LV which is necessarily bigger than can be reasonably built on top of local LVM. This means, you are likely claiming that you really need strict consistency as provided by a block device on more than 1 PB with current technology (2018). Examples are very big enterprise databases like classical SAP (c.f. section 1.5), or if you really need POSIX-compliance on a single big filesystem instance. Be conscious when you think this is the only solution to your problem. Double-check or triple-check whether there is really no other solution than creating such a huge block device and/or such a huge filesystem instance. Such huge SPOFs are tending to create similar problems as described in section 1.8 for similar reasons. When building a new storage system, be sure to check the following use cases. You should seriously consider a LocalSharding / RemoteSharding / FlexibleSharding model in favor of BigClusterSharding when ... • ... when more than 1 LV instance would be placed onto your “small cluster” shards. Then a {Local,Remote,Flexible}Sharding model could be likely used instead. Then the total overhead (total cost of ownership) introduced by a BigCluster model but actually stripped down to a “SmallCluster” implementation / configuration should be examined separately. Does it really pay off? • ... when there are legal requirements that you can tell at any time where your data is. Typically, this is all else but easy on a BigCluster model, even when stripped down to SmallCluster size. 1.6.2. FlexibleSharding Notice that MARS’ new remote device feature from the 0.2 branch series (which is kind of replacement for iSCSI) could be used for implementing some sort of “big cluster” model at block layer. Nevertheless, such models re-introducing some kind of “big dedicated storage network” into MARS operations are not the preferred model. Following is the a super-model which combines both the “big cluster” and sharding model at block layer in a very flexible way. The following example shows only two servers from a pool consisting of hundreds or thousands of servers: 21 1. Architectures of Cloud Storage / Software Defined Storage / Big Data Flexible MARS Sharding + Cluster-on-Demand VM1 VM3 VM2 Hypervisor LV1 LV2 ISCSI or MARS remote device LV3 RAID VM4 VM5 Hypervisor LV4 LV5 RAID any hypervisor works in client and/or server role and preferably locally at the same time MARS Presentation by Thomas Schöbel-Theuer The idea is to use iSCSI or the MARS remote device only where necessary. Preferably, local storage is divided into multiple Logical Volumes (LVs) via LVM, which are directly used locally by Virtual Machines (VMs), such as KVM or filesystem-based variants like LXC containers. In the above example, the left machine has relatively less CPU power or RAM than storage capacity. Therefore, not all LVs could be instantiated locally at the same time without causing operational problems, but some of them can be run locally. The example solution is to exceptionally(!) export LV3 to the right server, which has some otherwise unused CPU and RAM capacity. Notice that local operations of VMs doesn’t produce any storage network traffic at all. Therefore, this is the preferred runtime configuration. Only in cases of resource imbalance, such as (transient) CPU or RAM peaks (e.g. caused by DDOS attacks), some VMs or containers may be run somewhere else over the network. In a well-balanced and well-dimensioned system, this will be the vast minority, and should be only used for dealing with timely load peaks etc. Running VMs directly on the same servers as their storage is a major cost reducer. You simply don’t need to buy and operate n + m servers, but only about max(n, m) + m · servers, where corresponds to some relative small extra resources needed by MARS. In addition to this and to reduced networking costs, there are further cost savings at power consumption, air conditioning, Height Units (HUs), number of HDDs, operating costs, etc as explained below in section 1.7. 1.6.3. Principle of Background Migration The sharding model needs a different approach to load balancing of storage space than the big cluster model. There are serveral possibilities at different layers, each addressing different granularities: • Moving customer data at filesystem or database level via rsync or mysqldump or similar. Example: at 1&1 Shared Hosting Linux, we have about 9 millions of customer home directories. We also have a script movespace.pl using incremental tar for their moves. Now, if we would try to move around all of them this way, it could easily take years or even decades for millions of extremely small home directories, due to overhead like DNS 22 1.6. Distributed vs Local: Scalability Arguments from Architecture updates etc. However, there exist a small handful of large customer home directories in the terabyte range. For these, and only for these, it is a clever idea to use movespace.pl because thereby the size of a LV can be regulated more fine grained than at LV level. • Dynamically growing the sizes of LVs during operations: lvresize followed by marsadm resize followed by xfs_growfs or similar operations. • Moving whole LVs via MARS, as shown in the following example: Flexible MARS Background Migration VM1 VM3 VM2 Hypervisor LV1 LV2 RAID LV3 primary ISCSI or MARS remote device VM4 Hypervisor LV3’ MARS replication VM5 secondary Any # replicas k=1,2,3,… dynamically creatable at any time and anywhere LV4 LV5 RAID => any hypervisor may be source or destination of some LV replicas at the same time MARS Presentation by Thomas Schöbel-Theuer The idea is to dynamically create additional LV replicas for the sake of background migration. Examples: • In case you had no redundancy at LV level before, you have k = 1 replicas during ordinary operation. If not yet done, you should transparently introduce MARS into your LVM-based stack by using the so-called “standalone mode” of MARS. When necessary, create the first MARS replica with marsadm create-resource on your already-existing LV data, which is retained unmodified, and restart your application again. Now, for the sake of migration, you just create an additional replica at another server via marsadm join-resource there and wait until the second mirror has been fully synced in background, while your application is running and while the contents of the LV is modified in parallel by your ordinary applications. Then you do a primary handover to your mirror. This is usually a matter of minutes, or even seconds. Once the application runs again at the new location, you can delete the old replica via marsadm leave-resource and lvremove. Finally, you may re-use the freed-up space for something else (e.g. lvresize of another LV followed by marsadm resize followed by xfs_growfs or similar). For the sake of some hardware lifecycle, you may run a different strategy: evacuate the original source server completely via the above MARS migration method, and eventually decommission it. • In case you already have a redundant LV copy somewhere, you should run a similar procedure, but starting with k = 2 replicas, and temporarily increasing the number of replicas to either k 0 = 3 when moving each replica step-by-step, or you may even directly go up to k 0 = 4 when moving pairs at once. Example: see football.sh in the football/ directory of MARS, which is a checkout of the Football sub-project (see chapter 8). 23 1. Architectures of Cloud Storage / Software Defined Storage / Big Data • When already starting with k > 2 LV replicas in the starting position, you can do the same analogously, or you may then use a lesser variant. For example, we have some mission-critical servers at 1&1 which are running k = 4 replicas all the time on relatively small but important LVs for extremely increased safety. Only in such a case, you may have the freedom to temporarily decrease from k = 4 to k 0 = 3 and then going up to k 00 = 4 again. This has the advantage of requiring less temporary storage space for swapping some LVs. 1.7. Cost Arguments A common pre-jugdement is that “big cluster” is the cheapest scaling storage technology when built on so-called “commodity hardware”. While this is very often true for the “commodity hardware” part, it is often not true for the “big cluster” part. But let us first look at the “commodity” part. 1.7.1. Cost Arguments from Technology Here are some rough market prices for basic storage as determined around end of 2016 / start of 2017: Technology Enterprise-Grade Price in € / TB Consumer SATA disks via on-board SATA controllers SAS disks via SAS HBAs (e.g. in external 14” shelfs) SAS disks via hardware RAID + LVM (+DRBD/MARS) Commercial storage appliances via iSCSI Cloud storage, S3 over 5 years lifetime no (small-scale) halfways yes yes yes < 30 possible < 80 80 to 150 around 1000 3000 to 8000 You can see that any self-built and self-administered storage (whose price varies with slower high-capacity disks versus faster low-capacity disks) is much cheaper than any commercial offering by about a factor of 10 or even more. If you need to operate several petabytes of data, self-built storage is always cheaper than commercial one, even if additional manpower is needed for commissioning and operating. You don’t have to pay the shareholders of the storage provider. Here we just assume that the storage is needed permanently for at least 5 years, as is the case in web hosting, databases, backup / archival systems, and many other application areas. Commercial offerings of cloud storage are way too much hyped. Some people apparently don’t know that the generic term “Cloud Storage” refers to a storage class, not to a particular instance like original Amazon S3, and that it is possible to build and operate almost any instance of any storage class yourself. From a commercial perspective, outsourcing of huge masses of enterprise-critical storage (to whatever class of storage) usually pays off only when your storage demands are either relatively low, or are extremely varying over time, and/or when you need some extra capacity only temporarily for a very short time. 1.7.2. Cost Arguments from Architecture In addition to basic storage prices, many further factors come into play when roughly comparing big cluster architectures versus sharding. The following table bears the unrealistic assumption that BigCluster can be reliably operated with 2 replicas (the suffix ×2 means with additional geo-redundancy): # of Disks # of Servers Power Consumption HU Consumption BC >200% ≈ ×2 ≈ ×2 ≈ ×2 SHA <120% ≈ ×1.1 possible ≈ ×1.1 ≈ ×1.1 BC×2 >400% ≈ ×4 ≈ ×4 ≈ ×4 SHA×2 <240% ≈ ×2.2 ≈ ×2.2 ≈ ×2.2 As shown in section 1.8, two replicas are typically not sufficient for BigCluster. Even addicts of BigCluster are typically recommending 3 replicas in some so-called “best practices”, leading to the following more realistic table: 24 1.8. Reliability Arguments from Architecture # of Disks # of Servers Power Consumption HU Consumption BC >300% ≈ ×3 ≈ ×3 ≈ ×3 SHA <120% ≈ ×1.1 possible ≈ ×1.1 ≈ ×1.1 BC×2 >600% ≈ ×6 ≈ ×6 ≈ ×6 SHA×2 <240% ≈ ×2.2 ≈ ×2.2 ≈ ×2.2 The crucial point is not only the number of extra servers needed for dedicated storage boxes, but also the total number of HDDs. While big cluster implementations like Ceph or Swift can theoretically use some erasure encoding for avoiding full object replicas, their practice as seen in internal 1&1 Ceph clusters is similar to RAID-10, but just on objects instead of block-based sectors. Therefore a big cluster typically needs >300% disks to reach the same net capacity as a simple sharded cluster. The latter can typically take advantage of hardware RAID-60 with a significantly smaller disk overhead, while providing sufficient failure tolerance at disk level. There is a surprising consequence from this: geo-redundancy is not as expensive as many people are believing. It just needs to be built with the proper architecture. A sharded georedundant pool based on hardware RAID-60 (last column “SHA×2”) costs typically less than a non-georedundant big cluster with typically needed / recommended number of replicas (column “BC”). A geo-redundant sharded pool provides even better failure compensation (see section 1.8). Notice that geo-redundancy implies by definition that an unforeseeable full datacenter loss (e.g. caused by disasters like a terrorist attack or an earthquake) must be compensated for several days or weeks. Therefore it is not sufficient to take a big cluster and just spread it to two different locations. In any case, a MARS-based geo-redundant sharding pool is cheaper than using commercial storage appliances which are much more expensive by their nature. 1.8. Reliability Arguments from Architecture A contemporary common belief is that big clusters and their random replication methods would provide better reliability than anything else. There are some practical observations at 1&1 and its daughter companies which cannot confirm this. Similar experiences are part of a USENIX paper about copysets, see https://www.usenix. org/system/files/conference/atc13/atc13-cidon.pdf. Their proposed solution is different from the solution proposed here, but interestingly their problem analysis part contains not only similar observations, but also comes to similar conclusions about random replication. Citation from the abstract: However, random replication is almost guaranteed to lose data in the common scenario of simultaneous node failures due to cluster-wide power outages. [emphasis added by me] Stimulated by our practical experiences even in truly less disastrous scenarios than mass power outage, theoretical explanations were sought. Surprisingly, they show that LocalSharding is superior to true big clusters under practically important preconditions. Here is an intutitive explanation. A detailed mathematical description of the model can be found in appendix G on page 154. 1.8.1. Storage Server Node Failures 1.8.1.1. Simple intuitive explanation Block-level replication systems like DRBD are constructed for failover in local redundancy scenarios. Or, when using MARS, even for geo-redundant failover scenarios. They are traditionally dealing with pairs of servers, or with triples, etc. In order to get a storage incident with them, both sides of a DRBD or MARS small-cluster (also called shard) must have an incident at the same time. In contrast, big clusters are spreading their objects over a huge number of nodes O(n), with some redundancy degree k denoting the number of replicas. As a consequence, any k node failures out of O(n) will produce an incident. For example, when k = 2 and n is equal for both 25 1. Architectures of Cloud Storage / Software Defined Storage / Big Data models, then any combination to two node failures occurring at the same time will lead to an incident: Reliability of Architectures: NODE failures 2 Node failure => ALL their disks are unreachable DRBD or MARS simple pairs Big Storage Cluster e.g. Ceph, Swift, ... ... ... same n O(n2) network ... => no customer-visible incident Low probability for hitting the same pair, even then: only 1 shard affected => low total downtime ... k=2 replicas not enough => INCIDENT because objects are randomly distributed across whole cluster Higher probability for hitting any 2 nodes, then O(n) clients affected => much higher total downtime need k >= 3 replicas here MARS Presentation by Thomas Schöbel-Theuer Intuitively, it is easy to see that hitting both members of the same pair at the same time is less likely than hitting any two nodes of a big cluster. If you are curious about some concrete numbers, read on. 1.8.1.2. Detailed explanation For the sake of simplicity, the following more detailed explanation is based on the following assumptions: • We are looking at storage node failures only. • Disk failures are regarded as already solved (e.g. by local RAID-6 or by the well-known compensation mechanisms of big clusters). Only in case they don’t work, they are mapped to node failures, and are already included in the probability of storage node failures. • We only look at data replication with a redundancy degree of a relatively small k. CRC methods are not used across storage nodes, but may be present internally at some storage nodes, e.g. RAID-5 or RAID-6 or similar methods. Notice that CRC methods generally involve very high overhead, and even won’t work in realtime across long distances (georedundancy). • We restrict ourselves to temporary / transient failures, without regarding permanent data loss. Otherwise, the differences between local-storage sharding architectures and big clusters would become even worse. When loosing some physical storage nodes forever in a big cluster, it is typically all else but easy to determine which data of which application instances / customers have been affected, and which will need a restore from backup. • Storage network failures (as a whole) are ignored. Otherwise a fair comparison between the architectures would become difficult. If they were taken into account, the advantages of LocalSharding would become even bigger. • We assume that the storage network (when present) forms no bottleneck. Network implementations like TCP/IP versus Infiniband or similar are thus ignored. 26 1.8. Reliability Arguments from Architecture • Software failures / bugs are also ignored. We only compare architectures here, not their various implementations. • The x axis shows the number of basic storage units n from an application perspective, meaning “usable storage” or “net amount of storage”. For simplicitiy of the model, one basic application storage unit equals to the total disk space provided by one physical storage node in the special case of k = 1 replicas. Attention! when increasing the number of replicas k, the total number of storage nodes needs to be increased accordingly. Typically, you will need to deploy k · n physical storage nodes in order to get n net storage units from a user’s perspective. Attention! k has a strong influence at the price tag of any of the competing architectures. You cannot assume an “infinite amount of money”. Therefore, only relatively small k are bearable for business cases. • We assume that the number of application instances is linearly scaling with n. For simplicity, we assume that the number of applications running on the whole pool is exactly n. • We assume that the storage nodes are (almost completely) filled with data (sectors with RAID, and/or objects with BigCluster). • We assume that the number of sectors / objects per storage node is “very large”. Some examples: a logical volume of 4 TB has 1,000,000,000 sectors or object, each 4 KB in size. A physical storage node providing 40 TB of storage will then provide 10 billions of sectors / objects. • For the BigCluster architecture, we assume that all objects are always distributed to O(n) nodes. For simiplicy of the model, we assume a distribution via a uniform hash function. When other hash functions were used (e.g. distributing only to a constant number of nodes), it would no longer be a big cluster architecture in our sense. In the following example, we assume a uniform object distribution to exactly n nodes. Notice that any other n0 = O(n) with n0 < n will produce similar results for n0 → ∞, but may be better in detail for smaller n’. • When random distribution / random replication methods are used at BigCluster object stores, we assume that for any pair (or k-tuple) of storage nodes, the total number of objects is so high that there always exists some objects which are present at all of the nodes of any pair / k-tuple for any reasonable (small) k. This means, we assume not only uniformity in random replication, but also that the total number of objects is practically “infinite” compared to relatively small practical values of k. For mathematically interested readers: be careful when trying to argue with the probability to hit some object intersection for some given k-tuple of storage nodes while n is a growing parameter. Even when such a single probability is declining with growing both k and n, and even when the single probability for the existence of an intersection somewhen gets lower than 1, this has an impact onto the total incident probability of the whole BigCluster. In general, the number of such tuples is growing with O( k·n k ) = O((k · n)!). So, don’t forget to sum up all probabilities even if a single one appears to be “neglectible”. • For the LocalSharding (DRBDorMARS) architecture, we assume that only local storage is used. For higher replication degrees k = 2, . . ., the only occurring communication is among the pairs / triples / and so on (shards), but no communication to other shards is necessary. • For simplicity of the example, we assume that any single storage server node used in either architecture, including all of its local disks, has a reliability of 99.99% (four nines). This means, the probability of a storage node failure is uniformly assumed as p = 0.0001. 27 1. Architectures of Cloud Storage / Software Defined Storage / Big Data • This means, during an observation period of T = 10, 000 operation hours, we will have a total downtime of 1 hour per server in statistical average. For simplicity, we assume that the failure probability of a single server does neither depend on previous failures nor on the operating conditions of any other server. It is known that this is not true in general, but otherwise our model would become extremely complex. • More intuitively, our observation period of T = 10, 000 operation hours corresponds to about 13 months, or slightly more than a year. • Consequence: when operating a pool of 10,000 storage servers, then in statistical average there will be almost always one node which is failed. This is like a “permanent incident” which has to be solved by the competing storage architectures. • Hint: the term “statistical average” is somewhat vague here, in order to not confuse readers13 . A more elaborate statistical model can be found in appendix G on page 154. Let us start the comparison with a simple corner case: plain old servers with no further redundancy, other than their local RAIDs. This naturally corresponds to k = 1 replicas when using the DRBDorMARS architecture. Now we apply the corner case of k = 1 replicas to both architectures, i.e. also to BigCluster, in order to shed some spotlight at the fundamental properties of the architectures. Under the precondition of k = 1 replicas, an incident of each one of the n servers has two possible ways to influence the downtime from an application’s perspective: 1. Downtime of 1 storage node only influences 1 application unit depending on 1 basic storage unit. This is the case with the DRBDorMARS model, because there is no communication between shards, and we assumed that 1 storage server unit also carries exactly 1 application unit. 2. Downtime of 1 storage node will tear down more than 1 application unit, because any of the application units have spread their storage to more than 1 storage node via uniform hashing, as is the case at BigCluster. For ease of understanding, let us zoom into the special case n = 2 and k = 1 for a moment. These are the smallest numbers where you already can see the effect. In the following table, we denote 4 possible status combinations out of 2 servers A and B, where the cells are showing the number of application units influenced: LocalSharding (DRBDorMARS) B up B down A up 0 1 A down 1 2 BigCluster B up B down A up 0 2 A down 2 2 What is the heart of the difference? While a node failure at LocalSharding (DRBDorMARS) will tear down only the local application, the teardown produced by BigCluster will spread to all of the n = 2 application units, because of the uniform hashing and because we have only k = 1 replica. Would it help to increase both n and k to larger values? In the following graphics, the thick red line shows the behaviour for k = 1 PlainServers (which is the same as k = 1 DRBDorMARS) with increasing number of storage units n, ranging from 1 to 10,000 storage units = number of servers for k = 1. Higher values of k ∈ [1, 4] are also displayed. All lines corresponding to the same k are drawn in the same color. Notice that both the x and y axis are logscale: 13 The problem is that sometimes more servers than average can be down, and sometimes less. Average values should not be used in the mathematical model, but exact ones. However, humans can often better imagine when provided with “average behaviour”, so we use it here just for the sake of ease of understanding. 28 1.8. Reliability Arguments from Architecture SERVICE_Comparison_of_Reversible_StorageNode_Failures 1e+10 available-uptime-limit PlainServers-with-Local-RAID BigCluster-1-Replica DRBD-or-MARS-2-Replica BigCluster-2-Replica BigCluster-2-Replica-SizeWeighted DRBD-or-MARS-3-Replica BigCluster-3-Replica BigCluster-3-Replica-SizeWeighted DRBD-or-MARS-4-Replica BigCluster-4-Replica BigCluster-4-Replica-SizeWeighted Expected Average SERVICE Downtime [hours per ~year] 100000 Thin Solid = Probability of SOME SERVICE incident Dashed = SIZE argument = objects are completely independent 1 1e-05 1e-10 1e-15 1 10 100 1000 10000 100000 BASIC STORAGE-SERVER UNITS = PAIRS or TRIPLES etc = Application Instances When you look at the thin solid BigCluster lines for k = 2, . . . drawn in different colors, you may wonder why they are alltogether converging to the thin red BigCluster line, which corresponds to k = 1 BigCluster. And they also converge against the grey dotted topmost line indicating the total possible uptime of all applications (depending on x). It can be explained as follows: The x axis shows the number of basic storage units. When you have to create 10,000 storage units with a replication degree of k = 2 replicas, then you will have to deploy k∗10, 000 = 20, 000 servers in total. When operating a pool of 20,000 servers, in statistical average 2 servers of them will be down at any given point in time. However, 2 is the same number as the replication degree k. Because our BigCluster model as defined above will distribute all objects to all servers uniformly, there will almost always exist some objects for which no replica is available at any given point in time. This means, you will almost always have a permanent incident involving the same number of nodes as your replication degree k, and in turn some of your objects will not be accessible at all. This means, at x = 10, 000 storage units you will loose almost any advantage from increasing the number of replicas. Adding more replicas will no longer help at x ≥ 10, 000 storage units. Notice that the solid lines are showing the probability of some incident, disregarding the size of the incident. What’s about the dashed lines showing much better behaviour for BigCluster? Under some further preconditions, it would be possible to argue with the size of incidents. However, now a big fat warning. When you are responsible for operations of thousands of servers, you should be very conscious about these preconditions. Otherwise you could risk your career. In short: • When your application, e.g. a smartphone app, consists of accessing only 1 object at all during a reasonably long timeframe, you can safely assume that there is no interdependency between all of your objects. In addition, you have to assume (and you should check) that your cluster operating software as a whole does not introduce any further hidden / internal interdependencies. Only in this case, and only then, you can take the dashed lines arguing with the number of inaccessible objects instead of with the number of basic storage units. • Whenever your application uses bigger structured logical objects, such as filesystems 29 1. Architectures of Cloud Storage / Software Defined Storage / Big Data or block devices or whole VMs / containers, then you likely will get interdependent objects at your big cluster storage layer. Practical example: experienced sysadmins will confirm that even a data loss rate of only 1/1,000,000 of blocks in a classical Linux filesystem like xfs or ext4 will likely imply the need of an offline filesystem check (fsck), which is a major incident for the affected filesystem instances. Theoretical explanation: servers are running for a very long time, and filesystems are typically also mounted for a long time. Notice that the probability of hitting any vital filesystem data roughly equals the probability of hitting any other data. Sooner or later, any defective sector in the metadata structures or in freespace management etc will stop your whole filesystem, and in turn will stop your application instance(s) running on top of it. Similar arguments hold for transient failures: most classical filesystems are not constructed for compensation of hanging IO, typically leading to system hangs. Blindly taking the dashed lines will expose you to a high risk of error. Practical experience shows that there are often hidden dependencies in many applications, often also at application level. You cannot necessarily see them when inspecting their data structures! You will only notice some of them by analyzing their runtime behaviour, e.g. with tools like strace. Notice that in general the runtime behaviour of an arbitrary program is undecidable. Be cautious when drawing assumptions out of thin air! Conversely, the assumption that any unaccessible object may halt your application, might be too strong for some use cases. Therefore, some practical behaviour may be inbetween the solid thin lines and the dashed lines of some given color. Be extremely careful when constructing such an intermediate case. The above example of a loss rate of 1/1,000,000 of sectors in a classical filesystem should not be extended to lower values like 1/1,000,000,000 without knowing exactly how the filesystem works, and how it will react in detail 14 . The grey zone between the extreme cases thin solid vs dashed is a dangerous zone! If you want to stay at the safe side, simply obey the fundamental law as explained in the next section: 1.8.2. Optimum Reliability from Architecture Another argument could be: don’t distribute the BigCluster objects to exactly n nodes, but to less nodes. Would the result be better than DRBDorMARS LocalSharding? When distributing to O(k 0 ) nodes with some constant k 0 , we have no longer a BigCluster architecture, but a mixed BigClusterSharding form. As can be generalized from the above tables, the reliability of any BigCluster on k 0 > k nodes is always worse than of LocalSharding on exactly k nodes, where k is also the redundancy degree. In general: The LocalSharding model is the optimum model for reliability of operation, compared to any other model truly distributing its data and operations over truly more nodes, like RemoteSharding or BigClusterSharding or BigCluster does. There exists no better model because shards consisting of exactly k nodes where k is the redundancy degree are already the smallest possible shards under the assumptions of section 1.8.1.2. Any other model truly involving k 0 > k nodes for distribution of objects at any shard is always worse in the dimension of reliability. Thus the above sentence follows by induction. 14 In general, it is insufficient to analyze the logical dependencies inside of a filesystem instance, such as which inode contains some pointers to which other filesystem objects, etc. There exist further runtime dependencies, such as nr_requests block-layer restrictions on IO queue depths, and/or capabilities / limitiations of the hardware, and so on. Trying to model all of these influences in a reasonable way could be a major research undertakement outside the scope of this MARS manual. 30 1.8. Reliability Arguments from Architecture The above sentence is formulating a fundamental law of storage systems. 1.8.3. Error Propagation to Client Mountpoints The following is only applicable when filesystems (or their objectstore counterparts) are exported over a storage network, in order to be mounted in parallel at O(n) mountpoints each. In such a scenario, any problem / incident inside of your storage pool for the filesystem instances will be spread to O(n) clients, leading to an increase of the incident size by a factor of O(n) when measured in number of affected mountpoints: MOUNTPOINTS_Comparison_of_Reversible_StorageNode_Failures 1e+15 available-uptime-limit PlainServers-with-Local-RAID BigCluster-1-Replica DRBD-or-MARS-2-Replica BigCluster-2-Replica BigCluster-2-Replica-SizeWeighted DRBD-or-MARS-3-Replica BigCluster-3-Replica BigCluster-3-Replica-SizeWeighted DRBD-or-MARS-4-Replica BigCluster-4-Replica BigCluster-4-Replica-SizeWeighted Expected Average MOUNTPOINTS Downtime [hours per ~year] 1e+10 Thin Solid = MOUNTPOINTS affected Dashed = SIZE argument = objects are completely independent 100000 1 1e-05 1e-10 1e-15 1 10 100 1000 10000 100000 BASIC STORAGE-SERVER UNITS = PAIRS or TRIPLES etc = Application Instances As a results, we now have a total of O(n2 ) mountpoints = our new basic application units. Such O(n2 ) architectures are quickly becoming even worse than before. Thus a clear warning: don’t try to build systems in such a way. Notice: DRBD or MARS are traditionally used for running the application on the same box as the storage. Thus they are not vulnerable to these kinds of failure propagation over network. Even with traditional iSCSI exports over DRBD or MARS, you won’t have suchalike problems. Your only chance to increase the error propagation are O(n) NFS or glusterfs exports to O(n) clients leading to a total number of O(n2 ) mountpoints, or similar setups. Clear advice: don’t do that. It’s a bad idea. 1.8.4. Similarities and Differences to Copysets This section is mostly of academic interest. You can skip it when looking for practical advice. The USENIX paper about copysets (see https://www.usenix.org/system/files/conference/ atc13/atc13-cidon.pdf) relates to the Sharding model in the following way: Similarities The concept of Random Replication of the storage data to large number of machines will reduce reliability. When chosing too big sets of storage machines, then the storage system as a whole will become practically unusable. This is common sense between the USENIX paper and the Sharding Approach as propagated here. 31 1. Architectures of Cloud Storage / Software Defined Storage / Big Data Differences The USENIX paper and many other Cloud Storage approaches are presuming that there exists a storage network, allowing real-time distribution of replicas over this kind of network. In contrast, the Sharding Approach to Cloud Storage tries to avoid real-time storage networks as much as possible. Notice that RemoteSharding and further variants (including future improvements) do not preclude it, but are trying to avoid real-time storage network traffic. Instead, the load-balancing problem is addressed via background data migration. This changes the timely granularity of data access: many real-time accesses are shifted over to migration processes, which in turn are weakening the requirements to the network. In detail, there are some more differences to the USENIX paper. Some examples: • Terminology: the scatter width S is defined (see page 39 of the paper) as: each node’s data is split uniformly across a group of S other nodes. In difference, we neither assume uniformity, nor do we require the data to be distributed to other nodes. By using the term “other”, the USENIX paper (as well as many other BigCluster approaches) are probably presuming something like a distinction between “client” and “server” machines: while data processing is done on a “client”, data storage is on a “server”. • We don’t disallow this in variants like RemoteSharding or FlexibleSharding and so on, but we gave some arguments why we are trying to avoid this. • It seems that some definitions in the USENIX paper may implicitly relate to “each chunk”. In contrast, the Sharding Approach typically relates to LVs (logical volumes), which could however be viewed as a special case of “chunk”, e.g. by minimizing the number of chunks in a system. However notice: there exists definitions of “chunk” where it is the basic transfer unit. An LV has the fundamental property that small-granularity update in place (at any offset inside the LV) can be executed. • Notice: we do not preclude further fine-grained distribution of LV data, but this is something which should be avoided if not absolutely necessary. Preferred method in typical practical use cases: some storage servers may have some spare RAID slots to be populated later, by resizing the PVs = Physical Volumes before resizing LVs. • Notice that a typical local RAID system is also a Distributed System, according to some reasonable definition. Typical RAID implementations just involve SAS cables instead of Ethernet cables or Infiniband cables. Notice that this also applies to many “Commodity Hardware” approaches, like Ceph storage nodes driving dozens of local HDDs connected over SAS or SATA. The main difference is just that instead of a hardware RAID controller, a hardware HBA = Host Bus Adapter is used instead. Instead of Ethernet switches, SAS multiplexers in backplanes are used. Anyway, this forms a locally distributed sub-system. • Future variants of the Sharding Approach might extend this already present locally Distributed System to a somewhat wider one. For example, creation of a local LV (called “disk” in MARS terminology) could be implemented by a subordinate DRBD instance implementing a future RAID-10 mode over local Infiniband or crossover Ethernet cables, avoiding local switches. While DRBD would essentially create the “local” LV, the higher-level MARS instance would then be responsible for its wide-distance replication. See chapter 2 about use cases of MARS vs DRBD. Potential future use cases could be extremely huge LVs where external SAS disk shelves are no longer sufficient to get the desired capacity. • The USENIX paper needs to treat the following parameters as more or less fixed (or only slowly changable) constants, given by the system designer: the replication degree R, and the scatter width S. In contrast, the replication degree k of our Sharding Approach is not necessarily firmly given by the system, but can be dynamically changed at runtime on a per-LV basis. For example, during background migration via MARS the command marsadm join-resource is used for creating additional per-LV replicas. However notice: this freedom is limited by the total number of deployed hardware nodes. If you want k = 3 replicas at the whole pool, then you will need to (dynamically) deploy at least about k ∗ x nodes in general. 32 1.9. Performance Arguments from Architecture • The USENIX paper defines its copysets on a per-chunk basis. Similarly to before, we can transfer this definition to a Sharding Approach by relating it to a per-LV basis. As a side effect, a copyset can then trivially become identical to S when the definition is S is also changed to a per-LV basis, analogously. In the Sharding Approach, a distiction is not absolutely necessary, while the USENIX paper has to invest some effort into clarifying the relationship between S and copysets as defined on a BigCluster model. • Neglecting the mentioned differences, we see our typical use case (LocalSharding) roughly equivalent to S = R in the terminology of the USENIX paper, or to S = k (our number of replicas) in our terminology. • This means: we try to minimize the size of S for any given per-LV k, which will lead to the best possible reliability (under the conditions described in section 1.8.1.2) as has been shown in section 1.8.2. 1.9. Performance Arguments from Architecture Some people think that replication is easily done at filesystem layer. There exist lots of cluster filesystems and other filesystem-layer solutions which claim to be able to replicate your data, sometimes even over long distances. Trying to replicate several petabytes of data, or some billions of inodes, is however a much bigger challenge than many people can imagine. Choosing the wrong layer for mass data replication may get you into trouble. Here is an architectural-level (cf section 1.1) explanation why replication at the block layer is more easy and less error prone: Replication at Block Level vs FS Level Userspace Application Layer Kernelspace Filesystem Layer Caching Layer Block Layer Apache, PHP, Mail Queues, etc xfs, ext4, btrfs, zfs, … vs nfs, Ceph, Swift, ... Page Cache, Dentry Cache, ... 1:100 reduction LVM, DRBD / MARS Potential Cut Point A for Distributed System ~ 25 Operation Types ~ 100.000 Ops / s Potential Cut Point B for Distributed System DSM = Distributed Shared Memory => Cache Coherence Problem! 2 Operation Types (r/w) ~ 1.000 Ops / s Potential Cut Point C for Distributed System ++ replication of VMs for free! Hardware Hardware-RAID, BBU, ... MARS Presentation by Thomas Schöbel-Theuer The picture shows the main components of a standalone Unix / Linux system. In the late 1970s / early 1980s, a so-called Buffer Cache had been introduced into the architecture of Unix. Today’s Linux has refined the concept to various internal caches such as the Page Cache (for data) and the Dentry Cache (for metadata). All these caches serve one main purpose15 : they are reducing the load onto the storage by exploitation of fast RAM. A well-tuned cache can yield high cache hit ratios, typically 99%. In some cases (as observed in practice) even more than 99.9%. 15 Another important purpose is providing shared memory. 33 1. Architectures of Cloud Storage / Software Defined Storage / Big Data Now start distributing the system over long distances. There are potential cut points A and B and C16 . Cut point A is application specific, and can have advantages because it has knowledge of the application. For example, replication of mail queues can be controlled much more fine-grained than at filesystem or block layer. Cut points B and C are generic, supporting a wide variety of applicactions, without altering them. Cutting at B means replication at filesystem level. C means replication at block level. When replicating at B, you will notice that the caches are below your cut point. Thus you will have to re-implement distributed caches, and you will have to maintain cache coherence. When replicating at C, the Linux caches are above your cut point. Thus you will receive much less traffic, typically already reduced by a factor of 100, or even more. This is much more easy to cope with. You will also profit from journalling filesystems like ext4 or xfs. In contrast, truly distributed 17 journalling is typically not available with distributed cluster filesystems. A potential drawback of block layer replication is that you are typically limited to activepassive replication. An active-active operation is not impossible at block layer (see combinations of DRBD with ocfs2), but less common, and less safe to operate. This limitation isn’t necessarily caused by the choice of layer. It is simply caused by the laws of physics: communication is always limited by the speed of light. A distributed filesystem is nothing else but a logically distributed shared memory (DSM). Some decades of research on DSM have shown that there exist applications / workloads where the DSM model is inferior to the direct communication paradigm. Even in short-distance / cluster scenarios. Long-distance DSM is extremely cumbersome. Therefore: you simply shouldn’t try to solve long-distance communication needs via communication over filesystems. Even simple producer-consumer scenarios (one-way communication) are less performant (e.g. when compared to plain TCP/IP) when it comes to distributed POSIX semantics. There is simply too much synchronisation overhead at metadata level. If you have a need for mixed operations at different locations in parallel: just split your data set into disjoint filesystem instances (or database / VM instances, etc). All you need is careful thought about the appropriate granularity of your data sets (such as well-chosen sets of user homedirectory subtrees, or database sets logically belonging together, etc). Replication at filesystem level is often at single-file granularity. If you have several millions or even billions of inodes, you may easily find yourself in a snakepit. Conclusion: active-passive operation over long distances (such as between continents) is even an advantage. It keeps you from trying bad / almost impossible things. 1.10. Scalability Arguments from Architecture Some people are talking about scalability by (1) looking at a relatively small example cluster implementation of their respective (pre-)chosen architecture having n machines or n network components or running n application instances, and then (2) extrapolating its behaviour to bigger n. They think if it runs with small n, it will also run for bigger n. This way of thinking and acting is completely broken, and can endanger both companies and careers. This is not only because of confusion of “architecture” with “implementation”, cf section 1.1. It is also fundamentally broken because it assumes some “linearity” in a field which is non-linear by definition. If scalability would be linear, the term would not be useful at all, because there would be no limit. However, limits exist in practice, and the term “scalability” is the means for describing the behaviour at or around the limit. Another incorrect way of ill-defining the term “scalability” is looking at some relatively big example cluster, which is working in practice. Arguing with an example of a working system is wrong by construction. Every storage system on this globe has always some scalability limit, somewhere. Scalability is always a non-linear behaviour. In order to find the practical limit, you must reach it. 16 In theory, there is another cut point D by implementing a generically distributed cache. There exists some academic research on this, but practically usable enterprise-grade systems are rare and not wide-spread. 17 In this context, “truly” means that the POSIX semantics would be always guaranteed cluster-wide, and even in case of partial failures. In practice, some distributed filesystems like NFS don’t even obey the POSIX standard locally on 1 standalone client. We know of projects which have failed right because of this. 34 1.10. Scalability Arguments from Architecture Therefore, examples are principally insufficient for proving scalability, as well as for comparing the scalability of architectures and/or of certain implementations. Examples can be only used for disproving scalability. 1.10.1. Example Failures of Scalability The following description is a must read for sysadmins and system architects, and also for managers who are responsible. The numbers and some details are from my memory, thus it need not be 100% accurate in all places. It is about an operation environment for a new product, which was a proprietary web page editor running under a complicated variant of a LAMP stack. The setup started with a BigCluster architecture, but actually sized as a “SmallCluster” implementation. Setup 1 (NFS) The first setup consisted of n = 6 storage servers, each replicated to another datacenter via DRBD. Each were exporting their filesystems via NFS to about the same number of client servers, where Apache/PHP was supposed to serve the HTTP requests from the customers, which were entering the client cluster via a HTTP load balancer. The load balancer was supposed to spread the HTTP load to the client servers in a round-robin fashion. At this point, eager readers may notice some similarity with the error propagation problem treated in section 1.8.3 on page 31. Notice that this is about scalability instead, but you should compare with that, to find some similarities. After the complicated system was built up and was working well enough, the new product was launched via a marketing campaign with free trial accounts, limited to some time. So the number of customers was ramping up from 0 to about 20,000 within a few weeks. When about 20,000 customers were running on the client machines, system hangs were noticed, and also from a customer’s perspective. When too many customers were pressing the “save” button in parallel on reasonably large web page projects, a big number of small files, including a huge bunch of small image files, was generated over a short period of time. A few customers were pressing the “save” button several times a minute, each time re-creating all of these files again and again from the proprietary web page generator. Result: the system appeared to hang. However, all of the servers, including the storage servers, were almost idle with respect to CPU consumption. RAM sizes were also no problem. After investigating the problem for a while, it was noticed that the network was the bottleneck, but not in terms of throughput. The internal sockets were forming some queues which were delaying the NFS requests in some ping-pong like fashion, almost resulting in a “deadlock” from a customer’s perspective (a better term would be distributed livelock or distributed thrashing). Setup 2 (ocfs2) Due to some external investigations and recommendations, the system was converted from NFS to ocfs2. Now DRBD was operated in active-active mode. Only one system software component was replaced with another one, without altering the BigCluster architecture, and without changing the number of servers, which remained a stripped-down SmallCluster implementation. Result: the problem with the “hangs” disappeared. However, after the number of customers had exceeded the next scalability limit of about 30,000 customers, the “hang” problem appeared once again, in a similar way. The system showed systematical incidents again. Setup 3 (glusterfs as a substitute for NFS) After investigating the network queueing behaviour and the lock contention problems of ocfs2, the next solution was glusterfs. However, when the number of customers exceeded the next scalability limit, which was about 50,000 customers, some of them hammering the cluster with their “save” button, the “hangs” appeared again. 35 1. Architectures of Cloud Storage / Software Defined Storage / Big Data Setup 4 (glusterfs replication as a substitute for DRBD) After analyzing the problem once again, it was discovered by accident that drbdadm disconnect appeared to “solve” the problem. Therefore DRBD was replaced with glusterfs replication. There exists a glusterfs feature allowing replication of files at filesystem level. This attempt was immediately resulting in an almost fatal disaster, and thus was stopped immediately: the cluster completely broke down. Almost nothing was working anymore. The problem was even worse: switching off the glusterfs replication and rollback to DRBD did not work. The system remained unusable. As a temporary workaround, drbdadm disconnect was improving the situation enough for some humbling operation. Retrospective explanation: some of the reasons can be found in section 2.1.1 on page 44. glusterfs replication does not scale at all because it stores its replication information at perinode granularity in EAs (extended attributes), which must necessarily be worse than DRBD, because there were some hundreds of millions of them in total as reported by df -i (see the cut point discussion in section 1.9 on page 33). Overnight in some cron jobs, these EAs had to be deleted in reasonably sized batches in order to become more or less “operable” again. Setup5 (Sharding on top of DRBD) After the almost fatal incident had been resolved to a less critical one, the responsibility for setup was taken over by another person. After the O(n2 ) behaviour from section 1.6 on page 18 had been understood, and after it was clear that sharding is only O(k) from a customer’s perspective, it was the final solution. Now the problem was resolved at architectural level , no longer by just replacing some components with some others. The system was converted to a variant of a RemoteSharding model (see section 1.6.1 on page 20), and some migrate scripts were introduced for load balancing of customer homedirectories and databases between shards. As a side effect, the load balancer became a new role: instead of spreading all of the HTTP requests to all of the client servers in a round-robin fashion, it now acted as a redirection mechanism at shard granularity, e.g. when one of the client servers was handed over to another one for maintenance. Retrospective explanation: DRBD was definitely not the real reason for the critical incident. The replication traffic per shard is so low in average that until today, no replacement by MARS was absolutely necessary18 , although the distance is over 50 km. If you wonder why such low write traffic demands can cause such a big incident: look at the cache reduction graphics in section 1.9 on page 33. Today, the “save” buttons of the customers are just triggering some extra writebacks from the Page Cache of the kernel into the block layer, after some delay. These writebacks are not performance critical in reality, because the Page Cache is running them asynchronously in background . In contrast, distributed filesystems like NFS or ocfs2 or glusterfs are not working asynchronously in many places, but will often schedule their requests synchronously into ordinary network queues, which form a sequential bottleneck, competing with other high-frequent filesystem operations. In addition, the “save” button triggers masses of metadata / inode updates in a short time, often residing in the same directory. Such a directory may thus form a “global” bottleneck. When suchalike competing metadata updates are distributed via a round-robin load balancer, the problem can easily become critical by the cache coherence problem. While local filesystems can smoothen such application behaviour via the Dentry Cache plus Inode Cache, which also show some asynchronous writeback behaviour, network filesystems are often unable to deal with this performantly. Although DRBD has a similar sequential bottleneck at the low-frequency block layer by its write-through strategy into its replica, this does not really matter: all other writebacks from 18 Many 36 sysadmins are running a conservative strategy: never touch a running system... 1.10. Scalability Arguments from Architecture the Page Cache are also started asynchronously, and triggered low-frequently, and are occurring after some delay (which in turn will smoothen the spikes caused by mass dirtification of many small files and inodes in a short time as caused by the “save” button), and thus are not really performance critical for this particular use case. This is a striking example why careful selection of granularity level (filesystem vs block layer) is essential. This is also a striking example why asynchronous operations can form a huge advantage in certain use cases. The sharding setup is working until today, scaling up to the current number of customers, which is more than an order of magnitude, in the range of about a million of customers. Of course, the number of shards had to be increased, but this is just what sharding is about. 1.10.2. Properties of Storage Scalability 1.10.2.1. Influence Factors at Scalability In general, scalability of storage systems depends on the following factors (list may be incomplete): 1. The application class, in particular its principal workingset behaviour (in both dimensions: timely and locality). More explanations about workingsets can be found at http://blkreplay.org. 2. The size x of the application data and/or the number of application instances (possibly also denoted by x), and the amount of storage needed for it (could be also termed x). Besides the data itself, the corresponding metadata (inodes, indexes, etc) can form an important factor, or can even dominate the whole story. Typically, critical datacenter application data is tremendously differently sized from workstation data. Caution! Many people think erronously that scalability would be linearly depending on x. However, as is known at least since the 1960s (read some ancient papers from Saltzer and/or from Denning), scalability is never linear, but sometimes even disruptive, in particular when RAM size is the bottleneck. IO queues and/or networking queues are often also reacting to overload in a disruptive fashion. This means: after exceeding the scalability limit of a particular system for its particular class of applications, the system will always break down from a customer’s perspective, sometimes almost completely, and sometimes even fatally . On the other hand, some other systems are reacting with graceful degradation. Whether a particular systems reacts to a particular type of (over)load, either with graceful degradation, or with fatal disruption, or with some intermediate behaviour, is some sort of “quality property” of the system and/or of the application. EVERY SYSTEM, even sharded systems, and even the internet as a whole, has always some scalability limit somewhere. There exists no “inifinitely scaling” system on earth! 3. The distribution of the application behaviour in both timely and locality dimensions. Depending on the application class, this is often an exponential distribution according to Zipf’s law. By falsely assuming an equal distribution (or a Gaussian distribution) instead of actually measuring the distribution in both dimensions, you can easily induce zillions of costly problems for big x, or even fatal failure of the whole system / project. 37 1. Architectures of Cloud Storage / Software Defined Storage / Big Data 4. The transformation of the application workingset behaviour at architectural level, sometimes caused by certain components resp their specific implementation or parameterization. Examples are intermediate virtualization layers, e.g. vmware *.vmdk or KVM *.qcow2 container formats which can completely change the game, not only in extreme cases. Another example is random distribution to object stores, which can turn some uncomplicated sequential workloads into highly problematic random IO workloads. Don’t overlook such potential pitfalls! 5. The storage architecture to be chosen, such as CentralStorage vs BigCluster vs *Sharding. Choice of the wrong architecture can be fatal for big n and/or for certain timely / spatial application behaviour. Changing an architecture during operations on some petabytes of data and/or some billions of inodes can be almost impossible, and/or can consume a lot of time and money. 6. The number of storage nodes n. In some architectures, addition of more nodes can make the system worse instead of better, c.f. section 1.8 on page 25. 7. In case of architectures relying on a storage network: choice of layer for cut point, e.g. filesystem layer vs block layer, see section 1.9 on page 33, and/or introduction of an additional intermediate object storage layer (which can result in major degradation from an architectural view). Due to fundamental differences in distributed vs local cache coherence, suchalike can have a tremendous effect on scalability. 8. The implementation of the architecture. Be sure to understand the difference between an architecture and an implementation of that architecture. 9. The size and types / properties of various caches at various layers. You need to know the general properties of inclusive vs exclusive cache architecture. You absolutely need to know what thrashing is, and under which conditions it can occur. It is advantagous for system architects to know19 pre-loading strategies, as well as replacement strategies. It is advantageous to know what LRU or MFU means, what their induced overhead is, and how they really work on actual data, not just on some artificial lab data. You also should know what an anomaly is, and how it can be produced not only by FIFO strategies, but also by certain types of ill-designed multi-layer caching. Beware: there are places where FIFO-like behaviour is almost impossible to avoid, such as networks. All of these is outside the scope of this MARS manual. You should measure, when possible, the overhead of cache implementations. I know of examples where caching is counter-productive. For example, certain types and implementations of SSD caches are over-hyped. Removing a certain cache will then improve the situation. Notice: caches are conceptually based on some type of associative memory, which is either very costly when directly implemented in hardware, or can suffer from tremendous performance penalties when implemented inappropriately in software. 10. Hardware dimensioning of the implementation: choice of storage hardware, for each storage node. This includes SSDs vs HDDs, their attachment (e.g. SAS multiplexing bottlenecks), RAID level, and controller limitations, etc. 11. Only for architectures relying on a storage network: network throughput and network latencies, and network bottlenecks, including the queueing behaviour / congestion control / packet loss behaviour upon overload. The latter is often neglected, leading to unexpected behaviour at load peaks, and/or leading to costly over-engineering (examples see section 1.10.1 on page 35). 12. Hidden bottlenecks of various types. A complete enumeration is almost impossible, because there are too many “opportunities”. To reduce the latter, my general advice is to try to build bigger systems as simple as possible. This is why you should involve some real experts in storage systems, at least on critical enterprise data. 19 Reading a few Wikipedia articles does not count as “knowledge”. You need to be able to apply your knowdedge to enterprise level systems (as opposed to workstation-sized systems), sustainable and reproducible. Therefore you need to have actually worked in the matter and gained some extraordinary experiences, on top of deep understanding of the matter. 38 1.10. Scalability Arguments from Architecture Any of these factors can be dangerous when not carefully thought about and treated, depending on your use case. 1.10.2.2. Example Scalability Scenario To get an impression what “enterprise critical data” can mean in a concrete example, here are some characteristic numbers on 1&1 ShaHoLin (Shared Hosting Linux) around spring 2018, which would be the input parameters for any potential solution architecture CentralStorage vs BigCluster vs Sharding: • About 9 millions of customer homedirectories. • About 10 billions of inodes, with daily incremental backup. • More than 4 petabytes of net data (total df filling level) in spring 2018, with a growth rate of 21% per year. • All of this permanently replicated into a second datacenter. • Webhosting very close to 24/7/365. For maintenance, any resource must be switchable to the other datacenter at any time, indepently from other resources; while in catastrophic failure scenarios all resources must be switchable within a short time. For simplicity of our sandbox game, we assume that all of this is in one campus. In reality, about 30% is residing in another continent. Introducing this as an additional input parameter would not fundamentally change the game. Many other factors, like dependencies from existing infrastructure, are also neglected. Theoretical Solution: CentralStorage Let us assume somebody would try to operate this on classical CentralStorage, and let us assume that migration of this amount of data including billions of inodes would be no technical problem. What would be the outcome? With current technology, finding a single CentralStorage appliance would be all else but easy. Dimensioning would be needed for the lifetime of such a solution, which is at least 5 years. In five years, the data would grow by a factor of about 1.215 = 2.6, which is then about 10.5 petabytes. This is only the net capacity; at hardware layer much more is needed for spare space and for local redundancy. The single CentralStorage instance will need to scale up to at least this number, in one datacenter (under the simplified game assumptions). The current number of client LXC containers is about 2600, independent from location. You will have to support growth in number of them. For maintenance, any of these need to be switchable to a different location at any time. The number of bare metal servers running them can vary with hardware architecture / hardware lifecycle, and with growth. You will need to dimension a dedicated storage network for all of this. If you find a solution which can do this with current CentralStorage technology for the next 5 years, then you will have to ensure that restore from backup20 can be done in less than 1 day in case of a fatal disaster, see also treatment of CentralStorage reliability in section 1.5.5 on page 16. Notice that the current self-built backup solution for a total of 15 billions of inodes is based on a sharding model; converting this to some more or less centralized solution would turn out as another challenge. Attention! Buying 10 or 50 or 100 CentralStorage instances does not count as a CentralStorage architecture. By definition, suchalike would be RemoteSharding instead. Notice that the current 1&1 solution is already a mixture of LocalSharding and RemoteSharding, so you would win nothing at architectural level. In your business case, you would need to justify the price difference between the current component-based hardware solution (horizontally extensible by scale-out) and CentralStorage, which is about a factor of 10 per terabyte according to the table in section 1.7. Even if you manage to find a vendor who is willing to subsidize to a factor of only 3, this is not all you 20 Local snapshots, whether LVM or via some COW filesystem, do not count as backups! You need a logical copy, not a physical one, in case your production filesystem instance gets damaged. 39 1. Architectures of Cloud Storage / Software Defined Storage / Big Data need. You need to add the costs for the dedicated storage network. On top of this, you need to account for the migration costs after the lifetime of 5 years has passed, where the full data set needs to be migrated to a successor storage system. Notice that classical argumentations with manpower will not work. The current operating team is about 10 persons, with no dedicated storage admin. This relatively small team is not only operating a total of more than 6,000 shared boxes in all datacenters, but also some tenthousands of managed dedicated servers, running essentially the same software stack, with practically fully automated mass deployment. Most of their tasks are related to central software installation, which is then automatically distributed, and to operation / monitoring / troubleshooting of masses of client servers. Storage administration tasks in isolation are costing only a fraction of this. Typical claims that CentralStorage would require less manpower will not work here. Almost everything which is needed for mass automation is already automated. Neglecting the tenthousands of managed dedicated servers would be a catastrophic illdesign. Their hardware is already given, by existing customer contracts, some of them decades old. You simply cannot fundamentally change the hardware of these customers including their dedicated local disks, which is their main selling point. You cannot simply convert them to a shared CentralStorage, even if it would be technically possible, and if it would deliver similar IOPS rates than tenthousands of local spindles (and if you could reach the bundled performance of local SSDs from newer contracts), and even if you would introduce some interesting storage classes for all of this. A dedicated server on top of a shared storage is no longer a dedicated one. You would have to migrate these customers to another product, with all of its consequences. Alone for these machines, most 21 of the current automation of LocalStorage is needed anyway, although they are not geo-redundant at current stage. Conclusion: CentralStorage is simply unrealistic. Theoretical Solution: BigCluster The main problem of BigCluster is reliability, as explained intuitively in section 1.8 on page 25 and mathematically in appendix G on page 154, and as observed in numerous installations not working as expected. Let us assume that all of these massive technical problems were solved, somehow. Then the business case would have to deal with the following: The total number of servers would need to be roughly doubled. Not only their CAPEX, but also the corresponding OPEX (electrical power, rackspace, manpower) would increase. Alone their current electrical power cost, including cooling, is more than the current sysadmin manpower cost. Datacenter operations would also increase. On top, a dedicated storage network and its administration would also be needed. With respect to the tenthousands of managed dedicated servers and their customer contracts, a similar argument as above holds. You simply cannot convert them to BigCluster. Conclusion: BigCluster is also unrealistic. There is nothing to win, but a lot to loose. Current Solution: LocalSharding, sometimes RemoteSharding Short story: it works since decades, and is both cheap and robust since geo-redundancy had been added around 2010. With the advent of Football (see chapter 8 on page 130), the LocalSharding architecture is raising up on par with the most important management abilities of CentralStorage and BigCluster / Software Defined Storage. The story with the tenthousands of managed dedicated servers is arguing vice versa: without the traditional ShaHoLin sharding architecture and all of its automation, including the newest addition called Football, the product “managed dedicated servers” would not be possible in this scale. Summay: the sharded “shared” product enables another “dedicated” product, which is sharded by definition, and it actually is known to scale up by at least another order of magnitude (in terms of number of servers). 21 Only a few out of >1000 self-built or customized Debian packages are dealing with MARS and/or with the clustermanager cm3. 40 1.10. Scalability Arguments from Architecture 1.10.3. Scalability of Filesystem Layer vs Block Layer Following factors are responsible for better architectural (cf section 1.1) scalability of the block layer vs the filesystem layer, at least in many cases, with a few exceptions (list may be incomplete): 1. Granularity of access: metadata is often smaller than the content data it refers to, but access to data is typically not possible without accessing corresponding metadata first. When masses of metadata are present (e.g. some billions of inodes as in section 1.10.2.2), and when it is accessed more frequently than the corresponding data (e.g. in stateless designs like Apache), it is likely to become the bottleneck. Neglecting metadata and its access patterns is a major source of ill-designs. I know of projects which have failed (in their original setup) because of this. Repair will typically involve some non-trivial architectural changes. By default, the block layer itself has almost no metadata at all (or only tiny ones, such as describing a whole block device). Therefore it has an inherent advantage over the filesystem layer in such use cases. 2. Caching: shared memory caches in kernelspace (page cache + dentry cache) vs distributed caches over network. See the picture in section 1.9 on page 33. There exist examples where shared distributed caches do not work at all. I know of several projects which have failed. Another project than mentioned in section 1.10.1 on page 35 has failed because of violations of POSIX filesystem semantics. 3. Only in distributed systems: the cache coherence problem, both on metadata and on data. Depending on load patterns, this can lead to tremendous performance degradation, see example in section 1.10.1. 4. Dimensioning of the network: throughput, latencies, queueing behaviour. There exist a few known exceptions (list may be incomplete, please report further examples if you know some): • Databases: these are typically operating on specific container formats, where no frequent external metadata access is necessary, and where no sharing of the container as such is necessary. Typically, there is no big difference between storing them in block devices vs local filesystems. Exception from the exception: MyISAM is an old design from the 1980s, originally based on DBASE data structures. Don’t try to access them over NFS or similar. Or, better, try to avoid them at all if possible. • VM images: these are logical BLOBS, so there is typically no big difference whether you have an intermediate true filesystem layer, or not. Filesystems on top of object stores are no true intermediate filesystems. They are violating Dijkstra’s important layering rules, as stated in his famous articles on THE. A similar argument holds for block devices on top of object stores. Intermediate container formats like *.vmdk or *.qcow2 can also act as game changers. This does not mean that you have to avoid them at all. However, be sure to check their influence, and don’t forget their workingset and their caching behaviour (which can go both into positive and into negative direction), in order to really know what you are doing! 41 1. Architectures of Cloud Storage / Software Defined Storage / Big Data 1.11. Recommendations for Designing and Operating Storage Systems In order of precedence, do the following: 1. Fix and/or limit and/or tune the application. Some extreme examples: • When you encounter a classical Unix fork bomb, you have no chance against it. Even the “best and the most expensive hardware” is unable to successfully run a fork bomb. The only countermeasure is limitation of resources. Reason: unlimited resources do not exist on earth. • If you think that this were only of academic interest: several types of internet DDOS attacks are acting like a fork bomb, and Apache is also acting similar to a fork bomb when not configured properly. This is not about academics, it is about your survival (in the sense of Darwin). • If you think it cannot hurt you because you are running fast-cgi or another application scheme where forks are not part of the game (e.g. databases and many others): please notice that network queues are often acting as a replacement for processes. Overflow of queues can have a similar effect than fork bombs from the viewpoint of customers: they simply don’t get the service they are expecting. • Real-life example: some percentage of WordPress customers are typically and systematically misconfiguring their wp-cron cron jobs. They create backups of their website, which include their old backups. Result: in each generation of the backups, the needed disk space will roughly double. Even if you had “unlimited storage” on top of the “best and the most expensive storage system”, and even if you would like to give “unlimited storage” to your customers, it simply cannot work at all. Exponential growth is exponential growth. After a few months of this kind of daily backup, you would need more storage than atoms exist in the whole universe. You must introduce some quota limits somewhere. And you must ensure that the wp-cron misconfiguration is fixed, whoever is responsible for fixing it. • Another WordPress example: the wp-cron configuration syntax is not easily understandable by laymen. It is easy to misconfigure such that a backup is created once per minute. As long as the website is very small, this will not even be noticed by sysadmins. However, for bigger websites (and they are typically growing over time), the IO load may increase to a point until even asynchronous replication over 10Gig interfaces cannot catch up. Even worse: the next run of wp-cron may start before the old one has finished within a minute. Again, there is no chance except fixing the root cause at application level. 2. Choose the right overall architecture (not limited to storage). An impressive example for architectural (cf section 1.1) ill-design can be found in section 1.10.1. Important explanations are in section 1.10.2, in particular subsection 1.10.2.1 on page 37, and section 1.10.3 on the previous page. A strategic example is in subsection 1.10.2.2. It is absolutely necessary to know the standard cache hierarchy of Unix (similarly also found in Windows) from section 1.9 on page 33. More explanations are in this manual at many places. In general, major ill-designs of overall architectures (end-to-end) cannot be fixed at component level. Even the “best tuning of the world” executed by the “best tuning expert” on top of the “best and most expensive storage components and the best storage network of the world” cannot compensate major ill-designs, such as O(n2 ) behaviour. Similarly for reliability: if you have problems with too many and/or too large incidents affecting too many customers, read sections 1.8 on page 25 and 1.5.5 on page 16. 3. Choice and tuning of components. No further explanations necessary, because most people already know this. In case you 42 1.11. Recommendations for Designing and Operating Storage Systems think this is the only way: no, it is typically the worst and typically only the last resort when compared to the previous enumeration items. Exception: choice of wrong components with insufficient properties for your particular application / use case. But this is an architectural problem in reality. 43 2. Use Cases for MARS vs DRBD DRBD has a long history of successfully providing HA features to many users of Linux. With the advent of MARS, many people are wondering what the difference is. They ask for recommendations. In which use cases should DRBD be recommended, and in which other cases is MARS the better choice? The following table is a short guide to the most important cases where the decision is rather clear: Use Case server pairs, each directly connected via crossover cables active-active / dual-primary, e.g. gfs2, ocfs2 distance > 50km > 100 server pairs over a short-distance shared line all else / intermediate cases Recommendation DRBD DRBD MARS MARS read the following details There exist some use cases where DRBD is clearly better than MARS. 1&1 has a long history of experiences with DRBD where it works very fine, in particular coupling Linux devices rack-torack via crossover cables. DRBD is just constructed for that use case (RAID-1 over network). In such a scenario, DRBD is better than MARS because it uses up less disk space resources. In addition, newer DRBD versions can run over high-speed but short-distance interconnects like Infiniband (via the SDP protocol). Another use case for DRBD is active-active / dual-primary mode, e.g. ocfs21 over short2 distances. On the other hand, there exist other use cases where DRBD did not work as expected, leading to incidents and other operational problems. We analyzed them for our specific use cases. The later author of MARS came to the conclusion that they could only be resolved by fundamental changes in the overall architecture of DRBD. The development of MARS started at the personal initiative of the author, first in form of a personal project during holidays, but later picked up by 1&1 as an official project. MARS and DRBD simply have different application areas. In the following, we will discuss the pros and cons of each system in particular situations and contexts, and we shed some light at their conceptual and operational differences. 2.1. Network Bottlenecks 2.1.1. Behaviour of DRBD In order to describe the most important problem we found when DRBD was used to couple whole datacenters (each encompassing thousands of servers) over metro distances, we strip down that complicated real-life scenario to a simplified laboratory scenario in order to demonstrate the effect with minimal means. 1 Notice that ocfs2 is appearantly not constructed for long distances. 1&1 has some experiences on a specific short distance cluster where the ocfs2 / DRBD combination scaled a little bit better than NFS, but worse than glusterfs (using 2 clients in both cases – notice that glusterfs showed extremely bad performance when trying to enable active-active glusterfs replication between 2 server instances, therefore we ended up using active-passive DRBD replication below a single glusterfs server). Conclusion: NFS < ocfs2 < glusterfs < sharding. We found that glusterfs on top of active-passive DRBD scalability was about 2 times better than NFS on top of active-passive DRBD, while ocfs2 on top of DRBD in active-active mode was somewhere inbetween. All cluster comparisons with an increasing workload over time (measured as number of customers which could be safely operated). Each system was replaced by the next one when the respective scalability was at its respective end, each time leading to operational problems. The ultimate solution was to replace all of these clustering concepts by the general concept of sharding. 2 Active-active won’t work over long distances at all because of high network latencies (cf chapter 1). Probably, for replication of whole clusters over long distances DRBD and MARS could be stacked: using DRBD on top for MARS for active-active clustering of gfs2 or ocfs2, and a MARS instance below for failover of one of the DRBD replicas over long distances. 44 2.1. Network Bottlenecks Notice that the following DRBD effect does not appear at crossover cables. The following scenario covers a non-standard case of DRBD. DRBD works fine when no network bottleneck appears! The following picture illustrates an effect which has been observed in 1&1 datacenters when running masses of DBRD instances through a single network bottleneck. In addition, the effect is also reproducible by an elder version of the MARS test suite3 : network throughput additional throughput needed for re−sync, not possible DRBD throughput automatic re−connect automatic disconnect (p ot en tia l) in ci de nt −> wanted application throughput, not possible dec rea s mirror inconsistency ... ing thr oug hpu t li mit time The simplified scenario is the following: 1. DRBD is loaded with a low to medium, but constant rate of write operations for the sake of simplicity of the scenario. 2. The network has some throughput bottleneck, depicted as a red line. For the sake of simplicity, we just linearly decrease it over time, starting from full throughput, down to zero. The decrease is very slowly over time (some minutes, or even hours). What will happen in this scenario? As long as the actual DRBD write throughput is lower than the network bandwidth (left part of the horizontal blue line), DRBD works as expected. Once the maximum network throughput (red line) starts to fall short of the required application throughput (first blue dotted line), we get into trouble. By its very nature, DRBD works synchronously. Therefore, it must transfer all your application writes through the bottleneck, but now it is impossible4 due to the bottleneck. As a consequence, the application running on top of DRBD will see increasingly higher IO latencies and/or stalls / hangs. We found practical cases (at least with former versions of DRBD) where IO latencies exceeded practical monitoring limits such as 5 s by far, up to the range of minutes. As an experienced sysadmin, you know what happens next: your application will run into an incident, and your customers will be dissatisfied. In order to deal with such situations, DRBD has lots of tuning parameters. In particular, the timeout parameter and/or the ping-timeout parameter will determine when DRBD will give up in such a situation and simply drop the network connection as an emergency measure. Dropping the network connection is roughly equivalent to an automatic disconnect, followed by an automatic re-connect attempt after connect-int seconds. During the dropped connection, the incident will appear as being resolved, but at some hidden cost5 . 3 The effect has been demonstrated some years ago with DRBD version 8.3.13. By construction, is is independent from any of the DRBD series 8.3.x, 8.4.x, or 9.0.x. 4 This is independent from the DRBD protocols A through C, because it just depends on an informationtheoretic argument independently from any protocol. We have a fundamental conflict between network capabilities and application demands here, which cannot be circumvented due to the synchronous nature of DRBD. 5 By appropriately tuning various DRBD parameters, such as timeout and/or ping-timeout, you can keep the impact of the incident below some viable limit. However, the automatic disconnect will then happen 45 2. Use Cases for MARS vs DRBD What happens next in our scenario? During the disconnect, DRBD will record all positions of writes in its bitmap and/or in its activity log. As soon as the automatic re-connect succeeds after connect-int seconds, DRBD has to do a partial re-sync of those blocks which were marked dirty in the meantime. This leads to an additional bandwidth demand6 as indicated by the upper dotted blue box. Of course, there is absolutely no chance to get the increased amount of data through our bottleneck, since not even the ordinary application load (lower dotted lines) could be transferred. Therefore, you run at a very high risk that the re-sync cannot finish before the next timeout / ping-timeout cycle will drop the network connection again. What will be the final result when that risk becomes true? Simply, your secondary site will be permanently in state inconsistent. This means, you have lost your redundancy. In our scenario, there is no chance at all to become consistent again, because the network bottleneck declines more and more, slowly. It is simply hopeless, by construction. In case you lose your primary site now, you are lost at all. Some people may argue that the probability for a similar scenario were low. We don’t agree on such an argumentation. Not only because it really happens in pratice, and it may even last some days until problems are fixed. In case of rolling disasters, the network is very likely to become flaky and/or overloaded shortly before the final damage. Even in other cases, you can easily end up with inconsistent secondaries. It occurs not only in the lab, but also in practice if you operate some hundreds or even thousands of DRBD instances. The point is that you can produce an ill behaviour systematically just by overloading the network a bit for some sufficient duration. When coupling whole datacenters via some thousands of DRBD connections, any (short) network loss will almost certainly increase the re-sync network load each time the outage appears to be over. As a consequence, overload may be provoked by the re-sync repair attempts. This may easily lead to self-amplifying throughput storms in some resonance frequency (similar to self-destruction of a bridge when an army is marching over it in lockstep). The only way for reliable prevention of loss of secondaries is to start any re-connect only in such situations where you can predict in advance that the re-sync is guaranteed to finish before any network bottleneck / loss will cause an automatic disconnect again. We don’t know of any method which can reliably predict the future behaviour of a complex network. Conclusion: in the presence of network bottlenecks, you run a considerable risk that your DRBD mirrors get destroyed just in that moment when you desperately need them. Notice that crossover cables usually never show a behaviour like depicted by the red line. Crossover cables are passive components which normally7 either work, or not. The binary connect / disconnect behaviour of DRBD has no problems to cope with that. earlier and more often in practice. Flaky or overloaded networks may easily lead to an enormous number of automatic disconnects. 6 DRBD parameters sync-rate resp resync-rate may be used to tune the height of the additional demand. In addition, the newer parameters c-plan-ahead, c-fill-target, c-delay-target, c-min-rate, c-max-rate and friends may be used to dynamically adapt to some situations where the application throughput could fit through the bottleneck. These newer parameters were developed in a cooperation between 1&1 and Linbit, the maker of DRBD. Please note that lowering / dynamically adapting the resync rates may help in lowering the probability of occurrences of the above problems in practical scenarios where the bottleneck would recover to viable limits after some time. However, lowering the rates will also increase the duration of re-sync operations accordingly. The total amount of re-sync data simply does not decrease when lowering resync-rate; it even tends to increase over time when new requests arrive. Therefore, the expectancy value of problems caused by strong network bottlenecks (i.e. when not even the ordinary application rate is fitting through) is not improved by lowering or adapting resync-rate, but rather the expectancy value mostly depends on the relation between the amount of holdback data versus the amount of application write data, both measured for the duration of some given strong bottleneck. 7 Exceptions might be mechanical jiggling of plugs, or electro-magnetical interferences. We never noticed any of them. 46 2.1. Network Bottlenecks or Linbit recommends a workaround for the inconsistencies during re-sync: LVM snapshots. We tried it, but found a performance penalty which made it prohibitive for our concrete application. A problem seems to be the cost of destroying snapshots. LVM uses by default a BOW strategy (Backup On Write, which is the counterpart of COW = Copy On Write). BOW increases IO latencies during ordinary operation. Retaining snapshots is cheap, but reverting them may be very costly, depending on workload. We didn’t fully investigate that effect, and our experience is a few years old. You might come to a different conclusion for a different workload, for newer versions of system software, or for a different strategy if you carefully investigate the field. DRBD problems usually arise only when the network throughput shows some “awkward” analog behaviour, such as overload, or as occasionally produced by various switches / routers / transmitters, or other potential sources of packet loss. 2.1.2. Behaviour of MARS The behaviour of MARS in the above scenario: network throughput MARS application throughput, recorded in transaction log rep lica tion net wo rk thr o ugh put dec rea sin g th rou ghp ut l i mit time When the network is restrained, an asynchronous system like MARS will continue to serve the user IO requests (dotted green line) without any impact / incident while the actual network throughput (solid green line) follows the red line. In the meantime, all changes to the block device are recorded at the transaction logfiles. Here is one point in favour of DRBD: MARS stores its transaction logs on the filesystem /mars/. When the network bottleneck is lasting very long (some days or even some weeks), the filesystem will eventually run out of space some day. Section 4.4 discusses countermeasures against that in detail. In contrast to MARS, DRBD allocates its bitmap statically at resource creation time. It uses up less space, and you don’t have to monitor it for (potential) overflows. The space for transaction logs is the price you have to pay if you want or need anytime consistency, or asynchronous replication in general. In order to really grasp the heart of the difference between synchronous and asynchronous replication, we look at the following modified scenario: 47 2. Use Cases for MARS vs DRBD flaky throughput limit 00000 11111 1111 0000 00000 11111 network throughput 0000 1111 00000 11111 0000 1111 00000 11111 000 111 0000 1111 00000 11111 000 111 0000 1111 00000 11111 000 111 0000 1111 00000 11111 000 111 0000 1111 00000 11111 000 111 0000 1111 00000 11111 000 111 0000 1111 00000 11111 MARS application throughput 000 111 00 11 00000 11111 00 11 0000 1111 00000 11111 000 111 00 11 00000 11111 00 11 00 11 00000 11111 00 11 00 11 00000 11111 00 11 00 11 00000 11111 00 11 00000 11111 00 11 00 11 00000 11111 00 11 00 11 00 11 00000 11111 00 11 00 11 00000 11111 00 11 00 11 00000 11111 00 11 MARS network throughput 00000 11111 00 11 00 11 00000 11111 00 11 00000 11111 00 11 00000 11111 00 11 00000 11111 corresponding DRBD inconsistency time This time, the network throughput (red line) is varying8 in some unpredictable way. As before, the application throughput served by MARS is assumed to be constant (dotted green line, often superseded by the solid green line). The actual replication network throughput is depicted by the solid green line. As you can see, a network dropdown undershooting the application demand has no impact on the application throughput, but only on the replication network throughput. Whenever the network throughput is held back due to the flaky network, it simply catches up as soon as possible by overshooting the application throughput. The amount of lag-behind is visualized as shaded area: downward shading (below the application throughput) means an increase of the lag-behind, while the upwards shaded areas (beyond the application throughput) indicate a decrease of the lag-behind (catch-up). Once the lag-behind has been fully caught up, the network throughput suddenly jumps back to the application throughput (here visible in two cases). Note that the existence of lag-behind areas is roughly corresponding to DRBD disconnect states, and in turn to DRBD inconsistent states of the secondary as long as the lag-behind has not been fully cought up. The very rough9 duration of the corresponding DRBD inconsistency phase is visualized as magenta line at the time scale. MARS utilizes the existing network bandwidth as best as possible in order to pipe through as much data as possible, provided that there exists some data requiring expedition. Conceptually, there exists no better way due to information theoretic limits (besides data compression). Note that in average during a longer period of time, the network must have emough capacity for transporting all of your data. MARS cannot magically break through informationtheoretic limits. It cannot magically transport gigabytes of data over modem lines. Only relatively short network problems / packet loss can be compensated. In case of lag-behind, the version of the data replicated to the secondary site corresponds 8 In real life, many long-distance lines or even some heavily used metro lines usually show fluctuations of their network bandwidth by an order of magnitude, or even higher. We have measured them. The overall behaviour can be characterized as “chaotic”. 9 Of course, this visualization is not exact. On one hand, the DRBD inconsistency phase may start later as depicted here, because it only starts after the first automatic disconnect, upon the first automatic re-connect. In addition, the amount of resync data may be smaller than the amount of corresponding MARS transaction logfile data, because the DRBD bitmap will coalesce multiple writes to the same block into one single transfer. On the other hand, DRBD will transfer no data at all during its disconnected state, while MARS continues its best. This leads to a prolongation of the DRBD inconsistent phase. Depending on properties of the workload and of the network, the real duration of the inconsistency phase may be both shorter or longer. 48 2.1. Network Bottlenecks to some time in the past. Since the data is always transferred in the same order as originally submitted at the primary site, the secondary never gets inconsistent. Your mirror always remains usable. Your only potential problem could be the outdated state, corresponding to some state in the past. However, the “as-best-as-possible” approach to the network transfer ensures that your version is always as up-to-date as possible even under ill-behaving network bottlenecks. There is simply no better way to do it. In presence of temporary network bottlenecks such as network congestion, there exists no better method than prescribed by the information theoretic limit (red line, neglecting data compression). In order to get all of your data through the line, somewhen the network must be healthy again. Otherwise, data will be recorded until the capacity of the /mars/ filesystem is exhausted, leading to an emergency mode (see section B.3). MARS’ property of never sacrificing local data consistency (at the possible cost of actuality, as long as you have enough capacity in /mars/) is called Anytime Consistency. Even when the capacity of /mars/ is exhausted and when emergency mode is entered, the replicas will not become inconsistent by themselves. However, when the emergency mode is later cleaned up for a replica, it will become temporarily inconsistent during the fast full sync. Details are in section B.3. Conclusion: you can even use traffic shaping on MARS’ TCP connections in order to globally balance your network throughput (of course at the cost of actuality, but without sacrificing local data consistency). If you would try to do the same with DRBD, you could easily provoke a disaster. MARS simply tolerates any network problems, provided that there is enough disk space for transaction logfiles. Even in case of completely filling up your disk with transaction logfiles after some days or weeks, you will not lose local consistency anywhere (see section 4.4). Finally, here is yet another scenario where MARS can cope with the situation: network throughput application throughput, showing heavy peaks constant throughput limit MARS replication network throughput time This time, the network throughput limit (solid red line) is assumed to be constant. However, the application workload (dotted green line) shows some heavy peaks. We know from our 1&1 datacenters that such an application behaviour is very common (e.g. in case of certain kinds of DDOS attacks etc). When the peaks are exceeding the network capacities for some short time, the replication network throughput (solid green line) will be limited for a short time, stay a little bit longer at the limit, and finally drop down again to the normal workload. In other words, you get a flexible buffering behaviour, coping with the peaks. 49 2. Use Cases for MARS vs DRBD Similar scenarios (where both the application workload has peaks and the network is flaky to some degree) are rather common. If you would use DRBD there, you were likely to run into regular application performance problems and/or frequent automatic disconnect cycles, depending on the height and on the duration of the peaks, and on network resources. 2.2. Long Distances / High Latencies In general and in some theories, latencies are conceptually independent from throughput, at least to some degree. There exist all 4 possible combinations: 1. There exist communication lines with high latencies but also high throughput. Examples are raw fibre cables at the ground of the Atlantic. 2. High latencies on low-throughput lines is very easy to achieve. If you never saw it, you never ran interactive vi over ssh in parallel to downloads on your old-fashioned modem line. 3. Low latencies need not be incompatible with high throughput. See Myrinet, InfiniBand or high-speed point-to-point interconnects, such as modern RAM busses. 4. Low latency combined with low throughput is also possible: in an ATM system (or another pre-reservation system for bandwidth), just increase the multiplex factor on low-capacity but short lines, which is only possible at the cost of assigned bandwidth. In the internet practice, however, it is very likely that high latencies will also lead to worse throughput, because of the congestion control algorithms running all over the world. We have experimented with extremely large TCP send/receive buffers plus various window sizes and congestion control algorithms over long-distance lines between the USA and Europe. Yes, it is possible to improve the behaviour to some degree. But magic does not happen. Natural laws will always hold. You simply cannot travel faster than the speed of light. Our experience leads to the following rule of thumb, not formally proven by anything, but just observed in practice: In general10 , synchronous data replication (not limited to applications of DRBD) works reliably only over distances < 50 km, or sometimes even less. There may be some exceptions, e.g. when dealing with low-end workstation loads. But when you are responsible for a whole datacenter and/or some centralized storage units, don’t waste your time by trying (almost) impossible things. We recommend to use MARS in such use cases. 2.3. Explanation via CAP Theorem C = Consistency A = Availability P = Partitioning Tolerance The famous CAP theorem, also called Brewer’s theorem, is important for a deeper understanding of the differences between DRBD and MARS. A good explanation can be found at https://en.wikipedia.org/wiki/CAP_theorem (retrieved July 2018). 10 We have heard of cases where even less than 50 km were not working with DRBD. It depends on application workload, on properties of the line, and on congestion caused by other traffic. Some other people told us that according to their experience, much lesser distances should be considered operable, only in the range of a few single kilometers. However, they agree that DRBD is rock stable when used on crossover cables. 50 2.3. Explanation via CAP Theorem The CAP theorem states that only 2 out of 3 properties can be achieved at the same time, when a Distributed System is under pressure: C = Consistency means Strict Consistency at the level of the distributed system (which is not the same as strict consistency inside of one of the local systems), A = Availability = intuitively clear from a user’s perspective, and P = Partitioning Tolerance = the network may have its own outages at any time (which is a negative criterion). As explained in the Wikipedia article, the P = Partitioning Tolerance is a property which is imporant at least in wide-distance data replication scenarios, and possibly in some other scenarios. 2.3.1. CAP Differences between DRBD and MARS If you are considering only short distances like passive crossover cables between racks, then (and only then) you may assume(!) that P is not required. Then, and only then, you can get both A and C at the same time, without sacrificing P, because P is already for free by assumption. In such a crossover cable scenario, getting all three C and A and P is possible, similarly to an explanation in the Wikipedia article. This is the classical use case for DRBD: when both DRBD replicas are always staying physically connected via a passive crossover cable (which is assumed to never break down), you can get both strict global consistency and availability, even in cases where one of the DRBD nodes is failing11 . Both C and A are provided by DRBD during connected state, while P is assumed to be provided by a passive component. By addition of iSCSI failover, A can be achieved even in case of single storage node failures, while retaining C from the viewpoint12 of the application. This is explained by the thick line in the following variant of the graphics, which is only valid for crossover cables where P need not be guaranteed by the replication because it is already assumed for free: C = Consistency A = Availability P = Partitioning Tolerance Now look at the case of a truly Distributed System, where P cannot be assumed as for free. For example, try to use DRBD in a long-distance replication scenario. There we cannot assume P as already given. We must tolerate replication network outages. DRBD is reacting to this differently in two different modes. First we look at the (short) time interval before DRBD recognizes the replication network incident, and before it leaves the connected state. During this phase, the application IO will hang for some time, indicating the (temporary) sacrifice (from a user’s perspective) by a red X: C = Consistency X A = Availability P = Partitioning Tolerance 11 In addition, you will need some further components like Pacemaker, iSCSI failover, etc. the CAP theorem does not deal with node failures, only with network failures. Node failures would always violate C by some “strong” definition. By some “weaker” definition, the downtime plus recovery time (e.g. DRBD re-sync) can be taken out of the game. Notice: while a node can always “know” whether it has failed (at least after reboot), network failures cannot be distinguished from failures of remote nodes in general. Therefore node failures and network failures are fundamentally different by their nature. 12 Notice: 51 2. Use Cases for MARS vs DRBD Because Availability is one of the highest goods of enterprise-critical IT operations, you will typically configure DRBD such that it automatically switches to some variant of a disconnected state after some timeout, thereby giving up consistency between both replicas. The red X indicates not only loss of global strict consistency in the sense of the CAP theorem, but also that your replica will become Inconsistent during the following re-sync: X C = Consistency A = Availability P = Partitioning Tolerance You may wonder what the difference to MARS is. As explained in section 1.2, MARS is not only intended for wide distances, but also for Cloud Storage where no strict consistency is required at global level by definition, but instead Eventually Consistent is the preferred model for the Distributed System. Therefore, strict consistency (in the sense of the CAP theorem) is not required by definition. Therefore, the red X is not present in the following graphics, showing the state where MARS is remaining locally consistent all the time13 , even when a network outage occurs: C = Consistency A = Availability P = Partitioning Tolerance Notice: MARS does not guarantee strict consistency between LV replicas at the level of the Distributed System, but only Eventually Consistent. However, at the same time it also guarantees strict consistency locally, and even at each of the passive replicas, each by each. Don’t confuse these different levels. There are different consistency guarantees at different levels, at the same time. This might be confusing if you are not looking at the system at different levels: (1) overall Distributed System versus (2) each of the local system instances. Why does MARS this? Because a better way is not possible at all. The CAP theorem tells us that there exists no better way when both A have to be guaranteed (as almost everywhere in enterprise-critical IT operations), and P has to be ensured in datacenter disaster scenarios or some other scenarios. Similarly to natural laws like Einstein’s laws of the speed of light, there does not exist a better way! Conclusion from the CAP theorem: when P is a hard requirement, don’t use DRBD (or other synchronous replication implementations) for long-distance and/or Cloud Storage scenarios. The red X is in particular problematic during re-sync, after the network has become healthy again (cf section 2.1.1). MARS has no red X at C because of its Anytime Consistency, 13 Notice that the initial full sync is not considered here, neither for DRBD, nor for MARS. Setup of the Distributed System is its own scenario, not considered here. Repair of a damaged system is also a different scenario, also not considered here. Notice the MARS’ emergency mode also belongs to the class of “damages”, as well as DRBD’ disk failure modes, where is has some additional functionality compared to the current version of MARS. 52 2.3. Explanation via CAP Theorem which refers to local consistency, and which is violated by DRBD during certain important phases of its regular operation. Another conclusion from the CAP theorem: when A+C is a hard requirement, and when P can be faithfully assumed as already given by passive crossover cables, then don’t use the current version of MARS. Use DRBD instead. If you think that you require alle three properties C+A+P, but you don’t have passive crossover cables over short distances, you are requiring something which is impossible. There exists no solution, with whatever component, or from whatever commercial storage vendor. The CAP theorem is as hard as Einstein’s natural laws are. Rethink your complete concept, from end to end. Something is wrong, somewhere. Ignoring this on enterprise-critical use cases can endanger a company and/or your career. 2.3.2. CAP Commonalities between DRBD and MARS In this subsection, we look at the case that P is not for free, but has to be ensured by the Distributed Storage system. You may have noticed that MARS’ ordinary CAP behaviour is similar to DRBD’s CAP picture in disconnected state, or during similar states when the replication network is interrupted. Replication network interruption is also known as “Network Partitioning”. This is where property P = Partitioning Tolerance comes into play. When a network partition has actually occurred, both DRDB and MARS allow you to do the same: you may forcefully switch the primary role, which means activation of a former secondary node. In such a situation, you can issue commands like drbdadm primary --force or marsadm primary --force. It is no accident that both commands are looking similar to each other. The outcome will be the same: you will most likely get a SplitBrain situation. The possibility of getting a split brain is no specific property of neither DRBD nor MARS. It will also happen with any other replication system, whether synchronous or asynchronous. It is one of the consequences from the CAP theorem when (1a) P has to be assured, and (1b) a network partition has actually occurred, and (2) when A = Availability is enforced at both sides of the network partition. The result is that C = global Consistency is violated, by creation of two or more versions of the data. Careful: at least for some application classes, it is a bad idea to systematically create split brain via automatic cluster managers, e.g. Pacemaker or similar. As explained in section 7.1 on page 116, some cluster managers were originally constructed for truly shared disk scenarios, where no split brain can occur by construction. Using them in masses on versioned data in truly distributed systems can result in existential surprises, once a bigger network partition and/or a flaky replication networks triggers them in masses, and at some moments where you didn’t really want to do what they now are doing automatically, and in masses. Split brain should not be provoked when not absolutely necessary. Split brain resolution is all else but easy in general. When the data is in a generic block device, you typically will have no general means for merging both versions. This means, split brain resolution is typically only possible by throwing away some of the versions. This kind of split brain resolution problem is no specific property of DRBD or of MARS. It is a fundamental property of generic block devices. DRBD and MARS have some commands like drbdadm invalidate or marsadm invalidate for this. Again, the similarity is no accident. Notice that classical filesystems aren’t typically better than raw block devices. There are even more possibilities for tricky types of conflicts (e.g. on path names in addition to file content). Similary, BigCluster object stores are often suffering from similar (or even worse) problems, because higher application layers may have some hidden internal dependencies between object 53 2. Use Cases for MARS vs DRBD versions, while the object store itself is agnostic of version dependencies in general14 . When stacking block devices or filesystems (or something else) on top of some BigCluster object store, the latter will not magically resolve any split brain for you. Check whether your favorite object store implementation has some kind of equivalent of a primary --force command, and some equivalent15 of an invalidate command. If it doesn’t have one, or only a restricted one, you should be alerted. In case of a long-lasting storage network partition, you might need suchalike desperately for ensuring A, even at the cost of C. Check: whether you need this is heavily depending on the application class (see also the Cloud Storage definition in section 1.2 on page 9, or look at webhosting, etc). When you would need it, but you are not prepared for suchalike scenarios at your enterprise-critical data, it could cost you a lot of money and/or reputation and/or even your existence. Notice: the concept of SplitBrain is occurring almost everywhere in truly Distributed Systems when C can be violated in favour of A+P. It is a very general consequence16 of the CAP theorem. The only reliable way for avoiding split brain in truly distributed systems would be: don’t insist on A = Availability. Notice that there exist some application classes, like certain types of banking, where C is typically a higher good than A. Notice that both DRBD and MARS are supporting this also: just don’t add the option --force to the primary switch command. However: even in banking, some extremely extraordinary scenarios might occur, where sacrifice of C in favour of A could be necessary (e.g. when manual cleanup of C is cheaper than long-lasting violations of A). Good to know that both DRBD and MARS have some emergency measure for killing C in favour of A! 2.4. Higher Consistency Guarantees vs Actuality We already saw in section 2.1 that certain types of network bottlenecks can easily (and reproducibly) destroy the consistency of your DRBD secondary, while MARS will preserve local consistency at the cost of actuality (anytime consistency). Some people, often located at database operations, are obtrusively arguing that actuality is such a high good that it must not be sacrificed under any circumstances. Anyone arguing this way has at least the following choices (list may be incomplete): 1. None of the above use cases for MARS apply. For instance, short distance replication over crossover cables is sufficient (which occurs very often), or the network is reliable enough such that bottlenecks can never occur (e.g. because the total load is extremely low, or conversely the network is extremely overengineered / expensive), or the occurrence of bottlenecks can provably be taken into account. In such cases, DRBD is clearly the better solution than MARS, because it provides better actuality than the current version of MARS, and it uses up less disk resources. 2. In the presence of network bottlenecks, people didn’t notice and/or didn’t understand and/or did under-estimate the risk of accidental invalidation of their DRBD secondaries. They should carefully check that risk. They should convince themselves that the risk is really bearable. Once they are hit by a systematic chain of events which reproducibly provoke the bad effect, it is too late17 . 14 There exists lots of types of potential dependencies between objects. Timely ones are easy to capture, but this is not sufficient in general for everything. 15 Notice: BigCluster architectures are typically discriminating between between client servers and storage servers. This will typically introduce some more possibilities into the game, such as forced client failover, independently from forced storage failover. 16 There exist only few opportunities for generic conflict resolution, even in classical databases where some knowledge about the structure of the data is available. Typically, there are some more hidden dependencies. Lossless SplitBrain resolution will thus need to be implemented at application layer, if it is possible at all. 17 Some people seem to need a bad experience before they get the difference between risk caused by reproducible effects and inverted luck. 54 2.4. Higher Consistency Guarantees vs Actuality 3. In the presence of network bottlenecks, people found a solution such that DRBD does not automatically re-connect after the connection has been dropped due to network problems (c.f. ko-count parameter). So the risk of inconsistency appears to have vanished. In some cases, people did not notice that the risk has not completely 18 vanished, and/or they did not notice that now the actuality produced by DRBD is even drastically worse than that of MARS (in the same situation). It is true that DRBD provides better actuality in connected state, but for a full picture the actuality in disconnected state should not be neglected19 . So they didn’t notice that their argumentation on the importance of actuality may be fundamentally wrong. A possible way to overcome that may be re-reading section 2.1.2 and comparing its outcome with the corresponding outcome of DRBD in the same situation. 4. People are stuck in contradictive requirements because the current version of MARS does not yet support synchronous or pseudo-synchronous operation modes. This should be resolved some day. A common misunderstanding is about the actuality guarantees provided by filesystems. The buffer cache / page cache uses by default a writeback strategy for performance reasons. Even modern journalling filesystems will (by default) provide only consistency guarantees, but no strong actuality guarantee. In case of power loss, some transactions may be even rolled back in order to restore consistency. According to POSIX20 and other standards, the only reliable way to achieve actuality is usage of system calls like sync(), fsync(), fdatasync(), flags like O_DIRECT, or similar. For performance reasons, the vast majority of applications don’t use them at all, or use them only sparingly! It makes no sense to require strong actuality guarantees from any block layer replication (whether DRBD or future versions of MARS) while higher layers such as filesystems or even applications are already sacrificing them! In summary, the anytime consistency provided by MARS is an argument you should consider, even if you need an extra hard disk for transaction logfiles. 18 Hint: what’s the conceptual difference beween an automatic and a manual re-connect? Yes, you can try to lower the risk in some cases by transferring risks to human analysis and human decisions, but did you take into account the possibility of human errors? 19 Hint: a potential hurdle may be the fact that the current format of /proc/drbd does neither display the timestamp of the first relevant network drop nor the total amount of lag-behind user data (which is not the same as the number of dirty bits in the bitmap), while marsadm view can display it. So it is difficult to judge the risks. Possibly a chance is inspection of DRBD messages in the syslog, but quantification could remain hard. 20 The above argumentation also applies to Windows filesystems in analogous way. 55 3. Quick Start Guide This chapter is for impatient but experienced sysadmins who already know DRBD. For more complete information, refer to chapter The Sysadmin Interface (marsadm and /proc/sys/mars/) . 3.1. Preparation: What you Need Typically, you will use MARS at servers in a datacenter for replication of big masses of data. Typically, you will use MARS for replication between multiple datacenters, when the distances are greater than ≈ 50 km. Many other solutions, even from commercial storage vendors, will not work reliably over large distances when your network is not extremely reliable, or when you try to push huge masses of data from high-performance applications through a network bottleneck. If you ever encountered suchalike problems (or try to avoid them in advance), MARS is for you. You can use MARS both at dedicated storage servers (e.g. for serving Windows clients), or at standalone Linux servers where CPU and storage are not separated. In order to protect your data from low-level disk failures, you should use a hardware RAID controller with BBU. Software RAID is explicitly not recommended, because it generally provides worse performance due to the lack of a hardware BBU (for some benchmark comparisons with/out BBU, see https://github.com/schoebel/blkreplay/raw/master/doc/blkreplay. pdf). Don’t set your hardware BBU cache to “writethrough” mode. This may lead to tremendous performance degradation. Use the “writeback” strategy instead. It should be operationally safe, because in case of power loss the BBU cache content will be preserved thanks to the battery, and/or thanks to goldcaps for saving the cache content into some flash chips. For better performance, use newer MARS versions from branch mars0.1a.y or later. You may also play around with /proc/sys/mars/aio_sync_mode when actuality is less important. Further tuning of /proc/sys/mars/io_tuning/ and many more tunables is currently only recommended for experts. Future versions of MARS are planned to provide better performance with software RAID. Typically, you will need more than one RAID set1 for big masses of data. Therefore, use of LVM is also recommended2 for your data. MARS’ tolerance of networking problems comes with some cost. You will need some extra space for the transaction logfiles of MARS, residing at the /mars/ filesystem. The exact space requirements for /mars/ depend on the average write rate of your application, not on the size of your data. We found that only few applications are writing more than 1 TB per day. Most are writing even less than 100 GB per day. Usually, you want to dimension /mars/ such that you can survive a network loss lasting 3 days / about one weekend. This can be achieved with current technology rather easily: as a simple rule of thumb, just use one dedicated disk having a capacity of 4 TB or more. Typically, that will provide you with plenty of headroom even for bigger networking incidents. Dedicated disks for /mars/ have another advantage: their mechanical head movement is completely independent from your data head movements. For best performance, attach that 1 For low-cost storage, RAID-5 is no longer regarded safe for today’s typical storage sizes, because the error rate is regarded too high. Therefore, use RAID-6. If you need more than 15 disks in total, create multiple RAID sets (each having at most 15 disks, better about 12 disks) and stripe them via LVM (or via your hardware RAID controller if it supports RAID-60). 2 You may also combine MARS with commercial storage boxes connected via Fibrechannel or iSCSI, but we have not yet operational experiences at 1&1 with such setups. 56 3.2. Setup Primary and Secondary Cluster Nodes dedicated disk to your hardware RAID controller with BBU, building a separate RAID set (even if it consists only of a single disk – notice that the hardware BBU is the crucial point). If you are concerned about reliability, use two disks switched together as a relatively small RAID-1 set. For extremely high performance demands, you may consider (and check) RAID-10. Since the transaction logfiles are highly sequential in their access pattern, a cheap but highcapacity SATA disk (or nearline-SAS disk) is usually sufficient. At the time of this writing, standard SATA SSDs have shown to be not (yet) preferable. Although they offer high random IOPS rate, their sequential throughput is worse, and their long-term stability is questioned by many people at the time of this writing. However, as technology evolves and becomes more mature, this could change in future. Use ext4 for /mars/. Avoid ext3, and don’t use xfs3 at all. Notice that the filesystem /mars/ has nothing to do with an ordinary filesystem. It is completely reserved for MARS internal purposes, namely as a storage container for MARS’ persistent data. It does not obey any userspace rules like FHS (filesystem hierarchy standard), and it should not be accessed by any userspace tool execpt the official marsadm tool. Its internal data format should be a regarded as a blackbox by you. The internal data format may change in future, or the complete /mars/ filesystem may be even replaced by a totally different container format, while the official marsadm interface is supposed to remain stable. That said, you might look into its contents by hand for curiosity or for debugging purposes, and only as root. But don’t program any tools / monitoring scripts / etc bypassing the official marsadm tool. Like DRBD, the current version of MARS has no security built in. MARS assumes that it is running in a trusted network. Anyone who can connect to the MARS ports (default 7777 to 7779) can potentially breach in and become root! Therefore, you must protect your network by appropriate means, such as firewalling and/or encrypted VPN. Currently, MARS provides no shared secret like DRBD, because a simple shared secret is way too weak to provide any real security (potentially misleading people about the real level of security). Future versions of MARS should provide at least 2-factor authorization, and encryption via dynamic session keys. Until that is implemented, use a secured VPN instead! And don’t forget to audit it for security holes! 3.2. Setup Primary and Secondary Cluster Nodes If you already use DRBD, you may migrate to MARS (or even back from MARS to DRBD) if you use external 4 DRBD metadata (which is not touched by MARS). 3.2.1. Kernel and MARS Module The MARS kernel module should be available or can be built via one of the following methods: 1. As an external Debian or rpm kernel module, as provided by a package contributor (or hopefully by standard distros in the future). 2. As a separate kernel module, only for experienced5 sysadmins: see file Makefile.dist (tested with some older versions of Debian; may need some extra work with other distros). 3 It seems that the late internal resource allocation strategy of xfs (or another currently unknown reason) could be the reason for some resource deadlocks which appear only with xfs and only under extremely high IO load in combination with high memory pressure. 4 Internal DRBD metadata should also work as long as the filesystem inside your block device / disk already exists and is not re-created. The latter would destroy the DRBD metadata, but even that will not hurt you really: you can always switch back to DRBD using external metadata, as long as you have some small spare space somewhere. 5 You should be familiar with the problems arising from orthogonal combination of different kernel versions with different MARS module versions and with different marsadm userspace tool versions at the package management level. Hint: modinfo is your friend. 57 3. Quick Start Guide 3. Build for senior sysadmins or developers, inplace in the kernel source tree: first apply 0001-mars-minimum-pre-patch-for-mars.patch and 0001-mars-SPECIAL-for-in-tree-build.patch or similar, then cd block/ && git clone --recurse-submodules https://github.com/schoebel/mars. Then cd .. and build your kernel as usual. Config options for MARS should appear under “Enable the block layer”. Just activate MARS as a kernel module via “m” (don’t try a fixed compile-in), and leave all else MARS config options at the default (except you know what you are doing). Further / more accurate / latest instructions can be found in README and in INSTALL. You must not only install the kernel and the mars.ko kernel module to all of your cluster nodes, but also the marsadm userspace tool. Starting with mars0.1stable38 and other branches having merged this feature, a prepatch for vanilla kernels 3.2 through 4.4 is no longer needed. However, IO performance is currently somewhat worse when the pre-patch is not applied. This will be addressed in a later release. Therefore, application of the pre-patch to the kernel is recommended for large-scale production systems for now. Kernel pre-patches can be found in the pre-patches/ subdirectory of the MARS source tree. Following are the types of pre-patches: • 0001-mars-minimum-pre-patch-for-mars.patch or similar. Please prefer this one (when present for your kernel version) in front of 0001-mars-generic-pre-patch-for-mars.patch or similar. The latter should not be used anymore, except for testing or as an emergency fallback. • 0001-mars-SPECIAL-for-in-tree-build.patch or similar. This is only needed when building the MARS kernel module together with all other kernel modules in a single make pass. For separate external module builds, this patch must not be applied (but the prepatch should when possible). When using this patch, please apply the aforementioned pre-patch also, because your kernel is patched anyway. Starting from version mars0.1stable56 or mars0.1abeta8, submodules have been added to the github repo of MARS. If you have an old checkout, please say git pull --recurse-submodules=yes or similar. Otherwise you may be missing an important future part of the MARS release, without notice (depending on your local git version and its local configuration). 3.2.2. Setup your Cluster Nodes For your cluster, you need at least two nodes. In the following, they will be called A and B. In the beginning, A will have the primary role, while B will be your initial secondary. The roles may change later. 1. You must be root. 2. On each of A and B, create the /mars/ mountpoint. 3. On each node, create an ext4 filesystem on your separate disk / RAID set via mkfs.ext4 (for requirements on size etc see section Preparation: What you Need). 4. On each node, mount that filesystem to /mars/. It is advisable to add an entry to /etc/fstab. 5. For security reasons, execute chmod 0700 /mars everyhwere after /mars/ has been mounted. If you forget this step, any following marsadm command will drop you a warning, but will fix the problem for you. 6. On node A, say marsadm create-cluster. This must be done exactly once, on exactly one node of your cluster. Never do this twice or on different nodes, because that would create two different clusters which would have nothing to do with each other. The marsadm tool protects you against accidentally joining / merging two different clusters. If you accidentally created two different clusters, just umount that /mars/ partition and start over with step 3 at that node. 58 3.3. Creating and Maintaining Resources 7. On node B, you must have a working ssh connection to node A (as root). Test it by saying ssh A w on node B. It should work without entering a password (otherwise, use ssh-agent to achieve that). In addition, rsync must be installed. 8. On node B, say marsadm join-cluster A 9. Only after 6 that, do modprobe mars on each node. 3.3. Creating and Maintaining Resources In the following example session, a block device /dev/lv-x/mydata (shortly called disk ) must already exist on both nodes A and B, respectively, having the same7 size. For the sake of simplicity, the disk (underlying block device) as well as its later logical resource name as well as its later virtual device name will all be named uniformly by the same suffix mydata. In general, you might name each of them differently, but that is not recommended since it may easily lead to confusion in larger installations. You may have already some data inside your disk /dev/lv-x/mydata at the initially primary side A. Before using it for MARS, it must be unused for any other purpose (such as being mounted, or used by DRBD, etc). MARS will require exclusive access to it. 1. On node A, say marsadm create-resource mydata /dev/lv-x/mydata. As a result, a directory /mars/resource-mydata/ will be created on node A, containing some symlinks. Node A will automatically start in the primary role for this resource. Therefore, a new pseudo-device /dev/mars/mydata will also appear after a few seconds. Note that the initial contents of /dev/mars/mydata will be exactly the same as in your pre-existing disk /dev/lv-x/mydata. If you like, you may already use /dev/mars/mydata for mounting your already pre-existing data, or for creating a fresh filesystem, or for exporting via iSCSI, and so on. You may even do so before any other cluster node has joined the resource (so-called “standalone mode”). But you can also do so later after setup of (one ore many) secondaries. 2. Wait a few seconds until the directory /mars/resource-mydata/ and its symlink contents also appears on cluster node B. The command marsadm wait-cluster may be helpful. 3. On node B, say marsadm join-resource mydata /dev/lv-x/mydata. As a result, the initial full-sync from node A to node B should start automatically. Of course, your old contents of your disk /dev/lv-x/mydata at side B (and only there!) is overwritten by the version from side A. Since you are an experienced sysadmin, you knew that, and it was just the effect you deliberately wanted to achieve. If you didn’t check that your old contents didn’t contain any valuable data (or if you accidentally provided a wrong disk device argument), it is too late now. The marsadm command checks that the disk device argument is really a block device, and that exclusive access to it is possible (as well as some further safety checks, e.g. matching sizes). However, MARS cannot know the purpose of your generic block device. MARS (as well as DRBD) is completely ignorant of the contents of a generic block device; it does not interpret it in any way. Therefore, you may use MARS (as well as DRBD) for mirroring Windows filesystems, or raw devices from databases, or virtual machines, or whatever. Check that state Orphan is left after a while on B. Notice that join-resource is only starting a new replica, but does not wait for its completion. Hint: by default, MARS uses the so-called “fast fullsync” algorithm. It works 6 In fact, you may already modprobe mars at node A after the marsadm create-cluster. Just don’t do any of the *-cluster operations when the kernel module is loaded. All other operations should have no such restriction. 7 Actually, the disk at the initially secondary side may be larger than that at the initially primary side. This will waste space and is therefore not recommended. 59 3. Quick Start Guide similar to rsync, first reading the data on both sides and computing an md5 checksum for each block. Heavy-weight data is only transferred over the long-distance network upon checksum mismatch. This is extremely fast if your data is already (almost) identical on both sides. Conversely, if you know in advance that your initial data is completely different on both sides, you may choose to switch off the fast fullsync algorithm via echo 0 > /proc/sys/mars/do_fast_fullsync in order to save the additional IO overhead and network latencies introduced by the separate checksum comparison steps. 4. Optionally, only for experienced sysadmins who really know what they are doing: if you will create a new filesystem on /dev/mars/mydata after(!) having created the MARS resource as well as after having already joined it on every replica, you may abandon the fast fullsync phase before creating the fresh filesystem, because the old content of /dev/mars/mydata will then be just garbage not used by the freshly created filesystem8 . Then, and only then, you may say marsadm fake-sync mydata in order to abort the sync operation. Never do a fake-sync unless you are absolutely sure that you really don’t need to sync the data! Otherwise, you are guaranteed to have produced harmful inconsistencies. If you accidentally issued fake-sync, you may startover the fast full sync at your secondary side by saying marsadm invalidate mydata (analogously to the corresponding DRBD command). 3.4. Keeping Resources Operational 3.4.1. Logfile Rotation / Deletion As explained in section The Transaction Logger, all changes to your resource data are recorded in transaction logfiles residing on the /mars/ filesystem. These files are always growing over time. In order to avoid filesystem overflow, the following must be done in regular time intervals: 1. marsadm log-rotate all This starts appending to a new logfile on all of your resources. The logfiles are automatically numbered by an increasing 9-digit logfile number. This will suffice for many centuries even if you would logrotate once a minute. Practical frequencies for logfile rotation are more like once an hour, or every 10 minutes when having highly-loaded storage servers. 2. marsadm log-delete-all all This determines all logfiles from all resources which are no longer needed (i.e. which are fully replayed, on all relevant secondaries). All superfluous logfiles are then deleted, including all copies on all secondaries. The current version of MARS deletes either all replicas of a logfile everywhere, or none of the replicas. This is a simple rule, but has the drawback that one node may hinder other nodes from freeing space in /mars/. In particular, the command marsadm pause-replay $res (as well as marsadm disconnect $res) will freeze the space reclamation in the whole cluster when the pause is lasting very long. During such space accumulation, also the number of so-called deletions will accumulate in /mars/todo-global/ and sibling directories. In very big installations consisting 8 It is vital that the transaction logfile contents created by mkfs is fully propagated to the secondaries and then replayed there. Analogously, another exception is also possible, but at your own risk (be careful, really!): when migrating your data from DRBD to MARS, and you have ensured that (1) at the end of using DRBD both your replicas were really equal (you should have checked that), and (2) before and after setting up any side of MARS (create-resource as well as join-resource) nothing has been written at all to it (i.e. no usage, neither of /dev/lv/mydata nor of /dev/mars/mydata has occurred in any way), the first transaction logfile /mars/resource-mydata/log-000000001-$primary created by MARS will be empty. Check whether this is really true! Then, and only then, you may also issue a fake-sync. 60 3.4. Keeping Resources Operational of thousands of nodes, it is a good idea to regularly monitor the number of deletions similarly to the following: $(find /mars/ -name “delete-*” | wc -l) should not exceed a limit of ~150 entries. Please prefer the short form marsadm cron as an equivalent to scripting two separate commands marsadm log-rotate all and marsadm log-delete-all all. The short form is not only easier to remember, but also future-proof in case some new MARS features should be implemented in future. Best practice is to run marsadm cron in a cron job, such as /etc/cron.d/mars. An example cronjob can be found in the userspace/cron.d/ subdirectory of the git repo. In addition, you should establish some regular monitoring of the free space present in the /mars/ filesystem. More detailed information about about avoidance of /mars/ overflow is in section 4.4. 3.4.2. Switch Primary / Secondary Roles Scenario Planned Handover Temporary Node or Network Failure Dead Node Method Intended Switching Forced Switching Forced Switching Split Brain Resolution Destruction of Damaged Node Reconstruction Phase Rebuild of Redundancy In contrast to DRBD, MARS distinguishes between intended and forced switching. This distinction is necessary due to differences in the communication architecture (asynchronous communication vs synchronous communication, see sections 4.2 and 4.3). Asynchronous communication means that (in worst case) a message may take (almost) arbitrary time in a distorted network to propagate to another node. As a consequence, the risk for accidentally creating an (unintended) split brain is increased (compared to a synchronous system like DRBD). In order to minimize this risk, MARS has invested a lot of effort into an internal handover protocol when you start an intended primary switch. 3.4.2.1. Intended Switching / Planned Handover Before starting a planned handover from your old primary A to a new primary B, you should check the replication of the resource. As a human, use marsadm view mydata. For scripting, use the macros from section 5.1.2 (see also section 5.3; an example can be found in contrib/ example-scripts/check-mars-switchable.sh). The network should be OK, and the amount of replication delay should be as low as possible. Otherwise, handover may take a very long time. 61 3. Quick Start Guide Best practice is to prepare a planned handover by the following steps: 1. Check the network and the replication lag. It should be low (a few hundred megabytes, or a low number of gigabytes - see also the rough time forecast shown by marsadm view mydata when there is a larger replication delay, or directly access the forecast by marsadm view-replinfo). 2. Only when the systemd method from section 7.2 is not used: stop your application, then umount /dev/mars/mydata on host A. 3. Optionally: when the systemd method from section 7.2 is not used, and when scripting something else, or when typing extremely fast by hand, or for better safety: say marsadm wait-umount mydata on host B. When your network is OK, the propagation of the device usage state9 should take only a few seconds. Otherwise, check for any network problems or any other problems. This step is not really necessary, because marsadm primary will also wait for the umount before it will proceed. However, scripting this intermediate step gives you some more options: if the umount takes too long, you may program a different action, like re-starting at the old primary, or its contrary, some forced umount, or even continuing with a forceful failover instead (see section 3.4.2.2). 4. Optionally, and when the systemd method from section 7.2 is not used: on host B, wait until marsadm view mydata (or view-diskstate) shows UpToDate. It is possible to omit this step, but then you have no control on the duration of the handover, and in case of any transfer problems, disk space problems, etc you are potentially risking to produce a split brain (although marsadm will do its best to avoid it). Doing the wait by yourself, before starting marsadm primary, has a big advantage: you can abort the handover cycle at any time, just by re-mounting the device /dev/mars/mydata at the old primary A again, and by re-starting your application. Once you have started marsadm primary on host B, you might have to switch back, or possibly even via primary --force (see sections 3.4.2.2 and 3.4.3). Switching the roles is very similar to DRBD: just issue the command • marsadm primary mydata on your formerly secondary node B. In combination with a properly set-up systemd method (see section 7.2), this will even automatically start your application at the new site. The most important difference to DRBD: don’t use an intermediate marsadm secondary mydata anywhere. Although it would be possible, it has some disadvantages. Always switch directly! In contrast to DRBD, MARS remembers the designated primary, even when your system crashes and reboots. While in case of a crash you have to re-setup DRBD with commands like drbdadm up . . .; drbdadm primary . . ., MARS will automatically resume its former roles just 9 Notice that the usage check for /dev/mars/mydata on host B is based on the open count transferred from another node A. Since MARS is operating asynchronously (in contrast to DRBD), it may take some time until our node B knows that the device is no longer used at A. This can lead to a race condition if you automate an intended takeover with a script like ssh root@A “umount /dev/mars/mydata”; ssh root@B “marsadm primary mydata” because your second ssh command may be faster than the internal MARS symlink tree propagation (cf section 4.3). In order to prevent such races, you are strongly advised to use the command • marsadm wait-umount mydata on node B before trying to become primary. See also section 5.3. 62 3.4. Keeping Resources Operational by saying modprobe mars. In combination with a properly set-up systemd method (see section 7.2), this will even automatically re-start your application. Another fundamental difference to DRBD: when the network is healthy, there can only exist one designated primary at a time (modulo some communication delays caused by the “eventually consistent” communication model, see section 4.2). By saying marsadm primary mydata on host B, all other hosts (including A) will automatically go into secondary role after a while! You simply don’t need an intermediate marsadm secondary mydata for planned handover! Precondition for a plain marsadm primary (without systemd) is that you are up, that means in attached and connected state (cf. marsadm up), that you are no sync target anymore, and (only when systemd isn’t configured to automatically stop the application at the old site) that any old primary (in this case A) does not use its /dev/mars/mydata device any longer, and that the network is healthy. If some (parts of) logfiles are not yet (fully) transferred to the new primary, you will need enough space on /mars/ at the target side. If one of the preconditions described in section 6.2.2 is violated, marsadm primary may refuse to start. These preconditions try to protect you from doing silly things, such as accidentally provoking a split brain error state. We try to avoid split brain as best as we can. Therefore, we distinguish between intended and emergeny switching. Intended switching will try to avoid split brain as best as it can. Don’t rely on split brain avoidance, in particular when scripting any higher-level applications such as cluster managers (cf. section 5.3). marsadm does its best, but at least in case of (unnoticed) network outages / partitions (or extremely, really extremely slow / overloaded networks), an attempt to become UpToDate may fail. If you want to ensure that no split brain can result from intended primary switching, please obey the the best practices from above, and please give the primary command only after your secondary is known 10 to be really UpToDate (see marsadm wait-cluster and marsadm view and other macros described in section 3.6). A very rough estimation of the time to become UpToDate is displayed by marsadm view mydata or other macros (e.g. view-replinfo). However, on very flaky networks, the estimation may not only flicker much, but also be inaccurate. Planned handover is refused by default when some sync is running somewhere. By adding the option --ignore-sync, you are no longer protected by this safety measure, and you are willing to accept that any already running syncs will restart from point 0, in order to ensure consistency. 3.4.2.2. Forced Switching In case the connection to the old primary is lost for whatever reason, we just don’t know anything about its current state (which may deviate from its last known state). The following command sequence will skip many checks (essentially you just need to be attached and you must not be a current sync target) and tell your node to become primary forcefully: • marsadm pause-fetch mydata notice that this is similar to drbdadm disconnect mydata as you are probably 10 As noted in many places in this manual, checking this cannot be done by looking at the local state of a single cluster node. You have to check several nodes. marsadm can only check the local node reliably! 63 3. Quick Start Guide used from DRBD. For better compatibility with DRBD, you may use the alternate syntax marsadm disconnect mydata instead. However, there is a subtle difference to DRBD: DRBD will drop both sides of its single bi-directional connection and no longer try to reconnect from any of both sides, while pause-fetch is equivalent to pause-fetch-local, which instructs only the local host to stop fetching logfiles. Other members of the cluster, including the former primary, are not instructed to do so. They may continue fetching logfiles over their own private TCP connections, potentially using many connections in parallel, and potentially even from any other member of the resource, if they think they can get the data from there. In order to instruct11 all members of the resource to stop fetching logfiles, you may use marsadm pause-fetch-global mydata instead (cf section 6.2.2). • marsadm primary mydata --force this is the forceful failover. Depending on the current replication lag, you may loose some data. Use --force only if you know what you are doing! When systemd is configured properly (see section 7.2), your application will start automatically at the new primary site. when the network is interrupted, the old primary site cannot know this, and will continue running. Once the metadata exchange is working again (by default on port 7777), the old site will be automatically shut down by its local systemd configuration, when configured properly (see section 7.2). In difference to the planned handover from section 3.4.2.1, this may happen much later. In case of long-last network outages, even days or weeks! Running both sites in parallel for a long time may seriously damage your business. Ensure that any customer traffic cannot go to the old site! Be sure to configure your BGP in a proper way, such that only, and only the new site will receive any customer traffic from both inside and outside networks, like the internet. • marsadm resume-fetch mydata As such, the new primary does not really need this, because primaries are producing their own logfiles without need for fetching. This is only to undo the previous pause-fetch, in order to avoid future surprises when the new primary will somewhen change to secondary mode again (in the far-distant future), and you have forgotten to remember the fact that fetching had been switched off. When using --force, many precondition checks and other internal checks are skipped, and in particular the internal handover protocol for split brain avoidance. Therefore, use of --force is likely to provoke a split brain. Split brain is always an erroneous state which should be never entered deliberately! Once you have entered it accidentally, you must resolve it ASAP (see section 3.4.3), otherwise you cannot operate your resource in the long term. In order to impede you from giving an accidental --force, the precondition is different: --force works only in locally disconnected state. This is similar to DRBD. Remember: marsadm primary without --force tries to prevent split brain as best as it can. Use of the --force option will almost certainly provoke a split brain, at least if the old primary continues to operate on its local /dev/mars/mydata device. Therefore, you are strongly advised to do this only after 11 Notice slow). 64 that not all such instructions may arrive at all sites when the network is interrupted (or extremely 3.4. Keeping Resources Operational 1. marsadm primary without --force has failed for no good reason 12 , and 2. You are sure you really want to switch, even when that eventually leads to a split brain. You also declare that you are willing to do manual split-brain resolution as described in section 3.4.3, or even destruction / reconstruction of a damaged node as described in section 3.4.4. Notice: in case of connection loss (e.g. networking problems / network partitions), you may not be able to reliably detect whether a split brain actually resulted, or not. Some Background In contrast to DRBD, split brain situations are handled differently by MARS . When two primaries are accidentally active at the same time, each of them writes into different logfiles /mars/resource-mydata/log-000000001-A and /mars/resource-mydata/log-000000001-B where the origin host is always recorded in the filename. Therefore, both nodes can theoretically run in primary mode independently from each other, at least for some time. They might even log-rotate independently from each other. However, this is really no good idea. The replication to third nodes will likely get stuck, and your /mars/ filesystem(s) will eventually run out of space. Any further secondary node (when having k > 2 replicas) will certainly get into serious problems: it simply does not know which split-brain version it should follow. Therefore, you will certainly loose the actuality of your redundancy. marsadm secondary is strongly discouraged. It tells the whole cluster that nobody is designated as primary any more. All nodes should go into secondary mode, globally. In the current version of MARS, the secondaries will no long fetch any logfiles, since they don’t know which version is the “right” one. Syncing is also not possible. When the device /dev/mars/mydata is in use somewhere, it will remain in actual primary mode during that time. As soon as the local /dev/mars/mydata is released, the node will actually go into secondary mode if it is no longer designated as primary. You should avoid it in advance by always directly switching over from one primary to another one, without intermediate secondary command. This is different from DRBD. Split brain situations are detected passively by secondaries. Whenever a secondary detects that somewhere a split brain has happend, it refuses to replay any logfiles behind the split point (and also to fetch them when possible), or anywhere where something appears suspect or ambiguous. This tries to keep its local disk state always being consistent, but outdated with respect to any of the split brain versions. As a consequence, becoming primary may be impossible, because it cannot always know which logfiles are the correct ones to replay before /dev/mars/mydata can appear. The ambiguity must be resolved first. If you really need the local device /dev/mars/mydata to disappear everywhere in a split brain situation, you don’t need a strongly discouraged marsadm secondary command for this. marsadm detach or marsadm down can do it also, without destroying knowledge about the former designated primary. marsadm primary –force is rejected in newer13 marsadm versions if your replica is a current sync target. This is not a bug: it should prevent you from forcing an inconsistent replica into primary mode, which will certainly lead to inconsistent data. However, in extreme rare cases of severe damage of all of your replicas, you may be desperate. Only in such a rare case, and only then, you might decide to force any of your replicas (e.g. based on their last sync progress bar) into primary role although none of the re-syncs had finished before. In such 12 Most reasons will be displayed by marsadm when it is rejecting the planned handhover. older versions before mars0.1stable52 did deliberately skip this check because a few years ago somebody at 1&1 did place a requirement on this. Fortunately, the requirement now has gone, so a more safe behaviour could be implemented. The new behaviour is for your safety, to prevent you from doing “silly” things in case you are under pressure during an incident (try to safeguard human error as best as possible). 13 Beware: 65 3. Quick Start Guide a case, and only if you really know what you are doing, you may use marsadm fake-sync to first mark your inconsisten replica as UpToDate (which is a lie) and then force it to primary as explained above. Afterwards, you will certainly need an fsck or similar repair before you can restart your application. Good luck! And don’t forget to check the size of lost+found afterwards. This is really your very last chance if nothing else had succeeded before. 3.4.3. Split Brain Resolution Split brain can naturally occur during a long-lasting network outage (aka network partition) when you (forcefully) switch primaries inbetween, or due to final loss of your old primary node (fatal node crash) when not all logfile data had been transferred immediately before the final crash. Remember that split brain is an erroneous state which must be resolved as soon as possible! Whenever split brain occurs for whatever reason, you have two choices for resolution: either destroy one of your versions, or retain it under a different resource name. In any of both cases, do the following steps ASAP: 1. Manually check which (surviving) version is the “right” one. Any error is up to you: destroying the wrong version is your fault, not the fault of MARS. 2. If you did not already switch your primary to the final destination determined in the previous step, do it now (see description in section 3.4.2.2). Don’t use an intermediate marsadm secondary command (as known from DRBD): directly switch to the new designated primary! 3. Unless systemd is configured properly (see section 7.2), do the following manually: on each non-right version (which you don’t want to retain) which had been primary before, umount your /dev/mars/mydata or otherwise stop using it (e.g. stop iSCSI or other users of the device). Wait until each of them has actually left primary state and until their local logfile(s) have been fully written back to the underlying disk. 4. Wait until the network works again. All your (surviving) cluster nodes must 14 be able to communicate with each other. If that is not possible, or if it takes too long, you may fall back to the method described in section 3.4.4, but do this only as far as necessary. The next steps are different for different use cases: Destroying a Wrong Split Brain Version Continue with the following steps, each on those cluster node(s) where you do not want to retain its split-brain version. In preference, start with the old “wrong” primaries first (see advice at the end of this section): 5. marsadm invalidate mydata When no split brain is reported anymore after that (via marsadm view all), you are done. You need to repeat this on other secondaries only when necessary. In very rare cases when things are screwed up very heavily (e.g. a partly destroyed /mars/ partition), you may try an alternate method described in appendix C. Check that state Orphan is left after a while. Notice that invalidate is only restarting an existing replica, but does not wait for its completion. 14 If you are a MARS expert and you really know what you are doing (in particular, you can anticipate the effects of the Lamport clock and of the symlink update protocol including the “eventually consistent” behaviour including the not-yet-consistent intermediate states, see sections 4.2 and 4.3), you may deviate from this requirement. 66 3.4. Keeping Resources Operational Keeping a Split Brain Version (e.g. for inspection purposes): On those cluster node(s) where you want to retain the version 5. marsadm leave-resource mydata 6. After having done this on all those cluster nodes, check that the split brain is gone (e.g. by saying marsadm view mydata), as documented above. In very rare cases, you might also need a log-purge-all (see page 109). 7. Rename the underlying local disk /dev/lv-x/mydata is into something like /dev/lv-x/mynewdata (see man lvrename) This is extremely recommended to avoid confusion with the old resource name! 8. Check that each underlying local disk /dev/lv-x/mynewdata is really usable afterwards, e.g. by test-mounting it (or fsck if you can afford it). If all is OK, don’t forget to umount it before proceeding with the next step. 9. Create a completely new MARS resource out of the underlying disk /dev/lv-x/mynewdata having a different name, best is mynewdata (see description in section 3.3 on page 59). Generally: best practice is to always keep your LV names equal to your MARS resource names. This can avoid a lot of unnecessary confusion. Keeping a Good Version When you had a secondary which did not participate in the split brain, but just got confused and therefore stopped replaying logfiles immediately before the split-brain point, it may very well happen15 that you don’t need to do any action for it. When all wrong versions have disappeared from the cluster (by invalidate or leave-resource as described before), the confusion should be over, and the secondary should automatically resume tracking of the new unique version. Please check that all of your secondaries are no longer stuck. You need to execute split brain resolution only for stuck nodes. Hint / advice for k > 2 replicas: it is a good idea to start split brain resolution first with those (few) nodes which had been (accidentally) primary before, but are not the new designated primary. Usually, you had 2 primaries during split brain, so this will apply only to one of them. Leave the other one intact, by not umounting /dev/mars/mydata at all, and keeping your applications running. Even during emergency mode, see section 4.4.2. First resolve the problem of the “wrong” primary(s) via invalidate or leave-resource. Wait for a short while. Then check the rest of your secondaries, whether they now are already following the new (unique) primary, and finally check whether the split brain warning reported by marsadm view all is gone everywhere. This way, you can often skip unnecessary invalidations of replicas. 3.4.4. Final Destruction of a Damaged Node When a node has eventually died, do the following steps ASAP: 1. Physically remove the dead node from your network. Unplug all network cables! Failing to do so might provoke a disaster in case it somehow resurrects in an uncontrolled manner, such as a partly-damaged /mars/ filesystem, a half-defective kernel, RAM / kernel memory corruption, disk corruption, or whatever. Don’t risk any such unpredictable behaviour! 2. Manually check which of the surviving versions will be the “right” one. Any error is up to you: resurrecting an unnecessarily old / outdated version and/or destroying the newest / best version is your fault, not the fault of MARS. 15 In general, such a “good” behaviour cannot be guaranteed for all secondaries. Race conditions in complex networks may asynchronously transfer “wrong” logfile data to a secondary much earlier than conflicting “good” logfile data which will be marked “good” only in the future. It is impossible to predict this in advance. 67 3. Quick Start Guide 3. If you did not already switch your primary to the final destination determined in the previous step, do it now (see description in section 3.4.2.2). 4. On a surviving node, but preferably not the new designated primary, give the following commands: a) marsadm --host=your-damaged-host down mydata b) marsadm --host=your-damaged-host leave-resource mydata Check for misspellings, in particular the hostname of the dead node, and check the command syntax before typing return! Otherwise, you may forcefully destroy the wrong16 node! 5. In case any of the previous commands should fail (which is rather likely), repeat it with an additional --force option. Don’t use --force in the first place, alway try first without it! 6. Repeat the same with all resources which were formerly present at your-damaged-host. 7. Finally, say marsadm --host=your-damaged-host leave-cluster (optionally augmented with --force). Now your surviving nodes should believe that the old node your-damaged-host does no longer exist, and that it does no longer participate in any resource. Even if your dead node comes to life again in some way: always ensure that the mars kernel module cannot run any more. Never do a modprobe mars on a node marked as dead this way! Further instructions for complicated cases are in appendix D and E. 3.4.5. Online Resizing during Operation You should have LVM or some other means of increasing the physical size of your disk (e.g. via firmware of some RAID controllers). The network must be healthy. Do the following steps: 1. Increase your local disks (usually /dev/vg/mydata) everywhere in the whole cluster. In order to avoid wasting space, increase them uniformly to the same size (when possible). The lvresize tool is documented elsewhere. 2. Check that all MARS switches are on. If not, say marsadm up mydata everywhere. 3. At the primary: marsadm resize mydata 4. If you have intermediate layers such as iSCSI, you may need some iscsiadm update or other command. 5. Now you may increase your filesystem. This is specific for the filesystem type and documented elsewhere. Hint: the secondaries will start syncing the increased new part of the underlying primary disk. In many cases, this is not really needed, because the new junk data just does not care. If you are sure and if you know what you are doing, you may use marsadm fake-sync mydata to abort such unnecessary traffic. 16 That said, MARS is rather tolerant of human error. Once a sysadmin accidentally destroyed a cluster while it was continuously running as primary. Fortunately, the problem was detected early enough for a correction without causing any extraordinary customer downtime outside of accepted tolerances, and no data loss at all. 68 3.5. The State of MARS 3.5. The State of MARS In general, MARS tries to hide any network failures from you as best as it can. After a network problem, any internal low-level socket connections are transparently tried to re-open ASAP, without need for sysadmin intervention. In difference to DRBD, network failures will not automatically alter the state of MARS, such as switching to disconnected after a ko_timeout or similar. From a high-level sysadmin viewpoint, communication may just take a very long time to succeed. When the behaviour of MARS is different from DRBD, it is usually intended as a feature. MARS is not only an asynchronous system at block IO level, but also at control level. This is necessary because in a widely distributed long-distance system running on slow or even temporarily failing networks, actions may take a long time, and there may be many actions started in parallel. Synchronous concepts are generally not sufficient for expressing that. Because of inherent asynchronicity and of dynamic creation / joining of resources, it is neither possible to comprehensively depict a complex distributed MARS system, nor a comprehensive standalone snippet of MARS, as a finite state transition diagram17 . Although MARS tries to approximate / emulate the synchronous control behaviour of DRBD at the interface level (marsadm) in many situations as best as it can, the internal control model is necessarily asynchronous. As an experiencend sysadmin, you will be curious how it works in principle. When you know something about it, you will no longer be surprised when some (detail) behaviour is different from DRBD. The general principle is an asynchronous 2-edge handshake protocol, which is used almost everywhere in MARS: 1 3 todo switch actual response 2 4 time We have a binary todo switch, which can be either in state “on” or “off”. In addition, we have an actual response indicator, which is similar to an LED indicating the actual status. In our example, we imagine that both are used for controlling a big ventilator, having a huge inert mass. Imagine a big machine from a power plant, which is as tall as a human. We start in a situation where the binary switch is off, and the ventilator is stopped. At point 1, we turn on the switch. At that moment, a big contactor will sound like “zonggg”, and a big motor will start to hum. At first you won’t hear anything else. It will take a while, say 1 minute, until the big wheel will have reached its final operating RPM, due to the huge inert mass. During that spin-up, the lights in your room will become slightly darker. When having reached the full RPM at point 2, your workplace will then be noisier, but in exchange your room lights will be back at ordinary strength, and the actual response LED will start to lit in order to indicate that the big fan is now operational. Assume we want to turn the system off. When turning the todo switch to “off” at point 3, first nothing will seem to happen at all. The big wheel will keep spinning due to its heavy inert mass, and the RPM as well as the sound will go down only slowly. During spin-down, 17 Probably it could be possible to formally model MARS as a Petri net. However, complete Petri nets are tending to become very conplex, and to describe lots of low-level details. Expressing hierarchy, in a top-down fashion, is cumbersome. We find no clue in trying to do so. 69 3. Quick Start Guide the actual response LED will stay illuminated, in order to warn you that you should not touch the wheel, otherwise you may get injuried18 . The LED will only go off after, say, 2 minutes, when the wheel has actually stopped at point 4. After that, the cycle may potentially start over again. As you can see, all four possible cartesian product combinations between two boolean values are occurring in the diagram. The same handshake protocol is used in MARS for communication between userspace and kernelspace, as well as for communication in the widely distributed system. 3.6. Inspecting the State of MARS The main command for viewing the current state of MARS is • marsadm view mydata or its more specialized variant • marsadm view-$macroname mydata where $macroname is one of the macros described in chapter 5, or a macro which has been written by yourself. As always, you may replace the resource name mydata with the special keyword all in order to get the state of all locally joined resources, as well as a list of all those resources. When using the variant marsadm view all, additionally the global communication status will be displayed. This helps humans in diagnosing problems. Hint: use the compound command watch marsadm view all for continuous display of the current state of MARS. When starting this side-by-side in ssh terminal windows for all your cluster nodes, you can easily watch what’s going on in the whole cluster. 18 Notice that it is only safe to access the wheel when both the switch and the LED are off. Conversely, if at least one of them is on, something is going on inside the machine. Transferred to MARS: always look at both the todo switch and the correponding actual indicator in order to not miss something. 70 4. Basic Working Principle Even if you are impatient, please read this chapter. At the surface, MARS appears to be very similar to DRBD. It looks like almost being a drop-in replacement for DRBD. When taking this naïvely, you could easily step into some trivial pitfalls, because the internal working principle of MARS is totally different from DRBD. Please forget (almost) anything you already know about the internal working principles of DRBD, and look at the very different working principles of MARS. 4.1. The Transaction Logger MARS Data Flow Principle Host A (primary) Host B (secondary) /dev/mars/mydata Transaction Logger /dev/lv-x/mydata nd pe ap wr ba iteb ck ac gr k i ou n nd Temporary Memory Buffer e nc sta r i d g- sfe lon tran Logfile Replicator /mars/resource/mars/resourcemydata/log-00001mydata/log-00001hostA hostA Logfile Applicator /dev/lvx/mydata MARS LCA2014 Presentation by Thomas Schöbel-Theuer The basic idea of MARS is to record all changes made to your block device in a so-called transaction logfile. Any write reqeuest is treated like a transaction which changes the contents of your block device. This is similar in concept to some database systems, but there exists no separate “commit” operation: any write request is acting like a commit. The picture shows the flow of write requests. Let’s start with the primary node. Upon submission of a write request on /dev/mars/mydata, it is first buffered in a temporary memory buffer. The temporary memory buffer serves multiple purposes: • It keeps track of the order of write operations. • Additionally, it keeps track of the positions in the underlying disk /dev/lv-x/mydata. In particular, it detects when the same block is overwritten multiple times. • During pending write operation, any concurrent reads are served from the memory buffer. 71 4. Basic Working Principle After the write has been buffered in the temporary memory buffer, the main logger thread of the transaction logger creates a so-called log entry and starts an “append” operation on the transaction logfile. The log entry contains vital information such as the logical block number in the underlying disk, the length of the data, a timestamp, some header magic in order to detect corruption, the log entry sequence number, of course the data itself, and optional information like a checksum or compression information. Once the log entry has been written through to the /mars/ filesystem via fsync(), the application waiting for the write operation at /dev/mars/mydata is signalled that the write was successful. This may happen even before the writeback to the underlying disk /dev/lv-x/mydata has started. Even when you power off the system right now, the information is not lost: it is present in the logfile, and can be reconstructed from there. Notice that the order of log records present in the transaction log defines a total order among the write requests which is compatible to the partial order of write requests issued on /dev/mars/mydata. Also notice that despite its sequential nature, the transaction logfile is typically not the performance bottleneck of the system: since appending to a logfile is almost purely sequential IO, it runs much faster than random IO on typical datacenter workloads. In order to reclaim the temporary memory buffer, its content must be written back to the underlying disk /dev/lv-x/mydata somewhen. After writeback, the temporary space is freed. The writeback can do the following optimizations: 1. writeback may be in any order; in particular, it may be sorted according to ascending sector ´numbers. This will reduce the average seek distances of magnetic disks in general. 2. when the same sector is overwritten multiple times, only the “last” version need to be written back, skipping some intermediate versions. In case the primary node crashes during writeback, it suffices to replay the log entries from some point in the past until the end of the transaction logfile. It does no harm if you accidentally replay some log entries twice or even more often: since the replay is in the original total order, any temporary inconsistency is healed by the logfile application. In mathematics, the property that you can apply your logfile twice to your data (or even as often as you want), is called idempotence. This is a very desirable property: it ensures that nothing goes wrong when replaying “too much” / starting your replay “too early”. Idempotence is even more beneficial: in case anything should go wrong with your data on your disk (e.g. IO errors), replaying your logfile once more often may1 even heal some defects. Good news for desperate sysadmins forced to work with flaky hardware! The basic idea of the asynchronous replication of MARS is rather simple: just transfer the logfiles to your secondary nodes, and replay them onto their copy of the disk data (also called mirror ) in the same order as the total order defined by the primary. Therefore, a mirror of your data on any secondary may be outdated, but it always corresponds to some version which was valid in the past. This property is called anytime consistency2 . As you can see in the picture, the process of transfering the logfiles is independent from the process which replays the logfiles onto the data at some secondary site. Both processes can be switched on / off separately (see commands marsadm {dis,}connect and marsadm {pause,resume}-replay in section 6.2.2). This may be exploited : for example, you may replicate your logfiles as soon as possible (to protect against catastrophic failures), but deliberately 1 Miracles cannot be guaranteed, but higher chances and improvements can be expected (e.g. better chances for fsck). 2 Your secondary nodes are always consistent in themselves. Notice that this kind of consistency is a local consistency model. There exists no global consistency in MARS. Global consistency would be practically impossible in long-distance replication where Einstein’s law of the speed of light is limiting global consistency. The front-cover pictures showing the planets Earth and Mars tries to lead your imagination away from global consistency models as used in “DRBD Think(tm)”, and try to prepare you mentally for local consistency as in “MARS Think(tm)”. 72 4.2. The Lamport Clock wait one hour until it is replayed (under regular circumstances). If your data inside your filesystem /mydata/ at the primary site is accidentally destroyed by rm -rf /mydata/, you have an old copy at the secondary site. This way, you can substitute some parts 3 of conventional backup functionality by MARS. In case you need the actual version, just replay in “fast-forward” mode (similar to old-fashioned video tapes). Future versions of MARS Full are planned to also allow “fast-backward” rewinding, of course at some cost. 4.2. The Lamport Clock MARS is always asynchonously communicating in the distributed system on any topics, even strategic decisions. If there were a strict global consistency model, which would be roughly equivalent to a standalone model, we would need locking in order to serialize conflicting requests. It is known for many decades that distributed locks do not only suffer from performance problems, but they are also cumbersome to get them working reliably in scenarios where nodes or network links may fail at any time. Therefore, MARS uses a very different consistency model: Eventually Consistent. Notice that the network bottleneck problems described in section 2.1 are demanding an “eventually consistent” model. You have no chance against natural laws, like Einstein’s laws. In order to cope with the problem area, you have to invest some additional effort. Unfortunately, asynchronous communication models are more tricky to program and to debug than simple strictly consistent models. In particular, you have to cope with additional race conditions inherent to the “eventually consistent” model. In the face of the laws of the universe, motivate yourself by looking at the graphics at the cover page: the planets are a symbol for what you have to do! Example: the asynchronous communication protocol of MARS leads to a different behaviour from DRBD in case of network partitions (temporary interruption of communication between some cluster nodes), because MARS remembers the old state of remote nodes over long periods of time, while DRBD knows absolutely nothing about its peers in disconnected state. Sysadmins familiar with DRBD might find the following behaviour unusual: Event DRBD Behaviour MARS Behaviour 1. the network partitions automatic disconnect nothing happens, but replication lags behind 2. on A: umount $device works works 3. on A: {drbd,mars}adm secondary works works 4. on B: {drbd,mars}adm primary works, split brain happens refused because B believes that A is primary 5. the network resumes automatic connect attempt fails communication automatically resumes If you intentionally want to switch over (and to produce a split brain as a side effect), the following variant must be used with MARS: Event DRBD Behaviour MARS Behaviour 1. the network partitions automatic disconnect nothing happens, but replication lags behind 2. on A: umount $device works works 3. on A: {drbd,mars}adm secondary works works (but not remmonended!) 4. on B: {drbd,mars}adm primary split brain, but nobody knows refused because B believes that A is primary 5. on B: marsadm disconnect - works, nothing happens 6. on B: marsadm primary --force - works, split brain happens on B, but A doesn’t know 7. on B: marsadm connect - works, nothing happens 8. the network resumes automatic connect attempt fails communication resumes, A now detects the split brain 3 Please note that MARS cannot fully substitute a backup system, because it can keep only physical copies, and does not create logical copies. 73 4. Basic Working Principle In order to implement the consistency model “eventually consistent”, MARS uses a so-called Lamport4 clock. MARS uses a special variant called “physical Lamport clock”. The physical Lamport clock is another almost-realtime clock which can run independently from the Linux kernel system clock. However, the Lamport clock tries to remain as near as possible to the system clock. Both clocks can be queried at any time via cat /proc/sys/mars/lamport_clock. The result will show both clocks in parallel, in units of seconds since the Unix epoch, with nanosecond resolution. When there are no network messages at all, both the system clock and the Lamport clock will show almost the same time (except some minor differences of a few nanoseconds resulting from the finite processor clock speed). The physical Lamport clock works rather simple: any message on the network is augmented with a Lamport time stamp telling when the message was sent according to the local Lamport clock of the sender. Whenever that message is received by some receiver, it checks whether the time ordering relation would be violated: whenever the Lamport timestamp in the message would claim that the sender had sent it after it arrived at the receiver (according to drifts in their respective local clocks), something must be wrong. In this case, the local Lamport clock of the receiver is advanced shortly after the sender Lamport timestamp, such that the time ordering relation is no longer violated. As a consequence, any local Lamport clock may precede the corresponding local system clock. In order to avoid accumulation of deltas between the Lamport and the system clock, the Lamport clock will run slower after that, possibly until it reaches the system clock again (if no other message arrives which sets it forward again). After having reached the system clock, the Lamport clock will continue with “normal” speed. MARS uses the local Lamport clock for anything where other systems would use the local system clock: for example, timestamp generation in the /mars/ filesystem. Even symlinks created there are timestamped according to the Lamport clock. Both the kernel module and the userspace tool marsadm are always operating in the timescale of the Lamport clock. Most importantly, all timestamp comparisons are always carried out with respect to Lamport time. Bigger differences between the Lamport and the system clock can be annoying from a human point of view: when typing ls -l /mars/resource-mydata/ many timestamps may appear as if they were created in the “future”, because the ls command compares the output formatting against the system clock (it does not even know of the existence of the MARS Lamport clock). Always use ntp (or another clock synchronization service) in order to pre-synchronize your system clocks as close as possible. Bigger differences are not only annoying, but may lead some people to wrong conclusions and therefore even lead to bad human decisions! In a professional datacenter, you should use ntp anyway, and you should monitor its effectiveness anyway. Hint: many internal logfiles produced by the MARS kernel module contain Lamport timestamps written as numerical values. In order to convert them into human-readable form, use the command marsadm cat /mars/5.total.status or similar. 4.3. The Symlink Tree The symlink tree as described here will be replaced by another representation in future versions of MARS. Therefore, don’t do any scripting by directly accessing symlinks! Use the primitive macros described in section 5.1.2. The current /mars/ filesystem container format contains not only transaction logfiles, but also acts as a generic storage for (persistent) state information. Both configuration information 4 Published 74 in the late 1970s by Leslie Lamport, also known as inventor of LATEX. 4.3. The Symlink Tree and runtime state information are currently stored in symlinks. Symlinks are “misused5 ” in order to represent some key -> value pairs. It is not yet clear / decided, but there is a chance that the concept of key -> value pairs will be retained in future versions of MARS. Instead of being represented by symlinks, another representation will be used, such that hopefully the key part will remain in the form of a pathname, even if there were no longer a physical representation in an actual filesystem. A fundamentally different behaviour than DRBD: when your DRBD primary crashed some time ago, and now comes up again, you have to setup DRBD again by a sequence of commands like modprobe drbd; drbdadm up all; drbdadm primary all or similar. In contrast, MARS needs only modprobe mars (after /mars/ has been mounted by /etc/fstab). The persistence of the symlinks residing in /mars/ will automatically remember your previous state, even if some your resources were primary while others were secondary (mixed operations). You don’t need to do any actions in order to “restore” a previous state, no matter how “complex” it was. (Almost) all symlinks appearing in the /mars/ directory tree are automatically replicated thoughout the whole cluster, provided that the cluster uuids are equal6 at all sites. Thus the /mars/ directory forms some kind of global namespace. In order to avoid name clashes, each pathname created at node A follows a convention: the node name A should be a suffix of the pathname. Typically, internal MARS names follow the scheme /mars/something /myname-A. When using the expert command marsadm {get,set}-link (which will likely be replaced by something else in future MARS releases), you should follow the best practice of systematically using pathnames like /mars/userspace/myname-A or similar. As a result, each node will automatically get informed about the state at any other node, like B when the corresponding information is recorded on node B under the name /mars/userspace/myname-B (context-dependent names). Experts only: the symlink replication works generically. You might use the /mars/userspace/ directory in order to place your own symlink there (for whatever purpose, which need not have to do with MARS). However, the symlinks are likely to disappear. Use marsadm {get,set}-link instead. There is a chance that these abstract commands (or variants thereof) will be retained, by acting on the new data representation in future, even if the old symlink format will vanish some day. Important: the convention of placing the creator host name inside your pathnames should be used wherever possible. The name part is a kind of “ownership indicator”. It is crucial that no other host writes any symlink not “belonging” to him. Other hosts may read foreign information as often as they want, but never modify them. This way, your cluster nodes are able to communicate with each other via symlink / information updates. Although experts might create (and change) the current symlinks with userspace tools like ln -s, you should use the following marsadm commands instead: • marsadm set-link myvalue /mars/userspace/mykey-A • marsadm delete-file /mars/userspace/mykey-A There are many reasons for this: first, the marsadm set-link command will automatically use the Lamport clock for symlink creation, and therefore will avoid any errors resulting from a “wrong” system clock (as in ln -s). Second, the marsadm delete-file (which also deletes 5 This means, the symlink targets need not be other files or directories, but just any values like integers or strings. 6 This is protection against accidental “merging” of two unrelated clusters which had been created at different times with different uuids. 75 4. Basic Working Principle symlinks) works on the whole cluster. And finally, there is a chance that this will work in future versions of MARS even after the symlinks have vanished. What’s the difference? If you would try to remove your symlink locally by hand via rm -f, you will be surprised: since the symlink has been replicated to the other cluster nodes, it will be re-transferred from there and will be resurrected locally after some short time. This way, you cannot delete any object reliably, because your whole cluster (which may consist of many nodes) remembers all your state information and will “correct” it whenever “necessary”. In order to solve the deletion problem, MARS uses some internal deletion protocol using auxiliary symlinks residing in /mars/todo-global/. The deletion protocol ensures that all replicas get deleted in the whole cluster, and only thereafter the auxiliary symlinks in /mars/todo-global/ are also deleted eventually. You may update your already existing symlink via marsadm set-link some-other-value /mars/userspace/mykey-A . The new value will be propagated throughout the cluster according to a timestamp comparison protocol: whenever node B notices that A has a newer version of some symlink (according to the Lamport timestamp), it will replace its elder version by the newer one. The opposite does not work: if B notices that A has an elder version, just nothing happens. This way, the timestamps of symlinks can only progress in forward direction, but never backwards in time. As a consequence, symlink updates made “by hand” via ln -sf may get lost when the local system clock is much more earlier than the Lamport clock. When your cluster is fully connected by the network, the last timestamp will finally win everywhere. Only in case of network outages leading to network partitions, some information may be temporarily inconsistent, but only for the duration of the network outage. The timestamp comparison protocol in combination with the Lamport clock and with the persistence of the /mars/ filesystem will automatically heal any temporary inconsistencies as soon as possible, even in case of temporary node shutdown. The meaning of some internal MARS symlinks residing in /mars/ will be hopefully documented in section 9.4 some day. 4.4. Defending Overflow of /mars/ This section describes an important difference to DRBD. The metadata of DRBD is allocated statically at creation time of the resource. In contrast, the MARS transaction logfiles are allocated dynamically at runtime. This leads to a potential risk from the perspective of a sysadmin: what happens if the /mars/ filesystem runs out of space? No risk, no fun. If you want a system which survives long-lasting network outages while keeping your replicas always consistent (anytime consistency), you need dynamic memory for that. It is impossible to solve that problem using static memory7 . Therefore, DRBD and MARS have different application areas. If you just want a simple system for mirroring your data over short distances like a crossover cable, DRBD will be a suitable choice. However, if you need to replicate over longer distances, or if you need higher levels of reliability even when multiple failures may accumulate (such as network loss during a resync of DRBD), the transaction logs of MARS can solve that, but at some cost. 4.4.1. Countermeasures 4.4.1.1. Dimensioning of /mars/ The first (and most important) measure against overflow of /mars/ is simply to dimension it large enough to survive longer-lasting problems, at least one weekend. Recommended size is at least one dedicated disk, residing at a hardware RAID controller with BBU (see section 3.1). During normal operation, that size is needed only for a small fraction, typically a few percent or even less than one percent. However, it is your safety margin. Keep it high enough! 7 The bitmaps used by DRBD don’t preserve the order of write operations. They cannot do that, because their space is O(k) for some constant k. In contrast, MARS preserves the order. Preserving the order as such (even when only facts about the order were recorded without recording the actual data contents) requires O(n) space where n is infinitely growing over time. 76 4.4. Defending Overflow of /mars/ 4.4.1.2. Monitoring The next (equally important) measure is monitoring in userspace. Following is a list of countermeasures both in userspace and in kernelspace, in the order of “defensive walling”: 1. Regular userspace monitoring must throw an INFO if a certain freespace limit l1 of /mars/ is undershot. Typical values for l1 are 30%. Typical actions are automated calls of marsadm cron (or marsadm log-rotate all followed by marsadm log-delete-all all). You have to implement that yourself in sysadmin space. 2. Regular userspace monitoring must throw a WARNING if a certain freespace limit l2 of /mars/ is undershot. Typical values for l2 are 20%. Typical actions are (in addition to log-rotate and log-delete-all) alarming human supervisors via SMS and/or further stronger automated actions. Frequently large space is occupied by files stemming from debugging output, or from other programs or processes. A hot candidate is “forgotten” removal of debugging output to /mars/. Sometimes, an rm -rf $(find /mars/ -name “*.log”) can work miracles. Another source of space hogging is a “forgotten” pause-sync or disconnect. Therefore, a simple marsadm connect-global all followed by marsadm resume-replay-global all may also work miracles (if you didn’t want to freeze some mirror deliberately). If you just wanted to freeze a mirror at an outdated state for a very long time, you simply cannot do that without causing infinite growth of space consumption in /mars/. Therefore, a marsadm leave-resource $res at exactly that(!) secondary site where the mirror is frozen, can also work miracles. If you want to automate this in unserspace, be careful. It is easy to get unintended effects when choosing the wrong site for leave-resource. Hint: you can / should start some of these measures even earlier at the INFO level (see item 1), or even earlier. 3. Regular userspace monitoring must throw an ERROR if a certain freespace limit l3 of /mars/ is undershot. Typical values for l3 are 10%. Typical actions are alarming the CEO via SMS and/or even stronger automated actions. For example, you may choose to automatically call marsadm leave-resource $res on some or all secondary nodes, such that the primary will be left alone and now has a chance to really delete its logfiles because no one else is any longer potentially needing it. 4. First-level kernelspace action, automatically executed when /proc/sys/mars/required_ free_space_4_gb + /proc/sys/mars/required_free_space_3_gb + /proc/sys/mars/ required_free_space_2_gb + /proc/sys/mars/required_free_space_1_gb is undershot: a warning will be issued. 5. Second-level kernelspace action, automatically executed when /proc/sys/mars/required_ free_space_3_gb + /proc/sys/mars/required_free_space_2_gb + /proc/sys/mars/ required_free_space_1_gb is undershot: all locally secondary resources will delete local copies of transaction logfiles which are no longer needed locally. This is a desperate action of the kernel module. 6. Third-level kernelspace action, automatically executed when /proc/sys/mars/required_ free_space_2_gb + /proc/sys/mars/required_free_space_1_gb is undershot: 77 4. Basic Working Principle all locally secondary resources will stop fetching transaction logfiles. This is a more desperate action of the kernel module. You don’t want to get there (except for testing). 7. Last desperate kernelspace action when all else has failed and /proc/sys/mars/required_ free_space_1_gb is undershot: all locally primary resources will enter emergency mode (see description below in section 4.4.2). This is the most desperate action of the kernel module. You don’t want to get there (except for testing). In addition, the kernel module obeys a general global limit /proc/sys/mars/required_total_ space_0_gb + the sum of all of the above limits. When the total size of /mars/ undershots that sum, the kernel module refuses to start at all, because it assumes that it is senseless to try to operate MARS on a system with such low memory resources. The current level of emergency kernel actions may be viewed at any time via /proc/ sys/mars/mars_emergency_mode. 4.4.1.3. Throttling The last measure for defense of overflow is throttling your performance pigs. Motivation: in rare cases, some users with ssh access can do very silly things. For example, some of them are creating their own backups via user-cron jobs, and they do it every 5 minutes. Some example guy created a zip archive (almost 1GB) by regularly copying his old zip archive into a new one, then appending deltas to the new one, and finally deleting the old archive. Every 5 minutes. Yes, every 5 minutes, although almost never any new files were added to the archive. Essentially, he copied over his archive, for nothing. This led to massive bulk write requests, for ridiculous reasons. In general, your hard disks (or even RAID systems) allow much higher write IO rates than you can ever transport over a standard TCP network from your primary site to your secondary, at least over longer distances (see use cases for MARS in chapter 2). Therefore, it is easy to create a such a high write load that it will be impossible to replicate it over the network, by construction. Therefore, we need some mechanism for throttling bulk writers whenever the network is weaker than your IO subsystem. Notice that DRBD will always throttle your writes whenever the network forms a bottleneck, due to its synchronous operation mode. In contrast, MARS allows for buffering of performance peaks in the transaction logfiles. Only when your buffer in /mars/ runs short (cf subsection 4.4.1.1), MARS will start to throttle your application writes. There are a lot of screws named /proc/sys/mars/write_throttle_* with the following meaning: write_throttle_start_percent Whenever the used space in /mars/ is below this threshold, no throttling will occur at all. Only when this threshold is exceeded, throttling will start slowly. Typical values for this are 60%. write_throttle_end_percent Maximum throttling will occur once this space threshold is reached, i.e. the throttling is now at its maximum effect. Typical values for this are 90%. When the actual space in /mars/ lies between write_throttle_start_percent and write_throttle_end_percent, the strength of throttling will be interpolated linearly between the extremes. In practice, this should lead to an equilibrum between new input flow into /mars/ and output flow over the network to secondaries. write_throttle_size_threshold_kb (readonly) This parameter shows the internal strength calculation of the throttling. Only write8 requests exceeding this size (in KB) are throttled at all. Typically, this will hurt the bulk performance pigs first, while leaving ordinary users (issuing small requests) unaffected. 8 Read 78 requests are never throttled at all. 4.4. Defending Overflow of /mars/ write_throttle_ratelimit_kb Set the global IO rate in KB/s for those write requests which are throttled. In case of strongest9 throttling, this parameters determines the input flow into /mars/. The default value is 5.000 KB/s. Please adjust this value to your application needs and to your environment. write_throttle_rate_kb (readonly) Shows the current rate of exactly those requests which are actually throttled (in contrast to all requests). write_throttle_cumul_kb (logically readonly) Same as before, but the cumulative sum of all throttled requests since startup / reset. This value can be reset from userspace in order to prevent integer overflow. write_throttle_count_ops (logically readonly) Shows the cumulative number of throttled requests. This value can be reset from userspace in order to prevent integer overflow. write_throttle_maxdelay_ms Each request is delayed at most for this timespan. Smaller values will improve the responsiveness of your userspace application, but at the cost of potentially retarding the requests not sufficiently. write_throttle_minwindow_ms Set the minimum length of the measuring window. The measuring window is the timespan for which the average (throughput) rate is computed (see write_throttle_rate_kb). Lower values can increase the responsiveness of the controller algorithm, but at the cost of accuracy. write_throttle_maxwindow_ms This parameter must be set sufficiently much greater than write_throttle_minwindow_ms. In case the flow of throttled operations pauses for some natural reason (e.g. switched off, low load, etc), this parameter determines when a completely new rate calculation should be started over10 . 4.4.2. Emergency Mode and its Resolution When /mars/ is almost full and there is really absolutely no chance of getting rid of any local transaction logfile (or free some space in any other way), there is only one exit strategy: stop creating new logfile data. This means that the ability for replication gets lost. When entering emergency mode, the kernel module will execute the following steps for all resources where the affected host is acting as a primary: 1. Do a kind of “logrotate”, but create a hole in the sequence of transaction logfile numbers. The “new” logfile is left empty, i.e. no data ist written to it (for now). The hole in the numbering will prevent any secondaries from replaying any logfiles behind the hole (should they ever contain some data, e.g. because the emergency mode has been left again). This works because the secondaries are regularly checking the logfile numbers for contiguity, and they will refuse to replay anything which is not contiguous. As a result, the secondaries will be left in a consistent, but outdated state (at least if they already were consistent before that). 2. The kernel module writes back all data present in the temporary memory buffer (see figure in section 4.1). This may lead to a (short) delay of user write requests until that has finished (typically fractions of a second or a few seconds). The reason is that the temporary memory buffer must not be increased in parallel during this phase (race conditions). 3. After the temporary memory buffer is empty, all local IO requests (whether reads or writes) are directly going to the underlying disk. This has the same effect as if MARS would not be present anymore. Transaction logging does no longer take place. 9 In case of lighter throttling, the input flow into /mars/ may be higher because small requests are not throttled. if requests would pause for one hour, the measuring window could become also an hour. Of course, that would lead to completely meaningless results. Two requests in one hour is “incorrect” from a human point of view: we just have to ensure that averages are computed with respect to a reasonable maximum time window in the magnitude of 10s. 10 Motivation: 79 4. Basic Working Principle 4. Any sync from any secondary is stopped ASAP. In case they are resuming their sync somewhen later, they will start over from the beginning (position 0). In order to leave emergency mode, the sysadmin should do the following steps: 1. Free enough space. For example, delete any foreign files on /mars/ which have nothing to do with MARS, or resize the /mars/ filesystem, or whatever. 2. If /proc/sys/mars/mars_reset_emergency is not set, now it is time to set it. Normally, it should be already set. 3. Notice: as long as not enough space has been freed, a message containing “EMEGENCY MODE HYSTERESIS” (or similar) will be displayed by marsadm view all. As a consequence, any sync will be automatically halted. This applies to freshly invoked syncs also, for example created by invalidate or join-resource. 4. On the secondaries, use marsadm invalidate $res in order to request updating your outdated mirrors. 5. On the primary: marsadm log-delete-all all 6. As soon as emough space has been freed everywhere to leave the EMEGENCY MODE HYSTERESIS, sync should really start. Until that it had been halted. 7. Recommendation: check at secondaries that state Orphan has been left after a while. Alternatively, there is another method by roughly following the instructions from appendix C, but in a slightly different order. In this case, do leave-resource everywhere on all secondaries, but don’t start the join-resource phase for now. Then cleanup all your secondaries via log-purge-all, and finally log-delete-all all at the primary, and wait until the emergency has vanished everywhere. Only after that, re-join-resource your secondaries. Expert advice for k = 2 replicas: this means you had only 1 mirror per resource before the overflow happened. Provided that you have enough space on your LVMs and on /mars/, and provided that transaction logging has automatically restarted after leave-resource and log-purge-all, you can recover redundancy by creating a new replica via marsadm join-resource $res on a third node. Only after the initial full sync has finished there, run join-resource at your original mirror. This way, you will always retain at least one consistent mirror somewhere. After all is up-to-date, you can delete the superfluous mirror by marsadm leave-resource $res and reclaim the disk space from its underlying LVM disk. If you already have k > 2 replicas in total, it may be a wise idea to prefer the leave-resource ; log-purge-all ; join-resource method in front of invalidate because it does not invalidate all your replicas at the same time (when handled properly in the right order). 80 5. The Macro Processor marsadm comes with a customizable macro processor. It can be used for high-level complex display of the state of MARS (so-called complex macros), as well as for low-level display of lots of individual state values (so-called primitive macros). From the commandline, any macro can be called via marsadm view-$macroname mydata. The short form marsadm view mydata is equivalent to marsadm view-default mydata. In general, the command marsadm view-$macroname all will first call the macro $macroname in a loop for all resources we are a member locally. Finally, a trailing macro $macroname -global will be called with an empty %{res} argument, provided that such a macro is defined. This way, you can produce per-resource output followed by global output which does not depend on a particular resource. 5.1. Predefined Macros The macro processor is a very flexible and versatile tool for customizing. You can create your own macros, but probably the rich set of predefined macros is already sufficient for your needs. 5.1.1. Predefined Complex and High-Level Macros The following predefined complex macros try to address the information needs of humans. Use them only in scripts when you are prepared about the fact that the output format may change during development of MARS. Notice: the definitions of predefined complex macros may be updated in the course of the MARS project. However, the primitive macros recursively called by the complex ones will be hopefully rather stable in future (with the exception of bugfixes). If you want to retain an old / outdated version of a complex macro, just check it out from git, follow the instructions in section 5.2, and preferably give it a different name in order to avoid confusion with the newer version. In general, it should be possible to use old macros with newer versions of marsadm1 . default This is equivalent to marsadm view mydata without -maroname suffix. It shows a one-line status summary for each resource, optionally followed by informational lines such as progress bars whenever a sync or a fetch of logfiles is currently running. The status line has the following fields: %{res} resource name. [this_count /total_count ] total number of replicas of this resource, out of total number of cluster members. %include{diskstate} see diskstate macro below. %include{replstate} see replstate macro below. %include{flags} see flags macro below. %include{role} see role macro below. %include{primarynode} see primarynode macro below. %include{commstate} see commstate macro below. After that, optional lines such as progress bars are appearing only when something unusual is happening. These lines are subject to future changes. For examples, wasted disk space due to missing resize is reported when %{threshold} is exceeded. 1 You might need to check out also old versions of further macros and adapt their names, whenever complex macros call each other. 81 5. The Macro Processor 1and1 or default-1and1 A variant of default for internal use by 1&1 Internet AG. You may call this complex macro by saying marsadm view-1and1 all. Note: the marsadm view-1and1 command has been intensely tested in Spring 2014 to produce exactly the same output than the 1&1 internal2 tool marsview3 Customization via your own macros (see section 5.2) is explicitly encouraged by the developer. It would be nice if a vibrant user community would emerge, helping each other by exchange of macros. Hint: in order to produce your own customized inspection / monitoring tools, you may ask the author for an official reservation of a macro sub-namespace such as *-yourcompanyname . You will be fully responsible for your own reserved namespace and can do with it whatever you want. The official MARS release will guarantee that no name clashes with your reserved subnamespace will occur in future. default-global Currently, this just calls comminfo (see below). May be extended in future. diskstate Shows the status of the underlying disk device, in the following order of precedence4 : NotJoined (cf %get-disk{}) No underlying disk device is configured. NotPresent (cf %disk-present{}) The underlying disk device (as configured, see marsadm view-get-disk) does not exist or the device node is not accessible. Therefore MARS cannot work. Check that LVM or other software is properly configured and running. Detached (cf InConsistent, NeedsReplay, %todo-attach{}, %is-attach{}) The underlying disk is willingly switched off (see marsadm detach), and it actually is no longer opened by MARS. Detaching (cf %todo-attach{} and %is-attach{}) Access to the underlying disk is switched off, but actually not yet close()d by MARS. This can happen for a long time on a primary when other secondaries are accessing the disk remotely for syncing. DefectiveLog[description-text ] (cf %replay-code{}) Typicially this indicates an md5 checksum error in a transaction logfile, or another (hardware / filesystem) defect. This occurs extremely rarely in practice, but has been observed more frequently during a massive failure of air conditioning in a datacenter, when disk temperatures raised to more than 80° Celsius. Notice that a secondary refuses to apply any knowingly defective logfile data to the disk. Although this message is not directly referring to the underlying disk, it is mentioned here because of its superior relevance for the diskstate. A damaged transaction logfile will always affect the actuality of the disk, but not its integrity (by itself). What to do in such a case? 2 In addition to allow for customization, the macro processor is also meant as an exit strategy for removing dependencies from non-free software. Please put your future macros also under GPL! 3 There are some subtle differences: numbers are displayed in a different precision, some bug fixes in the macro version (which might have occurred in the meantime ) may lead to different output as a side effect from bug fixes in predefined macros, because the original marsview command is currently not actively maintained. Documentation of marsview can be found in the corresponding manpage, see man marsview. By construction, this is also the (unmaintained) documentation of marsadm view-1and1 and other -1and1 macros. Notice that all *-1and1 macros are not officially supported by the developer of MARS, and they may disappear in a future major release. However, they could be useful for your own customization macros. 4 When an earlier list item is displayed, no combinations with following items are possible. This kind of “hiding effect” can lead to an information loss. In order to get a non-lossy picture from the state of your system, please look at the flags which are able to display cartesian combinations of more detailed internal states. 82 5.1. Predefined Macros 1. When the damage is only at one of your secondaries, you should first ensure that the primary has a good logfile after a marsadm log-rotate, then try marsadm invalidate at the damaged secondary. It is crucial that the primary has a fresh correct logfile behind the error position, and that it is continuing to operate correctly. 2. When all of your secondaries are reporting DefectiveLog, the primary could have produced a damaged logfile (e.g. in RAM, in a DMA channel, etc) while continuing to operate, and all of your secondaries got that defective logfile. After marsadm log-delete-all all, you can check this by comparing the md5sum of the first primary logfile (having the lowest serial number) with the versions on your replicas. The problem is that you don’t know whether the primary side has a silent corruption on any of its disks, or not. You will need to take an operational decision whether to switchover to a secondary via primary --force, or whether to continue operation at the primary and invalidate your secondaries. 3. When the original primary is affected in a very bad way, such that it crashed badly and afterwards even recovery of the primary is impossible5 due to this error (which typically occurs extremely rarely, observed two times during 7 millions of operating hours on defective hardware), you need to take an operational decision between the following alternatives: a) switchover to a former secondary via primary --force, producing a split brain, and producing some (typically small) data loss. However, integrity is more important than actuality in such an extreme case. b) deconstruction of the resource at all replicas via leave-resource --force, running fsck or similar tools by hand at the underlying disks, selecting the best replica out of them, and finally reconstructing the resource again. c) restore your backup. Orphan The secondary cannot replay data anymore, because it has been kicked out for avoidance of emergency mode. The data is not recent anymore. Typically, marsadm invalidate needs to be done. There is an execption: shortly after join-resource or invalidate, it may take some time until state Orphan may be left, and until the newest logfile has appeared at your secondary site (depending on the size of logfiles, and on your network). In case of network problems, this may take very long. This state tells you that your replica is not current, and currently not being updated at all. Don’t forget to monitor for longer occurrences of this state! Otherwise you may get a big surprise when you need a forceful emergency failover, but your replica is very old or even does not really exist at all. NoAttach (cf %is-attach{}) The underlying disk is currently not opened by MARS. Reasons may be that the kernel module is not loaded, or an exclusive open() is currently not possible because somebody else has already opened it. InConsistent (cf %is-consistent{}) A logfile replay and/or sync is known to be needed / or to complete (e.g. after invalidate has started) in order to restore local consistency (for details, look at flags). 5 In such a rare case, the original primary (but not any other host) refuses to come up during recovery with his own logfile originally produced by himself. This is not a bug, but saves you from incorrectly assuming that your original primary disk were consistent - it is known to be inconsistent, but recovery is impossible due to the damaged logfile. Thus this one replica is trapped by defective hardware. The other replicas shouldn’t. 83 5. The Macro Processor Hint: in the current implementation of MARS, this will never happen on secondaries during ordinary replay (but only when either sync has not yet finished, or when the initial logfile replay after the sync has not yet finished), because the ordinary logfile replay always maintains anytime consistency once a consistent state had been reached. Only in case of a primary node crash, and only after attempts have failed to become primary again (e.g. IO errors, etc), this can (but need not) mean that something went wrong. Even in such an extremely unlikely event, chances are high that fsck can fix any remaining problems (and, of course, you can also switchover to a former secondary). When this message appears, simply start MARS again (e.g. modprobe mars; marsadm up all), in whatever role you are intending. This will automatically try to replay any necessary transaction logfile(s) in order to fix the inconsistency. Only if the automatic fix fails and this message persists for a long time without progress, you might have a problem. Typically, as observed at a large installation at 1&1, this happens extremely rarely, and then typically indicates that your hardware is likely to be defective. OutDated[FR] (cf %work-reached{}) Only at secondaries. Tells whether it is currently known that the disk has any lag-behind when compared to the currently known state of the current designated primary (if there exists one). Only meaningful if a current designated primary exists. Notice that this kind of status display is subject to natural races, for example when new logfile data has been produced in parallel, or network propagation is very slow. Additional information is in brackets: [F] Fetch is known to be needed. [R] Replay is known to be needed. [FR] Both are known to be needed. WriteBack (cf %is-primary{}) Appears only at actual primaries (whether designated or not), when the writeback from the RAM buffer is active (see section 4.1) Recovery (cf %todo-primary{}) Appears only at the designated primary before it actually has become primary. Similar to database recovery, this indicates the recovery phase after a crash6 . EmergencyMode (cf %is-emergency{}) A current designated primary exists, and it is known that this host has entered emergency mode. See section 4.4.2. UpToDate Displayed when none of the above has been detected. diskstate-1and1 A variant for internal use by 1&1 Internet AG. See above note. replstate Shows the status of the replication in the following order of precedence: ModuleNotLoaded (cf %is-module-loaded{}) No kernel module is loaded, and as a consequence no /proc/sys/mars/ does exist. UnResponsive (cf %is-alive{%{host}}) The main thread mars_light did not do any noticable work for more than %{window} (default 60) seconds. Notice that this may happen when deleting extremely large logfiles (up to hundreds of gigabytes or terabytes). If this happens for a very long time, you should check whether you might need a reboot in order to fix the hang. The time window may be changed by --window=$seconds. 6 In 84 some cases, primary --force may also trigger this message. 5.1. Predefined Macros NotJoined (cf %get-disk{}) No underlying disk device is configured for this resource. NotStarted (cf %todo-attach{}) Replication has not been started. • When the current host is designated as a primary, the rest of the precedence list looks as follows: EmergencyMode (cf. %is-emergency{}) See section 4.4.2. Replicating (cf. %is-primary{}) Primary mode has been entered. NotYetPrimary (catchall) This means the current host should act as a primary (see marsadm primary or marsadm primary --force), but currently doesn’t (yet). This happens during logfile replay, before primary mode is actually entered. Notice that replay of very big logfiles may take a long time. • When the current host is not designated as a primary: PausedSync (cf. %sync-rest{} and %todo-sync{}) Some data needs to be synced, but sync is currently switched off. See marsadm {pause,resume}-sync. Syncing (cf. %is-sync{}) Sync is currently running. PausedFetch (cf. %todo{fetch}) Fetch is currently switched off. See marsadm {pause,resume}-fetch. PausedReplay (cf. %todo{replay}) Replay is currently switched off. See marsadm {pause,resume}-replay. NoPrimaryDesignated (cf. %get-primary{}) A secondary command has been given somewhere in the cluster. Thus no designated primary exists. All resource members are in state Secondary or try to approach it. Sync and other operations are not possible. This state is therefore not recommended. PrimaryUnreachable (cf. %is-alive{}) A current designated primary has been set, but this host has not been remotely updated for more than 60 seconds (see also --window=$seconds). Orphan The secondary cannot replay data anymore, because it has been kicked out for avoidance of emergency mode. The data is not recent anymore. Typically, marsadm invalidate needs to be done. Replaying (catchall) None of the previous conditions have triggered. replstate-1and1 A variant for internal use by 1&1 Internet AG. See above note. flags For each of disk, consistency, attach, sync, fetch, and replay, show exactly one character. Each character is either a capital one, or the corresponding lowercase one, or a dash. The meaning is as follows: disk/device: D = the device /dev/mars/mydata is present, d = only the underlying disk /dev/lv-x/mydata is present, - = none present / configured. consistency: this relates to the underlying disk, not to /dev/mars/mydata! C = locally consistent, c = maybe inconsistent (no guarantee), - = cannot determine. Notice: this does not tell anything about actuality. Notice: like the other flags, this flag is subject to races and therefore should be relied on only in detached state! See also description of macro is-consistent below. attach: A = attached, a = currently trying to attach/detach but not yet ready (intermediate state), - = attach is switched off. sync: S = sync finished, s = currently syncing, - = sync is switched off. fetch: F = according to knowlege, fetched logfiles are up-to-date, f = currently fetching (some parts of) a logfile, - = fetch is switched off. replay: R = all fetched logfiles are replayed, r = currently replaying, - = replay is switched off. 85 5. The Macro Processor flags-1and1 A variant for internal use by 1&1 Internet AG. todo-role Shows the designated state: None, Primary or Secondary. role Shows the actual state: None, NotYetPrimary, Primary, RemainsPrimary, or Secondary. Any differences to the designated state are indicated by a prefix to the keyword Primary: NotYet means that it should become primary, but actually hasn’t. Vice versa, Remains means that it should leave primary state in order to become secondary, but actually cannot do that because the /dev/mars/mydata device is currently in use . %todo-primary{} == 0 %todo-primary{} == 1 %is-primary{} == 0 None / Secondary NotYetPrimary %is-primary{} == 1 RemainsPrimary Primary role-1and1 A variant for internal use by 1&1 Internet AG. primarynode Display (none) or the hostname of the designated primary. primarynode-1and1 A variant for internal use by 1&1 Internet AG. commstate When the last metadata communication to the designated primary is longer ago than ${window} (see also --window=seconds option), display that age in human readable form. See also primitive macro %alive-age{}. syncinfo Shows an informational progress bar when sync is running. Intended for humans. Scripts should not rely on any details from this. Scripts may use this only as an approximate means for detecting progress (when comparing the full output text to a prior version and finding any difference, they may conclude that some progress has happened, how small whatsoever). syncinfo-1and1 A variant for internal use by 1&1 Internet AG. replinfo Shows an informational progress bar when fetch is running. This should not be used for scripting at all, because it contains realtime information in human-readable form. replinfo-1and1 A variant for internal use by 1&1 Internet AG. fetch-line Additional details, called by replinfo. Shows the amount of data to be fetched, as well as the current transfer rate and a very rough estimation of the future duration. When primitive macros %fetch-age{} or %fetch-lag{} exceed ${window}, their values are also displayed for human informational purposes. See description of these primitive macros. replay-line Additional details, called by replinfo. Shows the amount of data to be replayed, as well as the current replay rate and a very rough estimation of the future duration. When primitive macro %replay-age{} exceeds ${window}, it is also displayed for human informational purposes. comminfo When the network communication is in an unusual condition, display it. Otherwise, don’t produce any output. 5.1.2. Predefined Primitive Macros 5.1.2.1. Intended for Humans In the following, shell glob notation {a,b} is used to document similar variants of similar macros in a single place. When you actually call the macro, you must choose one of the possible variants (excluding the braces). the-err-msg Show reported errors for a resource. When the resource argument is missing or empty, show global error information. all-err-msg Like before, but show all information including those which are OK. This way, you get a list7 of all potential error information present in the system. 7 The 86 list may be extended in future versions of MARS. 5.1. Predefined Macros {all,the}-wrn-msg Show all / reported warnings in the system. {all,the}-inf-msg Show all / reported informational messages in the system. {all,the}-msg Show all / reported messages regardless of its classification. {all,the}-global-msg Show global messages not associated with any resource (the resource argument of the marsadm command is ignored in this case). {all,the}-global-{inf,wrn,err}-msg Dito, but more specific. {all,the}-pretty-{global-,}{inf-,wrn-,err-,}msg Dito, but show numerical timestamps in a human readable form. {all,the}-{global-,}{inf-,wrn-,err-,}count Instead of showing the messages, show their count (number of lines). errno-text This macro takes 1 argument, which must represent a Linux errno number, and converts it to human readable form (similar to the C strerror() function). todo-{attach,sync,fetch,replay,primary} Shows a boolean value (0 or 1) indicating the current state of the corresponding todo switch (whether on or off). The meaning of todo switches is illustrated in section 3.5. get-resource-{fat,err,wrn} Access to the internal error status files. This is not an official interface and may thus change at any time without notice. Use this only for human inspection, not for scripting! These macros, as well as the error status files, are likely to disappear in future versions of MARS. They should be used for debugging only. At least when merging into the upstream Linux kernel, only the *-msg macros will likely survive. get-resource-{fat,err,wrn}-count Dito, but get the number of lines instead of the text. replay-code Indicate the current state of logfile replay / recovery: (empty) Unknown. 0 No replay is currently running. 1 Replay is currently running. 2 Replay has successfully stopped. <0 See Linux errno code. Typically this indicates a damaged logfile, or another filesystem error at /mars. is-{attach,sync,fetch,replay,primary,module-loaded} Shows a boolean value (0 or 1) indicating the actual state, whether the corresponding action has been actually carried out, or not (yet). Notice that the values indicated by is-* may differ from the todo-* values when something is not (yet) working. More explanations can be found in section 3.5. is-split-brain Shows whether split brain (see section 3.4.3) has been detected, or not. is-consistent Shows whether the underlying disk is in a locally consistent state, i.e. whether it could be (potentially) detached and then used for read-only test-mounting8 . Don’t confuse this with the consistency of /dev/mars/mydata, which is by construction always locally consistent once it has appeared9 . By construction of MARS, the disk 8 Notice that the writeback at the primary side is out-of-order by default, for performance reasons. Therefore, the underlying disk is only guaranteed to be consistent when there is no data left to be written back. Notice that this condition is racy by construction. When your primary node crashes during writeback and then comes up again, you must do a modprobe mars first in order to automatically replay the transaction logfiles, which will automatically heal such temporary inconsistencies. 9 Exceptions are possible when using marsadm fake-sync. Even in split brain situations, marsadm primary --force tries to prevent any further potential exception as best as it can, by not letting /dev/mars/mydata to appear and by insisting on split brain resolution first. In future implementations, this might change if more pressure is put on the developer to sacrifice consistency in preference to not waiting for a full logfile replay. 87 5. The Macro Processor of secondaries will always remain in a locally consistent state once the initial sync has finished as well as the initial logfile replay. Notice that local consistency does not necessarily imply actuality (see high-level explanation in section 2.1.2). is-emergency Shows whether emergency mode (see section 4.4.2) has been entered for the named resource, or not. rest-space (global, no resource argument necessary) Shows the logically available space in /mars/, which may deviate from the physically available space as indicated by the df command. get-{disk,device} Show the name of the underlying disk, or of the /dev/mars/mydata device (if it is available). {disk,device}-present Show (as a boolean value) whether the underlying disk, or the /dev/mars/mydata device, is available. device-opened Show (as a number) how often /dev/mars/mydata has been actually openend, e.g. by mount or by some processes like dd, or by iSCSI, etc. 5.1.2.2. Intended for Scripting While complex macros may output a whole bunch of information, the following primitive macros are outputting exactly one value. They are intended for script use (cf. section 5.3). Of course, curious humans may also try them :) In the following, shell glob notation {a,b} is used to document similar variants of similar macros in a single place. When you actually call the macro, you must choose one of the possible variants (excluding the braces). Name Querying cluster-members Show a newline-separated list of all host names participating in the cluster. resource-members Show a newline-separated list of all host names participating in the particular resource %{res}. Notice that this may be a subset of %cluster-members{}. {my,all}-resources Show a newline-separated list of either all resource names existing in the cluster, or only those where the current host %{host} is member. Optionally, you may specify the hostname as a parameter, e.g. %my-resources{otherhost }. replay−logcount fetch−logcount work−logcount replay−basenr replay−lognr fetch−lognr work−lognr deletable−size replay−rest fetch−rest replay−pos work−rest occupied−size 0% replay−pos work−pos fetch−pos replay−size 100% fetch−size work−size Figure 5.1.: overview on amounts / cursors Amounts of Data Inquiry The following macros are meaningful for both primary and secondary nodes: deletable-size Show the total amount of locally present logfile data which could be deleted by marsadm log-delete-all mydata. This differs almost always from both replay-pos and occupied-size due to granularity reasons (only whole logfiles can be deleted). Units are bytes, not kilobytes. 88 5.1. Predefined Macros occupied-size Show the total amount of locally present logfile data (sum of all file sizes). This is often roughly approximate to fetch-pos, but it may differ vastly (in both directions) when logfiles are not completely transferred, when some are damaged, during split brain, after a join-resource / invalidate, or when the resource is in emergency mode (see section 4.4.2). disk-size Show the size of the underlying local disk in bytes. resource-size Show the logical size of the resource in bytes. When this value is lower than disk-size, you are wasting space. device-size At a primary node, this may differ from resource-size only for a very short time during the resize operation. At secondaries, there will be no difference. The following macros are only meaningful for secondary nodes. By information theoretic limits, they can only tell what is locally known. They cannot reflect the “true (global) state10 ” of a cluster, in particular during network partitions. {sync,fetch,replay,work}-size Show the total amount of data which is / was to be processed by either sync, fetch, or replay. work-size is equivalent to fetch-size. replay-size is equivalent to fetch-pos (see below). Units are bytes, not kilobytes. {sync,fetch,replay,work}-pos Show the total amount of data which is already processed (current “cursor” position). work-pos is equivalent to replay-pos. The 0% point is the locally contiguous amount of data since the last create-resource, join-resource, or invalidate, or since the last emergency mode, but possibly shortened by log-deletes. Notice that the 0% point may be different on different cluster nodes, because their resource history may be different or non-contiguous during split brain, or after a join-resource, or after invalidate, or during / after emergency mode. {sync,fetch,replay,work}-rest Shows the difference between *-size and *-pos (amount of work to do). work-rest is therefore the difference between fetch-size and replay-pos, which is the total amount of work to do (regardless whether to be fetched and/or to be replayed). {sync,fetch,replay,work}-reached Boolean value indicating whether *-rest dropped down to zero11 . {fetch,replay,work}-threshold-reached Boolean value indicating whether *-rest dropped down to %{threshold}, which is pre-settable by the --threshold=size command line option (default is 10 MiB). In asynchronous use cases of MARS, this should be preferred over *-reached for human display, because it produces less flickering by the inevitable replication delay. 10 Notice that according to Einstein’s law, and according to observations by Lamport, the concept of “true state” does not exist at all in a distributed system. Anything you can know in a distributed system is always local knowlege, which races with other (remote) knowlege, and may be outdated at any time. 11 Recall from chapter 2 that MARS (in its current stage of development) does only guarantee local consistency, but cannot guarantee actuality in all imaginable situations. Notice that a general notion of “actuality” is undefinable in a widely distributed system at all, according to Einstein’s laws. Let’s look at an example. In case of a node crash, and after the node is up again, a modprobe mars has to occur, in order to replay the transaction logs of MARS again. However, at the recovery phase before, the journalling ext4 filesystem /mars/ may have rolled back some internal symlink updates which have occurred immediately before the crash. MARS is relying on the fact that journalling filesystems like ext4 should do their recovery in a consistent way, possibly by sacrifycing actuality a little bit. Therefore, the above macros cannot guarantee to deliver true information about what is persisted at the moment. Notice that there are further potential caveats. In case of {sync,fetch}-reached, MARS uses bio callbacks resp. fdatasync() by default, thus the underlying storage layer has told us that it believes it has commited the data in a reboot-safe way. Whether this is really true does not depend on MARS, but on the lower layers of the storage hierarchy. There exists hardware where this claim is known to be wrong under certain circumstances, such as certain hard disk drives in certain modes of operation. Please check the hardware for any violations of storage semantics under certain circumstances such as power loss, and check information sources like magazines about the problem area. Please notice that such a problem, if it exists at all, is independent from MARS. It would also exist if you wouldn’t use MARS on the same system. 89 5. The Macro Processor {fetch,replay,work}-almost-reached Boolean value indicating whether *-rest almost / approximately dropped down to zero. The default is that at lease 990 permille are reached. In asynchronous use cases of MARS, this can be preferred over *-reached for human display only, because it produces less flickering by the inevitable replication delay. However, don’t base any decisions on this! {sync,fetch,replay,work}-percent The cursor position *-pos as a percentage of *-size. {sync,fetch,replay,work}-permille The cursor position *-pos as permille of *-size. {sync,fetch,replay,work}-rate Show the current throughput in bytes12 per second. work-rate is the maximum of fetch-rate and replay-rate. {sync,fetch,replay,work}-remain Show the estimated remaining time for completion of the respective operation. This is just a very raw guess. Units are seconds. summary-vector Show the colon-separated CSV value %replay-pos{}:%fetch-pos{}:%fetch-size{}. replay-basenr Get currently first reachable logfile number (see figure 5.1 on page 88). Only for curious humans or for debugging / monitoring - don’t base any decisions on this. Use the *-{pos,size} macros instead. {replay,fetch,work}-lognr Get current logfile number of replay or fetch position, or of the currently known last reachable number (see figure 5.1 on page 88). Only for curious humans or for debugging / monitoring - don’t base any decisions on this. Use the *-{pos,size} macros instead. {replay,fetch,work}-logcount Get current number of logfiles which are already replayed, or are already fetched, or are to be applied in total (see figure 5.1 on page 88). Only for curious humans or for debugging / monitoring - don’t base any decisions on this. Use the *-{rest} macros instead. alive-timestamp Tell the Lamport Unix timestamp (seconds since 1970) of the last metadata communication to the designated primary (or to any other host given by the first argument). Returns −1 if no such host exists. {fetch,replay,work}-timestamp Tell the Lamport Unix timestamp (seconds since 1970) when the last progress has been made. When no such action exists, −1 is returned. %work-timestamp{hostname } is the maximum of %fetch-timestamp{hostname } and %replay-timestamp{hostname }. When the parameter hostname is empty, the local host will be reported (default). Example usage: marsadm view all --macro=”%replay-timestamp{%to shows the timestamp of the last reported13 writeback action at the designated primary. {alive,fetch,replay,work}-age Tell the number of seconds since the last respective action, or −1 if none exists. {alive,fetch,replay,work}-lag Report the time difference (in seconds) between the last known action at the local host and at the designated primary (or between any other hosts when 2 parameters are given). Returns −1 if no such action exists at any of the two hosts. Attention! This need not reflect the actual state in case of networking problems. Don’t draw wrong conclusions from a high {fetch,replay}-lag value: it could also mean that simply no write operation at all has occurred at the primary side for a long time. Conversely, a low lag value does not imply that the replication is recent: it may refer to different write operations at each of the hosts; therefore it only tells that some progress has been made, but says nothing about the amount of the progress. 12 Notice that the internal granularity reported by the kernel may be coarser, such as KiB. This interfaces abstracts away from kernel internals and thus presents everything in byte units. 13 Updates of this information are occurring with lower frequency than actual writebacks, for performance reasons. The metadata network update protocol will add further delays. Therefore, the accuracy is only in the range of minutes. 90 5.2. Creating your own Macros Misc Informational Status get-primary Return the name of the current designated primary node as locally known. actual-primary (deprecated) try to determine the name of the node which appears to be the actual primary. This only a guess, because it is not generally unique in split brain situations! Don’t use this macro. Instead, use is-primary on those nodes you are interested in. The explanations from section 3.5 also apply to get-primary versus actual-primary analogously. is-alive Boolean value indicating whether all other nodes participating in mydata are reachable / healthy. uuid (global) Show the unique identifier created by create-cluster or by create-uuid. Hint: this is immutable, and it is firmly bound to the /mars/ filesystem. It can only be destroyed by deleting the whole filesystem (see section 6.2). tree (global) Indicate symlink tree version (see section 4.3). Experts Only The following is for hackers who know what they are doing. The following is not officially supported. wait-{is,todo}-{attach,sync,fetch,replay,primary}-{on,off} This may be used to program some useful waiting conditions in advanced macro scripts. Use at your own risk! 5.2. Creating your own Macros In order to create your own macros, you could start writing them from scratch with your favorite ASCII text editor. However, it is much easier to take an existing macro and to customize it to your needs. In addition, you can learn something about macro programming by looking at the existing macro code. Go to a new empty directory and say • marsadm dump-macros in order to get the most interesting complex macros, or say • marsadm dump-all-macros in order to additionally get some primitive macros which could be customized if needed. This will write lots of files *.tpl into your current working directory. Any modfied or new macro file should be placed either into the current working directory ./ , or into $HOME/.marsadm/ , or into /etc/marsadm/ . They will be searched in this order, and the first match will win. When no macro file is found, the built-in version will be used if it exists. This way, you may override builtin macros. Example: if you have a file ./mymacro.tpl you just need to say marsadm view-mymacro mydata in order to invoke it in the resource context mydata. 5.2.1. General Macro Syntax Macros are simple ASCII text, enriched with calls to other macros. ASCII text outside of comments are copied to the output verbatim. Comments are skipped. Comments may have one of the following well-known forms: • # skipped text until / including next newline character • // skipped text until / including next newline character • /* skipped text including any newline characters */ • denoted as Perl regex: \\\n\s* (single backslash directly followed by a newline character, and eating up any whitespace characters at the beginning of the next line) Hint: this may be fruitfully used to structure macros in a more readable form / indentation. 91 5. The Macro Processor Special characters are always initiated by a backslash. The following pre-defined special character sequences are recognized: • \n newline • \r return (useful for DOS compatibility) • \t tab • \f formfeed • \b backspace • \a alarm (bell) • \e escape (e.g. for generating ANSI escape sequences) • \ followed by anything else: assure that the next character is taken verbatim. Although possible, please don’t use this for escaping letters, because further escape sequences might be pre-defined in future. Best practice is to use this only for escaping the backslash itself, or for escaping the percent sign when you don’t want to call a macro (protect against evaluation), or to escape a brace directly after a macro call (verbatim brace not to be interpreted as a macro parameter). • All other characters stand for their own. If you like, you should be able to produce XML, HTML, JSON and other ASCII-based output formats this way. Macro calls have the following syntax: • %macroname {arg1 }{arg2 }{argn } • Of course, arguments may be empty, denoted as {} • It is possible to supply more arguments than required. These are simply ignored. • There must be always at least 1 argument, even for parameterless macros. In such a case, it is good style to leave it empty (even if it is actually ignored). Just write %parameterlessmacro{} in such a case. • %{varname } syntax: As a special case, the macro name may be empty, but then the first argument must denote a previously defined variable (such as assigned via %let{varname}{myvalue}, or a pre-defined standard variable like %{res} for the current resource name, see later paragraph 5.2.3). • Of course, parameter calls may be (almost) arbitrarily nested. • Of course, the correctness of nesting of braces must be generally obeyed, as usual in any other macro processor language. General rule: for each opening brace, there must be exactly one closing brace somewhere afterwards. These rules are hopefully simple and intuitive. There are currently no exceptions. In particular, there is no special infix operator syntax for arithmetic expressions, and therefore no operator precedence rules are necessary. You have to write nested arithmetic expressions always in the above prefix syntax, like %*{7}{%+{2}{3}} (similar to non-inverse polish notation). When deeply nesting macros and their braces, you may easily find yourself in a feeling like in the good old days of Lisp. Use the above backslash-newline syntax to indent your macros in a readable and structured way. Fortunately, modern text editors like (x)emacs or vim have modes for dealing with the correctness of nested braces. 92 5.2. Creating your own Macros 5.2.2. Calling Builtin / Primitive Macros Primitive macros can be called in two alternate forms: • %primitive-macroname {something } • %macroname {something } When using the %primitive-*{} form, you explicitly disallow interception of the call by a *.tpl file. Otherwise, you may override the standard definition even of primitive macros by your own template files. Notice that %call{} conventions are used in such a case. The parameters are passed via %{0} . . .%{n} variables (see description below). Standard MARS State Inspection Macros These are already described in section 5.1.2. When calling one of them, the call will simply expand to the corresponding value. Example: %get-primary{} will expand to the hostname of the current designated primary node. Further MARS State Inspection Macros Variable Access Macros • %let{varname }{expression } Evaluates both varname and the expression . The expression is then assigned to varname. • %let{varname }{expression } Evaluates both varname and the expression . The expression is then appended to varname (concatenation). • %{varname } Evaluates varname , and outputs the value of the corresponding variable. When the variable does not exist, the empty string is returned. • %{++}{varname } or %{varname }{++} Has the obvious well-known side effect e.g. from C or Java. You may also use -- instead of ++. This is handy for programming loops (see below). • %dump-vars{} Writes all currently defined variables (from the currently active scope) to stderr. This is handy for debugging. CSV Array Macros • %{varname }{delimiter }{index } Evaluates all arguments. The contents of varname is interpreted as a comma-separated list, delimited by delimiter . The index ’th list element is returned. • %set{varname }{delimiter }{index }{expression } Evaluates all arguments. The contents of the old varname is interpreted as a comma-separated list, delimited by delimiter . The index ’th list element is the assigend to, or substituted by, expression . Arithmetic Expression Macros The following macros can also take more than two arguments, carrying out the corresponding arithmetic operation in sequence (it depends on the operator whether this accords to the associative law). • %+{arg1 }{arg2 } Evaluates the arguments, inteprets them as numbers, and adds them together. • %-{arg1 }{arg2 } Subtraction. • %*{arg1 }{arg2 } Multiplication. • %/{arg1 }{arg2 } Division. 93 5. The Macro Processor • %%{arg1 }{arg2 } Modulus. • %&{arg1 }{arg2 } Bitwise Binary And. • %|{arg1 }{arg2 } Bitwise Binary Or. • %^{arg1 }{arg2 } Bitwise Binary Exclusive Or. • %<<{arg1 }{arg2 } Binary Shift Left. • %>>{arg1 }{arg2 } Binary Shift Right. • %min{arg1 }{arg2 } Compute the arithmetic minimum of the arguments. • %max{arg1 }{arg2 } Compute the arithmetic maximum of the arguments. Boolean Condition Macros • %=={arg1 }{arg2 } Numeral Equality. • %!={arg1 }{arg2 } Numeral Inequality. • %<{arg1 }{arg2 } Numeral Less Then. • %<={arg1 }{arg2 } Numeral Less or Equal. • %>{arg1 }{arg2 } Numeral Greater Then. • %>={arg1 }{arg2 } Numeral Greater or Equal. • %eq{arg1 }{arg2 } String Equality. • %ne{arg1 }{arg2 } String Inequality. • %lt{arg1 }{arg2 } String Less Then. • %le{arg1 }{arg2 } String Less or Equal. • %gt{arg1 }{arg2 } String Greater Then. • %ge{arg1 }{arg2 } String Greater or Equal. • %=~{string }{regex }{opts } or %match{string }{regex }{opts } Checks whether string matches the Perl regular expression regex . Modifiers can be given via opts . Shortcut Evaluation Operators The following operators evaluate their arguments only when needed (like in C). • %&&{arg1 }{arg2 } Logical And. • %and{arg1 }{arg2 } Alias for %&&{}. • %||{arg1 }{arg2 } Logical Or. • %or{arg1 }{arg2 } Alias for %||{}. Unary Operators • %!{arg } Logical Not. • %not{arg } Alias for %!{}. • %~{arg } Bitwise Ńegation. 94 5.2. Creating your own Macros String Functions • %length{string } Return the number of ASCII characters present in string . • %toupper{string } Return all ASCII characters converted to uppercase. • %tolower{string } Return all ASCII characters converted to lowercase. • %append{varname }{string } Equivalent to %let{varname }{%{varname }string }. • %subst{string }{regex }{subst }{opts } Perl regex substitution. • %sprintf{fmt }{arg1 }{arg2 }{argn } Perl sprintf() operator. Details see Perl manual. • %human-number{unit }{delim }{unit-sep }{number 1}{number 2}. . . Convert a number or a list of numbers into human-readable B, KiB, MiB, GiB, TiB, as given by unit . When unit is empty, a reasonable unit will be guessed automatically from the maximum of all given numbers. A single result string is produced, where multiple numbers are separated by delim when necessary. When delim is empty, the slash symbol / is used by default (the most obvious use case is result strings like “17/32 KiB”). The final unit text is separated from the previous number(s) by unit-sep . When unit-sep is empty, a single blank is used by default. • %human-seconds{number } Convert the given number of seconds into hh:mm:ss format. Complex Helper Macros • %progress{20} Return a string containing a progress bar showing the values from %summary-vector{}. The default width is 20 characters plus two braces. • %progress{20}{minvalue }{midvalue }{maxvalue } Instead of taking the values from %summary-vector{}, use the supplied values. minvalue and midvalue indicate two different intermediate points, while maxvalue will determine the 100% point. Control Flow Macros • %if{expression }{then-part } or %if{expression }{then-part }{else-part } Like in any other macro or programming language, this evaluates the expression once, not copying its outcome to the output. If the result is non-empty and is not a string denoting the number 0, the then-part is evaluated and copied to the output. Otherwise, the else-part is evaluated and copied, provided that one exists. • %unless{expression }{then-part } or %unless{expression }{then-part }{else-part } Like %if{}, but the expression is logically negated. Essentially, this is a shorthand for %if{%not{expression}}{...} or similar. • %elsif{expr1 }{then1 }{expr2 }{then2 }. . . or %elsif{expr1 }{then1 }{expr2 }{then2 }. . .{odd-else-pa This is for simplification of boring if-else-if chains. The classical if-syntax (as shown above) has the drawback that inner if-parts need to be nested into outer else-parts, so rather deep nestings may occur when you are programming longer chains. This is an alternate syntax for avoidance of deep nesting. When giving an odd number of arguments, the last argument is taken as final else-part. • %elsunless. . . Like %elsif, but all conditions are negated. • %while{expression }{body } Evaluates the expression in a while loop, like in any other macro or programming language. The body is evaluated exactly as many times as the expression holds. Notice that endless loops can be only avoided by a calling a non-pure macro inspecting external state information, or by creating (and checking) another side effect somewhere, like assigning to a variable somewhere. • %until{expression }{body } Like %while{expression }{body }, but negate the expression. 95 5. The Macro Processor • %for{exp r1}{exp r2}{exp r3}{body } As you will expect from the corresponding C, Perl, Java, or (add your favorite language) construct. Only the syntactic sugar is a little bit different. • %foreach{varname }{CSV-delimited-string }{delimiter }{body } As you can expect from similar foreach constructs in other languages like Perl. Currently, the macro processor has no arrays, but can use comma-separated strings as a substitute. • %eval{count }{body } Evaluates the body exactly as many times as indicated by the numeric argument count . This may be used to re-evaluate the output of other macros once again. • %protect{body } Equivalent to %eval{0}{body }, which means that the body is not evaluated at all, but copied to the output verbatim14 . • %eval-down{body } Evaluates the body in a loop until the result does not change any more15 . • %tmp{body } Evaluates the body once in a temporary scope which is thrown away afterwards. • %call{macroname }{arg1 }{arg2 }{argn } Like in many other macro languages, this evaluates the named macro in the a new scope. This means that any side effects produced by the called macro, such as variable assignments, will be reverted after the call, and therefore not influence the old scope. However notice that the arguments arg1 to argn are evaluted in the old scope before the call actually happens (possibly producing side effects if they contain some), and their result is respectively assigned to %{1} until %{n } in the new scope, analogously to the Shell or to Perl. In addition, the new %{0} gets the macroname . Notice that the argument evaluation happens non-lazily in the old scope and therefore differs from other macro processors like TEX. • %include{macroname }{arg1 }{arg2 }{argn } Like %call{}, but evaluates the named macro in the current scope (similar to the source command of the bourne shell). This means that any side effects produced by the called macro, such as variable assignments, will not be reverted after the call. Even the %{0} until %{n } variables will continue to exist (and may lead to confusion if you aren’t aware of that). • %callstack{} Useful for debugging: show the current chain of macro invocations. Time Handling Macros • %time{} Return the current Lamport timestamp (see section 4.2), in units of seconds since the Unix epoch. • %real-time{} Return the current system clock timestamp, in units of seconds since the Unix epoch. • %sleep{seconds } Pause the given number of seconds. • %timeout{seconds } Like %sleep{seconds }, but abort the marsadm command after the total waiting time has exceeded the timeout given by the --timeout= parameter. Misc Macros • %warn{text } Show a WARNING: • %die{text } Abort execution with an error message. 14 T EX or LATEX fans usually know what this is good for ;) 15 Mathematicians knowing Banach’s fixedpoint theorem will 96 know what this is good for ;) 5.2. Creating your own Macros Experts Only - Risky The following macros are unstable and may change at any time without notice. • %get-msg{name } Low-level access to system messages. You should not use this, since this is not extensible (you must know the name in advance). • %readlink{path } Low-level access to symlinks. Don’t misuse this for circumvention of the abstraction macros from the symlink tree! • %setlink{value }{path } Low-level creation of symlinks. Don’t misuse this for circumvention of the abstraction macros for the symlink tree! • %fetch-info{} etc. Low-level access to internal symlink formats. Don’t use this in scripts! Only for curious humans. • %is-almost-consistent{} Whatever you guess what this could mean, don’t use it, at least never in place of %is-consistent{} - it is risky to base decisions on this. Mostly for historical reasons. • %does{name } Equivalent to %is-name {} (just more handy for computing the macro name). Use with care! 5.2.3. Predefined Variables • %{cmd} The command argument of the invoked marsadm command. • %{res} The resource name given to the marsadm command as a command line parameter (or, possibly expanded from all). • %{resdir} The corresponding resource directory. The current version of MARS uses /mars/resource-%{res}/, but this may change in future. Normally, you should not need this, since anything should be already abstracted for you. In case you really need low-level access to something, please prefer this variable over %{mars}/resource-%{res} because it is a bit more abstracted. • %{mars} Currently the fixed string /mars. This may change in future, probably with the advent of MARS Full. • %{host} The hostname of the local node. • %{ip} The IP address of the local node. • %{timeout} The value given by the --timeout= option, or the corresonding default value. • %{threshold} The value given by the --threshold= option, or the corresonding default value. • %{window} The value given by the --window= option, or the corresonding default value (60s). • %{force} The number of times the --force option has been given. • %{dry-run} The number of times the --dry-run option has been given. • %{verbose} The number of times the --verbose option has been given. • %{callstack} Same as the %callstack{} macro. The latter gives you an opportunity for overriding, while the former is firmly built in. 97 5. The Macro Processor 5.3. Scripting HOWTO Both the asynchronous communication model of MARS (cf section 4.2) including the Lamport clock, and the state model (cf section 3.5) is something you definitely should have in mind when you want to do some scripting. Here is some further concrete advice: • Don’t access anything on /mars/ directly, except for debugging purposes. Use marsadm. • Avoid running scripts in parallel, other than for inspection / monitoring purposes. When you give two marsadm commands in parallel (whether on the same host, or on different hosts belonging to the same cluster), it is very likely to produce a mess. marsadm has no internal locking. There is no cluster-wide locking at all. Unfortunately, some systems like Pacemaker are violating this in many cases (depending on their configuration). Best is if you have a dedicated / more or less centralized control machine which controls masses of your georedundant working servers. This reduces the risk of running interfering actions in parallel. Of course, you need backup machines for your control machines, and in different locations. Not obeying this advice can easily lead to problems such as complex races which are very difficult to solve in long-distance distributed systems, even in general (not limited to MARS). • marsadm wait-cluster is your friend. Whenever your (near-)central script has to switch between different hosts A and B (of the same cluster), use it in the following way: ssh A “marsadm action1”; ssh B “marsadm wait-cluster; marsadm action2” Don’t ignore this advice! Interference is almost sure! As a rule of thumb, precede almost any action command with some appropriate waiting command! • Further friends are any marsadm wait-* commands, such as wait-umount. • In some places, busy-wait loops might be needed, e.g. for waiting until a specific resource is UpToDate or matches some other condition. Examples of waiting conditions can be found under github.com/schoebel/test-suite in subdirectory mars/modules/, specifically 02_predicates.sh or similar. • In case of network problems, some command may hang (forever), if you don’t set the --timeout= option. Don’t forget the check the return state of any failed / timeouted commands, and to take appropriate measures! • Test your scripts in failure scenarios! 98 6. The Sysadmin Interface (marsadm and /proc/sys/mars/) In general, the term “after a while” means that other cluster nodes will take notice of your actions according to the “eventually consistent” propagation protocol described in sections 4.2 and 4.3. Please be aware that this “while” may last very long in case of network outages or bad firewall rules. In the following tables, column “Cmp” means compatibility with DRBD. Please note that 100% exact compatibility is not possible, because of the asynchronous communication paradigm. The following table documents common options which work with (almost) any command: Option --dry-run Cmp no Description Run the command without actually creating symlinks or touching files or executing rsync. This option should be used first at any dangerous command, in order to check what would happen. Don’t use in scripts! Only use by hand! --force almost This option does not change the waiting logic. Many commands are waiting until the desired effect has taken place. However, with --dry-run the desired effect will never happen, so the command may wait forever (or abort with a timeout). In addition, this option can lead to additional aborts of the commands due to unmet conditions, which cannot be met because the symlinks are not actually created / altered. Thus this option can give only a rough estimate of what would happen later! Some preconditions are skipped, i.e. the command will / should work although some (more or less) vital preconditions are violated. Instead of giving --force, you may alternatively prefix your command with force- THIS OPTION IS DANGEROUS! --ignore-sync almost --verbose no --timeout=$seconds no --window=$seconds no Use it only when you are absolutely sure that you know what you are doing! Use it only as a last resort if the same command without --force has failed for no good reason! Use this for a planned handover instead of --force. Only one precondition is relaxed: some sync may be running somewhere. Careful when using this on extremely huge LVs where the sync may take serveral days, or weeks. It is your sysadmin decision what you want to prefer: restarting the sync, or planned handover. Some (few) commands will become more speaky. Some commands require response from either the local kernel module, or from other cluster nodes. In order to prevent infinite waiting in case of network outages or other problems, the command will fail after the given timeout has been reached. When $seconds is -1, the command will wait forever. When $seconds is 0, the command will not wait in case any precondition is not met, und abort without performing an action.. The default timeout is 5s. The time window for checking the aliveness of other nodes in the network. When no symlink updates have occurred during the last window, --threshold=$size Option no Cmp the node is considered dead. Default is 60s. The macros containing the substring -threshold- or -almost- are using this as a default value for approximation whether something has been approximately reached. Default is 10MiB. The $size argument may be a number optionally followed by one the lowercase characters k m g t p for indicating kilo mega giga tera or peta bytes as multiples of 1000. When using the corresponding uppercase character, multiples of 1024 are formed instead. Description 99 6. The Sysadmin Interface (marsadm and /proc/sys/mars/) Option Cmp no --host=$host Description The command acts as if the command were executed on another host $host. This option should not be used regularly, because the local information in the symlink tree may be outdated or even wrong. Additionally, some local information like remote sizes of physical devices (e.g. remote disks) is not present in the symlink tree at all, or is wrong (reflecting only the local state). THIS OPTION IS DANGEROUS! --ip=$ip no --verbose no Option Cmp Use it only for final destruction of dead cluster nodes, see section 3.4.4. By default, marsadm always uses the IP for $host as stored in the symlink tree (directory /mars/ips/). When such an IP entry does not (yet) exist (e.g. create-cluster or join-cluster), all local network interfaces are automatically scanned for IPv4 adresses, and the first one is taken. This may lead to wrong decisions if you have multiple network interfaces. In order to override the automatic IP detection and.to explicitly tell the IP address of your storage network, use this option. Usually you will need this only at {create,join}-cluster. Some (few) commands will become more speaky. Description 6.1. Cluster Operations Command / Params Cmp Description create-cluster no Precondition: the /mars/ filesystem must be mounted and it must be empty (mkfs.ext4, see instructions in section 3.2.2). The kernel module must not be loaded. Postcondition: the initial symlink tree is created in /mars/. Additionally, the /mars/uuid symlink is created for later distribution in the cluster. It uniquely indentifies the cluster in the world. This must be called exactly once at the initial primary. join-cluster no Precondition: the /mars/ filesystem must be mounted and it must be empty (mkfs.ext4, see instructions in section 3.2.2). The kernel module must not be loaded. The cluster must have been already created at another node $host. A working ssh connecttion to $host as root must exist (without password). rsync must be installed at all cluster nodes. Postcondition: the initial symlink tree /mars/ is replicated from the remote host $host, and the local host has been added as another cluster member. This must be called exactly once at every initial secondary node. Hint: use the --ip= option if you have multiple interfaces. $host Hint: use the --ip= option if you have multiple interfaces. Command / Params 100 Cmp Description 6.1. Cluster Operations Command / Params leave-cluster Cmp no Description Precondition: the /mars/ filesystem must be mounted and it must contain a valid MARS symlink tree produced by the other marsadm commands. The local node must no longer be member of any resource (see marsadm leave-resource). The kernel module should be loaded and the network should be operating in order to also propogate the effect to the other nodes. Postcondition: the local node is removed from the replicated symlink tree /mars/ such that other nodes will cease to communicate with it a after a while. The converse it not true: the local node may continue passivley fetching the symlink tree. In order to really stop all communication, the kernel module should be unloaded afterwards. The local /mars/ filesystem may be manually destroyed after that (at least if you need to reuse it). In case of an eventual node loss (e.g. fire, water, ...) this command should be used on another node $helper in order to finally remove $damaged from the cluster via the command marsadm leave-cluster --host=$damaged --force. In case you cannot use leave-resource for any reason, you may do the following: just destroy the /mars/ filesystem on the host $deadhost you want to remove (e.g. by mkfs), or take other measures to ensure that it cannot be accidentally re-used in any way (e.g. physical destruction of the underlying RAID, lvremove, etc). On all other hosts, do rmmod mars, then delete the symlink /mars/ips/ip-$deadhost everywhere by hand, and finally modprobe mars again. Notice that the last leave-resource operation does not delete the cluster as such. It just creates an empty cluster which has no longer any members. In particular, the cluster ID /mars/uuid is not removed, b deliberately . Before you can re-use any left-over /mars/ filesystem for creating / joining a new / different cluster, you must obey the instructions in section 3.2.2 and use mkfs.ext4 accordingly. a Reason: leave-cluster removes only its own IP address from /mars/ips/, but does not destroy the usual symmetry of the symlink tree by leaving the other IPs intact. Therefore, the local node will continue fetching updates from all nodes present in /mars/ips/. As an effect, the local node will passively mirror the symlinks of other cluster members, but not vice versa. There is no communication from the local node to the other ones, turning the local node into a whitness according to some terminology from Distributed Systems. This is a feature, not a bug. It could be used for porstmortem analysis, or for monitoring purposes. However, deletions of symlinks are not guaranteed to take place, so your whitness may accumulate thousands of old symlinks over a long time. If you want to eventually stop all communication to the local node, just run rmmod. b This is a feature, not a bug. The uuid is created once, but never alterered anywhere. The only way to get rid of it is external deletion (not by marsadm) together(!) with all other contents of /mars/. This prevents you from accidentally merging half-dead remains which could have survived a disaster for any reason, such as snapshotting filesystems / VMs or whatever. Command / Params Cmp Description 101 6. The Sysadmin Interface (marsadm and /proc/sys/mars/) Command / Params Cmp no merge-cluster $host Description Precondition: the set of resources at the local cluster (transitively) and at the cluster of $host (transitively) must be disjoint. Create the union of both clusters, consisting of the union of all participating machines (transitively). Resource memberships are unaffected. This is useful for creating a “virtual LVM cluster” where resources can be migrated later via join-resource / leave-resource operations. Attention! The mars branch 0.1.y does not scale well in number of cluter members, because it evolved from a lab prototype with O(n2 ) behaviour at metadata exchange. Never exceed the maximum cluster members as described in appendix A on page 142. For safety, you should better stay at 1/2 of the numbers mentioned there. Use split-cluster for going back to smaller clusters again after your background data migration has completed. merge-cluster-check no Future versions of MARS, starting with branch 0.1b.y will be constructed for very big clusters in the range of thousands of nodes. Development has not yet stabilized there, and operational experiences are missing at the moment. Be careful until official announcements are appearing in the ChangeLog, reporting of operational experiences from the 1&1 big cluster at metadata level. Check in advance whether the set of resources at the local cluster and at the other cluster $host are disjoint. $host split-cluster no wait-cluster no create-uuid no This is almost the inverse operation of merge-cluster: it determines the minimum sub-cluster groups participating in some common resources. Then it splits the cluster memberships such that unnecessary connections between non-related nodes are interrupted. Use this for avoidance of too big clusters. See section 6.3.3. Deprecated. Only for compatibility with old version light0.1beta05 or earlier. Precondition: the /mars/ filesystem must be mounted. A uuid (such as automatically created by recent versions of marsadm create-cluster) must not already exist; i.e. you have a very old and outdated symlink tree. Postcondition: the /mars/uuid symlink is created for later distribution in the cluster. It uniquely indentifies the cluster in the world. This must be called at most once at the current primary. Command / Params Cmp Description 6.2. Resource Operations Common precondition for all resource operations is that the /mars/ filesystem is mounted, that it contains a valid MARS symlink tree produced by other marsadm commands (including a unique uuid), that your current node is a valid member of the cluster, and that the kernel module is loaded. When communication is impossible due to network outages or bad firewall rules, most commands will succeed, but other cluster nodes may take a long time to notice your changes. Instead of executing marsadm commands serveral times for each resource argument, you may give the special resource argument all. This work even when combined with --force, but be cautious when giving dangerous command combinations like marsadm delete-resource --force all. Beware when combining this with --host=somebody. In some very rare cases, like final destruction of a whole datacenter after an earthquake, you might need a combination like marsadm --host=defective delete-resource --force all. Don’t use such combinations if you don’t need them really! You can easily shoot yourself in your head if you are not carefully operating such commands! 6.2.1. Resource Creation / Deletion / Modification Command / Params Cmp Description Command / Params Cmp Description 102 6.2. Resource Operations Command / Params Cmp no create-resource $res $disk_dev [$mars_name] [$size] no join-resource $res $disk_dev [$mars_name] no leave-resource $res Description Precondition: the resource argument $res must not denote an already existing resource name in the cluster. The argument $disk_dev must denote an absolute path to a usable local block device, its size must be greater zero. When the optional $mars_name is given, that name must not already exist on the local node; when not given, $mars_name defaults to $res. When the optional $size argument is given, it must be a number, optionally followed by a lowercase suffix k, m, g, t, or p (denoting size factors as multiples of 1000), or an uppercase suffix K, M, G, T or P (denoting size factors as multiples of 1024). The given size must not exceed the actual size of $disk_dev. It will specify the future resource size as shown by marsadm view-resource-size $res. Postcondition: the resource $res is created, the inital role of the current node is primary. The corresponding symlink tree information is asynchonously distributed in the cluster (in the background). The device /dev/mars/$mars_name should appear after a while. Notice: when $size is strictly smaller than the size of $disk_dev, you will unnecessarily waste some space.. This must be called exactly once for any new resource. Precondition: the resource argument $res must denote an already existing resource in the cluster (i.e. its symlink tree information must have been received). The resource must have a designated primary, and it must no be in emergency mode. There must not exist a split brain in the cluster. The local node must not be already member of that resource. The argument $disk_dev must denote an absolute path to a usable (but currently unused) local block device, its size must be greater or equal to the logical size of the resource. When the optional $mars_name is given, that name must not already exist on the local node; when not given, $mars_name defaults to $res. Postcondition: the current node becomes a member of resource $res, the inital role is secondary. The initial full sync should start after a while. Notice: when the size of $disk_dev is strictly greater than the size of the resource, you will unnecessarily waste some space. After a while, state Orphan should be left. Don’t forget to regularly monitor for longer occurrences of Orphan! Precondition: the local node must be a member of the resource $res; its current role must be secondary. Sync, fetch and replay must be paused (see commands pause-{sync,fetch,replay} or their abbreviation down). The disk must be detatched (see commands detach or down). The kernel module should be loaded and the network should be operating in order to also propogate the effect to the other nodes. Postcondition: the local node is no longer a member of $res. Notice: as a side effect for other nodes, their log-delete may now become possible, since the current node does no longer count as a candidate for logfile application. In addition, a split brain situation may be (partly) resolved by this. Please notice that this command may lead to (but does not guarantee) split-brain resolution. The contents of the disk is not changed by this command. Before issuing this command, check whether the disk appears to be locally consistent (see view-is-consistent)! After giving this command, any internal information indicating the consistency state will be gone, and you will no longer be able to guess consistency properties. When you are sure.that the disk was consistent before (or is now by manually checking it), you may re-create a new resource out of it via create-resource. In case of an eventual node loss (e.g. fire, water, ...) this command may be used on another node $helper in order to finally remove all the resources $damaged from the cluster via the command marsadm leave-resource $res --host=$damaged --force. Command / Params Cmp Description 103 6. The Sysadmin Interface (marsadm and /proc/sys/mars/) Command / Params Cmp no delete-resource $res Description Precondition: the resource must be empty (i.e. all members must have left via leave-resource). This precondition is overridable by --force, increasing the danger to maximum! It is even possible to combine --force with an invalid resource argument and an invalid --host=somebodyelse argument in order to desperately try to destroy remains of incomplete or pysically damaged hardware. Postcondition: all cluster members will somewhen be forcefully removed from $res. In case of network interruptions, the forced removal may take place far in the future. THIS COMMAND IS VERY DANGEROUS! Use this only in desperate situations, and only manually. Don’t call this from scripts. You are forcefully using a sledgehammer, even without --force! The danger is that the true state of other cluster nodes need not be known in case of network problems .Even when it were known, it could be compromised by byzantine failures. It is strongly advised to try this command with --dry-run first. When combined with --force, this command will definitely murder other cluster nodes, possibly after a long while, and even when they are operating in primary mode / having split brains / etc. However, there is no guarantee that other cluster nodes will be really dead – it is (theoretically) possible that they remain only half dead. For example, a half dead node may continue to write data to /mars/ and thus lead to overflow somewhen. This command implies a forceful detach, possibly destroying consistency. It is similar in spirit to a STONITH. In particular, when a cluster node was operating in primary mode (/dev/mars/mydata being continuously in use), the forceful detach cannot be carried out until the device is completely unused. In the meantime, the current transaction logfile will be appended to, but the file might be already unlinked (orphan file filling up the disk). After the forceful detach, the underlying disk need not be consistent (although MARS does its best). Since this command deletes any symlinks which normally would indicate the consistency state, no guarantees about consistency can be given after this in general! Always check consistency by hand! When possible / as soon as possible, check the local state on the other nodes in order to really shutdown the resource everywhere (e.g. to really unuse the /dev/mars/mydata device, etc). After this command, you should rebuild the resource under a different name, in order to avoid any clashes caused by unexpected resurrection of “dead” or “half-dead” nodes (beware of shapshot / restores on virtual machines!!). MARS does its best to avoid problems even in case the new resource name should equal the old one, but there can be no guarantee in all possible failure scenarios / usage scenarios. no wait-resource When possible, prefer leave-resource over this! See section 6.3.3. $res {is-,}{attach, primary, device}{-off,} Command / Params Cmp Description 6.2.2. Operation of the Resource Common preconditions are the preconditions from section 6.2, plus the respective resource $res must exist, and the local node must be a member of it. With the single exception of attach itself, all other operations must be started in attached state. When $res has the special reserved value all, the following operations will work on all resources where the current node is a member (analogously to DRBD). Command / Params Cmp Description Command / Params Cmp Description 104 6.2. Resource Operations Command / Params Cmp yes attach $res Description Precondition: the local disk belonging to $res is not in use by anyone else. Its contents has not been altered in the meantime since the last detach. Mounting read-only is allowed during the detached phase. However, be careful! If you accidentally forget to give the right readonly-mount flags, if you use fsck in repair mode inbetween, or alter the disk content in any other way (beware of LVM snapshots / restores etc), you will almost certainly produce an unnoticed inconsistency (not reported by view-is-consistent)! MARS has no chance to notice suchalike! Postcondition: MARS uses the local disk and is able to work with it (e.g. replay logfiles on it). Note: the local disk is opened in exclusive read-write mode. This should protect against most common misuse, such as opening the disk in parallel to MARS. yes detach $res However, this does not necessarily protect against non-exclusive openers. Precondition: the local /dev/mars/mydata device (when present) is no longer opened by anybody. Postcondition: the local disk belonging to $res is no longer in use. In contrast to DRBD, you need not explicitly pause syncing, fetching, or replaying to (as apposed to from) the local disk. These processes are automatically paused. As another contrast to DRBD, the respective processes will usually automatically resume after re-attach, as far as possible in the respective new situation. This will usually work even over rmmod or reboot cycles, since the internal symlink tree will automatically persist all todo switches for you (c.f. section 3.5). Notice: only local transfer operations to the local disk are paused by a detach. When another node is remotely running a sync from your local disk, it will likely remain in use for remote reading. The reason is that the server part of MARS is operating purely passively, in order serve all remote requests as best as possible (similar to the original Unix philosophy). In order to really stop all accesses, do a pause-sync on all other resource member where a sync is currently running. You may also try pause-sync-global. WARNING! After this, and ather having paused any remote data access, you might use the underlying disk for your own purposes, such as test-mounting it in readonly mode. Don’t modifiy its contents a in any way! Not even by an fsck ! Otherwise, you will have inconsistencies guaranteed. MARS has no way for knowing of any modifications to your disk when bypassing /dev/mars/*. In case you accidentally modified the underlying disk at the primary side, you may choose to resolve the inconsistencies by marsadm invalide $res on each secondary. a Some (but not all) fsck tools for some filesystems have options to start only a test repair / verify mode / dry run, without doing actual modifications to the data. Of course, these modes can be used. But be really sure! Double-check for the right options! pause-sync partly Equivalent to pause-sync-local. partly Precondition: none additionally. Postcondition: any sync operation targeting the local disk (when not yet completed) is paused after a while (cf section 3.5). When successfully completed, this operation will remember the switch state forever and automatically become relevant if a sync is needed again (e.g. invalidate or resize). $res pause-sync-local $res Command / Params Cmp Description 105 6. The Sysadmin Interface (marsadm and /proc/sys/mars/) Command / Params pause-sync-global Cmp Description partly Like *-local, but operates on all members of the resource. partly Equivalent to resume-sync-local. partly Precondition: additionally, a primary must be designated, and it must not be in emergency mode. Postcondition: any sync operation targeting the local disk (when not yet completed) is resumed after a while. When completed, this operation will remember the switch state forever and become relevant if a sync is needed again (e.g. invalidate or resize). partly Like *-local, but operates on all members of the resource. partly Equivalent to pause-fetch-local. partly Precondition: none additionally. The resource should be in secondary role. Otherwise the switch has no immediate effect, but will come (possibly unexpectedly) into effect whenever secondary role is entered later for whatever reason. Postcondition: any transfer of (parts of) transaction logfiles which are present at another primary host to the local /mars/ storage are paused at their current stage. $res resume-sync $res resume-sync-local $res resume-sync-global $res pause-fetch $res pause-fetch-local $res pause-fetch-global partly This switch works independently from {pause,resume}-replay. Like *-local, but operates on all members of the resource. partly Equivalent to resume-fetch-local. partly Precondition: none additionally. The resource should be in secondary role. Otherwise the switch has no immediate effect, but will come (possibly unexpectedly) into effect whenever secondary role is entered later for whatever reason. Postcondition: any (parts of) transaction logfiles which are present at another primary host shouldl be transferred to the local /mars/ storage as far as not yet locally present. $res resume-fetch $res resume-fetch-local $res resume-fetch-global partly This works independently from {pause,resume}-replay. Like *-local, but operates on all members of the resource. partly Equivalent to pause-replay-local. partly Precondition: none additionally. The resource should be in secondary role. Otherwise the switch has no immediate effect, but will come (possibly unexpectedly) into effect whenever secondary role is entered later for whatever reason. Postcondition: any local replay operations of transaction logfiles to the local disk are paused at their current stage. $res pause-replay $res pause-replay-local $res pause-replay-global partly This works independently from {pause,resume}-fetch resp. {dis,}connect. Like *-local, but operates on all members of the resource. partly Equivalent to pause-replay-local. partly Precondition: must be in secondary role. Postcondition: any (parts of) locally existing transaction logfiles (whether replicated from other hosts or produced locally) are started for replay to the local disk, as far as they have not yet been applied. Like *-local, but operates on all members of the resource. $res resume-replay $res resume-replay-local $res resume-replay-global partly $res Command / Params 106 Cmp Description 6.2. Resource Operations Command / Params connect Cmp Description partly Equivalent to connect-local and to resume-fetch-local. partly Note: although this sounds similar to DRBD’s drbdadm connect, there are subtle differences. DRBD has exactly one connection per resource, which is associated with pairs of nodes. In contrast, MARS may create multiple connections per resource at runtime, and these are associated with the target host (not with pairs of hosts). As a consequence, the fetch may potentially occur from any other other source host which happens to be reachable (although the current implementation prefers the current designated primary, but this may change in future). In addition, marsadm disconnect does not stop all communication. It only stops fetching logfiles. The symlink update running in background is not stopped, in order to always propagate as much metadata as possible in the cluster. In case of a later incident, chances are higher for a better knowledge of the real state of the cluster. Equivalent to resume-fetch-local. partly Equivalent to resume-fetch-global. partly Equivalent to disconnect-local and to pause-fetch-local. partly See above note at connect. Equivalent to pause-fetch-local. partly Equivalent to pause-fetch-global. $res connect-local $res connect-global $res disconnect $res disconnect-local $res disconnect-global $res yes up yes down Equivalent to attach followed by resume-fetch followed by followed by resume-replay followed by resume-sync. $res $res Equivalent to pause-sync followed pause-replay followed by detach. by pause-fetch Hint: consider to prefer plain detach over this, because detach will remember the last state of all switches, while down will not. Command / Params Cmp Description 107 6. The Sysadmin Interface (marsadm and /proc/sys/mars/) Command / Params Cmp almost primary $res Description Precondition: sync must have finished at any resource member. All relevant transaction logfiles must be either already locally present, or be fetchable (see resume-fetch and resume-replay). When some logfile data is locally missing, there must be enough space on /mars/ to fetch it. Any replay must not have been interrupted by a replay error (see macro %replay-code{} or diskstate DefectiveLog). The current designated primary must be reachable over network. When there is no designated primary (i.e. marsadm secondary had been executed before, which is explicitly not recommended), all other members of the resource must be reachable (since we have no memory who was the old primary before), and then they must also match the same preconditions. When another host is currently primary (whether designated or not), it must match the preconditions of marsadm secondary (that means, its local /dev/mars/mydata device must not be in use any more). A split brain must not already exist. Postcondition: /dev/mars/$dev_name appears locally and is usable; the current host is in primary role. Switches the designated primary. There are three variants: 1) Handover when not giving --force: when another host is currently primary, it is first asked to leave its primary role, and it is waited until it actually has become secondary. After that, the local host is asked to become primary. Before actually becoming primary, all relevant logfiles are transferred over the network and replayed, in order to avoid a accidental creation of split brain as best as possible . Only after that, /dev/mars/$dev_name will appear. When network transfers of the symlink tree are very slow (or currently impossible), this command may take a very long time. In case a split brain is already detected at the initial situation, the local host will refuse to switch the designated primary without --force. In case of k > 2 replicas: if you want to handover between host A and B while a sync is currently running at host C, you have the following options: 1. 2. wait until the sync has finished (see macro sync-rest, or marsadm view in general). do a leave-resouce on host C, and later join-resource after the handover completed successfully. 2) Handover ignoring running syncs, by adding the option --ignore-sync. Any running syncs will restart from scratch, in order to ensure consistency. Use this only when the planned handover is more important than the sync time. 3) Forced switching: by giving –force while pause-fetch is active (but not pause-replay), most preconditions are ignored, and MARS does its best to actually become primary even if some logfiles are missing or incomplete or even defective. primary --force is a potentially harmful variant, because it will provoke a split brain in most cases, and therefore in turn will lead to data loss because one of your split brain versions must be discarded later in order to resolve the split brain (see section 3.4.3). Never call primary --force when primary without --force is sufficient! If primary without --force complains that the device is in use at the former primary side, take it seriously! Don’t override with b --force, but rather umount the device at the other side! Only use primary --force when something is already broken, such as a network outage, or a node crash, etc. During ordinary operations (network OK, nodes OK), you should never need primary --force! If you umount /dev/mars/mydata on the old primary A, and then wait until marsadm view (or another suitable macro) on the target host B shows that everything is UpToDate, you can prevent a split brain by yourself even when giving primary --force afterwards. However, checking / assuring this is your responsibility! 108 primary --force switches the designated primary. In some extremely rare cases, when multiple faults have accumulated in a weird situation, it might be impossible becoming the / an actual primary. Typically you may be already in a split brain situation. This has not been observed for a long operations time on recent versions of MARS, but in general becoming primary via --force cannot be guaranteed always, although MARS does its best. In split brain situations, or if you ever encounter such a problem, you must resolve the split brain immediately after giving this command (see section 3.4.3). 6.2. Resource Operations Command / Params Cmp almost secondary $res Description Precondition: the local /dev/mars/$dev_name is no longer in use (e.g. umounted). Postcondition: There exists no designated primary any more. During split brain and when the network is OK (again), all actual primaries (including the local host) will leave primary ASAP (i.e. when their /dev/mars/mydata is no longer in use). Any secondary will start following (old) logfiles (even from backlogs) by replaying transaction logs if it is uniquely possible (which is often violated during split brain). On any secondary, /dev/mars/$dev_name will have disappeared. Notice: in difference to DRBD, you don’t need this command during normal operation, including handover. Any resource member which is not designated as primary will automatically go into secondary role. For example, if you have k = 4 replicas, only one of them can be designated as a primary. When the network is OK, all other 3 nodes will know this fact, and they will automatically go into secondary mode, following the transaction logs from the (new) primary. a Hint: avoid this command. It turns off any primary, globally . You cannot start a sync after that (e.g. invalidate or join-resource or resume-sync), because it is not unique wherefrom the data shall be fetched. In split brain situations (when the network is OK again), this may have further drawbacks. It is much better / easier to directly switch the designated primary from one node to another via the primary command. See also section 3.4.2.2. There is only one valid use case where you really need this command: before finally destroying a resouce via the last leave-resource (or the dangerous delete-resource), you will need this before you can do that. a A serious misconception among some people is when they believe that they can switch “a certain node to secondary”. It is not possible to switch individual nodes to secondary, without affecting other nodes! The concept of “designated primary” is global throughout a resource! wait-umount no See section 6.3.3. no Precondition: none additionally. Postcondition: all locally known logfiles and version links are removed, whenever they are not / no longer reachable by any split brain version. $res log-purge-all $res Rationale: remove hindering leave-resource leftovers. split-brain / Use this only when split brain does not go away by means of leave-resource (which could happen in very weird scenarios such as MARS running on virtual machines doing a restore of their snapshots, or otherwise unexpected resurrection of dead or half-dead nodes). THIS IS POTENTIALLY DANGEROUS! almost resize $res [$size] Command / Params Cmp This command might destroy some valuable logfiles / other information in case the local information is outdated or otherwise incorrect. MARS does its best for checking anything, but there is no guarantee. Hint: use --dry-run beforehand for checking! Precondition: The local host must be primary. All disks in the cluster participating in $res must be physically larger than the logical resource size (e.g, by use of lvm; can be checked by macros %disk-size{} and %resource-size{}). When the optional $size argument is present, it must be smaller than the minimum of all physical sizes, but larger than the current logical size of the resource. Postcondition: the logical size of /dev/mars/$dev_name will reflect the new size after a while. Description 6.2.3. Logfile Operations 109 6. The Sysadmin Interface (marsadm and /proc/sys/mars/) Command / Params Cmp no cron no log-rotate $res no log-delete $res no log-delete-one $res no log-delete-all Description Do all necessary housekeeping tasks. See log-rotate and log-delete-all for details. This should be regularly called by an external cron job or similar. Precondition: the local node $host must be primary at $res. Postcondition: after a while, a new transaction logfile /mars/resource-$res/log-$new_nr-$host will be used instead of /mars/resource-$res/log-$old_nr-$host where $new_nr = $old_nr + 1. Without --force, this will only carry out actions at the primary side since it makes no sense on secondaries. With --force, secondaries are trying to remotely trigger a log-rotate, but without any guarantee (likely even a split-brain may result instead, so use this only if you are really desperate). Precondition: the local node must be a member of $res. Postcondition: when there exists some old transaction logfiles /mars/resource-$res/log-*-$some_host which are no longer referenced by any of the symlinks /mars/resource-$res/replay-* , those logfiles are marked for deletion in the whole cluster. When no such logfiles exist, nothing will happen. Precondition: the local node must be a member of $res. Postcondition: when there exists an old transaction logfile /mars/resource-$res/log-$old_nr-$some_host where $old_nr is the minimum existing number and that logfile is no longer referenced by any of the symlinks /mars/resource-$res/replay-* , that logfile is marked for deletion in the whole cluster. When no such logfile exists, nothing will happen. Alias for log-delete. $res Command / Params Cmp Description 6.2.4. Consistency Operations Command / Params Cmp no invalidate $res no fake-sync $res Description Precondition: the local node must be in secondary role at $res. A designated primary must exist. When having k > 2 replicas, no split brain must exist (otherwise, or when invalidate does not work in case of k = 2, use the leave-resource ; join-resource method described in section 3.4.3). Postcondition: the local disk is marked as inconsistent, and a fast fullsync from the designated primary will start after a while. Notice that marsadm {pause,resume}-sync will influence whether the sync really starts. When the fullsync has finished successfully, the local node will be consistent again. After a while, state Orphan should be left. Don’t forget to regularly monitor for longer occurrences of Orphan! Precondition: the local node must be in secondary role at $res. Postcondition: when a fullsync is running, it will stop after a while, and the local node will be marked as consistent as if it were consistent again. ONLY USE THIS IF YOU REALLY KNOW WHAT YOU ARE DOING! See the WARNING in section 3.3 Use this only before creating a fresh filesystem inside /dev/mars/$res. set-replay no ONLY FOR ADVANCED HACKERS WHO KNOW WHAT THEY ARE DOING! This command is deliberately not documented. You need the competence level RTFS (“read the fucking sources”). Command / Params Cmp Description 6.3. Further Operations 6.3.1. Inspection Commands Command / Params Cmp Description Command / Params Cmp Description 110 6.3. Further Operations Command / Params Cmp no view-macroname Display the output of a macro evaluation. See section 3.6 for a thorough description. $res view Description no Equivalent to view-default. no Deprectated. Use view-role instead. no Deprectated. Use view-state instead. no Deprectated. Use view-cstate instead. no Deprectated. Use view-dstate instead. no Deprectated. Use view-status instead. no Deprectated. Don’t use it. Use view-state instead, or other macros. no Deprectated. Don’t use it. Use view-info instead, or other macros. no Deprectated. Don’t use it. Use or implement some macros instead. no Deprectated. $res role $res state $res cstate $res dstate $res status $res show-state $res show-info $res show $res show-errors no cat Use view-the-err-msg or view-resource-err similar macros. $res Write the file content to stdout, but replace all occurences of numeric timestamps converted to a human-readable format. Thus is most useful $file for inspection of status and log files, e.g. marsadm cat /mars/5.total.log Command / Params Cmp Description 6.3.2. Setting Parameters 6.3.2.1. Per-Resource Parameters Command / Params Cmp set-emergency-limit no Description The argument n must be percentage between 0 and 100 %. When $res n the remaining store space in /mars/ undershoots the given percentage, get-emergency-limit computation described in section 4.4. 0 means unlimited. Inquiry of the preceding value. the resource will go earlier into emergency mode than by the global no $res Command / Params Cmp Description 6.3.2.2. Global Parameters Command / Params Cmp set-sync-limit-value no Description Limit the concurrency of sync operations to some maximum number. 0 n get-sync-limit-value no means unlimited. Inquiry of the preceding value. set-sync-pref-list no Set the order of preferences for syning. The argument must be comma- res1,res2,resn get-sync-pref-list no separated list of resource names. Inquiry of the preceding value. set-connect-pref-list no Set the order of preferences for connections when there are more than host1,host2,hostn 2 hosts participating in a cluster. get-connect-pref-list separated list of node names. Inquiry of the preceding value. Command / Params no Cmp The argument must be comma- Description 111 6. The Sysadmin Interface (marsadm and /proc/sys/mars/) 6.3.3. Waiting Command / Params Cmp Description no wait-cluster Precondition: the /mars/ filesystem must be mounted and it must contain a valid MARS symlink tree produced by the other marsadm commands. The kernel module must be loaded. Postcondition: none. Wait until all nodes in the cluster have sent a message, or until time- Be may be out. The default timeout is 30 s (exceptionally) and changed by --timeout=$seconds Precondition: the local node must be a member of the resource $res. Postcondition: none. Wait until the local node reaches a specified condition on $res, or until timeout. The default timeout of 60 s may be changed by --timeout=$seconds. The last argument denotes the condition. The condition is inverted if suffixed by -off. When preceded by is- (which is the most useful case), it is checked whether the condition is actually reached. When the is- prefix is left off, the check is whether another marsadm command has been already given which tries to achieves the intended result (typicially, you may use this after the is- variant has failed). no wait-resource $res {is-,}{attach, primary, device}{-off,} almost wait-connect This is an alias for wait-cluster waiting until only those nodes are reachable which belong to $res (instead of waiting for the full cluster). $res no wait-umount Precondition: none additionally. Postcondition: the local /dev/mars/$dev_name is no longer in use (e.g. umounted). $res Command / Params Cmp Description 6.3.4. Low-Level Expert Commands These commands are for experts and advanced sysadmins only. The interface is not stable, i.e. the meaning may change at any time. Use at your own risk! Command / Params Cmp Description set-link no RTFS. get-link no RTFS. delete-file no RTFS. Command / Params Cmp Description The following commands are for manual setup / repair of cluster membership. Only to be used by experts who know what they are doing! In general, cluster-wide operations on IP addresses may need to be repeated at all hosts in the cluster iff the communication is not (yet) possible and/or not (yet) actually working (e.g. firewalling problems etc). Command / Params Cmp lowlevel-ls-host-ips no Description List all configured cluster members together with their cur- lowlevel-set-host-ip no rently configured IP addresses, as known locally. Change the assignment of IP addresses locally. May be $hostname used when hosts are moved to different network locations, or $ip when different network interfaces are to be used for replication (e.g. dedicated replication IPs). Notice that the names of hosts must not change at all, only their IP addresses may be changed. Check active connections with netstat & friends. Updates may need some time to proceed (socket timeouts etc). no lowlevel-delete-host Remove a host from the cluster membership locally, together with its IP address assignment. This does not re- $hostname move any further information. In particular, resource memberships are untouched. Command / Params 112 Cmp Description 6.4. The /proc/sys/mars/ and other Expert Tweaks 6.3.5. Senseless Commands (from DRBD) Command / Params Cmp syncer no new-current-uuid no create-md no dump-md no dump no get-gi no show-gi no outdate no adjust yes hidden-commands no Command / Params Cmp Description Implemented as NOP (not necessary with MARS). Description 6.3.6. Forbidden Commands (from DRBD) These commands are not implemented because they would be dangerous in MARS context: Command / Params Cmp Description invalidate-remote no This would be too dangerous in case you have multiple secondaries. A verify no similar effect can be achieved with the --host= option. This would cause unintended side effects due to races between logfile transfer / application and block-wise comparison of the underlying disks. However, marsadm join-resource or invalidate will do the same as DRBD verify followed by DRBD resync, i.e. this will automatically correct any found errors;. Note that the fast-fullsync algorithm Command / Params Cmp of MARS will minimize network traffic. Description 6.4. The /proc/sys/mars/ and other Expert Tweaks In general, you shouldn’t need to deal with any tweaks in /proc/sys/mars/ because everything should already default to reasonable predefined values. This interface allows access to some internal kernel variables of the mars.ko kernel module at runtime. Thus it is not a stable interface. It is not only specific for MARS, but may also change between releases without notice. This section describes only those tweaks intended for sysadmins, not those for developers / very deep internals. 6.4.1. Syslogging All internal messages produced by the kernel module belong to one of the following classes: 0 debug messages 1 info messages 2 warnings 3 error messages 4 fatal error messages 5 any message (summary of 0 to 4) 113 6. The Sysadmin Interface (marsadm and /proc/sys/mars/) 6.4.1.1. Logging to Files These classes are used to produce status files $class.*.status in the /mars/ and/or in the /mars/resource-mydata / directory / directories. When you create a file $class.*.log in parallel to any $class.*.status, the *.log file will be appended forever with the same messages as in *.status. The difference is that *.status is regenerated anew from an empty starting point, while *.log can (potentially) increase indefinitely unless you remove it, or rename it to something else. Beware, any permamently present *.log file can easily fill up your /mars/ partition until the problems described in section 4.4 will appear. Use *.log only for a limited time, and only for debugging! 6.4.1.2. Logging to Syslog The classes also play a role in the following /proc/sys/mars/ tweaks: syslog_min_class (rw) The mimimum class number for permanent syslogging. By default, this is set to -1 in order to switch off perment logging completely. Permament logging can easily flood your syslog with such huge amounts of messages (in particular when class=0), that your system as a whole may become unusable (because vital kernel threads may be blocked too long or too often by the userspace syslog daemon). Instead, please use the flood-protected syslogging described below! syslog_max_class (rw) The maximum class number for permanent syslogging. Please use the flood-protected version instead. syslog_flood_class (rw) The mimimum class of flood-protected syslogging. The maximum class is always 4. syslog_flood_limit (rw) The maxmimum number of messages after which the flood protection will start. This is a hard limit for the the number of messages written to the syslog. syslog_flood_recovery_s (rw) The number of seconds after which the internal flood counter is reset (after flood protection state has been reached). When no new messages appear after this time, the flood protection will start over at count 0. The rationale behind flood protected syslogging: sysadmins are usually only interested in the point in time where some problems / incidents / etc have started. They are usually not interested in capturing each and every single error message (in particular when they are flooding the system logs). If you really need complete error information, use the *.log files described above, compress them and save them to somewhere else regularly by a cron job. This bears much less overhead than filtering via the syslog daemon, or even remote syslogging in real time which will almost surely screw up your system in case of network problems co-inciding with flood messages, such as caused in turn by those problems. Don’t rely on real-time concepts, just do it the old-fashioned batch job way. 6.4.1.3. Tuning Verbosity of Logging show_debug_messages Boolean switch, 0 or 1. Mostly useful only for developers. This can easily flood your logs if our are not careful. show_log_messages Boolean switch, 0 or 1. show_connections Boolean switch, 0 or 1. Show detailed internal statistics on sockets. 114 6.4. The /proc/sys/mars/ and other Expert Tweaks show_statistics_local / show_statistics_global Only useful for kernel developers. Shows some internal information on internal brick instances, memory usage, etc. 6.4.2. Tuning the Sync sync_flip_interval_sec (rw) The sync process must not run in parallel to logfile replay, in order to easily guarantee consistency of your disk. If logfile replay would be paused for the full duration of very large or long-lasting syncs (which could take some days over very slow networks), your /mars/ filesystem could overflow because no replay would be possible in the meantime. Therefore, MARS regulary flips between actually syncing and actually replaying, if both is enabled. You can set the time interval for flipping here. sync_limit (rw) When > 0, this limits the maximum number of sync processes actually running parallel. This is useful if you have a large number of resources, and you don’t want to overload the network with sync processes. sync_nr (ro) Passive indicator for the number of sync processes currently running. sync_want (ro) Passive indicator for the number of sync processes which demand running. 115 7. Tips and Tricks 7.1. Avoiding Inappropriate Clustermanager Types for Medium and Long-Distance Replication This section addresses some wide-spread misconceptions. Its main target audience is developers, but sysadmins will profit from detailed explanations of problems and pitfalls. When the problems described in this section are solved somewhen in future, this section will be shortened and some relevant parts moved to the appendix. Doing High Availability (HA) wrong at concept level may easily get you into trouble, and may cost you several millions of € or $ in larger installations, or even knock you out of business when disasters are badly dealt with at higher levels such as clustermanagers. 7.1.1. General Cluster Models The most commonly known cluster model is called shared-disk, and typically controlled by clustermanagers like PaceMaker: App Cluster Side A (currently active) App Cluster Side B (currently passive) Clustermanager e.g. PaceMaker e.g. iSCSI e.g. iSCSI Shared Disk The most important property of shared-disk is that there exists only a single disk instance. Nowadays, this disk often has some internal redundancy such as RAID. At system architecure layer / network level, there exists no redundant disk at all. Only the application cluster is built redundant. It should be immediately clear that shared-disk clusters are only suitable for shortdistance operations in the same datacenter. Although running one of the data access lines over short distances between very near-by datacenters (e.g. 1 km) would be theoretically possible, there would be no sufficient protection against failure of a whole datacenter. Both DRBD and MARS belong to a different architectural model called shared-nothing: App Cluster Side A (currently active) Clustermanager e.g. iSCSI App Cluster Side B (currently passive) e.g. iSCSI Disk Coupling e.g. DRBD or MARS Disk A Disk B The characteristic feature of a shared-nothing model is (additional) redundancy at network level. 116 7.1. Avoiding Inappropriate Clustermanager Types for Medium and Long-Distance Replication Shared-nothing “clusters1 ” could theoretically be built for any distances, from short to medium to long distances. However, concrete technologies of disk coupling such as synchronous operation may pose practical limits on the distances (see chapter 2). In general, clustermanagers must fit to the model. Some clustermanager can be configured to fit to multiple models. If so, this must be done properly, or you may get into serious trouble. Some people don’t know, or they don’t believe, that different architectural models like shareddisk or shared-nothing will require an appropriate type of clustermanager and/or a different configuration. Failing to do so, by selection of an inappropriate clustermanager type and/or an inappropriate configuration may be hazardous. Selection of the right model alone is not sufficient. Some, if not many, clustermanagers have not been designed for long distances. As explained in section 7.1.5, long distances have further hard requirements. Disregarding them may be also hazardous! 7.1.2. Handover / Failover Reasons and Scenarios From a sysadmin perspective, there exist a number of different reasons why the application workload must be switched from the currently active side A to the currently passive side B: 1. Some defect has occurred at cluster side A or at some corresponding part of the network. 2. Some maintenance has to be done at side A which would cause a longer downtime (e.g. security kernel update or replacement of core network equipment or maintainance of UPS or of the BBU cache etc - hardware isn’t 24/7/365 in practice, although some vendors claim it - it is either not really true, or it becomes extremely expensive). Both reasons are valid and must be automatically handled in larger installations. In order to deal with all of these reasons, the following basic mechanisms can be used in either model: 1. Failover (triggered either manually or automatically) 2. Handover (triggered manually2 ) It is important to not confuse handover with failover at concept level. Not only the reasons / preconditions are very different, but also the requirements. Example: precondition for handover is that both cluster sides are healthy, while precondition for failover is that some relevant(!) failure has been detected somewhere (whether this is really true is another matter). Typically, failover must be able to run in masses, while planned handover often has lower scaling requirements. Not all existing clustermanagers are dealing with all of these cases (or their variants) equally well, and some are not even dealing with some of these cases / variants at all. Some clustermanagers cannot easily express the concept of “automatic triggering” versus “manual triggering” of an action. There exists simply no cluster-global switch which selects either “manual mode” or “automatic mode” (except when you start to hack the code and/or write new plugins; then you might notice that there is almost no architectural layering / sufficient separation between mechanism and strategy). Being forced to permanently use an automatic mode for several hundreds or even thousands of clusters is not only boring, but bears a considerable risk when automatics do a wrong decision at hundreds of instances in parallel. 7.1.3. Granularity and Layering Hierarchy for Long Distances Many existing clustermanager solutions are dealing with a single cluster instance, as the term “cluster manager” suggests. However, when running several hundreds or thousands of cluster instances, you likely will not want to manage each of them individually. In addition, failover 1 Notice that the term “cluster computing” usually refers to short-distance only. Long-distance coupling should be called “grid computing” in preference. As known from the scientific literature, grid computing requires different concepts and methods in general. Only for the sake of simplicity, we use “cluster” and “grid” interchangeably. 2 Automatic triggering could be feasible for prophylactic treatments. 117 7. Tips and Tricks should not only be triggered (not to be confused with executed ) individually at cluster level, but likely also at a higher granularity such as a room, or a whole datacenter. Otherwise, some chaos is likely to happen. Here is what you probably will need, possibly in difference to what you may find on the market (whether OpenSource or not). For simplicity, the following diagram shows only two levels of granularity, but can be easily extended to multiple layers of granularity, or to some concept of various subsets of clusters: Automatics Layer: Failover of {Datacenters...Clusters} Mechanics Layer: Handover+Failover of whole Datacenter Mechanics Layer: Handover+Failover of single Cluster (one globally distributed instance) (about a dozen of instances) (several hundreds / thousands of instances) Notice that many existing clustermanager solutions are not addressing the datacenter granularity at all. Typically, they use concepts like quorums for determining failures at cluster level solely, and then immediately executing failover of the cluster, sometimes without clean architectural distinction between trigger and execution (similar to the “separation of concerns” between mechanism and strategy in Operating Systems). Sometimes there is even no internal software layering / modularization according to this separation of concerns at all. When there is no distinction between different levels of granularity, you are hopelessly bound to a non-extensible and thus non-adaptable system when you need to operate masses of clusters. A lacking distinction between automatic mode and manual mode, and/or lack of corresponding architectural software layers is not only a blatant ignoration of well-established best practices of software engineering, but will bind you even more firmly to an inflexible system. Terminology: for practical reasons, we use the general term “clustermanager” also for speaking about layers dealing with higher granularity, such as datacenter layers, and also for long-distance replication scenarios, although some terminology from grid computing would be more appropriate in a scientific background. Please consider the following: when it comes to long-distance HA, the above layering architecture is also motivated by vastly different numbers of instances for each layer. Ideally, the topmost automatics layer should be able to overview several datacenters in parallel, in order to cope with (almost) global network problems such as network partitions. Additionally, it should also detect single cluster failures, or intermediate problems like “rack failure” or “room failure”, as well as various types of (partial / intermediate) (replication) network failures. Incompatible decisions at each of the different granularities would be a no-go in practice. Somewhere and somehow, you need one single3 top-most logical problem detection / ranking instance, which should be internally distributed of course, typically using some distributed consensus protocol; but in difference to many published distributed consensus algorithms it should be able to work with multiple granularities at the same time. 7.1.4. Methods and their Appropriateness 7.1.4.1. Failover Methods Failover methods are only needed in case of an incident. They should not be used for regular handover. 3 If you have logical pairs of datacenters which are firmly bound together, you could also have several topmost automatics instances, e.g. for each pair of datacenters. However, that would be very inflexible, because then you cannot easily mix locations or migrate your servers between datacenters. Using k > 2 replicas with MARS would also become a nightmare. In your own interest, please don’t create any concepts where masses of hardware are firmly bound to fixed constants at some software layers. 118 7.1. Avoiding Inappropriate Clustermanager Types for Medium and Long-Distance Replication STONITH-like Methods STONITH = Shoot The Other Node In The Head These methods are widely known, although they have several serious drawbacks. Some people even believe that any clustermanager must always have some STONITH-like functionality. This is wrong. There exist alternatives, as shown in the next paragraph. The most obvious drawback is that STONITH will always create a damage, by definition. Example: a typical contemporary STONITH implementation uses IPMI for automatically powering off your servers, or at least pushes the (virtual) reset button. This will always create a certain type of damage: the affected systems will definitely not be available, at least for some time until they have (manually) rebooted. This is a conceptual contradiction: the reason for starting failover is that you want to restore availability as soon as possible, but in order to do so you will first destroy the availability of a particular component. This may be counter-productive. Example: when your hot standby node B does not work as expected, or if it works even worse than A before, you will loose some time until you can become operational again at the old side A. Here is an example method for handling a failure scenario. The old active side A is assumed to be no longer healthy anymore. The method uses a sequential state transition chain with a STONITH-like step: Phase1 Check whether the hot standby B is currently usable. If this is violated (which may happen during certain types of disasters), abort the failover for any affected resources. Phase2 Try to shutdown the damaged side A (in the hope that there is no serious damage). Phase3 In case phase2 did not work during a grace period / after a timeout, assume that A is badly damaged and therefore STONITH it. Phase4 Start the application at the hot standby B. Notice: any cleanup actions, such as repair of defective hard- or software etc, are outside the scope of failover processes. Typically, they are executed much later when restoring redundancy. Also notice: this method is a heavily distributed one, in the sense that sequential actions are alternated multiple times on different hosts. This is known to be cumbersome in distributed systems, in particular in presence of network problems. Phase4 in more detail for DRBD, augmented with some pseudo code for application control: 1. at side B: drbdadm disconnect all 2. at side B: drbdadm primary --force all 3. at side B: applicationmanager start all The same phase4 using MARS: 1. at side B: marsadm pause-fetch all 2. at side B: marsadm primary --force all 3. at side B: applicationmanager start all This sequential 4-phase method is far from optimal, for the following reasons: • The method tries to handle both failover and handover scenarios with one single sequential receipe. In case of a true failover scenario where it is already known for sure that side A is badly damaged, this method will unnecessarily waste time for phase 2. This could be fixed by introduction of a conceptual distinction between handover and failover, but it would not fix the following problems. • Before phase4 is started (which will re-establish the service from a user’s perspective), a lot of time is wasted by both phases 2 and 3. Even if phase 2 would be skipped, phase 3 would unnecessarily cost some time. In the next paragraph, an alternative method is explained which eliminates any unnecessary waiting time at all. • The above method is adapted to the shared-disk model. It does not take advantage of the shared-nothing model, where further possibilities for better solutions exist. 119 7. Tips and Tricks • In case of long-distance network partitions and/or sysadmin / system management subnetwork outages, you may not even be able to (remotely) start STONITH at at. Thus the above method misses an important failure scenario. Some people seem to have a binary view at the healthiness of a system: in their view, a system is either operational, or it is damaged. This kind of view is ignoring the fact that some systems may be half-alive, showing only minor problems, or occurring only from time to time. It is obvious that damaging a healthy system is a bad idea by itself. Even generally damaging a half-alive system in order to “fix” problems is not generally a good idea, because it may increase the damage when you don’t know the real reason4 . Even worse: in a distributed system5 you sometimes cannot(!) know whether a system is healthy, or to what degree it is healthy. Typical STONITH methods as used in some contemporary clustermanagers are assuming a worst case, even if that worst case is currently not for real. Therefore, avoid the following fundamental flaws in failover concepts and healthiness models, which apply to implementors / configurators of clustermanagers: • Don’t mix up knowledge with conclusions about a (sub)system, and also don’t mix this up with the real state of that (sub)system. In reality, you don’t have any knowledge about a complex distributed system. You only may have some knowledge about some parts of the system, but you cannot “see” a complex distributed system as a whole. What you think is your knowledge, isn’t knowledge in reality: in many cases, it is conclusion, not knowledge. Don’t mix this up! • Some systems are more complex than your model of it. Don’t neglect important parts (such as networks, routers, switches, cables, plugs) which may lead you to wrong conclusions! • Don’t restrict your mind to boolean models of healthyness. Doing so can easily create unnecessary damage by construction, and even at concept level. You should know from software engineering that defects in concepts or models are much more serious than simple bugs in implementations. Choosing the wrong model cannot be fixed as easily as a typical bug or a typo. • Try to deduce the state of a system as reliably as possible. If you don’t know something for sure, don’t generally assume that it has gone wrong. Don’t confuse missing knowledge with the conclusion that something is bad. Boolean algebra restricts your mind to either “good” or “bad”. Use at least tri-state algebra which has a means for expressing “unknown”. Even better: attach a probability to anything you (believe to) know. Errare humanum est: nothing is absolutely sure. • Oversimplification: don’t report an “unknown” or even a “broken” state for a complex system whenever a smaller subsystem exists for which you have some knowledge (or you can conclude something about it with reasonable evidence). Otherwise, your users / sysadmins may draw wrong conclusions, and assume that the whole system is broken, while in reality only some minor part has some minor problem. Users could then likely make wrong decisions, which may then easily lead to bigger damages. • Murphy’s law: never assume that something can’t go wrong! Doing so is a blatant misconception at topmost level: the purpose of a clustermanager is creating High Availablity (HA) out of more or less “unreliable” components. It is the damn duty of both a clustermanager and its configurator to try to compensate any failures, regardless of their probability 6 , as best as possible. 4 Example, occurring in masses: an incorrectly installed bootloader, or a wrong BIOS boot priority order which unexpectedly lead to hangs or infinite reboot cycles once the DHCP or BOOTP servers are not longer available / reachable. 5 Notice: the STONITH concept is more or less associated with short-distance scenarios where crossover cables or similare equipment are used. The assumption is that crossover cables can’t go defective, or at least it would be an extremely unlikely scenario. For long-distance replication, this assumption is simply not true. 6 Never claim that something has only low probability (and therefore it were not relevant). In the HA area, you simply cannot know that, because you typically have sporadic incidents. In extreme cases, the purpose of your HA solution is protection against 1 failure per 10 years. You simply don’t have the time to wait for creating an incident statistics about that! 120 7.1. Avoiding Inappropriate Clustermanager Types for Medium and Long-Distance Replication • Never confuse probability with expectancy value! If you don’t know the mathematical term “expectancy value”, or if you don’t know what this means in practice, don’t take responsibility for millions of € or $. • When operating masses of hard- and software: never assume that a particular failure can occur only at a low number of instances. There are unknown(!) systematic errors which may pop up at the wrong time and in huge masses when you don’t expect them. • Multiple layers of fallback: any action can fail. Be prepared to have a plan B, and even a plan C, and even better a plan D, wherever possible. • Never increase any damage anywhere, unnecessarily! Always try to miminize any damage! It can be mathematically proven that in deterministic probabilistic systems having finite state, increases of a damage level at the wrong place will introduce an additional risk of getting into an endless loop. This is also true for nondeterministic systems, as known from formal language theory7 . • Use the best effort principle. You should be aware of the following fact: in general, it is impossible to create an absolutely reliable system out of unreliable components. You can lower the risk of failures to any > 0 by investing a lot of resources and of money, but whatever you do: = 0 is impossible. Therefore, be careful with boolean algebra. Prefer approximation methods / optimizing methods instead. Always do your best, instead of trying to reach a global optimum which likely does not exist at all (because the can only converge to an optimum, but will never actually reach it). The best effort principle means the following: if you discover a method for improving your operating state by reduction of a (potential) damage in a reasonable time and with reasonable effort, then simply do it. Don’t argue that a particular step is no 100% solution for all of your problems. Any improvement is valuable. Don’t miss any valuable step having reasonable costs with respect to your budget. Missing valuable measures which have low costs are certainly a violation of the best effort principle, because you are not doing your best. Keep that in mind. If you have understood this (e.g. deeply think at least one day about it), you will no longer advocate STONITH methods in general, when there are alternatives. STONITH methods are only valuable when you know in advance that the final outcome (after reboot) will most likely be better, and that waiting for reboot will most likely pay off. In general, this condition is not true if you have a healthy hot standby system. This should be easy to see. But there exist well-known clustermanager solutions / configurations blatantly ignoring8 this. Only when the former standby system does not work as expected (this means that all of your redundant systems are not healthy enough for your application), only then 9 STONITH is unevitable as a last resort option. In short: blindly using STONITH without true need during failover is a violation of the best effort principle. You are simply not doing your best. • When your budget is limited, carefully select those improvements which make your system as reliable as possible, given your fixed budget. 7 Finite automatons are known to be transformable to deterministic ones, usually by an exponential increase in the number of states. 8 For some special(!) cases of the shared-disk model, there exist some justifications for doing STONITH before starting the application at the hot standby. Under certain circumstances, it can happen that system A running amok could destroy the data on your single shared disk (example: a filesystem doubly mounted in parallel, which will certainly destroy your data, except you are using ocfs2 or suchalike). This argument is only valid for passive disks which are directly attached to both systems A and B, such that there is no external means for fencing the disk. In case of iSCSI running over ordinary network equipment such as routers or switches, the argument “fencing the disk is otherwise not possible” does not apply. You can interrupt iSCSI connection at the network gear, or you can often do it at cluster A or at the iSCSI target. Even commercial storage appliances speaking iSCSI can be remotely controlled for forcefully aborting iSCSI sessions. In modern times, the STONITH method has no longer such a justification. The justification stems from ancient times when a disk was a purely passive mechanical device, and its disk controller was part of the server system. 9 Notice that STONITH may be needed for (manual or partially automatic) repair in some cases, e.g. when you know that a system has a kernel crash. Don’t mix up the repair phase with failover or handover phases. Typically, they are executed at different times. The repair phase is outside the scope of this section. 121 7. Tips and Tricks • Create statistics on the duration of your actions. Based on this, try to get a balanced optimum between time and costs. • Whatever actions you can start in parallel for saving time, do it. Otherwise you are disregarding the best effort principle, and your solution will be sub-optimal. You will require deep knowledge of parallel systems, as well as experience with dealing with problems like (distributed) races. Notice that any distributed system is inherently parallel. Don’t believe that sequential methods can deliver an optimum solution in such a difficult area. • If you don’t have the necessary skills for (a) recognizing already existing parallelism, (b) dealing with parallelism at concept level, (c) programming and/or configuring parallelism race-free and deadlock-free (or if you even don’t know what a race condition is and where it may occur in practice), then don’t take responsibility for millions of € or $. • Avoid hard timeouts wherever possible. Use adaptive timeouts instead. Reason: depending on hardware or workload, the same action A may take a very short time on cluster 1, but take a very long time on cluster 2. If you need to guard action A from hanging (which is almost always the case because of Murphy’s law), don’t configure any fixed timeout for it. When having several hundreds of clusters, you would need to use the worst case value, which is the longest time occurring somewhere at the very slow clusters / slow parts of the network. This wastes a lot of time in case one of the fast clusters is hanging. Adaptive timeouts work differently: they use a kind of “progress bar” to monitor the progress of an action. They will abort only if there is no progress for a certain amount of time. Hint: among others, marsadm view-*-rest commands or macros are your friend. ITON = Ignore The Other Node This means fencing from application traffic, and can be used as an alternative to STONITH when done properly. General Fencing Methods Fencing from Storage Fencing from Application Traffic ... STONITH Fencing from application traffic is best suited for the shared-nothing model, but can also be adapted to the shared-disk model with some quirks. The idea is simple: always route your application network traffic to the current (logically) active side, whether it is currently A or B. Just don’t route any application requests to the current (logically) passive side at all. For failover (and only for that), you should not care about any split brain occurring at the low-level generic block device: common part of history A B Although having a split brain at the generic low-level block device, you now define the “logically active” and “logically passive” side by yourself by logically ignoring the “wrong” side as defined by yourself: common part of history A B This is possible because the generic block devices provided by DRBD or MARS are completely agnostic of the “meaning” of either version A or B. Higher levels such as clustermanagers (or humans like sysadmins) can assign them a meaning like “relevant” or “not relevant”, or “logically active” or “logically passive”. 122 7.1. Avoiding Inappropriate Clustermanager Types for Medium and Long-Distance Replication As a result of fencing from application traffic, the “logically passive” side will logically cease any actions such as updating user data, even if it is “physically active” during split-brain (when two primaries exist in DRBD or MARS sense10 ). If you already have some load balancing, or BGP, or another mechanism for dynamic routing, you already have an important part for the ITON method. Additionally, ensure by an appropriate strategy that your balancer status / BGP announcement etc does always coincide with the “logically active” side (recall that even during split-brain you must define “logically active” uniquely11 by yourself). Example: Phase1 Check whether the hot standby B is currently usable. If this is violated (which may happen during certain types of disasters), abort the failover for any affected resources. Phase2 Do the following in parallel 12 : • Start all affected applications at the hot standby B. This can be done with the same DRBD or MARS procedure as described on page 119. • Fence A by fixedly routing all affected application traffic to B. That’s all which has to be done for a shared-nothing model. Of course, this will likely produce a split-brain (even when using DRBD in place of MARS), but that will not matter from a user’s perspective, because the users will no longer “see” the “logically passive” side A through their network. Only during the relatively small time period where application traffic was going to the old side A while not replicated to B due to the incident, a very small number of updates could have gone lost. In fields like webhosting, this is taken into account. Users will usually not complain when some (smaller amount of) data is lost due to split-brain. They will complain when the service is unavailable. This method is the fastest for restoring availability, because it doesn’t try to execute any (remote) action at side A. Only from a sysadmin’s perspective, there remain some cleanup tasks to be done during the following repair phase, such as split-brain resolution, which are outside the scope of this treatment. By running the application fencing step sequentially (including wait for its partial successfulness such that the old side A can no longer be reached by any users) in front of the failover step, you may minimize the amount of lost data, but at the cost of total duration. Your service will take longer to be available again, while the amount of lost data is typically somewhat smaller. A few people might clamour when some data is lost. In long-distance replication scenarios with high update traffic, there is simply no way at all for guaranteeing that no data can be lost ever. According to the laws of Einstein and the laws of Distributed Systems like the famous CAP theorem, this isn’t the fault of DRBD+proxy or MARS, but simply the consequence of having long distances. If you want to protect against data loss as best as possible, then don’t use k = 2 replicas. Use k ≥ 4, and spread them over different distances, such as mixed small + medium + long distances. Future versions of MARS will support adaptive pseudo-synchronous modes, which will allow individual adaptation to network latencies / distances. 10 Hint: some clustermanagers and/or some people seem to define the term “split-brain” differently from DRBD or MARS. In the context of generic block devices, split brain means that the history of both versions has been split to a Y-like fork (for whatever reason), such that re-joining them incrementally by ordinary write operations is no longer guaranteed to be possible. As a slightly simplified definition, you might alternatively use the definition “two incompatible primaries are existing in parallel”, which means almost the same in practice. Details of formal semantics are not the scope of this treatment. 11 A possible strategy is to use a Lamport clock for route changes: the change with the most recent Lamport timestamp will always win over previous changes. 12 For database applications where no transactions should get lost, you should slightly modify the order of operations: first fence the old side A, then start the application at standby side B. However, be warned that even this cannot guarantee that no transaction is lost. When the network between A and B is interrupted before the incident happens, DRBD will automatically disconnect, and MARS will show a lagbehind. In order to fully eliminate this possibility, you can either use DRBD and configure it to hang forever during network outages (such that users will be unable to commit any transactions at all), or you can use the shared-disk model instead. But in the latter case, you are introducing a SPOF at the single shared disk. The former case is logically almost equivalent to shared-disk, but avoiding some parts of the physical SPOF. In a truly distributed system, the famous CAP theorem is limiting your possibilities. Therefore, no general solution exists fulfilling all requirements at the same time. 123 7. Tips and Tricks The ITON method can be adapted to shared-disk by additionally fencing the common disk from the (presumably) failed cluster node A. 7.1.4.2. Handover Methods Planned handover is conceptually simpler, because both sides must be (almost) healthy as a precondition. There are simply no pre-existing failures to deal with. Here is an example using DRBD, some application commands denoted as pseudo code: 1. at side A: applicationmanager stop all 2. at side A: drbdadm secondary all 3. at side B: drbdadm primary all 4. at side B: applicationmanager start all MARS already has a conceptual distinction between handover and failover. With MARS, it becomes even simpler, because a generic handover procedure is already built in: 1. at side A: applicationmanager stop all 2. at side B: marsadm primary all 3. at side B: applicationmanager start all 7.1.4.3. Hybrid Methods In general, a planned handover may fail at any stage. Notice that such a failure is also a failure, but (partially) caused by the planned handover. You have the following alternatives for automatically dealing with such cases: 1. In case of a failure, switch back to the old side A. 2. Instead, forcefully switch to the new side A, similar to the methods described in section 7.1.4.1. Similar options exist for a failed failover (at least in theory), but chances are lower for actually recovering if you have only k = 2 replicas in total. Whatever you decide to do in what case in whatever priority order, whether you decide it in advance or during the course of a failing action: it simply means that according to the best effort principle, you should never leave your system in a broken state when there exists a chance to recover availability with any method. Therefore, you should implement neither handover nor failover in their pure forms. Always implement hybrid forms following the best effort principle. 7.1.5. Special Requirements for Long Distances Most contemporary clustermanagers have been constructed for short distance shared-nothing clusters, or even for local shared-nothing clusters (c.f. DRBD over crossover cables), or even for shared-disk clusters (originally, when their concepts were developed). Blindly using them for long-distance replication without modification / adaptation bears some additional risks. • Notice that long-distance replication always requires a shared-nothing model. • As a consequence, split brain can appear regularly during failover. There is no way for preventing it! This is an inherent property of distributed systems, not limited to MARS (e.g. also ocurring with DRBD if you try to use it over long distances). Therefore, you must deal with occurences of split-brain as a requirement. • The probability of network partitions is much higher: although you should have been required by Murphy’s law to deal with network partitions already in short-distance scenarios, it now becomes mandatory. • Be prepared that in case of certain types of (more or less global) internet partitions, you may not be able to trigger STONITH actions at all. Therefore, fencing of application traffic is mandatory. 124 7.2. systemd Templates 7.2. systemd Templates Starting with mars0.1stable57 (resp. mars0.1abeta9), you may use systemd as a cluster manager at the Mechanics Layer as explained in section 7.1.3 on page 117. MARS will replicate some systemd-relevant state information across the (big) cluster, so there is some limited remote operation support. In particular, automated handover via marsadm primary $resource is supported. More features will be added to future releases. 7.2.1. Why systemd? All major Linux distributions are now systemd based. It is the new quasi standard. Although there have been some discussions in the community about its merits and shortcomings, it appears to be accepted now in large parts of the Linux world. Systemd has a few advantages: 1. It is running as init process under the reserved pid=1. If it would ever die, then your system would die. There is no need for adding a new MARS clustermanager daemon or similar, which could fail independently from other parts of the system. 2. Although systemd has been criticised as being “monolithic” (referring to its internal software architecture), its usage by sysadmins is easily decomposable into many plugins called units. 3. Local LXC containers, local VMs, iSCSI exports, nfs exports and many other parts of the system are often already controlled by systemd. Together with udev and other parts, it already controls devices, LVM, mountpoints, etc. Since MARS is only a particular component in a bigger complicated stack, it is an advantage to use the same (more or less standardized and well-integrated) tools for managing the whole stack. Systemd has also a few disadvantages: 1. It is not accepted everywhere. Therefore the systemd template extensions of marsadm are not mandatory for MARS operations. You can implement your own alternatives when necessary. 2. It can be messy to deal with. In particular, it can sometimes believe that the system were in a particular state, although in reality it isn’t. Compensation is hairy. 3. Usablility / reporting: it is less usable for getting an overview over a bigger local system, and is practically unusable (out-of-the-box) for managing a bigger cluster at cluster level. Monitoring needs to be done separately. 7.2.2. Working Principle of the systemd Template Engine Systemd already has some very basic templating capabilities. It is possible to create unit names containing the @ symbol, which can then be expanded under certain circumstances, e.g. to tty names etc. However, automatic expansion is only done when somebody knows the instance name already in advance. The author has not found any way for creating instance names out of “thin air”, such as from dynamically created MARS resource names. Essentially, an inference machine for systemd templates does not yet exist. This lacking functionality is completed with the following macro processing capabilities of marsadm: Some ordinary or templated systemd unit files (see man systemd.unit) can be installed into one of the following directories: ./systemd-templates, $HOME/.marsadm/systemd-templates/, /etc/marsadm/systemd-templates/, /usr/lib/marsadm/systemd-templates/, /usr/local/lib/marsadm/systemd-templates/. Futher places can be defined by overriding the $MARS_PATH environment variable. From these directories, ordinary systemd unit files will be just copied into /run/systemd/system/ (configurable via $SYSTEMD_TARGET_DIR) and then picked up by systemd as ordinary unit files. 125 7. Tips and Tricks Template unit files are nothing but unit files containing @{varname } parts or other macro definitions in their filename, and possibly also in their bodies, at arbitrary places. These @{...} parts are substituted by a marsadm macro processing engine. The following macro capabilities are currently defined: @{varname } Expands to the value of the variable. Predefined are the following variables: @{res} The MARS resource name. @{resdir} The MARS resource directory /mars/resource-$res/. @{host} The local host name as determined by marsadm, or as overridden by the --host= parameter. @{cmd} The marsadm command as given on the command line (only reasonable for debugging or for error messages). @{varname } Further variables as defined by the macro processor, see section 5.2.3 on page 97, and as definable by %let{varname}{...} statements, see also sections 5.1.1 on page 81 and 5.1.2 on page 86. @eval{text } Calls the MARS macro processor as explained in chapter 5 on page 81, and substitutes its output. @esc{text } Calls the systemd-escape tool for conversion of pathnames following the systemd naming conventions (see man systemd-escape). For example, a dash is converted to \x2d. Omitting this can lead to problems when your resource names are containing special characters like dashes or other special symbols (in the sense of systemd). Bugs of this kind are hard to find and to debug. Either forbid special characters in your installation, or don’t forget to test everything with some crude resource names! Example snippet from a .path unit. Please notice where escaping is needed and where it must not be used (also notice that a dash is sometimes a legal part of the .mount unit name, but except from the resource name part): [ Path ] P a t h E x i s t s=/dev / mars /@{ r e s } Unit=vol −@escvar { r e s } . mount Another source of crude bugs is the backslash character in the systemd-escape substitution, such as from \x2d. When passed to a shell, such as in certain ExecStart= statements, the backslash will be removed. Therefore, don’t forget to either replace any single backslash with two backslashes, or to put the whole pathname in single quotes, or similar. Always check the result of your substitutions! It depends on the target (such as bash, as opposed to systemd) whether further escaping of the escapes is needed, or whether it must not be applied. Become a master of the escaping hell by inserting debug code into your scripts (reporting to /dev/stderr or to log files) and do thorough testing like a devil. @escvar{varname } Equivalent to @esc{@{varname }}. When creating a new resource via marsadm create-resource, or when adding a new replica via marsadm join-resource or similar, the template system will automatically create new instances for the new resource or its replicas. Conversely, marsadm leave-resource and its friends like delete-resource etc will automatically remove the corresponding template instances from /run/systemd/system/. 126 7.2. systemd Templates 7.2.3. Example systemd Templates These can be found in the MARS repo in the systemd/ subdirectory. At the moment, the following are available (subject to further extension and improvements): mars.path This ensures that the mountpoint /mars/ is already mounted before mars.service is started. mars.service This starts and stops the MARS kernel module, provided that /mars is (somehow) mounted. The latter can be ensured by classical /etc/fstab methods, or by .mount units like your own hand-crafted mars.mount unit. mars-trigger.path This is used for remote triggering of the marsadm template engine from another MARS cluster member, e.g. when initiating a handover. Local triggering is also possible via touch /mars/userspace/systemd-trigger. When triggered, the command marsadm systemd-trigger is executed. In turn, this will re-compute all systemd templates and start those units where the local host is in primary role. dev-mars-@{res}.path This is used for generic triggering of any systemd unit as set by marsadm set-systemd-unit $res $unit (see below in section 7.2.4). vol-@{res}.mount This is one of the possible sub-ordinate targets which depend on dev-mars-@{res}.path. For fully automatic activation of this target, use something like marsadm set-systemd-unit mydata vol-mydata.mount or similar. 7.2.4. Handover involving systemd First, you need to install your systemd templates into one of the template directories mentioned in section 7.2.2. In case you have never used the template engine before, you can create the first instantiation via marsadm systemd-trigger. Afterwards, inspect /run/systemd/system/ for newly created template instances and check them. For each resource $res, you should set (potentially different) systemd targets via marsadm set-systemd-unit $res “$start_unit” “$stop_unit”. Notice that $start_unit and $stop_unit are typically denoting different targets (with few exceptions) for the following reason: Example: assume your stack consists of vol-@{res}.mount and nfs-export-@{res}.service. Before the filesystem can be exported via nfs, it first needs to be mounted. At startup, systemd can do this easily for you: just add a Requires= dependency between both targets, or similar. However, the situation can become tricky upon shutdown. Theoretically, systemctl stop nfs-export-@{res}.service could work in some cases, but in general it is not reliable. Reason: there might be other sister units which also depend on the mount. In some cases, you need not necessarily notice that sisters, because systemd can add further (internal) targets automatically. The problem is easily solvable by systemctl stop vol-@{res}.mount, which will automatically tear down all dependencies in reverse order. For maximum safety, $start_unit should always point at the tip of your stack, while $stop_unit should point at the bottom (but one level higher than /dev/mars/$res). Removing any systemd targets is also possible via marsadm set-systemd-unit $res “” . When everything is set up properly, the following should work: 1. Issue marsadm primary $res on another node which is currently in secondary role. 2. As a consequence, systemctl stop “$stop_unit” should be automatically executed at the old primary side. 3. After a while, the MARS kernel module will notice that /dev/mars/$res is no longer opened. You can check this manually via marsadm view-device-opened $res which will tell you a boolean result. In case the device is not closed, ordinary handover cannot proceed, because somebody could (at least potentially) write some data into it, even after the handover, which 127 7. Tips and Tricks would lead to a split brain. Therefore MARS must insist that the device is closed before ordinary handover will proceed. In case it is not closed, you can (a) use primary --force which will likely provoke a split brain, or (b) check your systemd configuration or other sources of error why the device is not closed. Possible reasons could be hanging processes or hanging sessions which might need a kill or a kill -9 or similar. Notice that lsof does not catch all possible sources like (recursive or bind-) mounts. 4. Once /dev/mars/$res has disappeared, the ordinary MARS handover from the old primary to the new site should proceed as usual. 5. After /dev/mars/$res has appeared at the new site, systemctl start “$start_unit” should be automatically executed. The rest depends on your systemd and its configuration. For example, you can configure systemd targets for activation of VMs, or for LXC containers, or for iSCSI exports, or for nfs exports, or for glusterfs exports, or for whatever you need. For true geo-redundancy, you will likely have to include some quagga or bird or other BGP configurations into your stack. 7.3. Creating Backups via Pseudo Snapshots When all your secondaries are all homogenously located in a standby datacenter, they will be almost idle all the time. This is a waste of computing resources. Since MARS is no substitute for a full-fledged backup system, and since backups may put high system load onto your active side, you may want to utilize your passive hardware resources in a better way. MARS supports this thanks to its ability to switch the pause-replay independently from pause-fetch. The basic idea is simple: just use pause-replay at your secondary site, but leave the replication of transaction logfiles intact by deliberately not saying pause-fetch. This way, your secondary replica (block device) will stay frozen for a limited time, without loosing your redundancy: since the transaction logs will continue to replicate in the meantime, you can start resume-replay at any time, in particular when a primary-side incident should happen unexpectedly. The former secondary will just catch up by replaying the outstanding parts of the transaction logs in order to become recent. However, some details have to be obeyed. In particular, the current version of MARS needs an additional detach operation, in order to release exclusive access to the underlying disk /dev/lv/$res. Future versions of MARS are planned to support this more directly, without need for an intermediate detach operation. Beware: mount -o ro /dev/vg/$res can lead to unnoticed write operations if you are not careful! Some journalling filesystems like xfs or ext4 may replay their journals onto the disk, leading to binary differences and thus destroying your consistency later when you re-enable resume-replay! Therefore, you may use small LVM snapshots (only in such cases). Typically, xfs journal replay will require only a few megabytes. Therefore you typically don’t need much temporary space for this. Here is a more detailed description of steps: 1. marsadm pause-replay $res 2. marsadm detach $res 3. lvcreate --size 100m --snapshot --name ro-$res /dev/vg/$res 4. mount -o ro /dev/vg/ro-$res /mnt/tmp 5. Now draw your backup from /mnt/tmp/ 6. umount /mnt/tmp 128 7.3. Creating Backups via Pseudo Snapshots 7. lvremove -f /dev/vg/ro-$res 8. marsadm up $res Hint: during the backup, the transaction logs will accumulate on /mars/. In order to avoid overflow of /mars/ (c.f. section 4.4), don’t unnecessarily prolong the backup duration. 129 8. LV Football / VM Football / Container Football The Football scripts can be obtained in two different ways: 1. git clone --recurse-submodules https://github.com/schoebel/mars then cd mars/football/ 2. git clone https://github.com/schoebel/football The --recurse-submodule method is the preferred way for non-developers because the main repo contains a link to the right version of Football. When switching branches, you should use git submodule update for synchronizing the Football submodule with the MARS main checkout. Recommended MARS branch for playing Football is mars0.1a.y. Although the old stable branch mars0.1.y has been updated for the most important marsadm features merge-cluster and split-cluster, it does not scale well for Football and can cause operational problems when merging too many hosts together, showing some O(n2 ) metadata update behaviour where n is the number of machines in a MARS cluster. The future branch mars0.1b.y will contain more scalability improvements; in particular the split-cluster operation should no longer be needed at all because it is planned to scale with O(k) where k is the number of resources at a single host. This should allow creation of a virtual(!) BigCluster pool at metadata level (where metadata transfer rates are typically measured in KiB/s), consisting of thousands of machines, while at the same time creating a LocalSharding or FlexibleSharding model at the realtime IO paths (where some petabytes are pumped through thick pipelines). Please check the other branches regularly at the github repo whether some newer branches will be marked “stable”, or at least “beta”. At the moment (spring 2018), mars0.1a.y is marked “beta” although it is in production at several thousands of machines for several months. Low-level documentation is available by calling any of the scripts with --help parameter (see also appendix H.2 ff): • ./football.sh --help • ./screener.sh --help By adding --verbose, you can get a list of parameters for configuring and tweaking. 8.1. Football Overview Topmost architectural level (not yet implemented): 130 8.1. Football Overview generic pool−optimizer output drivers input drivers ssh status cache machine pool parameters / settings action CSV action plugins The planned heart of the Football system is the generic pool optimizer, which aims to provide a similar functionality than Kubernetes, but working on a sharding architecture. Instead of controlling stateless Docker containers, its designated goal is to control masses of LVs on thousands of machines, creating a “Virtually Distributed LVM pool” (petabytes of total storage), and doing similar things than Software Defined Storage (SDS) on the virtual pool. In addition to load balancing of storage space (and its special cases like hardware lifecycle), there will be designated plugins for dealing with CPU and RAM dimensions. Further dimensions and a variety of goal functions could be added via future plugins. The optimizer itself aims to be as generic as possible, while functionality and interfaces can be added via plugins and/or drivers. Future versions might even support DRBD in addition to MARS. The first version may use a simple greedy algorithm for solving the underlying N P-complete problem, but could be augmented with more sophisticated problem solvers in future. The automatic operations generated by pool-optimizer will be customizable by dozens of parameters, and also extendable by action plugins. At the moment, the following football.sh actions are planned: migrate This will move an LV (together with its VM / LXC container / etc) to a different machine in the machine pool. This is the classical Football “kick” operation. shrink This decreases the occupied LV space of a filesystem (currently only xfs implemented, but easily extendable) via creation of a smaller temporary LV at the hypervisor, then transferring all data during operations via local rsync, then shutting down the VM for a short period, doing a final incremental rsync, renaming the copied temporary LV to its original name, restarting the VM on the new version (which contains the same data as before but wastes less space), and finally re-establishing the MARS replicas (but of course with smaller LV size). extend This is much easier than shrinking: it first increases the underlying LV size dynamically on all replicas, then marsadm resize, and finally calls xfs_growfs while the filesystem remains mounted and while the VM / container is running. migrate+shrink Similar to migrate immediately followed by shrink, but produces less network traffic and runs faster. migrate+shrink+back Use this when there is not enough local temporary space for shrinking. The LV is first migrated to a temporary host, then shrunk, and finally migrated back to its original position. By running the overall system in an endless loop, a control loop for permanent optimization can be established. Typical periods are each few days, or once a week. In addition, manual triggering is also possible. The result of an (incremental) pool-optimizer run is a CSV file, which may be automatically forwarded to the execution engine football.sh for manual execution, or to screener.sh for 131 8. LV Football / VM Football / Container Football action CSV input filters mass execution on a common control machine. Alternatively, intermediate steps like manual checking, filtering etc may be inserted into the processing pipeline. The excecution engine football.sh rep. its 1&1-internal variant tetris.sh is already in production at 1&1, and already reached more than 300 migrations per week. Architecture of the execution engine: parameters / settings screener.sh Humans Screener plugins ssh machine pool football.sh output drivers parameters / settings Football plugins The so-called Screener is simply a generic program allowing mass execution of arbitrary scripts in background screen sessions. This allows masses (several hundreds, possibly thousands) of long-lasting processes (hours or days) to run unattended in background, while allowing a (larger) group of sysadmins to attach / detach to screen sessions at any time for corrective by-hand actions, e.g. in case of failures or other problems, or for supervision, etc. When Screener is combined with the Football execution engine football.sh, more specialized functionality is available (via a variety of plugins): • Optional waiting for sysadmin confirmation before some customer downtime is initiated. • Automatic generation of motd status reporting to other sysadmins. • Automatic sending of email alerts or status reports, e.g. on errors or critical errors, etc. By sending email to SMS gateways, real-time alerting can be configured (e.g. over the weekend). • Generic interfacing to external scripts with configurable parameters, e.g. for triggering monitoring systems, feeding external databases, etc. Screener can detect and will automatically manage the following states (in this example, all state lists are empty): $common_user> . / s c r e e n e r . sh l i s t List of waiting : List of delayed : List of condition : L i s t o f running : List of c r i t i c a l : List of serious : 132 8.1. Football Overview List List List List List of of of of of interrupted : illegal : failed : timeouted : done : Screener can discriminate the seriosity of errors as follows: failed An error occurred outside of critical sections, e.g. during preparation of LV space etc. During ordinary operations, VMs / containers are usually running continuously, and there is no customer impact to be expected. Typically, ./screener.sh restart $resource should fix the problem if it is only a temporary problem. However, for maximum safety, manual inspection via ./screener.sh attach $resource or inspection of the logfile via ./screener.sh show $resource is recommended before trying an automatic restart. serious An error occured while a VM / container was temporarily stopped, which would normally lead to customer downtime, but Football was able to compensate the problem for now by automatically restarting the VM. Thus no long-lasting customer impact has likely occurred. However, manual inspection and repair by sysadmins is likely necessary. critical An uncompensated error occured during customer downtime. The VM / container is likely down. This will need manual sysadmin actions ASAP, such as hardware replacement, networking fixes, etc. timeouted This means that the script is assumed to hang because it did not produce any output for more than $session_timeout seconds (default 3600 * 3 = 3 hours). illegal This means that a precondition is not met. For example, there is not enough space at the target LVM. interrupted Somebody has pressed Ctl-c in a screen session, or has otherwise sent a signal to the running script. As a result, a signal trap has been executed. Ordinary Screener states during execution: running This means that a (background) process is currently running. You can attach to the screen session either manually via screen -x $pid.$resource, or more comfortably via ./screener.sh attach $resource. Then you can use screen as documented in man screen. The most important operation is detaching via keystrokes Ctrl-a d. Notice: don’t press Ctrl-c unless you know what you are doing. In most cases, this will terminate the running process, and in consequence lead to interrupted or failed or even critical state (depending on the moment of keypress). Depending on parameter drop_shell, the Screener session will also terminate, or you will get an interactive shell for manual repair. waiting When the plugins football-waiting and screener-waiting are configured properly (which is not the default), the script execution will pause immediately before a customer downtime action would be started. Now any sysadmin from the larger group has a chance to ./screener attach $resource and to press RETURN to continue the waiting script and to personally watch the course of the critical section. There are some more comfortable variants like ./screener continue $resource for background continuation of a single session, or ./screener continue 100 which can be used for continuing masses of waiting sessions. There are further variants which are automatically attaching to sessions, see Appendix H.4. delayed This state is only entered before lvremove $resource is executed (which will destroy your old internal backup copy), and when configured appropriately. Typically, you also need to configure the $wait_before_cleanup variable in order to avoid endless waiting. Notice that old LV data gets soon outdated after a while, so please don’t unnecessarily prolong the running time of your scripts by choosing too long $wait_before_cleanup values. 133 8. LV Football / VM Football / Container Football condition Special case of delay: some condition is currently not met, such as the $business_hours feature, where you can configure when customer downtimes are allowed, and when not. done This means that the script reported successful execution by exit status 0. The background screen session terminated automatically. You can inspect the logfile manually via ./screener.sh show $resource, or by looking into the directory $screener_logdir/done/. Logfiles of other states can also be inspected (or monitored by standard tools like grep) by looking into sister directories, such as $screener_logdir/running/. When running Screener for several months or years, old logfiles will accumulate in these directories over time. Call ./screener.sh purge or ./screener.sh cron regularly via a cron job, or archieve your old logfiles from time to time via another method. 8.2. HOWTO instantiate / customize Football In order to install and operate Football, the recommended deployment strategy is bottom-up, layer by layer. Top-down strategies should be used only, and only, for planning. An Egyptian pyramid can never be built, even if you had some billions of workers, by starting at the tip and by creating the foundations as the very last step. Suchalike attempt would end up in a disaster. Testing of each layer separately is very important. Before proceeding to the next higher layer, first ensure that any lower layer is working correctly. Otherwise debugging can become tricky. 8.2.1. Block Device Layer Step-by-step instructions can be found in chapter 3 on page 56. Please ensure that your hardware (including RAID controllers and LVM and so on), and your operating system, and your network / setup, and MARS is working correctly before proceeding to the next layer. 8.2.2. Mechanics Layer of Cluster Operations In the following example, it is assumed that systemd is used, as explained in section 7.2 on page 125, and now applied to vm4711 supposed to run on hypervisors hyper1234a (primary role) and hyper1234b (secondary role), which is assumed to be controllable via the following systemd start and stop units: • marsadm set-systemd-unit vm4711 lxc-vm4711.target vol-vm4711.mount Test the cluster mechanics layer like in the following example: • On host hyper1234b, the following must work: marsadm primary vm4711 This must result in an automatic handover of vm4711 from the current primary site hyper1234a to the new primary hyper1234b, as explained in section 7.2 on page 125. Please check that vm4711 is running correctly at the new location. It must be reachable via network. In case you are using BGP because hyper1234a and hyper1234b are located in different datacenters, ensure that BGP is also controlled by your systemd unit dependencies, and test it. 134 8.2. HOWTO instantiate / customize Football 8.2.3. Mechanics Layer of Football Operations At the moment, there are two alternative plugins already implemented in the Football subproject (see subdirectory football/plugins/). Of course, you can implement some further plugins. Please put them under GPL, and share them. Please contact the author of MARS for inclusion into the official MARS release. football-cm3.sh This plugin can be only used at Shared Hosting Linux (ShaHoLin) at 1&1, since it is bound to a specific proprietary instance. However, the sourcecode of the plugin itself (not the code called by the plugin, e.g. over REST interfaces) is under GPL. You can (and should ) inspect the plugin code, and learn how a real-world system (which has grown over some decades and bears a lot of history) is actually working at certain points. This plugin is automatically activated when called via the symlink tetris.sh instead of directly calling football.sh. This has historic reasons. football-basic.sh This plugin uses the new systemd interface of marsadm for controlling the mechanics. See section 7.2 on page 125. You should be familiar with commands like marsadm set-systemd-unit. Manual handover via marsadm primary $resource must be already working (with high reliability ; check that any umount works everywhere without hangups) before you can start using this plugin for football.sh. This plugin is automatically activated when calling football.sh. It can be deactivated by overriding variable enable_basic=0. 8.2.3.1. Configuring and Overriding Variables A detailed list of all available customization options can be obtained via ./football.sh --help --verbose. Each option is documented by some help text, and you can always see the default settings. See also section H.3 on page 170. If you create any new plugin for Football, or if you modify an existing one, please follow these standards. Try to describe any option as concisely as possible. Configuring is possible in the following ways, in order of precedence: • at the command line via ./football.sh --$variable_name=$value $arguments. • via environment variables, e.g. globally via export $variable_name=$value && ./football.sh $arguments, or locally via $variable_name=$value ./football.sh $arguments. • by adding some small football-*.conf files into one of the directories /usr/lib/mars/plugins /etc/mars/plugins $script_dir/plugins $HOME/.mars/plugins ./plugins, in this order of precedence. This list of directories can be modifed externally over the environment variable football_includes (but not during already running inclusions of football-*.conf files). 8.2.3.2. football-basic.sh Customization Here is a brief summary of the most important configuration tasks and options: initial_hostname_file Somehow, the football-basic.sh plugin must know the hostnames of your pool. Once Football is working, the hostname will be automatically maintained whenever marsadm join-cluster or marsadm merge-cluster is executed somewhere. For your hardware deployment strategy, this means the following: just deploy any new hardware, or remove your old one (after Football has emptied all of your former LV resources). It does not matter how you are doing this, e.g. via OpenStack, or via the proprietary Schlunix methods used at ShaHoLin, or whatever. Then you have the following options for adding the new machines to the Football hostname cache (see variable hostname_cache): 1. Write the pure hostname(s) into the file as configured with initial_hostname_file (by default: ./hostnames.input). Each hostname must be on its own ASCII line. Not only these new hosts will be picked up automatically, but also... 135 8. LV Football / VM Football / Container Football 2. ...any further hosts reported anywhere (at the already known hosts) by marsadm view-cluster-members, transitively. Consequence: if you are running the new mars0.1b.y (or newer) branch of MARS, you don’t need marsadm split-cluster anymore. Then you can operate several thousands of machines as a big virtual cluster, even if their storage is local (see LocalSharding model described in section 1.6.1 on page 20). Previous versions of MARS, like mars0.1.y and mars0.1a.y, are not yet scalable at their metadata exchange level. Trying to join-cluster or merge-cluster several tens or even hundreds of machines with those versions will surely lead to a disaster. Always use marsadm split-cluster at those versions, regularly. First upgrade to the future mars0.1b.y (or later versions) before creating big clusters at metadata level! 3. Use ./football.sh basic_add_host $hostname for adding a single new host manually. Afterwards, the transitive closure of all reachable hosts is computed as usual. This may also be used for the very first initialization of a fresh Football installation, provided you already have a big cluster at metadata level. Test the Football mechanics like one of the following example command sequences, where it is assumed that hyper4321a and hyper4321b are already newly deployed hypervisors having enough local LVM storage, and have been already added to the MARS cluster via marsadm join-cluster, or have been at least added to hostname_cache as explained above: • ssh-add; ./football.sh migrate vm4711 hyper4321a hyper4321b • ssh-add; ./football.sh migrate vm4711 hyper4321a hyper4321b --screener; ./screener.sh attach vm4711 Check the automatically produced logfile (via ./screener.sh show vm4711) that Football has automatically determined the old hypervisor where vm4711 was running before, that it has automatically executed marsadm merge-cluster when necessary, and has created the LV replicas at the new hypervisors, and has executed some marsadm join-resource commands, has automatically waited for MARS fast fullsync to finish, then successfully executed an automatic handover to the new primary hypervisor, and finally has destructed the old MARS replicas including their old LVs. Check that vm4711 is running correctly at the new hypervisor pair, and that handover between the new hypervisor sites *a and *b is working correctly. A larger group of sysadmins can co-work over a central common control machine via ssh agent forwarding (which must be enabled in /etc/ssh/sshd_config) in the following way: • At the workstation: ssh-add; ssh -A football@common-control.mycompany.org Then cd $script_dir and run your ./football.sh or ./screener.sh commands as usual. The automatically generated logfiles will be tagged with the real usernames from your original workstation login, as reported by ssh-add -l, even transitively when using ssh agent forwarding. Thus you may use a common username like football on the common1 control machine. Hint: use ./screener.sh list (or one of its more specific variants like ./screener.sh list-running) for determining what’s currently going on in a larger group of sysadmins. 1 Of course, it is also possible to maintain individual accounts for the same Unix group, and set umask and common directory permissions accordingly, such that the classical group-wise working concept from the 1970s will do the rest. This is much more work, but can establish more fine-grained access control. Even more sophisticated methods could involve ACLs, but suchalike is probably only necessary at extremely high-sensitive installations. 136 9. MARS for Developers This chapter is organized strictly top-down. If you are a sysadmin and want to inform yourself about internals (useful for debugging), the relevant information is at the beginning, and you don’t need to dive into all technical details at the end. If you are a kernel developer and want to contribute code to the emerging MARS community, please read it (almost) all. Due to the top-down organization, sometimes you will need to follow some forward references in order to understand details. Therefore I recommend reading this chapter twice in two different reading modes: in the first reading pass, you just get a raw network of principles and structures in your brain (you don’t want to grasp details, therefore don’t strive for a full understanding). In the second pass, you will exploit your knowlegde from the first pass for a deeper understanding of the details. Alternatively, you may first read the sections about general architecture, and then start a bottom-up scan by first reading the last section about generic objects and aspects, and working in reverse section order (but read subsections in-order) until you finally reach the kernel interfaces / symlink trees. 9.1. Motivation / Politics MARS is not yet upstream in the Linux kernel. This section tries to clear up some potential doubts. Some people have asked why MARS uses its own internal framework instead of directly 1 being based on some already existing Linux kernel infrastructures like the device mapper. Here is a list of technical reasons: 1. The existing device mapper infrastructure is based on struct bio. In contrast, the new XIO personality of the generic brick infrastructure is based on the concept of AIO (Asynchronous IO), which is a true superset of block IO. 2. In particular, struct bio is firmly referencing to struct page (via intermediate struct bio_vec), using types like sector_t in the field bi_sector. Basic transfer units are blocks, or sectors, or pages, or the like. In contrast, struct aio_object used by the XIO personality can address arbitrary granularity memory with byte resolution even at odd2 positions in (virtual) files / devices, similar to classical Unix file IO, but asynchronously. Practical experience shows that even non-functional properties like performance of many datacenter workloads are profiting from that3 . The AIO/XIO abstraction contains no fixed link to kernel abstractions and should be easily portable to other environments. In summary, the new personality provides a uniform abstraction which abstracts away from multiple different kernel interfaces; it is designed to be useful even in userspace. 3. Kernel infrastructures for the concept of direct IO are different from those for buffered IO. The XIO personality used by MARS subsumes both concepts as use case variants. 1 Notice that indirect use of pre-existing Linux infrastructure is not only possible, but actually implemented, by usinig it internally in brick implementations (black-box principle). However, such bricks are not portable to other environments like userspace. 2 Some brick implementations (as opposed to the capabilities of the interface) may be (and, in fact, are) restricted to PAGE_SIZE operations or the like. This is no general problem, because IOP can automatically insert some translator bricks extending the capabilities to universal granularity (of course at some performance costs). 3 The current transaction logger uses variable-sized headers at “odd” addresses. Although this increases memcpy() load due to “misalignment”, the overall performance was provably better than in variants where sector / page alignment was strictly obeyed, but space was wasted for alignments. Such functionality is only possible if the XIO infrastructure allows for (but doesn’t force) “mis-aligned” IO operations. In future, many different transaction logfile formats showing different runtime behaviour (e.g. optimized for high-throughput SSD loads) may co-exist in parallel. Note that properly aligned XIO operations bear no noticeable overhead compared to classical block IO, at least in typical datacenter RAID scenarios. 137 9. MARS for Developers Buffering is an optional internal property of XIO bricks (almost non-functional property with support for consistency guarantees). 4. The AIO/XIO personality is generically designed for remote operations over networks, at arbitrary places in the IO stack, with (almost4 ) no semantic differences to local operations (built-in network transparency). There are universal provisions for mixed operation of different versions (rolling software updates in clusters / grids). 5. The generic brick infrastructure (as well as its personalities like XIO or any other future personality) supports dynamic re-wiring / re-configuration during operation (even while parallel IO requests are flying, some of them taking different paths in the IO stack in parallel). This is absolutely needed for MARS logfile rotation. In the long term, this would be useful for many advanced new features and products, not limited to multipathing. 6. The generic brick infrastructure (and in turn all personalities) provide additional comfort to the programmer while enabling increased functionality: by use of a generalization of aspect orientation5 , the programmer need no longer worry about dynamic memory allocations for local state in a brick instance. MARS is automating local state even when dynamically instantiating new bricks (possibly having the same brick type) at runtime. Specifially, XIO is automating request stacking at the completion path this way, even while dynamically reconfiguring the IO stack6 . A similar automation7 does not exist in the rest of the Linux kernel. 7. The generic brick infrastructure, together with personalities like XIO, enables new longterm functional and non-functional opportunities by use of concepts from instanceoriented programming (IOP8 ). The application area is not limited to device drivers. For example, a new personality for stackable filesystems could be developed in future. In summary, anyone who would insist that MARS should be directly 9 based on pre-existing kernel structures / frameworks instead of contributing a new framework would cause a massive regression of functionality. • On one hand, all code contributed by the MARS project is non-intrusive into the rest of the Linux kernel. From the viewpoint of other parts of the kernel, the whole addition behaves like a driver (although its infrastructure is much more than a driver). • On the other hand, if people are interested, the contributed infrastructure may be used to add to the power of the Linux kernel. It is designed to be open for contributions. 4 By default, automatic network connection re-establishment and infinite network retries are already implemented in the xio_client and xio_server bricks to provide fully transparent semantics. However, this may be undesirable in case of fatal crashes. Therefore, abort operations are also configurable, as well as network timeouts which are then mapped to classical IO errors. 5 Similar to AOP, insertion of IOP bricks for checking / debugging etc is one of the key advantages of the generic brick infrastructure. In contrast to AOP where debugging is usually {en,dis}abled statically at compile time, IOP allows for dynamic (re-)configuration of debugging bricks, automatic repair, and many more features promoted by organic computing. 6 The generic aspect orientation approach leads to better separation of concerns: local state needed by brick implementations is not visible from outside by default. In other words, local state is also private state. Accidental hampering of internal operations is impeded. Example from the kernel: in include/linux/blkdev.h the definition of struct request contains the following comment: /* the following two fields are internal, NEVER access directly */. It appears that struct request contains not only fields relevant for the caller, but also internal fields needed only in some specific callees. For example, rb_node is documented to be used only in IO schedulers. XIO goes one step further: there need not exist exactly one IO scheduler instance in the IO stack for a single device. Future xio_scheduler_{deadline,cfq,...} brick types could be each instantiated many times, and in arbitrary places, even for the same (logical) device. The equivalent of rb_node would then be automatically instantiated multiple times for the same IO request, by automatically instantiating the right local aspect instances. 7 DM can achieve stacking and dynamic routing by a workaround called request cloning, potentially leading to mass creation of temporary / intermediate object instances. 8 See http://athomux.net/papers/paper_inst2.pdf 9 Notice that kernel-specific structures like struct bio are of course used by MARS, but only inside the blackbox implementation of bricks like mars_bio or mars_if which act as adaptors to/from that structure. It is possible to write further adaptors, e.g. for direct interfacing to the device mapper infrastructure. 138 9.2. Architecture Overview • A possible (but not the only possible) way to do this is giving the generic brick framework / the XIO personality as well as future personalities / the MARS application the status of a subsystem inside the kernel (in the long term), similar to the SCSI subsystem or the network subsystem. Noone is forced to use it, but anybody may use it if he/she likes. • Politically, the author is a FOSS advocate willing to collaborate and to support anyone interested in contributions. The author’s personal interest is long-term and is open for both in-tree and out-of-tree extensions of both the framework and MARS by any other party obeying the GPL and not hazarding FOSS by patents (instead supporting organizations like the Open Invention Network). The author is open to closer relationships with the Linux Foundation and other parts of the Linux ecosystem. 9.2. Architecture Overview Framework Architecture for MARS + future projects External Software, Cluster Managers, etc Userspace Interface marsadm Framework Application Layer MARS Light, MARS Full, etc Framework Personalities XIO = eXtended IO ≈ AIO MARS Light XIO bricks Generic Brick Layer IOP = Instance Oriented Programming + AOP = Aspect Oriented Programming MARS Full future Strategy bricks ... other future Personalities and their bricks Generic Bricks Generic Objects Generic Aspects s MARS LCA2014 Presentation by Thomas Schöbel-Theuer 9.3. Some Architectural Details The following pictures show some “zones of responsibility”, not necessarily a strict hierarchy (although Dijkstra’s famous layering rules from THE are tried to be respected as much as possible). The construction principle follows the concept of Instance Oriented Programming (IOP) described in http://athomux.net/papers/paper_inst2.pdf. Please note that MARS is only instance-based 10 , while MARS Full is planned to be fully instance-oriented. 9.3.1. MARS Architecture 10 Similar to OOP, where “object-based” means a weaker form of “object-oriented”, the term “instance-based” means that the strategy brick layer need not be fully modularized according to the IOP principles, but the worker brick layer already is. 139 9. MARS for Developers higher−level tools (cluster managers, HA managers, etc) interface: marsadm parameters, exit code, stdout (similar to drbdadm) userspace tool marsadm (perl) interface: /mars/ symlink tree helper library functions MARS ad−hoc strategy layer XIO bricks: {a,b,s}io,if,trans_logger,... XIO personality generic brick infrastructure generic aspect infrastructure generic object infrastructure 9.3.2. MARS Full Architecture (planned) higher−level tools (cluster managers, HA managers, etc) interface: marsadm parameters, exit code, stdout (similar to drbdadm) userspace tool marsadm (perl) helper library functions interface: /mars/full/ symlink tree XIO worker bricks: {a,b,s}io,if,trans_logger,... MARS Full strategy bricks XIO brick personality StrategY brick personality generic brick infrastructure generic aspect infrastructure generic object infrastructure 9.4. Documentation of the Symlink Trees The /mars/ symlink tree is serving the following purposes, all at the same time: 1. For communication between cluster nodes, see sections 4.2 and 4.3. This communication is even the only communication between cluster nodes (apart from the contents of transaction logfiles and sync data). 2. Internal interface between the kernel module and the userspace tool marsadm. 3. Internal persistent repository which keeps state information between reboots (also in case of node crashes). It is even the only place where state information is kept. There is no other place like /etc/drbd.conf. Because of its internal character, its representation and semantics may change at any time without notice (e.g. via an internal upgrade procedure between major releases). It is not an external interface to the outer world. Don’t build anything on it. However, knowledge of the symlink tree is useful for advanced sysadmins, for human inspection and for debugging. And, of course, for developers. As an “official” interface from outside, only the marsadm command should be used. 140 9.5. XIO Worker Bricks 9.4.1. Documentation of the MARS Symlink Tree 9.5. XIO Worker Bricks 9.6. StrategY Worker Bricks NYI 9.7. The XIO Brick Personality 9.8. The Generic Brick Infrastructure Layer 9.9. The Generic Object and Aspect Infrastructure 141 A. Technical Data MARS MARS has some built-in limitations which should be overcome1 by the future MARS Full. Please don’t exceed the following limits: • maximum 10 nodes per cluster • maximum 10 resources per cluster • maximum 100 logfiles per resource 1 Some internal algorithms are quadratic. The reason is that MARS evolved from a lab prototype which wasn’t originally intended for enterprise grade usage, but should have been succeeded by the fully instance-oriented MARS Full much earlier. 142 B. Handout for Midnight Problem Solving Here are generic instructions for the generic marsadm and commandline level. Other levels (e.g. different types of cluster managers, PaceMaker, control scripts / rc scripts / upstart scripts, etc should be described elsewhere. B.1. Inspecting the State of MARS For manual inspection, please prefer the new marsadm view all over the old marsadm view-1and1 all. It shows more appropriate / detailed information. Hint: this might change in future when somebody will program better marcros for the view-1and1 variant, or create even better other macros. # watch marsadm view a l l Checking the low-level network connections at runtime: # watch " n e t s t a t −−t c p | g r e p 777" Meaning of the port numbers (as currently configured into the kernel module, may change in future): • 7777 = metadata / symlink propagation • 7778 = transfer of transaction logfiles • 7779 = transfer of sync traffic 7777 must be always active on a healthy cluster. 7778 and 7779 will appear only on demand, when some data is transferred. Hint: when one of the columns Send-Q or Recv-Q are constantly at high values, you might have a network bottleneck. B.2. Replication is Stuck Indications for a stuck: • One of the flags shown by marsadm view all or marsadm view-flags all contain a symbol "-" (dash). This means that some switch is currently switched off (deliberately). Please check whether there is a valid reason why somebody else switched it off. If the switch-off is just by accident, use the following command to fix the stuck: # marsadm up a l l (or replace all by a particular resource name if you want to start only a specific one). Note: up is equivalent to the sequence attach; resume-fetch; resume-replay; resume-sync. Instead of switching each individual knob, use up as a shortcut for switching on anything which is currently off. • netstat --tcp | grep 7777 does not show anything. Please check the following: 143 B. Handout for Midnight Problem Solving – Is the kernel module loaded? Check lsmod | grep mars. When necessary, run modprobe mars. – Is the network interface down? Check ifconfig, and/or ethtool and friends, and fix it when necessary. – Is a pingpossible? If not, fix the network / routing / firewall / etc. When fixed, the MARS connections should automatically appear after about 1 minute. – When ping is possible, but a MARS connection to port 7777 does not appear after a few minutes, try to connect to remote port 7777 by hand via telnet. But don’t type anything, just abort the connection immediately when it works! Typing anything will almost certainly throw a harsh error message at the other server, which could unnecessarily alarm other people. • Check whether marsadm view all shows some progress bars somewhere. Example: i s t o r e −t e s t −bap1:~# marsadm view a l l −−−−−−−−− r e s o u r c e lv −0 lv −0 OutDated [ F ] PausedReplay dCAS−R Secondary i s t o r e −t e s t −bs1 replaying : [ > . . . . . . . . . . . . . . . . . . . ] 1.21% ( 1 2 / 1 0 2 0 )MiB logs : [ 2 . . 3 ] > f e t c h : 1 0 0 8 . 1 9 8 MiB r a t e : 0 B/ s e c r e m a i n i n g : −−:−−:−− h r s > r e p l a y : 0 B r a t e : 0 B/ s e c r e m a i n i n g : 0 0 : 0 0 : 0 0 h r s At least one of the rate: values should be greater than 0. When none of the rate: values indicate any progress for a longer time, try marsadm up all again. If it doesn’t help, check and repair the network. If even this does not help, check the hardware for any IO hangups, or kernel hangups. First, check the RAID controllers. Often (but not certainly), a stuck kernel can be recognized when many processes are permanently in state "D", for a long time: ps ax | grep " D" | grep -v grep or similar. Please check whether there is just an overload, or really a true kernel problem. Discrimination is not easy, and requires experience (as with any other system; not limited to MARS). A truly stuck kernel can only be resurrected by rebooting. The same holds for any hardware problems. • Check whether marsadm view all reports any lines like WARNING: SPLIT BRAIN at ” detected. In such a case, check that there is really a split brain, before obeying the instructions in section B.4. Notice that network outages or missing marsadm log-delete-all all or cron may continue to report an old split brain which has gone in the meantime. • Check whether /mars/ is too full. For a rough impression, df /mars/ may be used. For getting authoritative values as internally used by the MARS emergency-mode computations, use marsadm view-rest-space (the unit is GiB). In practice, the differences are only marginal, at least on bigger /mars/ partitions. When there is only few rest space (or none at all), please obey the instructions in section B.3. B.3. Resolution of Emergency Mode Emergency mode occurs when /mars/ runs out of space, such that no new logfile data can be written anymore. In emergency mode, the primary will write any write requests directly to the underlying disk, as if MARS were not present at all. Thus, your application will continue to run. Only the replication as such is stopped. Notice: emergency mode means that your secondary nodes are usually in a consistent, but outdated state (exception: when a sync was running in parallel to the emergency mode, then the sync will be automatically started over again). You can check consistency via marsadm view-flags all. Only when a local disk shows a lower-case letter "d" instead of an uppercase "D", it is known to be inconsistent (e.g. during a sync). When there is a dash instead, it 144 B.4. Resolution of Split Brain and of Emergency Mode usually means that the disk is detatched or misconfigured or the kernel module is not started. Please fix these problems first before believing that your local disk is unusable. Even if it is really inconsistent (which is very unlikely, typically occurring only as a consequence of hardware failures, or of the above-mentioned exception), you have a big chance to recover most of the data via fsck and friends. A currently existing Emergency mode can be detected by primary :~# marsadm view−i s −emergency a l l s e c o n d a r y :~# marsadm view−i s −emergency a l l Notice: this delivers the current state, telling nothing about the past. Currently, emergency mode will also show something like WARNING: SPLIT BRAIN at ” detected. This ambiguity will be resolved in a future MARS release. It is however not crucial: the resolution methods for both cases are very similar. If in doubt, start emergency resolution first, and only proceed to split brain resoultion if it did not help. Preconditions: • Only current version of MARS: the space at the primary side should have been already released, and the emergency mode should have been already left. Otherwise, you might need the split-brain resolution method from section B.4. • The network must be working. Check that the following gives an entry for each secondary: primary :~# n e t s t a t −−t c p | g r e p 7777 When necessary, fix the network first (see instructions above). Emergency mode should now be resolved via the following instructions: primary :~# marsadm view−i s −emergency a l l primary :~# du −s / mars / r e s o u r c e −∗ | s o r t −n Remember the affected resources. Best practice is to do the following, starting with the biggest resource as shown by the du | sort output in reverse order, but starting the following only with the affected resources in the first place: s e c o n d a r y 1 :~# marsadm i n v a l i d a t e s e c o n d a r y 1 :~# marsadm l o g −d e l e t e −a l l a l l . . . d i t o with a l l r e s o u r c e s showing emergency mode . . . d i t o on a l l o t h e r s e c o n d a r i e s primary :~# marsadm l o g −d e l e t e −a l l a l l Hint: during the resolution process, some other resources might have gone into emergency mode concurrently. In addition, it is possible that some secondaries are stuck at particular resources while the corresponding primary has not yet entered emergency mode. Please repeat the steps in such a case, and look for emergency modes at secondaries additionally. When necessary, extend your list of affected resources. Hint: be patient. Deleting large bulks of logfile data may take a long time, at least on highly loaded systems. You should give the cleanup processes at least 5 minutes before concluding that an invalidate followed by log-delete-all had no effect! Don’t forget to give the log-delete-all at all cluster nodes, even when seemingly unaffected. In very complex scenarios, when the primary roles of different resources are spread over diffent hosts (aka mixed operation), you may need to repeat the whole cycle iteratively for a few cycles until the jam is resolved. If it does not go away, you have another chance by the following split-brain resolution process, which will also cleanup emergency mode as a side effect. B.4. Resolution of Split Brain and of Emergency Mode Hint: in many cases (but not guaranteed), the previous receipe for resolution of emergency mode will also cleanup split brain. Good chances are in case of k = 2 total replicas. Please collect your own experiences which method works better for you! Precondition: the network must be working. Check that the following gives an entry for each secondary: 145 B. Handout for Midnight Problem Solving primary :~# n e t s t a t −−t c p | g r e p 7777 When necessary, fix the network first (see instructions above). Inspect the split brain situation: primary :~# marsadm view a l l primary :~# du −s / mars / r e s o u r c e −∗ | s o r t −n Remember those resources where a message like WARNING: SPLIT BRAIN at ” detected appears. Do the following only for affected resources, starting with the biggest one (before proceeding to the next one). Do the following with only one resource at a time (before proceeding to the next one), and repeat the actions on that resource at every secondary (if there are multiple secondaries): s e c o n d a r y 1 :~# marsadm l e a v e −r e s o u r c e $ r e s 1 s e c o n d a r y 1 :~# marsadm l o g −d e l e t e −a l l a l l Check whether the split brain has vanished everywhere. Startover with other resources at their secondaries when necessary. Finally, when no split brain is reported at any (former) secondary, do the following on the primary: primary :~# marsadm l o g −d e l e t e −a l l a l l primary :~# s l e e p 30 primary :~# marsadm view a l l Now, the split brain should be gone even at the primary. If not, repeat this step. In case even this should fail on some $res (which is very unlikely), read the PDF manual before using marsadm log-purge-all $res. Finally, when the split brain is gone everywhere, rebuild the redundancy at every secondary via s e c o n d a r y 1 :~# marsadm j o i n −r e s o u r c e $ r e s 1 / dev/ / $ r e s 1 If even this method does not help, setup the whole cluster afresh by rmmod mars everywhere, and creating a fresh /mars/ filesystem everywhere, followed by the same procedure as installing MARS for the first time (which is outside the scope of this handout). B.5. Handover of Primary Role When there exists a method for primary handover in higher layers such as cluster managers, please prefer that method (e.g. cm3 or other tools). If suchalike doesn’t work, or if you need to handover some resource $res1 by hand, do the following: • Stop the load / application corresponding to $res1 on the old primary side. • umount /dev/mars/$res1, or otherwise close any openers such as iSCSI. • At the new primary: marsadm primary $res1 • Restart the application at the new site (in reverse order to above). In case you want to switch all resources which are not yet at the new side, you may use marsadm primary all. B.6. Emergency Switching of Primary Role Emergency switching is necessary when your primary is no longer reachable over the network for a longer time, or when the hardware is defective. Emergency switching will very often lead to a split brain, which requires lots of manual actions to resolve (see above). Therefore, try to avoid emergency switching when possible! Hint: MARS can automatically recover after a primary crash / reboot, as well as after secondary crashes, just by executing modprobe mars after /mars/ had been mounted. Please consider to wait until your system comes up again, instead of risking a split brain. 146 B.6. Emergency Switching of Primary Role The decision between emergency switching and continuing operation at the same primary side is an operational one. MARS can support your decision by the following information at the potentially new primary side (which was in secondary mode before): i s t o r e −t e s t −bap1:~# marsadm view a l l −−−−−−−−− r e s o u r c e lv −0 lv −0 I n C o n s i s t e n t Syncing dcAsFr Secondary i s t o r e −t e s t −bs1 s y n c i n g : [ = = = = > . . . . . . . . . . . . . . ] 27.84% ( 5 6 7 / 2 0 4 8 )MiB r a t e : 7 2 5 8 3 . 0 0 KiB/ s e c r e m a i n i n > sync : 5 6 7 . 2 9 3 / 2 0 4 8 MiB r a t e : 72583 KiB/ s e c r e m a i n i n g : 0 0 : 0 0 : 2 0 h r s r e p l a y i n g : [ > : : : : : : : : : : : : : : : : : : : ] 0.00% ( 0 / 1 2 9 0 2 ) KiB l o g s : [ 1 . . 1 ] > f e t c h : 0 B r a t e : 38 KiB/ s r e m a i n i n g : 0 0 : 0 0 : 0 0 > r e p l a y : 1 2 9 0 2 . 0 4 7 KiB r a t e : 0 B/ s r e m a i n i n g : −−:−−:−− When your target is syncing (like in this example), you cannot switch to it (same as with DRBD). When you had an emergency mode before, you should first resolve that (whenever possible). When a split brain is reported, try to resolve it first (same as with DRBD). Only in case you know that the primary is really damaged, or it is really impossible to the run the application there for some reason, emergency switching is desirable. Hint: in case the secondary is inconsistent for some reason, e.g. because of an incremental fast full-sync, you have a last chance to recover most data after forceful switching by using a filesystem check or suchalike. This might be even faster than restoring data from the backup. But use it only if you are really desperate! The amount of data which is known to be missing at your secondary is shown after the > fetch: in human-readable form. However, in cases of networking problems this information may be outdated. You always need to consider further facts which cannot be known by MARS. When there exists a method for emergency switching of the primary in higher layers such as cluster managers, please prefer that method in front of the following one. If suchalike doesn’t work, or when a handover attempt has failed several times, or if you really need forceful switching of some resource $res1 by hand, you can do the following: • When possible, stop the load / application corresponding to $res1 on the old primary side. • When possible, umount /dev/mars/$res1, or otherwise close any openers such as iSCSI. • When possible (if you have some time), wait until as much data has been propagated to the new primary as possible (watch the fetch: indicator). • At the new primary: marsadm disconnect $res1; marsadm primary --force $res1 • Restart the application at the new site (in reverse order to above). • After the application is known to run reliably, check for split brains and cleanup them when necessary. 147 C. Alternative Methods for Split Brain Resolution Instead of marsadm invalidate, the following steps may be used. In preference, start with the old “wrong” primaries first: 1. marsadm leave-resource mydata 2. After having done this on one cluster node, check whether the split brain is already gone (e.g. by saying marsadm view mydata). There are chances that you don’t need this on all of your nodes. Only in very rare1 cases, it might happen that the preceding leave-resource operations were not able to clean up all logfiles produced in parallel by the split brain situation. 3. Read the documentation about log-purge-all (see page 109) and use it. 4. If you want to restore redundancy, you can follow-up a join-resource phase to the old resource name (using the correct device name, double-check it!) This will restore your redundancy by overwriting your bad split brain version with the correct one. It is important to resolve the split brain before you can start the join-resource reconstruction phase! In order to keep as many “good” versions as possible (e.g. for emergency cases), don’t re-join them all in parallel, but rather start with the oldest / most outdated / worst / inconsistent version first. It is recommended to start the next one only when the previous one has sucessfully finished. 1 When your network had partitioned in a very awkward way for a long time, and when your partitioned primaries did several log-rotate operations indendently from each other, there is a small chance that leave-resource does not clean up all remains of such an awkward situation. Only in such a case, try log-purge-all. 148 D. Alternative De- and Reconstruction of a Damaged Resource In case leave-resource --host= does not work, you may use the following fallback. On the surviving new designated primary, give the following commands: 1. marsadm disconnect-all mydata 2. marsadm down mydata 3. Check by hand whether your local disk is consistent, e.g. by test-mounting it readonly, fsck, etc. 4. marsadm delete-resource mydata 5. Check whether the other vital cluster nodes don’t report the dead resource any more, e.g. marsadm view all at each of them. In case the resource has not disappeared anywhere (which may happen during network problems), do the down ; delete-resource steps also there (optionally again with --force). 6. Be sure that the resource has disappeared everywhere. delete-resource with --force. When necessary, repeat the 7. marsadm create-resource newmydata ... at the correct node using the correct disk device containing the correct version, and further steps to setup your resource from scratch, preferably under a different name to minimize any risk. In any case, manually check whether a split brain is reported for any resource on any of your surviving cluster nodes. If you find one there (and only then), please (re-)execute the split brain resolution steps on the affected node(s). 149 E. Cleanup in case of Complicated Cascading Failures MARS does its best to recover even from multiple failures (e.g. rolling disasters). Chances are high that the instructions from sections 3.4.3 3.4.4 or appendix C D will work even in case of multiple failures, such as a network failure plus local node failure at only 1 node (even if that node is the former primary node). However, in general (e.g. when more than 1 node is damaged and/or when the filesystem /mars/ is badly damaged) there is no general guarantee that recovery will always succeed under any (weird) circumstances. That said, your chances for recovery are very high when some disk remains usable at least at one of your surviving secondaries. It should be very hard to finally trash a secondary, because the transaction logfiles are containing md5 checksums for all data records. Any attempt to replay currupted logfiles is refused by MARS. In addition, the sequence numbers of log-rotated logfiles are checked for contiguity. Finally, the sequence path of logfile applications (consisting of logfile names plus their respective length) is additionally secured by a git-like incremental checksum over the whole path history (so-called “version links”). This should detect split brains even if logfiles are appended / modified after a (forceful) switchover has already taken place. That said, your risk of final data loss is very high if you remove the BBU from your hardware RAID controller before all hot data has been flushed to the physical disks. Therefore, never try to “repair” a seemingly dead node before your replication is up again somewhere else! Only unplug the network cables when advised, but never try to repair the hardware instantly! In case of desperate situations where none of the previous instructions have succeeded, your last chance is rebuilding all your resources from intact disks as follows: 1. Do rmmod mars on all your cluster nodes and/or reboot them. Note: if you are less desperate, chances are high that the following will also work when the kernel module remains active and everywhere a marsadm down is given instead, but for an ultimate instruction you should eliminate potential kernel problems by rmmod / reboot, at least if you can afford the downtime on concurrently operating resources. 2. For safety, physically remove the storage network cables on all your cluster nodes. Note: the same disclaimer holds. MARS really does its best, even when delete-resource is given while the network is fully active and multiple split-brain primaries are actively using their local device in parallel (approved by some testcases from the automatic test suite, but note that it is impossible to catch all possible failure scenarios). Don’t challenge your fate if you are desperate! Don’t rely on this! Nothing is absolutely fail-safe! 3. Manually check which surviving disk is usable, and which is the “best” one for your purpose. 4. Do modprobe mars only on that node. If that fails, rmmod and/or reboot again, and start over with a completely fresh /mars/ partition (mkfs.ext4 /mars/ or similar) everywhere on all cluster nodes, and continue with step 7. 5. If your old /mars/ works, and you did not already (forcefully) switch your designated primary to the final destination, do it now (see description in section 3.4.2.2). Wait until any old logfile data has been replayed. 6. Say marsadm delete-resource mydata --force. This will cleanup all internal symlink tree information for the resource, but will leave your disk data intact. 150 7. Locally build up the new resource(s) as usual, out of the underlying disks. 8. Check whether the new resource(s) work in standalone mode. 9. When necessary, repeat these steps with other resources. Now you can choose how the rebuild your cluster. If you rebuilt /mars/ anywhere, you must rebuild it on all new cluster nodes and start over with a fresh join-cluster on each of them, from scratch. It is not possible to mix the old cluster with the new one. 10. Finally, do all the necessary join-resources on the respective cluster nodes, according to your new redundancy scenario after the failures (e.g. after activating spare nodes, etc). If you have k > 2 replicas, start join-resource on the worst / most damaged version first, and start the next preferably only after the previous sync has completed successfully. This way, you will be permanently retaining some (old and outdated, but hopefully potentially usable) replicas while a sync is running. Don’t start too many syncs in parallel. Never use delete-resource twice on the same resource name, after you have already a working standalone primary1 . You might accidentally destroy your again-working copy! You can issue delete-resource multiple times on different nodes, e.g. when the network has problems, but doing so after re-establishment of the initial primary bears some risk. Therefore, the safest way is first deleting the resources everywhere, and then starting over afresh. Before re-connecting any network cable on any non-primary (new secondaries), ensure that all /dev/mars/mydata devices are no longer in use (e.g. from an old primary role before the incident happened), and that each local disk is detached. Only after that, you should be able to safely re-connect the network. The delete-resource given at the new primary should propagate now to each of your secondaries, and your local disk should be usable for a re-join-resource. When you did not rebuild your cluster from scratch with fresh /mars/ filesystems, and one of the old cluster nodes is supposed to be removed permanently, use leave-resource (optionally with --host= and/or --force) and finally leave-cluster. 1 Of course, when you don’t have created the same resource anew, you may repeat delete-resource on other cluster nodes in order to get rid of local files / symlinks which had not been propagated to other nodes before. 151 F. Experts only: Special Trick Switching and Rebuild The following is a further alternative for experts who really know what they are doing. The method is very simple and therefore well-suited for coping with mass failures, e.g. power blackout of whole datacenters. In case a primary datacenter fails as a whole for whatever reason and you have a backup datacenter, do the following steps in the backup datacenter: 1. Fencing step: by means of firewalling, ensure that the (virtually) damaged datacenter nodes cannot be reached over the network. For example, you may place REJECT rules into all of your local iptables firewalls at the backup datacenter. Alternatively / additionally, you may block the routes at the appropriate central router(s) in your network. 2. Run the sequence marsadm disconnect all; marsadm primary --force all on all nodes in the backup datacenter. 3. Restart your services in the backup datacenter (as far as necessary). Depending on your network setup, further steps like switching BGP routes etc may be necessary. 4. Check that all your services are really up and running, before you try to repair anything! Failing to do so may result in data loss when you execute the following restore method for experts. Now your backup datacenter should continue servicing your clients. The final reconstruction of the originally primary datacenter works as follows: 1. At the damaged primary datacenter, ensure that nowhere the MARS kernel module is running. In case of a power blackout, you shouldn’t have executed an automatic modprobe mars anywhere during reboot, so you should be already done when all your nodes are up again. In case some nodes had no reboot, execute rmmod mars everywhere. If rmmod refuses to run, you may need to umount the /dev/mars/mydata device first. When nothing else helps, you may just mass reboot your hanging nodes. 2. At the failed side, do rm -rf /mars/resource-$mydata/ for all those resources which had been primary before the blackout. Do this only for those cases, otherwise you will need unnecessary leave-resources or invalidates later (e.g. when half of your nodes were already running at the surving side). In order to avoid unnecessary traffic, please do this only as far as really necessary. Don’t remove any other directories. In particular, /mars/ips/ must remain intact. In case you accidentally deleted them, or you had to re-create /mars/ from scratch, try rsync with the correct options. Caution! before doing this, check that the corresponding directory exists at the backup datacenter, and that it is really healthy! 3. Un-Fencing: restore your network firewall / routes and check that they work (ping etc). 4. Do modprobe mars everywhere. All missing directories and their missing symlinks should be automatically fetched from the backup datacenter. 5. Run marsadm join-resource $res, but only at those places where the directory was removed previously, while using the same disk devices as before. This will minimize actual traffic thanks to the fast full sync algorithm. 152 It is crucial that the fencing step must be executed before any primary --force! This way, no split brain will be visible at the backup datacenter side, because there is simply no chance for transferring different versions over the network. It is also crucial to remove any (potentially diverging) resource directories before the modprobe! This way, the backup datacenter never runs into split brain. This saves you a lot of detail work for split brain resolution when you have to restore bulks of nodes in a short time. In case the repair of a full datacenter should take so extremely long that some /mars/ partitions are about to run out of space at the surviving side, you may use the leave-resource --host=failed-node trick described earlier, followed by log-delete-all. Best if you have prepared a fully automatic script long before the incident, which executes suchalike only as far as necessary in each individual case. Even better: train such scenarios in advance, and prepare scripts for mass automation. Look into section 5.3. 153 G. Mathematical Model of Architectural Reliability The assumptions used in the model are explained in detail in section 1.8.1.2 on page 26. Here is a quick recap of the main parameters: • n is the number of basic storage units. It is also used for the number of application units, assumed to be the same. • k is the replication degree, or number of replicas. In general, you will have to deploy N = k ∗ n storage servers for getting n basic storage units. This applies to any of the competing architectures. • s is the architecture-dependent spread exponent: it tells whether a storage incident will spread to the application units. Examples: s = 0 means that there is no spread between storage unit failures and application unit failures, other than a local 1:1 one. s = 1 means that an uncompensated storage node incident will cause n application incidents. • p is the probability of a storage server incident. In the examples at section 1.8 on page 25, a fixed p = 0.0001 was used for easy understanding, but the following formulae should also hold for any other p ∈ (0, 1). • T is the observational period, introduced for convenience of understanding. The following can also be computed independently from any T , as long as the probability p does not change over time, which is assumed. Because T is only here for convenience of understanding, we set it to T = 1/p. In the examples from section 1.8.1.2 on page 26, a fixed T = 10, 000 hours was used. G.1. Formula for DRBD / MARS We need not discrimiate between a storage failure probability S and an application failure probability A because applications are run locally at the storage servers 1:1. The probability for failure of a single shard consisting of k nodes is Ap (k) = pk because all k shard members have to be down all at the same time. In section 1.8.1.2 on page 26 we assumed that there is no cross-communication between shards. Therefore they are completely independent from each other, and the total downtime of n shards during the observational period T is Ap,T (k, n) = T ∗ n ∗ pk When introducing the spread exponent s, the formula turns into As,p,T (k, n) = T ∗ ns+1 ∗ pk G.2. Formula for Unweighted BigCluster This is based on the Bernoulli formula. The probability that exactly k̄ storage nodes out of N = k ∗ n total storage nodes are down is N S̄p (k̄, N ) = ∗ pk̄ ∗ (1 − p)N −k̄ k̄ 154 G.3. Formula for SizeWeighted BigCluster Similarly, the probability for getting k or more storage node failures (up to N ) at the same time is Sp (k, N ) = N X S̄p (k̄, N ) = k̄=k N X N k̄ k̄=k ∗ pk̄ ∗ (1 − p)N −k̄ By replacing N with k∗n (for conversion of the x axis into basic storage units) and by introducing T we get Sp,T (k, n) = T ∗ k∗n X k∗n k̄=k k̄ ∗ pk̄ ∗ (1 − p)k∗n−k̄ For comparability with DRBDorMARS, we have to compute the application downtime A instead of the storage downtime S, which depends on the spread exponent s as follows: As,p,T (k, n) = ns+1 ∗ Sp,T (k, n) = ns+1 ∗ T ∗ k∗n X k∗n k̄=k k̄ ∗ pk̄ ∗ (1 − p)k∗n−k̄ Notice that at s = 0 we have introduced a factor of n, which corresponds to the hashing effect (teardown of n application instances by a single uncompensated storage incident) as described in section 1.8.1.2 on page 26. G.3. Formula for SizeWeighted BigCluster In difference to above, we need to introduce a correction factor by the fraction of affected objects, relative to basic storage units. Otherwise the y axis would not stay comparable due to different units. For the special case of k = 1, there is no difference to above. For the special case of k = 2 replica, the correction factor is 1/(N − 1), because we assume that all the replica of the affected first node are uniformly spread to all other nodes, which is N − 1. The probability for hitting the intersection of the first node with the second node is thus 1/(N − 1). For higher values of k, and with a similar argument (never put another replica of the same object onto the same storage node) we get the correction factor as C(k, N ) = k−1 Y l=1 1 N −l Hint: there are maximum k physical replicas on the disks. For higher values of k̄ ≥ k, there are k̄k combinations of object intersections (when assuming that the number of objects on a node is very large such and no further object repetition can occur execpt for the k-fold replica placement). Thus the generalization to k̄ ≥ k is k−1 k̄ Y 1 C(k, k̄, N ) = k N −l l=1 By inserting this into the above fomula, we get As,p,T (k, n) = ns+1 ∗ T ∗ k∗n X k̄=k C(k, k̄, k ∗ n) ∗ k∗n ∗ pk̄ ∗ (1 − p)k∗n−k̄ k̄ 155 H. Command Documentation for Userspace Tools H.1. marsadm --help Thorough documentation is in mars-manual.pdf. Please use the PDF manual as authoritative reference! Here is only a short summary of the most important sub-commands / options: marsadm [ ] [ | all | ] marsadm [ ] view[- ] [ | all ] = --force Skip safety checks. Use this only when you really know what you are doing! Warning! This is dangerous! First try --dry-run. Not combinable with ’all’. --ignore-sync Allow primary handover even when some sync is running somewhere. This is less rude than --force because it checks for all else preconditions. --dry-run Don’t modify the symlink tree, but tell what would be done. Use this before starting potentially harmful actions such as ’delete-resource’. --verbose Increase speakyness of some commands. --logger=/path/to/usr/bin/logger Use an alternative syslog messenger. When empty, disable syslogging. --max-deletions= When your network or your firewall rules are defective over a longer time, too many deletion links may accumulate at /mars/todo-global/delete-* and sibling locations. This limit is preventing overflow of the filesystem as well as overloading the worker threads. --thresh-logfiles= --thresh-logsize= Prevention of too many small logfiles when secondaries are not catching up. When more than thresh-logfiles are already present, the next one is only created when the last one has at least size thresh-logsize (in units of GB). --timeout= Abort safety checks after timeout with an error. When giving ’all’ as resource agument, this works for each resource independently. --window= Treat other cluster nodes as healthy when some communcation has occured during the given time window. --threshold= 156 H.1. marsadm --help Some macros like ’fetch-threshold-reached’ use this for determining their sloppyness. --host= Act as if the command was running on cluster node . Warning! This is dangerous! First try --dry-run --backup-dir= Only for experts. Used by several special commands like merge-cluster, split-cluster etc for creating backups of important data. --ip= Override the IP address stored in the symlink tree, as well as the default IP determined from the list of network interfaces. Usually you will need this only at ’create-cluster’ or ’join-cluster’ for resolving ambiguities. --ssh-port= Override the default ssh port (22) for ssh and rsync. Useful for running {join,merge}-cluster on non-standard ssh ports. --ssh-opts=" " Override the default ssh commandline options. Also used for rsync. --macro= Handy for testing short macro evaluations at the command line. = attach usage: attach Attaches the local disk (backing block device) to the resource. The disk must have been previously configured at {create,join}-resource. When designated as a primary, /dev/mars/$res will also appear. This does not change the state of {fetch,replay}. For a complete local startup of the resource, use ’marsadm up’. cat usage: cat Print internal debug output in human readable form. Numerical timestamps and numerical error codes are replaced by more readable means. Example: marsadm cat /mars/5.total.status connect usage: connect See resume-fetch-local. connect-global usage: connect-global Like resume-fetch-local, but affects all resource members in the cluster (remotely). connect-local usage: connect-local See resume-fetch-local. create-cluster usage: create-cluster (no parameters) This must be called exactly once when creating a new cluster. Don’t call this again! Use join-cluster on the secondary nodes. Please read the PDF manual for details. 157 H. Command Documentation for Userspace Tools create-resource usage: create-resource (further syntax variants are described in the PDF manual). Create a new resource out of a pre-existing disk (backing block device) /dev/lv/mydata (or similar). The current node will start in primary role, thus /dev/mars/ will appear after a short time, initially showing the same contents as the underlying disk /dev/lv/mydata. It is good practice to name the resource and the disk name identical. cron usage: cron (no parameters) Do all necessary regular housekeeping tasks. This is equivalent to log-rotate all; sleep 5; log-delete-all all. delete-resource usage: delete-resource CAUTION! This is dangerous when the network is somehow interrupted, or when damaged nodes are later re-surrected in any way. Precondition: the resource must no longer have any members (see leave-resource). This is only needed when you _insist_ on re-using a damaged resource for re-creating a new one with exactly the same old . HINT: best practice is to not use this, but just create a _new_ resource with a new out of your local disks. Please read the PDF manual on potential consequences. detach usage: detach Detaches the local disk (backing block device) from the MARS resource. Caution! you may read data from the local disk afterwards, but ensure that no data is written to it! Otherwise, you are likely to produce harmful inconsistencies. When running in primary role, /dev/mars/$res will also disappear. This does not change the state of {fetch,replay}. For a complete local shutdown of the resource, use ’marsadm down’. disconnect usage: disconnect See pause-fetch-local. disconnect-global usage: disconnect-global Like pause-fetch-local, but affects all resource members in the cluster (remotely). disconnect-local usage: disconnect-local See pause-fetch-local. down usage: down Shortcut for detach + pause-sync + pause-fetch + pause-replay. 158 H.1. marsadm --help get-emergency-limit usage: get-emergency-limit Counterpart of set-emergency-limit (per-resource emergency limit) get-sync-limit-value usage: get-sync-limit-value (no parameters) For retrieval of the value set by set-sync-limit-value. get-systemd-unit usage: get-systemd-unit Show the system units (for start and stop), or empty when unset. invalidate usage: invalidate Only useful on a secondary node. Forces MARS to consider the local replica disk as being inconsistent, and therefore starting a fast full-sync from the currently designated primary node (which must exist; therefore avoid the ’secondary’ command). This is usually needed for resolving emergency mode. When having k=2 replicas, this can be also used for quick-and-simple split-brain resolution. In other cases, or when the split-brain is not resolved by this command, please use the ’leave-resource’ / ’join-resource’ method as described in the PDF manual (in the right order as described there). join-cluster usage: join-cluster Establishes a new cluster membership. This must be called once on any new cluster member. This is a prerequisite for join-resource. join-resource usage: join-resource (further syntax variants are described in the PDF manual). The resource must have been already created on another cluster node, and the network must be healthy. The contents of the local replica disk /dev/lv/mydata will be overwritten by the initial fast full sync from the currently designated primary node. After the initial full sync has finished, the current host will act in secondary role. For details on size constraints etc, refer to the PDF manual. leave-cluster usage: leave-cluster (no parameters) This can be used for final deconstruction of a cluster member. Prior to this, all resources must have been left via leave-resource. Notice: this will never destroy the cluster UID on the /mars/ filesystem. Please read the PDF manual for details. leave-resource usage: leave-resource Precondition: the local host must be in secondary role. 159 H. Command Documentation for Userspace Tools Stop being a member of the resource, and thus stop all replication activities. The status of the underlying disk will remain in its current state (whatever it is). log-delete usage: log-delete When possible, globally delete all old transaction logfiles which are known to be superflous, i.e. all secondaries no longer need to replay them. This must be regularly called by a cron job or similar, in order to prevent overflow of the /mars/ directory. For regular maintainance cron jobs, please prefer ’marsadm cron’. For details and best practices, please refer to the PDF manual. log-delete-all usage: log-delete-all Alias for log-delete log-delete-one usage: log-delete-one When possible, globally delete at most one which is known to be superfluous, i.e. all need to replay it. Hint: use this only for testing and manual For regular maintainance cron jobs, please or log-delete-all. old transaction logfile secondaries no longer inspection. prefer cron log-purge-all usage: log-purge-all This is potentially dangerous. Use this only if you are really desperate in trying to resolve a split brain. Use this only after reading the PDF manual! log-rotate usage: log-rotate Only useful at the primary side. Start writing transaction logs into a new transaction logfile. This should be regularly called by a cron job or similar. For regular maintainance cron jobs, please prefer ’marsadm cron’. For details and best practices, please refer to the PDF manual. lowlevel-delete-host usage: lowlevel-delete-host Delete cluster member. lowlevel-ls-host-ips usage: lowlevel-ls-host-ips List cluster member names and IP addresses. lowlevel-set-host-ip usage: lowlevel-set-host-ip Set IP for host. merge-cluster usage: merge-cluster Precondition: the resource names of both clusters must be disjoint. Create the union of two clusters, consisting of the union of all machines, and the union of all resources. 160 H.1. marsadm --help The members of each resource are _not_ changed by this. This is useful for creating a big "virtual LVM cluster" where resources can be almost arbitrarily migrated between machines via later join-resource / leave-resource operations. merge-cluster-check usage: merge-cluster-check Check whether the resources of both clusters are disjoint. Useful for checking in advance whether merge-cluster would be possible. merge-cluster-list usage: merge-cluster-list Determine the local list of resources. Useful for checking or analysis of merge-cluster disjointness by hand. pause-fetch usage: pause-fetch See pause-fetch-local. pause-fetch-global usage: pause-fetch-global Like pause-fetch-local, but affects all resource members in the cluster (remotely). pause-fetch-local usage: pause-fetch-local Stop fetching transaction logfiles from the current designated primary. This is independent from any {pause,resume}-replay operations. Only useful on a secondary node. pause-replay usage: pause-replay See pause-replay-local. pause-replay-global usage: pause-replay-global Like pause-replay-local, but affects all resource members in the cluster (remotely). pause-replay-local usage: pause-replay-local Stop replaying transaction logfiles for now. This is independent from any {pause,resume}-fetch operations. This may be used for freezing the state of your replica for some time, if you have enough space on /mars/. Only useful on a secondary node. pause-sync usage: pause-sync See pause-sync-local. pause-sync-global usage: pause-sync-global Like pause-sync-local, but affects all resource members in the cluster (remotely). 161 H. Command Documentation for Userspace Tools pause-sync-local usage: pause-sync-local Pause the initial data sync at current stage. This has only an effect if a sync is actually running (i.e. there is something to be actually synced). Don’t pause too long, because the local replica will remain inconsistent during the pause. Use this only for limited reduction of system load. Only useful on a secondary node. primary usage: primary Promote the resource into primary role. This is necessary for /dev/mars/$res to appear on the local host. Notice: by concept there can be only _one_ designated primary in a cluster at the same time. The role change is automatically distributed to the other nodes in the cluster, provided that the network is healthy. The old primary node will _automatically_ go into secondary role first. This is different from DRBD! With MARS, you don’t need an intermediate ’secondary’ command for switching roles. It is usually better to _directly_ switch the primary roles between both hosts. When --force is not given, a planned handover is started: the local host will only become actually primary _after_ the old primary is gone, and all old transaction logs have been fetched and replayed at the new designated priamry. When --force is given, no handover is attempted. A a consequence, a split brain situation is likely to emerge. Thus, use --force only after an ordinary handover attempt has failed, and when you don’t care about the split brain. For more details, please refer to the PDF manual. resize usage: resize Prerequisite: all underlying disks (usually /dev/vg/$res) must have been already increased, e.g. at the LVM layer (cf. lvresize). Causes MARS to re-examine all sizing constraints on all members of the resource, and increase the global logical size of the resource accordingly. Shrinking is currently not yet implemented. When successful, /dev/mars/$res at the primary will be increased in size. In addition, all secondaries will start an incremental fast full-sync to get the enlarged parts from the primary. resume-fetch usage: resume-fetch See resume-fetch-local. resume-fetch-global usage: resume-fetch-global Like resume-fetch-local, but affects all resource members in the cluster (remotely). resume-fetch-local usage: resume-fetch-local Start fetching transaction logfiles from the current 162 H.1. marsadm --help designated primary node, if there is one. This is independent from any {pause,resume}-replay operations. Only useful on a secondary node. resume-replay usage: resume-replay See resume-replay-local. resume-replay-global usage: resume-replay-global Like resume-replay-local, but affects all resource members in the cluster (remotely). resume-replay-local usage: resume-replay-local Restart replaying transaction logfiles, when there is some data left. This is independent from any {pause,resume}-fetch operations. This should be used for unfreezing the state of your local replica. Only useful on a secondary node. resume-sync usage: resume-sync See resume-sync-local. resume-sync-global usage: resume-sync-global Like resume-sync-local, but affects all resource members in the cluster (remotely). resume-sync-local usage: resume-sync-local Resume any initial / incremental data sync at the stage where it had been interrupted by pause-sync. Only useful on a secondary node. secondary usage: secondary Promote all cluster members into secondary role, globally. In contrast to DRBD, this is not needed as an intermediate step for planned handover between an old and a new primary node. The only reasonable usage is before the last leave-resource of the last cluster member, immediately before leave-cluster is executed for final deconstruction of the cluster. In all other cases, please prefer ’primary’ for direct handover between cluster nodes. Notice: ’secondary’ sets the global designated primary node to ’(none)’ which in turn prevents the execution of ’invalidate’ or ’join-resource’ or ’resize’ anywhere in the cluster. Therefore, don’t unnecessarily give ’secondary’! set-emergency-limit usage: set-emergency-limit Set a per-resource emergency limit for disk space in /mars. See PDF manual for details. set-sync-limit-value usage: set-sync-limit-value 163 H. Command Documentation for Userspace Tools Set the maximum number of resources which should by syncing concurrently. set-systemd-unit usage: set-systemd-unit [ ] This activates the systemd template engine of marsadm. Please read mars-manual.pdf on this. When is omitted, it will be treated equal to . split-cluster usage: split-cluster (no parameters) NOT OFFICIALLY SUPPORTED - ONLY FOR EXPERTS. RTFS = Read The Fucking Sourcecode. Use this only if you know what you are doing. up usage: up Shortcut for attach + resume-sync + resume-fetch + resume-replay. wait-cluster usage: wait-resource [ ] Waits until a ping-pong communication has succeeded in the whole cluster (or only the members of ). NOTICE: this is extremely useful for avoiding races when scripting in a cluster. wait-connect usage: wait-connect [ ] See wait-cluster. wait-resource usage: wait-resource [[attach|fetch|replay|sync][-on|-off]] Wait until the given condition is met on the resource, locally. wait-umount usage: wait-umount Wait until /dev/mars/ has disappeared in the cluster (even remotely). Useful on both primary and secondary nodes. = name of resource or "all" for all resources = | = 1and1 comminfo commstate cstate default default-global diskstate diskstate-1and1 dstate fetch-line 164 H.1. marsadm --help fetch-line-1and1 flags flags-1and1 outdated-flags outdated-flags-1and1 primarynode primarynode-1and1 replay-line replay-line-1and1 replinfo replinfo-1and1 replstate replstate-1and1 resource-errors resource-errors-1and1 role role-1and1 state status sync-line sync-line-1and1 syncinfo syncinfo-1and1 todo-role = deletable-size device-opened errno-text Convert errno numbers (positive or negative) into human readable text. get-log-status get-resource-{fat,err,wrn}{,-count} get-{disk,device} is-{alive} is-{split-brain,consistent,emergency,orphan} occupied-size present-{disk,device} (deprecated, use *-present instead) replay-basenr replay-code When negative, this indidates that a replay/recovery error has occurred. rest-space summary-vector systemd-unit tree uuid wait-{is,todo}-{attach,sync,fetch,replay,primary}-{on,off} {alive,fetch,replay,work}-{timestamp,age,lag} {all,the}-{pretty-,}{global-,}{{err,wrn,inf}-,}msg {cluster,resource}-members {disk,device}-present {disk,resource,device}-size {fetch,replay,work}-{lognr,logcount} {get,actual}-primary {is,todo}-{attach,sync,fetch,replay,primary} {my,all}-resources {sync,fetch,replay,work,syncpos}-{size,pos} 165 H. Command Documentation for Userspace Tools {sync,fetch,replay,work}-{rest,{almost-,threshold-,}reached,percent,permille,vector} {sync,fetch,replay}-{rate,remain} {time,real-time} H.2. football.sh --help Usage: ./football.sh --help [--verbose] Show help ./football.sh --variable= Override any shell variable Actions for resource migration: ./football.sh migrate [ ] Run the sequence migrate_prepare ; migrate_wait ; migrate_finish; migrate_cleanup. Dto for testing (do not rely on it): ./football.sh migrate_prepare [ ] Allocate LVM space at the targets and start MARS replication. ./football.sh migrate_wait [ ] Wait until MARS replication reports UpToDate. ./football.sh migrate_finish [ ] Call hooks for handover to the targets. ./football.sh migrate_cleanup Remove old / currently unused LV replicas from MARS and deallocate from LVM. Actions for inplace FS shrinking: ./football.sh shrink Run the sequence shrink_prepare ; shrink_finish ; shrink_cleanup. Dto for testing (do not rely on it): ./football.sh shrink_prepare [ ] Allocate temporary LVM space (when possible) and create initial raw FS copy. Default percent value(when left out) is 85. ./football.sh shrink_finish Incrementally update the FS copy, swap old <=> new copy with small downtime. ./football.sh shrink_cleanup Remove old FS copy from LVM. Actions for inplace FS extension: ./football.sh expand ./football.sh extend Increase mounted filesystem size during operations. 166 H.2. football.sh --help Combined actions: ./football.sh migrate+shrink [ ] [ ] Similar to migrate ; shrink but produces less network traffic. Default percent value (when left out) is 85. ./football.sh migrate+shrink+back [ ] Migrate temporarily to , then shrink there, finally migrate back to old primary and secondaries. Default percent value (when left out) is 85. Actions for (manual) repair in emergency situations: ./football.sh manual_handover This is useful in place of going to the machines and starting handover on their command line. You dont need to log in. All hooks (e.g. for downtime / reporting / etc) are automatically called. Notice: it will only work when there is already a replica at , and when further constraints such as clustermanager constraints will allow it. For a full Football game between different clusters, use "migrate" instead. ./football.sh manual_migrate_config [ ] Transfer only the cluster config, without changing the MARS replicas. This does no resource stopping / restarting. Useful for reverting a failed migration. ./football.sh manual_config_update Only update the cluster config, without changing anything else. Useful for manual repair of failed migration. ./football.sh manual_merge_cluster Run "marsadm merge-cluster" for the given hosts. Hostnames must be from different (former) clusters. ./football.sh manual_split_cluster Run "marsadm split-cluster" at the given hosts. Useful for fixing failed / asymmetric splits. Hint: provide _all_ hostnames which have formerly participated in the cluster. ./football.sh repair_vm Try to restart the VM on one of the given machines. Useful during unexpected customer downtime. ./football.sh repair_mars Before restarting the VM like in repair_vm, try to find a local LV where a stand-alone MARS resource can be found and built up. Use this only when the MARS resources are gone, and when you are desperate. Problem: this will likely create a MARS setup which is not usable for production, and therefore must be corrected later by hand. Use this only during an emergency situation in order to get the customers online again, while buying the downsides of this command. ./football.sh manual_lock 167 H. Command Documentation for Userspace Tools ./football.sh manual_unlock Manually lock or unlock an item at all of the given hosts, in an atomic fashion. In most cases, use "ALL" for the item. Only for testing / development (no stable interfaces): ./football.sh manual_call_hook Global maintenance: ./football.sh lv_cleanup General features: - Instead of , an absolute amount of storage with suffix ’k’ or ’m’ or ’g’ can be given. - When is currently stopped, login to the container is not possible, and in turn the hypervisor node and primary storage node cannot be automatically determined. In such a case, the missing nodes can be specified via the syntax : : - The following LV suffixes are used (naming convention): -tmp = currently emerging version for shrinking -preshrink = old version before shrinking took place - By adding the option --screener, you can handover football execution to ./screener.sh . When some --enable_*_waiting is also added, then the critical sections involving customer downtime are temporarily halted until some sysadmins says "screener.sh continue $resource" or attaches to the sessions and presses the RETURN key. PLUGIN football-1and1config 1&1 specfic plugin for dealing with the cm3 clusters and its concrete configuration. PLUGIN football-cm3 1&1 specfic plugin for dealing with the cm3 cluster manager and its concrete operating enviroment (singleton instance). Current maximum cluster size limit: Maximum #syncs running before migration can start: Following marsadm --version must be installed: Following mars kernel modules must be loaded: Specific actions for plugin football-cm3: ./football.sh clustertool {GET|PUT} Call through to the clustertool via REST. 168 H.2. football.sh --help Useful for manual inspection and repair. Specific features with plugin football-cm3: - Parameter syntax "cluster123" instead of "icpu456 icpu457" This is an alternate specification syntax, which is automatically replaced with the real machine names. It tries to minimize datacenter cross-traffic by taking the new $target_primary at the same datacenter location where the container is currenty running. PLUGIN football-ticket Generic plugin for creating and updating tickets, e.g. Jira tickets. You will need to hook in some external scripts which are then creating / updating the tickets. Comment texts may be provided with following conventions: comment.$ticket_state.txt comment.$ticket_phase.$ticket_state.txt Directories where comments may reside: football_creds=/usr/lib/mars/creds /etc/mars/creds /home/schoebel/mars/football-master.gi football_confs=/usr/lib/mars/confs /etc/mars/confs /home/schoebel/mars/football-master.gi football_includes=/usr/lib/mars/plugins /etc/mars/plugins /home/schoebel/mars/football-ma PLUGIN football-basic Generic driver for systemd-controlled MARS pools. The current version supports only a flat model: (1) There is a single "big cluster" at metadata level. All cluster members are joined via merge-cluster. All occurring names need to be globally unique. (2) The network uses BGP or other means, thus any hypervisor can (potentially) start any VM at any time. (3) iSCSI or remote devices are not supported for now (LocalSharding model). This may be extended in a future release. This plugin is exclusive-or with cm3. Plugin specific actions: ./football.sh basic_add_host Manually add another host to the hostname cache. PLUGIN football-downtime Generic plugin for communication of customer downtime. PLUGIN football-motd 169 H. Command Documentation for Userspace Tools Generic plugin for motd. Communicate that Football is running at login via motd. PLUGIN football-report Generic plugin for communication of reports. PLUGIN football-waiting Generic plugig, interfacing with screener: when this is used by your script and enabled, then you will be able to wait for "screener.sh continue" operations at certain points in your script. H.3. football.sh --help --verbose verbose=1 Usage: ./football.sh --help [--verbose] Show help ./football.sh --variable= Override any shell variable Actions for resource migration: ./football.sh migrate [ ] Run the sequence migrate_prepare ; migrate_wait ; migrate_finish; migrate_cleanup. Dto for testing (do not rely on it): ./football.sh migrate_prepare [ ] Allocate LVM space at the targets and start MARS replication. ./football.sh migrate_wait [ ] Wait until MARS replication reports UpToDate. ./football.sh migrate_finish [ ] Call hooks for handover to the targets. ./football.sh migrate_cleanup Remove old / currently unused LV replicas from MARS and deallocate from LVM. Actions for inplace FS shrinking: ./football.sh shrink Run the sequence shrink_prepare ; shrink_finish ; shrink_cleanup. Dto for testing (do not rely on it): ./football.sh shrink_prepare [ ] Allocate temporary LVM space (when possible) and create initial 170 H.3. football.sh --help --verbose raw FS copy. Default percent value(when left out) is 85. ./football.sh shrink_finish Incrementally update the FS copy, swap old <=> new copy with small downtime. ./football.sh shrink_cleanup Remove old FS copy from LVM. Actions for inplace FS extension: ./football.sh expand ./football.sh extend Increase mounted filesystem size during operations. Combined actions: ./football.sh migrate+shrink [ ] [ ] Similar to migrate ; shrink but produces less network traffic. Default percent value (when left out) is 85. ./football.sh migrate+shrink+back [ ] Migrate temporarily to , then shrink there, finally migrate back to old primary and secondaries. Default percent value (when left out) is 85. Actions for (manual) repair in emergency situations: ./football.sh manual_handover This is useful in place of going to the machines and starting handover on their command line. You dont need to log in. All hooks (e.g. for downtime / reporting / etc) are automatically called. Notice: it will only work when there is already a replica at , and when further constraints such as clustermanager constraints will allow it. For a full Football game between different clusters, use "migrate" instead. ./football.sh manual_migrate_config [ ] Transfer only the cluster config, without changing the MARS replicas. This does no resource stopping / restarting. Useful for reverting a failed migration. ./football.sh manual_config_update Only update the cluster config, without changing anything else. Useful for manual repair of failed migration. ./football.sh manual_merge_cluster Run "marsadm merge-cluster" for the given hosts. Hostnames must be from different (former) clusters. ./football.sh manual_split_cluster Run "marsadm split-cluster" at the given hosts. Useful for fixing failed / asymmetric splits. Hint: provide _all_ hostnames which have formerly participated in the cluster. 171 H. Command Documentation for Userspace Tools ./football.sh repair_vm Try to restart the VM on one of the given machines. Useful during unexpected customer downtime. ./football.sh repair_mars Before restarting the VM like in repair_vm, try to find a local LV where a stand-alone MARS resource can be found and built up. Use this only when the MARS resources are gone, and when you are desperate. Problem: this will likely create a MARS setup which is not usable for production, and therefore must be corrected later by hand. Use this only during an emergency situation in order to get the customers online again, while buying the downsides of this command. ./football.sh manual_lock ./football.sh manual_unlock Manually lock or unlock an item at all of the given hosts, in an atomic fashion. In most cases, use "ALL" for the item. Only for testing / development (no stable interfaces): ./football.sh manual_call_hook Global maintenance: ./football.sh lv_cleanup General features: - Instead of , an absolute amount of storage with suffix ’k’ or ’m’ or ’g’ can be given. - When is currently stopped, login to the container is not possible, and in turn the hypervisor node and primary storage node cannot be automatically determined. In such a case, the missing nodes can be specified via the syntax : : - The following LV suffixes are used (naming convention): -tmp = currently emerging version for shrinking -preshrink = old version before shrinking took place - By adding the option --screener, you can handover football execution to ./screener.sh . When some --enable_*_waiting is also added, then the critical sections involving customer downtime are temporarily halted until some sysadmins says "screener.sh continue $resource" or attaches to the sessions and presses the RETURN key. Configuration: You can place shell variable definitions for overriding any tunables into the following locations: football_includes=/usr/lib/mars/plugins /etc/mars/plugins /home/schoebel/mars/football-master.g football_confs=/usr/lib/mars/confs /etc/mars/confs /home/schoebel/mars/football-master.git/conf 172 H.3. football.sh --help --verbose football_creds=/usr/lib/mars/creds /etc/mars/creds /home/schoebel/mars/football-master.git Filenames should match the following patterns: football-*.preconf football-*.conf football-*.postconf football-*.reconf Here you may change paths and enable_* variables. Inteded for main parameters. For late overrides after sourcing modules. Modify runtime parameters during waits. ## football_includes # List of directories where football-*.sh and football-*.conf # files can be found. football_includes="${football_includes:-/usr/lib/mars/plugins /etc/mars/plugins $script_dir/ ## football_confs # Another list of directories where football-*.conf files can be found. # These are sourced in a second pass after $football_includes. # Thus you can change this during the first pass. football_confs="${football_confs:-/usr/lib/mars/confs /etc/mars/confs $script_dir/confs $HOM ## football_creds # List of directories where various credential files can be found. football_creds="${football_creds:-/usr/lib/mars/creds /etc/mars/creds $script_dir/creds $scr ## trap_signals # List of signal names which should be trapped. # Traps are importnatn for housekeeping, e.g. automatic # removal of locks. trap_signals="${trap_signals:-SIGINT}" ## dry_run # When set, actions are only simulated. dry_run=${dry_run:-0} ## verbose # increase speakiness. verbose=${verbose:-0} ## confirm # Only for debugging: manually started operations can be # manually checked and confirmed before actually starting opersions. confirm=${confirm:-1} ## force # Normally, shrinking and extending will only be started if there # is something to do. # Enable this for debugging and testing: the check is then skipped. force=${force:-0} ## debug_injection_point # RTFS don’t set this unless you are a developer knowing what you are doing. debug_injection_point="${debug_injection_point:-0}" ## football_logdir # Where the logfiles should be created. # HINT: after playing Football in masses for a whiile, your $logdir will # be easily populated with hundreds or thousands of logfiles. 173 H. Command Documentation for Userspace Tools # Set this to your convenience. football_logdir="${football_logdir:-${logdir:-$HOME/football-logs}}" ## football_backup_dir # In this directory, various backups are created. # Intended for manual repair. football_backup_dir="${football_backup_dir:-$football_logdir/backups}" ## screener # When enabled, delegate execution to the screener. # Very useful for running Football in masses. screener="${screener:-1}" ## min_space # When testing / debugging with extremely small LVs, it may happen # that mkfs refuses to create extemely small filesystems. # Use this to ensure a minimum size. min_space="${min_space:-20000000}" ## cache_repeat_lapse # When using the waiting capabilities of screener, and when waits # are lasting very long, your dentry cache may become cold. # Use this for repeated refreshes of the dentry cache after some time. cache_repeat_lapse="${cache_repeat_lapse:-120}" # Minutes ## remote_ping # Before using ssh, ping the target. # This is only useful in special cases. remote_ping="${remote_ping:-0}" ## ping_opts # Options for ping checks. ping_opts="${ping_opts:--W 1 -c 1}" ## ssh_opt # Useful for customization to your ssh environment. ssh_opt="${ssh_opt:--4 -A -o StrictHostKeyChecking=no -o ForwardX11=no -o KbdInteractiveAuthentic ## ssh_auth # Useful for extra -i options. ssh_auth="${ssh_auth:-}" ## rsync_opt # The rsync options in general. # IMPORTANT: some intermediate progress report is absolutely needed, # because otherwise a false-positive TIMEOUT may be assumed when # no output is generated for several hours. rsync_opt="${rsync_opt:- -aH --inplace --info=progress2,STATS}" ## rsync_opt_prepare # Additional rsync options for preparation and updating # of the temporary shrink mirror filesystem. rsync_opt_prepare="${rsync_opt_prepare:---exclude=’.filemon2’ --delete}" ## rsync_opt_hot # This is only used at the final rsync, immediately before going # online again. rsync_opt_hot="${rsync_opt_hot:---delete}" 174 H.3. football.sh --help --verbose ## rsync_nice # Typically, the preparation steps are run with background priority. rsync_nice="${rsync_nice:-nice -19}" ## rsync_repeat_prepare and rsync_repeat_hot # Tuning: increases the reliability of rsync and ensures that the dentry cache # remains hot. rsync_repeat_prepare="${rsync_repeat_prepare:-5}" rsync_repeat_hot="${rsync_repeat_hot:-3}" ## rsync_skip_lines # Number of rsync lines to skip in output (avoid overflow of logfiles). rsync_skip_lines="${rsync_skip_lines:-1000}" ## use_tar # Use incremental Gnu tar in place of rsync: # 0 = don’t use tar # 1 = only use for the first (full) data transfer, then use rsync # 2 = always use tar # Experience: tar has better performance on local data than rsync, but # it tends to produce false-positive failure return codes on online # filesystems which are altered during tar. # The combined mode 1 tries to find a good compromise between both # alternatives. use_tar="${use_tar:-1}" ## tar_exe # Use this for activation of patched tar versions, such as the # 1&1-internal patched spacetools-tar. tar_exe="${tar_exe:-/bin/tar}" ## tar_options_src and tar_options_dst # Here you may give different options for both sides of tar invocations # (source and destination), such as verbosity options etc. tar_options_src="${tar_options_src:-}" tar_options_dst="${tar_options_dst:-}" ## tar_is_fixed # Tell whether your tar version reports false-positive transfer errors, # or not. tar_is_fixed="${tar_is_fixed:-0}" ## tar_state_dir # This directory is used for keeping incremental tar state information. tar_state_dir="${tar_state_dir:-/var/tmp}" ## buffer_cmd # Speed up tar by intermediate buffering. buffer_cmd="${buffer_cmd:-buffer -m 16m -S 1024m || cat}" ## wait_timeout # Avoid infinite loops upon waiting. wait_timeout="${wait_timeout:-$(( 24 * 60 ))}" # Minutes ## lvremove_opt # Some LVM versions are requiring this for unattended batch operations. lvremove_opt="${lvremove_opt:--f}" 175 H. Command Documentation for Userspace Tools ## automatic recovery options: enable_failure_* enable_failure_restart_vm="${enable_failure_restart_vm:-1}" enable_failure_recreate_cluster="${enable_failure_recreate_cluster:-0}" enable_failure_rebuild_mars="${enable_failure_rebuild_mars:-1}" ## critical_status # This is the "magic" exit code indicating _criticality_ # of a failed command. critical_status="${critical_status:-199}" ## serious_status # This is the "magic" exit code indicating _seriosity_ # of a failed command. serious_status="${serious_status:-198}" ## interrupted_status # This is the "magic" exit code indicating a manual interruption # (e.g. keypress Ctl-c) interrupted_status="${interrupted_status:-190}" ## illegal_status # This is the "magic" exit code indicating an illegal command # (e.g. syntax error, illegal arguments, etc) illegal_status="${illegal_status:-191}" ## pre_hand or --pre-hand= # Set this to do an ordinary handover to a new start position # (in the source cluster) before doing anything else. # This may be used for handover to a different datacenter, # in order to minimize cross traffic between datacenters. pre_hand="${pre_hand:-}" ## post_hand or --post-hand= # Set this to do an ordinary handover to a final position # (in the target cluster) after everything has successfully finished. # This may be used to establish a uniform default running location. post_hand="${post_hand:-}" ## tmp_suffix # Only for experts. tmp_suffix="${tmp_suffix:--tmp}" ## shrink_suffix_old # Suffix for backup LVs. These are kept for wome time until # *_cleanup operations will remove them. shrink_suffix_old="${shrink_suffix_old:--preshrink}" ## start_regex # At which $operation the hook football_start # shoule be called start_regex="${start_regex:-^(migrate_prepare|migrate|migrate+|shrink_prepare|shrink)}" ## finished_regex # At which $operation the hook football_finished # shoule be called finished_regex="${finished_regex:-^(migrate_finish|migrate|migrate+|shrink_finish|shrink)}" 176 H.3. football.sh --help --verbose ## call_finished # Whether to call the hook football_failed at failures. call_finished="${call_finished:-1}" ## lock_break_timeout # When remote ssh commands are failing, remote locks may sustain forever. # Avoid deadlocks by breaking remote locks after this timeout has elapsed. # NOTICE: these type of locks are only intended for short-term locking. lock_break_timeout="${lock_break_timeout:-3600}" # seconds ## startup_when_locked # When == 0: # Don’t abort and don’t wait when a lock is detected at startup. # When == 1 and when enable_startup_waiting=1: # Wait until the lock is gone. # When == 2: # Abort start of script execution when a lock is detected. # Later, when a locks are set _during_ execution, they will # be obeyed when enable_*_waiting is set (instead), and will # lead to waits instead of aborts. startup_when_locked="${startup_when_locked:-1}" ## resource_pre_check # Useful for debugging of container problems. # Normally not needed. resource_pre_check="${resource_pre_check:-0}" ## enable_background_reporting # Progress reporting to screener. # Runs in the background, in parallel to forground processes # like rsync or tar. enable_background_reporting="${enable_background_reporting:-1}" ## condition_check_interval # How often conditions should be re-evaluated. condition_check_interval="${condition_check_interval:-180}" # Seconds ## lease_time # Any intents (e.g. for creation of new resources) are recorded. # This is needed for race avoidance, when multiple resources # are migrated in _parallel_ to the _same_ target. # This might lead to livelocks when there would be no lease time # after which the intents are regarded as "invalid". lease_time="${lease_time:-3600}" # seconds ## limit_syncs # Limit the number of actually running syncs by waiting # until less than this number of syncs are running at any # target host. limit_syncs="${limit_syncs:-4}" ## limit_shrinks # Limit the number of actually running shrinks by waiting # until less than this number of shrinks are running at any # target host. limit_shrinks="${limit_shrinks:-1}" ## count_shrinks_by_tmp_mount 177 H. Command Documentation for Userspace Tools # Only count the temporary mounts. # Otherwise, LVs are counted. The latter may yield false positives # because LVs may be created in advance (e.g. at another cluster member) count_shrinks_by_tmp_mount="${count_shrinks_by_tmp_mount:-1}" ## limit_mars_logfile # Dont handover when too much logfile data is missing at the # new primary site. limit_mars_logfile="${limit_mars_logfile:-1024}" # MiB ## shrink_min_ram_gb # When set, check that the target machines for shrinking # have enough RAM. # Rationale: even incremental rsync needs the Dentry cache of the # kernel. When there is not enough RAM, and when there are some millions # of inodes, the customer downtime may rise to some hours or even some days # instead of some minutes (only when the detnry+inode cache does not # fit into kernel memory <<<=== this is the cruscial point) shrink_min_ram_gb="${shrink_min_ram_gb:-0}" # GiB ## optimize_dentry_cache # Don’t umount the temporary shrink space unnecessarily. # Try to shutdown the VM / container without umounting. # Important for high speed. optimize_dentry_cache="${optimize_dentry_cache:-1}" ## mkfs_cmd # Tunable for creation of new filesystems. mkfs_cmd="${mkfs_cmd:-mkfs.xfs -s size=4096 -d agcount=1024}" ## mount_opts # Options for temporary mounts. # Not used for ordinary clustermanager operations. mount_opts="${mount_opts:--o rw,nosuid,noatime,attr2,inode64,usrquota}" ## reuse_mount # Assume that already existing temporary mounts are the correct ones. # This will speed up interrupted and repeated runs by factors. reuse_mount="${reuse_mount:-1}" ## reuse_lv # Assume that temporary LVs are reusable. reuse_lv="${reuse_lv:-1}" ## reuse_lv_check # When set, this command is executed for checking whether # the LV can be reused. reuse_lv_check="${reuse_lv_check:-xfs_db -c sb -c print -r}" ## do_quota # Transfer xfs quota information. # 0 = off # 1 = global xfs quota transfer # 2 = additionally local one do_quota="${do_quota:-2}" ## xfs_dump_dir # Temporary space for keeping xfs quota dumps. 178 H.3. football.sh --help --verbose xfs_dump_dir="${xfs_dump_dir:-$football_backup_dir/xfs-quota-$start_stamp}" ## xfs_quota_enable # Command for re-enabling the quota system after shrink. xfs_quota_enable="${xfs_quota_enable:-xfs_quota -x -c enable}" ## xfs_dump and xfs_restore # Commands for transfer of xfs quota information. xfs_dump="${xfs_dump:-xfs_quota -x -c dump}" xfs_restore="${xfs_restore:-xfs_quota -x -c restore}" ## shortcut_tar_percent # Percentage when a shrink space should no longer be considered # as "inital" (or empty). shortcut_tar_percent="${shortcut_tar_percent:-5}" ## max_rsync_downtime # When set, check the _expected_ duration of customer downtime. # if it takes longer than this limit, abort without causing # customer downtime. # Afterward, sysadmins need to decide what to do: # For example, move the resource to faster hardware with more RAM, or similar. max_rsync_downtime="${max_rsync_downtime:-0}" # seconds ## merge_shrink_secondaries # This is only needed when targets are not yet pre-merged. merge_shrink_secondaries="${merge_shrink_secondaries:-0}" ## fs_resize_cmd # Command for online filesystem expansion. fs_resize_cmd="${fs_resize_cmd:-xfs_growfs -d}" ## migrate_two_phase # This is useful when the new hardware has a better replication network, # e.g. 10GBit uplink instead of 1GBit. # Instead of starting two or more syncs in parallel on the old hardware, # run the syncs in two phases: # 1. migrate data to the new primary only. # 1b. handover to new primary. # 2. now start migration of data to the new secondaries, over the better # network attachment of the new hardware. migrate_two_phase="${migrate_two_phase:-0}" ## migrate_always_all # By default, migrate+shrink creates only 1 replica during the initial # migration. # When setting this, all replicas are created, which improves resilience, # but worsens network performance. migrate_always_all="${migrate_always_all:-0}" ## migrate_early_cleanup # Early cleanup of old replicas when using migrate_always_all or # migrate_two_phase. # Only reasonable when combined with migrate+shrink. # This is slightly less safe, but saves time when you want to # decommission old hardware as fast as popssible. # Early cleanup of the old replicase will only be done when # at least 2 replicas are available at the new (target) side. 179 H. Command Documentation for Userspace Tools # These two new replicas can be created either by # a) migrate_always_all=1 or # b) migrate_two_phase=1 or automatically selected (or not) via # c) auto_two_phase=1 migrate_early_cleanup="${migrate_early_cleanup:-1}" ## user_name # Normally automatically derived from ssh agent or from $LOGNAME. # Please override this only when really necessary. export user_name="${user_name:-$(get_real_ssh_user)}" export user_name="${user_name:-$LOGNAME}" ## replace_ssh_id_file # When set, replace current ssh user with this one. # The new user should hot have a passphrase. # Useful for logging out the original user (interrupting the original # ssh agent chain). replace_ssh_id_file="${replace_ssh_id_file:-}" PLUGIN football-1and1config 1&1 specfic plugin for dealing with the cm3 clusters and its concrete configuration. ## enable_1and1config # ShaHoLin-specifc plugin for working with the infong platform # (istore, icpu, infong) via 1&1-specific clustermanager cm3 # and related toolsets. Much of it is bound to a singleton database # instance (clustermw & siblings). enable_1and1config="${enable_1and1config:-$(if [[ "$0" =~ tetris ]]; then echo 1; else echo 0; fi ## runstack_host # To be provided in a *.conf or *.preconf file. runstack_host="${runstack_host:-}" ## runstack_cmd # Command to be provided in a *.conf file. runstack_cmd="${runstack_cmd:-}" ## runstack_ping # Only call runstack when the container is pingable. runstack_ping="${runstack_ping:-1}" ## dastool_host # To be provided in a *.conf or *.preconf file. dastool_host="${dastool_host:-}" ## dastool_cmd # Command to be provided in a *.conf file. dastool_cmd="${dastool_cmd:-}" ## update_host # To be provided in a *.conf or *.preconf file. update_host="${update_host:-}" ## update_cmd # Command to be provided in a *.conf file. 180 H.3. football.sh --help --verbose update_cmd="${update_cmd:-}" PLUGIN football-cm3 1&1 specfic plugin for dealing with the cm3 cluster manager and its concrete operating enviroment (singleton instance). Current maximum cluster size limit: Maximum #syncs running before migration can start: Following marsadm --version must be installed: Following mars kernel modules must be loaded: Specific actions for plugin football-cm3: ./football.sh clustertool {GET|PUT} Call through to the clustertool via REST. Useful for manual inspection and repair. Specific features with plugin football-cm3: - Parameter syntax "cluster123" instead of "icpu456 icpu457" This is an alternate specification syntax, which is automatically replaced with the real machine names. It tries to minimize datacenter cross-traffic by taking the new $target_primary at the same datacenter location where the container is currenty running. ## enable_cm3 # ShaHoLin-specifc plugin for working with the infong platform # (istore, icpu, infong) via 1&1-specific clustermanager cm3 # and related toolsets. Much of it is bound to a singleton database # instance (clustermw & siblings). enable_cm3="${enable_cm3:-$(if [[ "$0" =~ tetris ]]; then echo 1; else echo 0; fi)}" ## skip_resource_ping # Enable this only for testing. Normally, a resource name denotes a # container name == machine name which must be runnuing as a precondition, # und thus must be pingable over network. skip_resource_ping="${skip_resource_ping:-0}" ## business_hours # When set, critical sections are only entered during certain # days of the week, and/or during certain hours. # This is a regex matching against "date +%u_%H". # Example regex: [1-5]_(0[8-9]|1[0-8]) # This means Monday to Friday from 8 to 18 o’clock. business_hours="${business_hours:-}" ## cm3_stop_safeguard_cmd # Workaround for a bug. # Sometimes a systemd unit does not go away. cm3_stop_safeguard_cmd="${cm3_stop_safeguard_cmd:-{ sleep 2; try=0; while (( try++ < 10 )) & ## check_ping_rounds 181 H. Command Documentation for Userspace Tools # Number of pings to try before a container is assumed to # not respond. check_ping_rounds="${check_ping_rounds:-5}" ## additional_runstack # Do an additional runstack after startup of the new container. # In turn, this will only do something when source and target are # different. additional_runstack="${additional_runstack:-1}" ## workaround_firewall # Documentation of technical debt for later generations: # This is needed since July 2017. In the many years before, no firewalling # was effective at the replication network, because it is a physically # separate network from the rest of the networking infrastructure. # An attacker would first need to gain root access to the _hypervisor_ # (not only to the LXC container and/or to KVM) before gaining access to # those physical replication network interfaces. # Since about that time, which is about the same time when the requirements # for Container Football had been communicated, somebody introduced some # unnecessary firewall rules, based on "security arguments". # These arguments were however explicitly _not_ required by the _real_ # security responsible person, and explicitly _not_ recommended by him. # Now the problem is that it is almost politically impossible to get # rid of suchalike "security feature". # Until the problem is resolved, Container Football requires # the _entire_ local firewall to be _temporarily_ shut down in order to # allow marsadm commands over ssh to work. # Notice: this is _not_ increasing the general security in any way. # LONGTERM solution / TODO: future versions of mars should no longer # depend on ssh. # Then this "feature" can be turned off. workaround_firewall="${workaround_firewall:-1}" ## ip_magic # Similarly to workaround_firewall, this is needed since somebody # introduced additional firewall rules also disabling sysadmin ssh # connections at the _ordinary_ sysadmin network. ip_magic="${ip_magic:-1}" ## do_split_cluster # The current MARS branch 0.1a.y is not yet constructed for forming # a BigCluster constisting of several thousands of machines. # When a future version of mars0.1b.y (or 0.2.y) will allow this, # this can be disabled. # do_split_cluster >= 2 means that the resulting MARS clusters should # not exceed these number of members, when possible. do_split_cluster="${do_split_cluster:-2}" ## forbidden_hosts # Regex for excluding hostnames from any Football actions. # The script will fail when some of these is encountered. forbidden_hosts="${forbidden_hosts:-}" ## forbidden_flavours # Regex for excluding flavours from any Football actions. # The script will fail when some of these is encountered. forbidden_flavours="${forbidden_flavours:-}" 182 H.3. football.sh --help --verbose ## forbidden_bz_ids # PROVISIONARY regex for excluding certain bz_ids from any Football actions. # NOTICE: bz_ids are deprecated and should not be used in future # (technical debts). # The script will fail when some of these is encountered. forbidden_bz_ids="${forbidden_bz_ids:-}" ## auto_two_phase # When this is set, override the global migrate_two_phase parameter # at runtime by ShaHoLin-specific checks auto_two_phase="${auto_two_phase:-1}" ## clustertool_host # URL prefix of the internal configuation database REST interface. # Set this via *.preconf config files. clustertool_host="${clustertool_host:-}" ## clustertool_user # Username for clustertool access. # By default, scans for a *.password file (see next option). clustertool_user="${clustertool_user:-$(get_cred_file "*.password" | head -1 | sed ’s:.*/::g ## clustertool_passwd_file # Here you can supply the encrpted password. # By default, a file $clustertool_user.password is used # containing the encrypted password. clustertool_passwd_file="${clustertool_passwd_file:-$(get_cred_file "$clustertool_user.passw ## clustertool_passwd # Here you may override the password via config file. # For security reasons, dont provide this at the command line. clustertool_passwd="${clustertool_passwd:-$(< $clustertool_passwd_file)}" || ## do_migrate # Keep this enabled. Only disable for testing. do_migrate="${do_migrate:-1}" # must be enabled; disable for dry-run testing ## always_migrate # Only use for testing, or for special situation. # This skip the test whether the resource has already migration. always_migrate="${always_migrate:-0}" # only enable for testing ## check_segments # 0 = disabled # 1 = only display the segment names # 2 = check for equality # WORKAROUND, potentially harmful when used inadequately. # The historical physical segment borders need to be removed for # Container Football. # Unfortunately, the subproject aiming to accomplish this did not # proceed for one year now. In the meantime, Container Football can # be only played within the ancient segment borders. # After this big impediment is eventually resolved, this option # should be switched off. check_segments="${check_segments:-1}" ## enable_mod_deflate 183 echo "cannot H. Command Documentation for Userspace Tools # Internal, for support. enable_mod_deflate="${enable_mod_deflate:-1}" ## enable_segment_move # Seems to be needed by some other tooling. enable_segment_move="${enable_segment_move:-1}" ## override_hwclass_id # When necessary, override this from $include_dir/plugins/*.conf override_hwclass_id="${override_hwclass_id:-}" # typically 25007 ## override_hvt_id # When necessary, override this from $include_dir/plugins/*.conf override_hvt_id="${override_hvt_id:-}" # typically 8057 or 8059 ## override_overrides # When this is set and other override_* variables are not set, # then try to _guess_ some values. # No guarantees for correctness either. override_overrides=${override_overrides:-1} ## iqn_base and iet_type and iscsi_eth and iscsi_tid # Workaround: this is needed for _dynamic_ generation of iSCSI sessions # bypassing the ordinary ones as automatically generated by the # cm3 cluster manager (only at the old istore architecture). # Notice: not needed for regular operations, only for testing. # Normally, you dont want to shrink over a _shared_ 1MBit iSCSI line. iqn_base="${iqn_base:-iqn.2000-01.info.test:test}" iet_type="${iet_type:-blockio}" iscsi_eth="${iscsi_eth:-eth1}" iscsi_tid="${iscsi_tid:-4711}" ## monitis_downtime_script # ShaHoLin-internal monitis_downtime_script="${monitis_downtime_script:-}" ## monitis_downtime_duration # ShaHoLin-internal monitis_downtime_duration="${monitis_downtime_duration:-60}" # Minutes ## orwell_downtime_script # ShaHoLin-internal orwell_downtime_script="${orwell_downtime_script:-}" ## orwell_tz # Deal with differences in clock timezones. orwell_tz="${orwell_tz:-Europe/Berlin}" ## orwell_downtime_duration # ShaHoLin-internal orwell_downtime_duration="${orwell_downtime_duration:-20}" # Minutes ## orwell_workaround_sleep # Workaround for a race condition in Orwell. # Try to ensure that another check has been executed before # the downtime is removed. # 0 = dont remove the downtime at all. orwell_workaround_sleep="${orwell_workaround_sleep:-300}" # Seconds 184 H.3. football.sh --help --verbose ## shaholin_customer_report_cmd # Action script when the hardware has improved. shaholin_customer_report_cmd="${shaholin_customer_report_cmd:-}" ## shaholin_min_cpus and shaholin_dst_cpus shaholin_src_cpus="${shaholin_src_cpus:-4}" shaholin_dst_cpus="${shaholin_dst_cpus:-32}" ## ip_renumber_cmd # Cross-call with another independent project. ip_renumber_cmd="${ip_renumber_cmd:-}" ## shaholin_finished_log # ShaHoLin-specific logfile, reporting _only_ successful completion # of an action. shaholin_finished_log="${shaholin_finished_log:-$football_logdir/shaholin-finished.log}" ## update_cmd # OPTIONAL: specific action script with parameters. update_cmd="${update_cmd:-}" ## update_host # To be provided in a *.conf or *.preconf file. update_host="${update_host:-}" ## parse_ticket # Regex for identifying tickets from script outputs or arguments parse_ticket="${parse_ticket:-TECCM-[0-9]+}" ## prefer_parsed_ticket # Workaround bugs from getting inconsistent ticket IDs from different sources. prefer_parsed_ticket="${prefer_parsed_ticket:-0}" ## translate_db_state # Whether to use the following mapping definitions. translate_db_state="${translate_db_state:-0}" ## db_state_* # Map logical names to the ones in the database. db_state_init="${db_state_init:-}" db_state_prepare="${db_state_prepare:-}" db_state_finish="${db_state_finish:-}" db_state_cleanup="${db_state_cleanup:-}" db_state_done="${db_state_done:-}" ## use_type_for_ticket # Internal ticketing convention. use_type_for_ticket="${use_type_for_ticket:-1}" ## auto_handover # Load-balancing accross locations. # Works only together with the new syntax "cluster123". # Depending on the number of syncs currently running, this # will internally add --pre-hand and --post_hand options # dynamically at runtime. This will spread much of the sync # traffic to per-datacenter local behaviour. # Notice: this may produce more total customer downtime when 185 H. Command Documentation for Userspace Tools # running a high parallelism degree. # Thus it tries to reduce unnecessary handovers to other locations. auto_handover="${auto_handover:-1}" ## preferred_location # When set, override any other pre-handover to this location. # Useful for maintenance of a whole datacenter. preferred_location="${preferred_location:-}" PLUGIN football-ticket Generic plugin for creating and updating tickets, e.g. Jira tickets. You will need to hook in some external scripts which are then creating / updating the tickets. Comment texts may be provided with following conventions: comment.$ticket_state.txt comment.$ticket_phase.$ticket_state.txt Directories where comments may reside: football_creds=/usr/lib/mars/creds /etc/mars/creds /home/schoebel/mars/football-master.git/cre football_confs=/usr/lib/mars/confs /etc/mars/confs /home/schoebel/mars/football-master.git/con football_includes=/usr/lib/mars/plugins /etc/mars/plugins /home/schoebel/mars/football-master. ## enable_ticket enable_ticket="${enable_ticket:-$(if [[ "$0" =~ tetris ]]; then echo 1; else echo 0; fi)}" ## ticket # OPTIONAL: the meaning is installation specific. # This can be used for identifying JIRA tickets. # Can be set on the command line like "./tetris.sh $args --ticket=TECCM-4711 ticket="${ticket:-}" ## ticket_get_cmd # Optional: when set, this script can be used for retrieving ticket IDs # in place of commandline option --ticket= # Retrieval should be unique by resource names. # You may use any defined bash varibale by escaping them like # $res . # Example: ticket_get_cmd="my-ticket-getter-script.pl "$res"" ticket_get_cmd="${ticket_get_cmd:-}" ## ticket_create_cmd # Optional: when set, this script can be used for creating new tickets. # It will be called when $ticket_get_cmd does not retrieve anything. # Example: ticket_create_cmd="my-ticket-create-script.pl "$res" "$target_primary"" # Afterwards, the new ticket needs to be retrievable via $ticket_get_cmd. ticket_create_cmd="${ticket_create_cmd:-}" ## ticket_update_cmd # This can be used for calling an external command which updates # the ticket(s) given by the $ticket parameter. # Example: ticket_update_cmd="my-script.pl "$ticket" "$res" "$ticket_phase" "$ticket_state"" 186 H.3. football.sh --help --verbose ticket_update_cmd="${ticket_update_cmd:-}" ## ticket_require_comment # Only update a ticket when a comment file exists in one of the # directories $football_creds $football_confs $football_includes ticket_require_comment="${ticket_require_comment:-1}" ## ticket_for_migrate # Optional 1&1-specific: separate ticket for migrate. # Useful when migrate+shink need to post into separate tickets. ticket_for_migrate="${ticket_for_migrate:-}" ## ticket_for_shrink # Optional 1&1-specific: separate ticket for migrate. # Useful when migrate+shink need to post into separate tickets. ticket_for_shrink="${ticket_for_shrink:-}" ## ticket_prefer_cached # Workaround a bug in ticket ID retrieval: # Trust my own cached values more than trust the "inconsistent read". ticket_prefer_cached="${ticket_prefer_cached:-1}" ## ticket_code # List of operation:res:shard ticket_code="${ticket_code:-}" ## get_ticket_code get_ticket_code="${get_ticket_code:-}" ## max_start_ticket # Maximum number of instances to start per call max_start_ticket="${max_start_ticket:-1}" PLUGIN football-basic Generic driver for systemd-controlled MARS pools. The current version supports only a flat model: (1) There is a single "big cluster" at metadata level. All cluster members are joined via merge-cluster. All occurring names need to be globally unique. (2) The network uses BGP or other means, thus any hypervisor can (potentially) start any VM at any time. (3) iSCSI or remote devices are not supported for now (LocalSharding model). This may be extended in a future release. This plugin is exclusive-or with cm3. Plugin specific actions: ./football.sh basic_add_host Manually add another host to the hostname cache. ## pool_cache_dir # Directory for caching the pool status. pool_cache_dir="${pool_cache_dir:-$script_dir/pool-cache}" ## initial_hostname_file 187 H. Command Documentation for Userspace Tools # This file must contain a list of storage and/or hypervisor hostnames # where a /mars directory must exist. # These hosts are then scanned for further cluster members, # and the transitive closure of all host names is computed. initial_hostname_file="${initial_hostname_file:-./hostnames.input}" ## hostname_cache # This file contains the transitive closure of all host names. hostname_cache="${hostname_cache:-$pool_cache_dir/hostnames.cache}" ## resources_cache # This file contains the transitive closure of all resource names. resources_cache="${resources_cache:-$pool_cache_dir/resources.cache}" ## res2hyper_cache # This file contains the association between resources and hypervisors. res2hyper_cache="${res2hyper_cache:-$pool_cache_dir/res2hyper.assoc}" ## enable_basic # This plugin is exclusive-or with cm3. enable_basic="${enable_basic:-$(if [[ "$0" =~ football ]]; then echo 1; else echo 0; fi)}" ## ssh_port # Set this for separating sysadmin access from customer access ssh_port="${ssh_port:-}" ## basic_mnt_dir # Names the mountpoint directory at hypervisors. # This must co-incide with the systemd mountpoints. basic_mnt_dir="${basic_mnt_dir:-/mnt}" PLUGIN football-downtime Generic plugin for communication of customer downtime. ## downtime_cmd_{set,unset} # External command for setting / unsetting (or communicating) a downtime # Empty = don’t do anything downtime_cmd_set="${downtime_cmd_set:-}" downtime_cmd_unset="${downtime_cmd_unset:-}" PLUGIN football-motd Generic plugin for motd. Communicate that Football is running at login via motd. ## enable_motd # whether to use the motd plugin. enable_motd="${enable_motd:-0}" ## update_motd_cmd # Distro-specific command for generating motd from several sources. # Only tested for Debian Jessie at the moment. update_motd_cmd="${update_motd_cmd:-update-motd}" ## download_motd_script and motd_script_dir 188 H.3. football.sh --help --verbose # When no script has been installed into /etc/update-motd.d/ # you can do it dynamically here, bypassing any "official" deployment # methods. Use this only for testing! # An example script (which should be deployed via your ordinary methods) # can be found under $script_dir/update-motd.d/67-football-running download_motd_script="${download_motd_script:-}" motd_script_dir="${motd_script_dir:-/etc/update-motd.d}" ## motd_file # This will contain the reported motd message. # It is created by this plugin. motd_file="${motd_file:-/var/motd/football.txt}" ## motd_color_on and motd_color_off # ANSI escape sequences for coloring the generated motd message. motd_color_on="${motd_color_on:-\\033[31m}" motd_color_off="${motd_color_off:-\\033[0m}" PLUGIN football-report Generic plugin for communication of reports. ## report_cmd_{start,warning,failed,finished} # External command which is called at start / failure / finish # of Football. # The following variables can be used (e.g. as parameters) when # escaped with a backslash: # $res = name of the resource (LV, container, etc) # $primary = the current (old) # $secondary_list = list of current (old) secondaries # $target_primary = the target primary name # $target_secondary = list of target secondaries # $operation = the operation name # $target_percent = the value used for shrinking # $txt = some informative text from Football # Further variables are possible by looking at the sourcecode, or by # defining your own variables or functions externally or via plugins. # Empty = don’t do anything report_cmd_start="${report_cmd_start:-}" report_cmd_warning="${report_cmd_warning:-$script_dir/screener.sh notify "$res" warning "$tx report_cmd_failed="${report_cmd_failed:-}" report_cmd_finished="${report_cmd_finished:-}" PLUGIN football-waiting Generic plugig, interfacing with screener: when this is used by your script and enabled, then you will be able to wait for "screener.sh continue" operations at certain points in your script. ## enable_*_waiting # # When this is enabled, and when Football had been started by screener, # then football will delay the start of several operations until a sysadmin # does one of the following manually: # 189 H. Command Documentation for Userspace Tools # a) ./screener.sh continue $session # b) ./screener.sh resume $session # c) ./screener.sh attach $session and press the RETURN key # d) doing nothing, and $wait_timeout has exceeded # # CONVENTION: football resource names are used as screener session ids. # This ensures that only 1 operation can be started for the same resource, # and it simplifies the handling for junior sysadmins. # enable_startup_waiting="${enable_startup_waiting:-0}" enable_handover_waiting="${enable_handover_waiting:-0}" enable_migrate_waiting="${enable_migrate_waiting:-0}" enable_shrink_waiting="${enable_shrink_waiting:-0}" ## enable_cleanup_delayed and wait_before_cleanup # By setting this, you can delay the cleanup operations for some time. # This way, you are keeping the old LV contents as a kind of "backup" # for some limited time. # # HINT1: dont set wait_before_cleanup to very large values, because it can # seriously slow down Football. # # HINT2: the waiting time starts when the last MARS replica was created. # Only when the syncing times are _smaller_ than this value, # an _additional_ delay will be produced. enable_cleanup_delayed="${enable_cleanup_delayed:-0}" wait_before_cleanup="${wait_before_cleanup:-180}" # Minutes ## reduce_wait_msg # Instead of reporting the waiting status once per minute, # decrease the frequency of resporting. # Warning: dont increase this too much. Do not exceed # session_timeout/2 from screener. Because of the Nyquist criterion, # stay on the safe side by setting session_timeout at least to _twice_ # the time than here. reduce_wait_msg="${reduce_wait_msg:-60}" # Minutes H.4. screener.sh --help ./screener.sh: Run _unattended_ processes in screen sessions. Useful for MASS automation, running hundreds of unattended commands in parallel. HINT: for running more than ~500 sessions in parallel, you might need some system tuning (e.g. rlimits, kernel patches etc) for creating a huge number of file descritor / sockets / etc. ADVANTAGE: You may attach to individual screens, kill them, or continue some waiting commands. Synopsis: ./screener.sh ./screener.sh ./screener.sh ./screener.sh ./screener.sh ./screener.sh ./screener.sh 190 --help [--verbose] list-running list-waiting list-interrupted list-illegal list-timeouted list-failed H.4. screener.sh --help ./screener.sh ./screener.sh ./screener.sh ./screener.sh ./screener.sh ./screener.sh ./screener.sh ./screener.sh ./screener.sh list-critical list-serious list-done list list-archive list-screens run [ ] start [ ] Inquiry operations: ./screener.sh list-screens Equivalent to screen -ls ./screener.sh list- Show a list of currently running, waiting (for continuation), failed, and done/completed screen sessions. ./screener.sh list First show a list of currently running screens, then for each a list of (old) failed / completed / sessions (and so on). ./screener.sh status Like list-*, but filter and dont report timestamps. ./screener.sh show Show the last logfile of at standard output. ./screener.sh less Show the last logfile of using "less -r". MASS starting of screen sessions: ./screener.sh run Commands are launched in screen sessions via "./screener.sh start" commands, unless the same is already running, or is in some error state, or is already done (see below). The commands are given by a column with CSV header name containing "command", or by the first column. The needs to be given by a column with CSV header name matching "screen_id|resource". The number and type of commands to launch can be reduced via any combination of the following filter conditions: --max= Limit the number of _new_ sessions additionally started this time. -- == Only select lines where an arbitrary CSV column (given by its CSV header name in C identifier syntax) has the given value. -- != Only select lines where the colum has _not_ the given value. -- =~ Only select lines where the bash regular expression matches 191 H. Command Documentation for Userspace Tools at the given column. --max-per= Limit the number per _distinct_ value of the column denoted by the _next_ filter condition. Example: ./screener.sh run test.csv --dry-run --max-per=2 --dst_network=~. would launch only 2 Football processes per destination network. Hint: filter conditions can be easily checked by giving --dry-run. Start / restart / kill / continue screen sessions: ./screener.sh start