SMM 1012C CRAY_YMP_XMP_EA_XMP_CRAY_1_Computer_Systems UNICOS_Online_Diagnostic_Maintenance_Manual March_1989.OCR CRAY YMP XMP EA 1 Computer Systems UNICOS Online Diagnostic Maintenance Manual March 1989.OCR
SMM-1012C-CRAY_YMP_XMP_EA_XMP_CRAY_1_Computer_Systems-UNICOS_Online_Diagnostic_Maintenance_Manual-March_1989.OCR manual pdf -FilePursuit
SMM-1012C-CRAY_YMP_XMP_EA_XMP_CRAY_1_Computer_Systems-UNICOS_Online_Diagnostic_Maintenance_Manual-March_1989.OCR SMM-1012C-CRAY_YMP_XMP_EA_XMP_CRAY_1_Computer_Systems-UNICOS_Online_Diagnostic_Maintenance_Manual-March_1989.OCR
User Manual: SMM-1012C-CRAY_YMP_XMP_EA_XMP_CRAY_1_Computer_Systems-UNICOS_Online_Diagnostic_Maintenance_Manual-March_1989.OCR
Open the PDF directly: View PDF .
Page Count: 392
Download | ![]() |
Open PDF In Browser | View PDF |
CRAY Y_MPTM, CRAY X-MP EA™, CRAY X_MpTM, and CRAY-l® Computer Systems UNI COS On-line Diagnostic Maintenance Manual ® - ,,! SMM-I012 C Cray Research, Inc. CRAY PROPRIETARY Dissemination of this documentation to non-CRI personnel requires approval of the appropriate vice president and that the recipient sign a nondisclosure agreement. Export of technical Information In this category may require an export license CRAY PROPRIETARY Dissemination of this documentation to non-CRI personnel requires approval from the appropriate vice president and a nondisclosure agreement. Export of technical information in this category may require a Letter of Assurance. Restricted Rights Legend Use, duplication, or disclosure by the Government is subject to restrictions as set forth in the subparagraph [(c) (1) (ii)] of the rights in Technical Data and Computer Software clause at 52.227-7013. (May 1987) Cray Research, Inc. 608 2nd Avenue South Minneapolis, MN 55402 Cray Research, Inc. Unpublished Proprietary Information - All Rights Reserved under the copyright laws of the United States and the U.C.C. CRAY, CRAY-1, HSX, SSD, and UNICOS are registered trademarks and CFT, CFT77, CFT2, COS, Cray Ada, CRAY-2, CRAYX-MP, CRAYX-MP EA, CRAYY-MP, CSIM, Delivering the power... , IDS, SEGLDR, and SUPERLINK are trademarks of Cray Research, Inc. HYPERchannel and NSC are registered trademarks of Network Systems Corporation. IBM is a registered trademark of International Business Machines Corporation. Motorola is a registered trademark of Motorola, Inc. Sun Workstation is a registered trademark and Sun is a trademark of Sun Microsystems, Inc. UNIX is a registered trademark of AT&T. VMEbus is a trademark of Motorola, Inc. The UNICOS operating system is derived from the AT&T UNIX System V operating system. UNICOS is also based in part on the Fourth Berkeley Software Distribution under license from The Regents of the University of California. Due to space restrictions, the following abbreviations are used in place of the specific system names: CXll Includes all models of the CRAY X-MP and CRAY-l computer systems CEA Includes all models of the Extended Architecture (EA) series, including the CRAY Y-MP and CRAY X-MP EA computer systems CRAY-2 Includes all models of the CRAY-2 computer system CXlCEA Includes all models of the CRAY X-MP computer systems plus all models of the CRAY Y-MP and CRAY X-MP EA computer systems. It does not include the CRAY-l computer systems. Requests for copies of Cray Research, Inc. publications should be sent to the following address: Cray Research, Inc. Distribution Center 2360 Pilot Knob Road Mendota Heights, MN 55120 HEW AND ERBANCED FEATURES This UNICOS release 5.0 overview describes the new and enhanced features contained in the CRAY Y-MP, CRAY X-MP EA, CRAY X-MP, and CRAY-l Computer Systems UNICOS On-line Diagnostic Maintenance Manual, CRI publication SMM-I012. With UNICOS 5.0, there is support for diagnostics that run on CRAY Y-MP and CRAY X-MP EA computer systems, as follows: • Y-mode (32-bit addressing), available only as indicated in appendix A, On-line Diagnostic Programs • X-mode (24-bit addressing), unless otherwise indicated Specific new and enhanced features are as follows: Feature Status cleario Enhanced 6 Adds support for the Operator Workstation (OWS) and the CRAY Y-MP and CRAY X-MP EA computer systems. dsdiaq Enhanced 6 Adds support for the OWS and the CRAY Y-MP and CRAY X-MP EA computer systems. donut New 5 On-line disk maintenance program offmon New 2 Off-line confidence monitor olcfpt New 3 Comprehensive floating-point instructions and data test olClO New 3 Common memory test olcrit Enhanced 3 Adds cluster selection. oldmon New 5 Down CPU monitor olhpa Enhanced 7 Adds support for DD-40 disk drives, SSD errors, and the CRAY Y-MP and CRAY X-MP EA computer systems. Section Description Feature Status olibuf New 3 Instruction buffer test olsbt New 3 On-line semaphore, shared B and shared T register test runsequence Enhanced 7 Adds examples of sequence files used for testing and file cleanup. Invokes one less shell. unitap New 5 On-line magnetic tape test Section Description RESEARCH. INC. RECORD OF REVISION PUBLICATION NUMBER SMM-I012 Each time this manual is revised and reprinted, all changes Issued against the previous version are incorporated into the new version and the new version is assigned an alphabetic level. Every page changed by a reprint with revision has the revision level in the lower righthand corner. Changes to part of a page are noted by a change bar in the margin directly opposite the change. A change bar in the margin opposite the page number indicates that the entire page is new. If the manual is rewritten, the revision level changes but the manual does not contain change bars. Requests for copies of Cray Research, Inc. publications should be directed to the Distribution Center and comments about these publications should be directed to: Restricted Rights legend CRAY RESEARCH, INC. 1345 Northland Drive Mendota Heights, Minnesota Revision 55120 Use, duplication, or disclosure by the Government is subject to restrictions as set forth in the subparagraph [(c)(1 )(ii)) of the Rights in Technical Data and Computer Software clause at 52.227-7013. (May 1987) Cray Research,lnc., 608 2nd Avenue South, Minneapolis, Minnesota 55402 Description September 1986 - Original printing. This printing supports the on-line diagnostic tests that run under the Cray operating system UNICOS, release 2.0, on the CRAY X-MP and CRAY-1 computer systems. The on-line diagnostic tests for CRAY-1 computer systems are not available for UNICOS release 2.0. All trademarks are listed in the record of revision. A June 1987 - Rewrite. This printing supports the on-line diagnostic tests that run under the Cray operating system UNICOS, release 3.0, on CRAY X-MP and CRAY-1 computer systems. B July 1988 - Rewrite. This printing supports the on-line diagnostic tests that run under the Cray operating system UNICOS, release 4.0, on CRAY Y-MP, CRAY X-MP EA, CRAY X-MP, and CRAY-1 computer systems. C March 1989 - Rewrite. This printing supports the on-line diagnostic tests that run under the Cray operating system UNICOS, release 5.0, on CRAY Y-MP, CRAY X-MP EA, CRAY X-MP, and CRAY-1 computer systems. SMM-1012 C CRAY PROPRIETARY iii PREFACE This manual describes the on-line environment for diagnostic tests that run under the Cray operating system UNICOS, release 5.0, on CRAY Y-MP, CRAY X-MP EA, CRAY X-MP, and CRAY-l computer systems. It is intended for Cray Research, Inc. (CRI) field engineers and analysts. A working knowledge of UNICOS is assumed. CONVENTIONS To aid in identifying the various groups of Cray mainframes, this manual uses the naming conventions shown in the Hardware Product Line sheet, which is located at the end of the preface. The Hardware Product Line sheet shows both the chronological evolution of Cray mainframes and the characteristics of each group. The reverse side contains definitions of the terms used on the sheet and throughout this manual. The conventions for entering the diagnostic commands are as follows: Convention Description bold Bold indicates one of the following: Diagnostic program Command option Man page entry File name italic Italic indicates variable or user-supplied information. x is O'x The prefix 0' indicates that RETURN This indicates the RETURN key. You must press the RETURN after entering each keyboard command. [ Square brackets indicate optional items. ] an octal value. +option A plus sign (+) preceding a command option indicates that the option is enabled. -option A minus sign (-) preceding a command option indicates that the option is disabled. SMM-1012 C CRAY PROPRIETARY v Convention Description command(l) This refers to an entry in the UNICOS User Commands Reference Manual, CRI publication SR-2011. command(lM) This refers to an entry in the UNICOS Administrator Commands Reference Manual, CRI publication SR-2022. system call(2) This refers to an entry in the UNICOS System Calls Reference Manual, CRI publication SR-2012. entry(4X) This refers to an entry in the UNICOS File Formats and Special Files Reference Manual, CRI publication SR-2014. The x indicates the section of the manual that contains the entry. OTHER PUBLICATIONS CRI off-line diagnostic publications that may be of interest are as follows: HO-OI004 HO-OI005 HO-OI007 HM-OIOIO CRAY-l Computer Systems Diagnostic Ready Reference Guide CRAY X-MP Computer Systems Diagnostic Ready Reference Guide 1/0 Subsystem (lOS) Diagnostic Ready Reference Guide CRAY X-MP Computer Systems lOS-based Diagnostic Reference Manual CRI software publications that may be of interest are as follows: SO-0083 SD-0235 SG-0307 SG-2005 SR-2011 SR-2012 SR-2014 SR-2022 SN-3030 vi CRAY Y-MP, CRAY X-MP EA, CRAY X-MP and CRAY-l CAL Assembler Version 2 Ready Reference Software Problem Report (SPR) User's Guide 1/0 Subsystem (lOS) Administrator's Guide 1/0 Subsystem (lOS) Operator's Guide for UNICOS UNICOS User Commands Reference Manual Volume 4: UNICOS System Calls Reference Manual UNICOS File Formats and Special Files Reference Manual UNICOS Administrator Commands Reference Manual Operator Workstation (OWS) Guide CRAY PROPRIETARY SMM-I012 C CRI hardware publications that may be of interest are as follows: HR-0030 HR-0081 CSMOll0000 CSM-0111-000 CSMOl12000 CSM-0400-000 IIO Subsystem Model B Hardware Reference Manual I/O Subsystem Model C/O Hardware Reference Manual CRAY X-MP/2 System Programmer Reference Manual CRAY X-MP/l System Programmer Reference Manual CRAY X-MP/4 System Programmer Reference Manual CRAY Y-MP System Programmer Reference Manual For additional information, refer to the on-line diagnostic listings. UNICOS SYSTEM INSTALLATION BULLETIN Refer to the UNICOS System Installation Bulletin for the following information: • • Build and installation procedures Configuration guidelines Each site receives this bulletin with the UNICOS release package. can order additional copies from the CRI Distribution Center. You Note that appendix G, Installation Information, describes the procedure for on-line diagnostic re-installation subsequent to system installation. READER COMMENTS If you have any comments about the technical accuracy, content, or organization of this manual, please tell us. You can contact us in any of the following ways: • Call our Technical Publications department at (612) 681-5729 during the hours of 7:30 A.M. to 6:00 P.M. (Central Time). • Send us electronic mail from a UNICOS or UNIX system, using the following UUCP addresses: uunet!cray!publications sun! tundra!hall !publications • Send us electronic mail from a UNICOS or UNIX system, using the following ARPAnet address: publications@cray.com SMM-1012 C CRAY PROPRIETARY vii • Send a facsimile of your comments to the attention of "Publications" at FAX number (612) 681-5602. • Use the postage-paid Reader's Comment form at the back of this manual. • Write to us at the following address: Cray Research, Inc. Technical Publications Department 1345 Northland Drive Mendota Heights, Minnesota 55120 We" value your comments and will respond to them promptly. viii CRAY PROPRIETARY SMM-1012 C Hardware Product Line eXIt Syatems , . . . - - - - - - , • 12.s.na cloek qtClc • Up to.1 Mword of.mcmory • Bft'i.c:ientvocw ~ ~C8 - ......-- ....... . . . . . - - - - - - , • 12.S-mckUeyd¢ • Up to 4 Mwonts Of~CX'y • Jntrodacdod of 1.0 Subsystem (lOS) _-.,..-- ...... . . . . . - - - - - - , • 12.O-nIclodt .1. • Upto4MworcJsOfmemay The following list defines architecture terms: Definition CX/l systems This group includes all models of the CRAY X-MP and CRAY-l computer systems. It is characterized by 24-bit addressing capabilities. CEAsystems This group includes all models of the Extended Architecture (EA) series, which are the CRAY Y -MP and CRAY X-MP EA computer systems. It is characterized by 32-bit addressing capabilities. CRAY -2 systems This group includes all models of the CRAY -2 computer systems. It is characterized by 32-bit addressing capabilities, large common memories, and immersion cooling. CX/CEA systems This group designates all models of CRAY X-MP computer systems plus all models of the CRAY Y -MP and CRAY X -MP EA computer systems. It does not include CRAY -1 computer systems. EAM bit (hardware) In CX/l systems, the EAM bit is the Enhanced Addressing Mode bit in the Flag register. When set, it sign-extends certain instructions for memory addressing in 8- and 16-Mword systems. In CEA systems, the EAM bit is the Extended Addressing Mode bit in the Flag register. It is set by the operating system to select either 24- or 32-bit addressing. EMA feature (software) In CX/l systems, EMA is the Extended Memory Addressing feature for 8- or 16-Mword systems. X-mode This term refers to the 24-bit addressing mode in CEA systems. The operating systems select this mode with the EAM bit in the Exchange Package. V-mode This term refers to the 32-bit addressing mode in CEA systems. The operating systems select this mode with the EAM bit in the Exchange Package. COlITEHTS PREFACE • • • • • 1. CONVENTIONS • OTHER PUBLICATIONS UNICOS SYSTEM INSTALLATION BULLETIN . . READER COMMENTS • . . . . • . . . • . • v ON-LINE DIAGNOSTIC SYSTEM 1-1 1.1 1.2 2. vi vii vii ON-LINE DIAGNOSTIC ENVIRONMENT . . . . • ON-LINE DIAGNOSTIC PROGRAMS . . . . . . . 1-1 1-2 CONFIDENCE TEST AND MONITOR OVERVIEW . . . 2-1 2.1 2.2 2.3 2.4 2.5 2.6 2-1 2-1 2-5 2-5 2-6 2.7 3. v ON-LINE CONFIDENCE MONITOR (olcmon) PROGRAM SYNOPSIS . . . . • . . . . • • . TEST EXECUTION . . TEST TERMINATION . TEST EXAMPLES TEST MESSAGES . . . . . . . . . 2.6.1 Informative messages •• 2.6.2 Error messages . . . . . . . . . . . OFF-LINE CONFIDENCE MONITOR (offmon) . ..... .... 2-8 2-9 2-9 2-10 CONFIDENCE TEST DESCRIPTIONS • 3-1 3.1 3-1 3-2 3-6 3.2 SMM-1012 C olcfdt 3.1.1 3.1.2 3.1.3 olcfpt 3.2.1 3.2.2 Test synopsis Test examples Test messages . . . . . . . . . . 3.1.3.1 Informative messages. 3.1.3.2 Error messages . Test synopsis . . . . . . . . Test execution • • . . . . . • 3.2.2.1 Test initialization 3.2.2.2 Random floating-point data generation . . . 3.2.2.3 Random floating-point buffer simulation . . 3.2.2.4 Random floating-point buffer execution CRAY PROPRIETARY 3-8 3-9 3-9 . . . • • • . . instruction . . . • . • instruction . . . . . . instruction . . . • • . • .. and • .. 3-11 3-11 3-14 3-15 3-15 3-15 • 3-16 ix 3.2.2 3.2.3 3.2.4 3.2.5 3.3 olem • 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 3.4 Test synopsis . • . • . . . • . . . Test execution . • . • . 3.3.2.1 Test initialization. 3.3.2.2 Test section execution . . . • . Test section 1 Test sections 2 and 3 Test section 4 Test section 5 Test section 6 Test section 7 3.3.2.3 Comparison of expected and actual data . • . • . 3.3.2.4 Error report . . • • • • • • • • Test termination • • . . . . . . . Test examples . . . • • • • • Test messages 3.3.5.1 Informative messages 3.3.5.2 Error messages 3.3.5.3 Error output definitions olcrit 3.4.1 3.4.2 3.4.3 3.4.4 3.4.5 x Test execution (continued) 3.2.2.5 Comparison of simulation and execution results • • • • • • • • • • . • • • 3.2.2.6 Error isolation. Test termination • Test examples Test messages . . . . • . • • • • • 3.2.5.1 Informative messages 3.2.5.2 Error messages Test synopsis . • • • . • • • . • Test execution • • • . . • • . • . . • • • 3.4.2.1 Test initialization and hardware configuration detection • • • • . 3.4.2.2 Random instruction and data generation . . . • • • • • Random instruction buffer 3.4.2.3 simulation • . . . 3.4.2.4 Random instruction buffer execution 3.4.2.5 Comparison of simulation and execution results • • • . . . . • • Error isolation 3.4.2.6 Test termination • • . . . Test examples . • . • • • . • • • Test messages • • . • • . . • • . • • • • 3.4.5.1 Test mode messages • • • • • . • . • . 3.4.5.2 Informative messages .•••. 3.4.5.3 Error messages. • • • ••. CRAY PROPRIETARY 3-16 3-16 3-18 3-18 3-23 3-23 3-24 3-25 3-25 3-26 3-26 3-27 3-27 3-27 3-27 3-28 3-28 3-29 3-30 3-30 3-30 3-30 3-34 3-34 3-34 3-35 3-36 3-36 3-44 3-45 3-46 3-47 3-47 3-47 3-48 3-49 3-49 3-57 3-57 3-59 3-59 SMM-1012 C 3.5 olcsvc 3.5.1 3.5.2 3.5.3 3.5.4 3.5.5 3.6 olibuf 3.6.1 3.6.2 3.6.3 3.6.4 3.6.5 3.6.6 3.7 olsbt 3.7.1 3.7.2 ..... ········ ···· ·· · · · · · · Test synopsis Test execution 3.5.2.1 Test initialization and hardware configuration detection 3.5.2.2 Random instruction and data generation 3.5.2.3 Instruction buffer execution 3.5.2.4 Comparison of execution results 3.5.2.5 Error isolation Test termination Test examples Test messages 3.5.5.1 Test mode messages 3.5.5.2 Informative messages · · · 3-61 3-61 3-66 · · · · 3-66 ·············· 3-67 3-75 3-76 3-76 3-77 3-77 3-83 3-84 3-84 3-85 3-85 3-88 3-88 · ·· ···· ····· · ·· · · ·· ·· ···· ·· ··· ··· ···· ·· ·· · ·· ···· ·· Test synopsis · · · ··········· Test execution · · · · ····· · · · · · 3.6.2.1 Test initialization · · ·test · ···· 3.6.2.2 CRAY X-MP computer system buffer generation ······· 3.6.2.3 CRAY Y-MP computer system test buffer generation · ·· · 3.6.2.4 Test buffer execution ··· 3.6.2.5 Comparison of expected and actual data · ···· ··· ···· 3.6.2.6 Error report · · · · · · · ··· Error isolation to the failing bit · · · ··· 3.6.3.1 CXl1 system error isolation · ·· 3.6.3.2 CRAY Y-MP computer system error isolation ····· ···· Test termination · · · · · · ··· Test examples · · · · · · · · ·· Test messages · · · · · · · ·· 3.6.6.1 Informative messages · · · · · · · Error messages 3.6.6.2 ····· ···· ············ Test synopsis · · ···· · ··· · · · Test execution · · · · · · · · · · · 3.7.2.1 Test initialization and hardware configuration detection ··· ··· 3.7.2.2 Random instruction and data generation ······· ···· 3.7.2.3 Random instruction buffer simulation · ···· · · 3.7.2.4 Random instruction buffer execution 3-113 3-113 Comparison of simulation and execution results 3.7.2.6 Error isolation Test termination 3-114 3-114 3-115 3-89 3-92 3-96 3-96 3-96 3-96 3-97 3-99 3-101 3-101 3-105 3-105 3-106 3-107 3-107 3-110 3-110 3-110 3.7.2.5 ··· ············ 3.7.3 SMM-1012 C · · · · · CRAY PROPRIETARY ·· xi 3.7 4. 4.5 4.6 4.7 4.8 4-1 MAINTENANCE MONITOR (olmon) PROGRAM SYNOPSIS . . . . . . TEST EXECUTION . . . • . • . TEST-SPECIFIC REQUIREMENTS . 4.4.1 olaht 4.4.2 olCDm: 4.4.3 olibz TEST TERMINATION TEST EXAMPLES TEST MESSAGES DIAGNOSTIC MEMORY IMAGE FOR MAINTENANCE TESTS 4-1 4-2 4-4 4-4 4-5 4-5 4-6 4-7 4-7 4-12 4-13 DOWN-DEVICE PROGRAMS 5-1 5.1 5-1 5-2 5-2 5-3 5-3 5-4 5-4 5-5 5-9 5-9 5-10 5-10 5-11 donut 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 5.1.7 5.1.8 5.1.9 xii 3-115 3-126 3-126 3-126 3-127 MAINTENANCE TEST AND MONITOR OVERVIEW 4.1 4.2 4.3 4.4 5. olsbt (continued) 3.7.4 Test examples •.•.• 3.7.5 Test messages .••••• 3.7.5.1 Test mode messages 3.7.5.2 Informative messages 3.7.5.3 Error messages Disk selection • Disk mode 5.1.2.1 System mode 5.1.2.2 Maintenance mode • • Warnings and messages . • • • Menu displays . • . • Program execution . • . . Main menu • • . . ...••• • • • • 5.1.6.1 Commands to display submenus . . • . • 5.1.6.2 Commands to select display format 5.1.6.3 Commands to set arguments • . • • • . 5.1.6.4 Commands to display the data buffer 5.1.6.5 Commands to display flaw table menus • • . . . . 5.1.6.6 Commands to change the data buffer • • 5.1.6.7 Commands to change the type of write command used • . • 5.1.6.8 Commands to display commands list Buffer Utility menu • . • • Error Utility menu . • • • ••••••••• 5.1.8.1 Error Table menu. . .•. 5.1.8.2 Error Log menu. • ••• Formatting menu . . . . . . . . . 5.1.9.1 Logical address of the sector ID 5.1.9.2 Position field of the sector ID (DD-10s and DD-40s only) . • . • CRAY PROPRIETARY 5-11 5-12 5-12 5-13 5-13 5-17 5-18 5-19 5-20 5-21 5-22 SMM-1012 C 5.1.9 Formatting menu (continued) 5.1.9.3 Examine Oata Buffer menu 5.1.9.4 IO Analysis menu (00-10s, 00-39s, 00-40s, and 00-49s only) 10 analysis (00-39s/49s) 1D analysis (DD-40s) IO Analysis menu commands 5.1.9.5 Parameter menu Surface Tests menu 5.1.10.1 Write Data, Read Data and Compare, and Surface Analysis menus 5.1.10.2 Examine Data Buffer menu 5.1.10.3 Parameter menu Flaw Table Utility menus Error correction code test Parameter menu Exiting donut Program examples ····· ··· ··· 5.1.10 5.2 5.1.11 5.1.12 5.1.13 5.1.14 5.1.15 oldmon 5.2.1 5.2.2 5.2.3 ······ ········· · · ······· ··· ···· ·· · ·· . . . . . . . . · ·· Down CPU tests ·· Program synopsis · Program execution ···· ···· ···· ···· ··· off-line Oown CPU tests Modifications to the diagnostic test base Default configuration files 5.2.3.2 Test loop code 5.2.3.3 Environment variables Display modes 5.2.4.1 Scroll mode display 5.2.4.2 Screen mode display Program commands 5.2.5.1 Common arguments 5.2.5.2 Append ( a) and Oump ( d) commands 5.2.5.3 CPU command (c) 5.2.5.4 Enter command (e) 5.2.5.5 Execute command (x) 5.2.5.6 Fill command ( f) 5.2.5.7 Go command (9) 5.2.5.8 Halt command (h) 5.2.5.9 Load command (I) 5.2.5.10 Options command (0) 5.2.5.11 Quit command ( q) 5.2.5.12 Redraw command ( r) 5.2.5.13 Shell escape command (!) 5.2.5.14 Status command (8) 5.2.5.15 Up command ( u) 5.2.5.16 View command (v) 5.2.5.17 Write command (w) Program example Program messages 5.2.3.1 · ···· 5.2.4 ····· · 5.2.5 ··········· ···· · ··· · ···· · · ··· ··· ·· ·· · · · · ·· · · ·· ···· ······· · 5.2.6 5.2.7 SMM-1012 C · CRAY PROPRIETARY ·· ······· ·· · · · · · · · · · ···· · · ··· ··· ··· 5-22 5-23 5-24 5-25 5-27 5-27 5-27 5-29 5-33 5-33 5-33 5-41 5-42 5-44 5-44 5-50 5-50 5-51 5-53 5-53 5-54 5-54 5-56 5-58 5-59 5-61 5-62 5-63 5-65 5-66 5-67 5-68 5-68 5-68 5-69 5-69 5-70 5-70 5-71 5-71 5-72 5-72 5-72 5-72 5-73 5-74 5-87 xiii 5.3 unitap 5.3.1 5.3.2 5.3.3 5.3.4 ......··············· Program synopsis ···· ········· Interactive program execution ········· Program menus · · · · · · ···· · · · 5.3.3.1 Main Menu ···· · · · · 5.3.3.2 Variable Menu · · · · ···· 5.3.3.3 Test Menu ········ 5.3.3.4 Canned Test Menu · ······· · 5.3.3.5 Debug Menu · · · · 5.3.3.6 Global Options Menu ······ 5.3.3.7 Hardware Layout Menu · Debug tools · · · · · ···· · · 5.3.4.1 Breakpoint Tool · ··· 5.3.4.2 Channel Commands Tool 5.3.4.3 Display Data Buffer Tool 5.3.4.4 Compare Data Tool 5.3.4.5 System Call History Tool 5.3.4.6 Programming Tool 5.3.4.7 Packet Status Tool Trace file Learn mode Program examples Program messages 5.3.8.1 Messages with menu displays 5.3.8.2 Messages without menu displays ···· · · 5.3.5 5.3.6 5.3.7 5.3.8 6. IIO SUBSYSTEM DEADSTART PROGRAMS 6.1 6.2 6.3 ···· ······ ·· ···· ···· ·· ···· ··· · . . . . . · ·· ·· ·· ·· · · ·· ·· · ··· ·· · ......······ · ···· SYSTEM CONFIGURATION cleario 6.2.1 Program execution 6.2.2 Program messages 6.2.2.1 Informative messages 6.2.2.2 Error messages dsdiaq Program execution 6.3.1 6.3.1.1 IOP-O tests IIO Subsystem tests 6.3.1.2 dsmos16k dsiom dsiop dsmos dshsp dslsp ·· · · · · · ···· CRAY PROPRIETARY · · · · · ·· ····· ··· ······ ····· · ····· ····· ···· · ·· · ····· ···· ····· · ····· ····· ···· ···· ····· xiv ···· 5-89 5-90 5-91 5-91 5-92 5-93 5-94 5-96 5-98 5-99 5-100 5-102 5-103 5-104 5-105 5-107 5-108 5-109 5-110 5-111 5-111 5-111 5-111 5-112 5-113 ··· 6-1 ··· ··· ··· 6-1 6-2 6-2 6-4 6-4 6-4 6-5 6-5 6-7 6-9 6-9 6-10 6-10 6-13 6-14 6-15 ··· · ··· ··· ···· SMM-1012 C 6.3 7• dsdiaq (continued) 6.3.2 Program messages . . . . . . • 6.3.2.1 Informative messages 6.3.2.2 Error messages . . • Messages applicable to all tests • IOP-O messages . • . dsmos16k messages dsiom messages . . • . • dsiop messages dsmos messages • . dshsp messages . . . . . . . • • . dslsp messages . 6-16 6-16 6-17 6-17 6-18 6-19 6-19 6-20 6-22 6-24 6-31 UTILITY PROGRAMS 7-1 7.1 7-1 7-1 7-6 7-9 7-10 7-13 7-14 7-14 7-16 7-17 7.2 olhpa 7.1.1 Program synopsis . 7.1.2 Help menus . • • • Program examples . . . . . . . • • . . 7.1.3 Shell script generation and execution • • • • . 7.1.4 7.1.5 Program messages . . runsequence . . . . • . • 7.2.1 crontab input file. 7.2.2 Sequence files . . . 7.2.3 runsequence shell script. APPENDIX SECTION A. ON-LINE DIAGNOSTIC PROGRAMS A.1 A.2 A.3 A.4 A.5 A.6 A.7 B. CONFIDENCE TESTS . . • MAINTENANCE TESTS DOWN-DEVICE PROGRAMS . ON-LINE NETWORK COMMUNICATIONS PROGRAM • • 1/0 SUBSYSTEM DEADSTART PROGRAMS . . UTILITY PROGRAMS • offman TESTS . . • . . • . . . . TEST EXECUTION TIMES . . . . . . . • B.1 B.2 SMM-1012 C EXECUTION TIMES FOR CONFIDENCE TESTS . EXECUTION TIMES FOR MAINTENANCE TESTS CRAY PROPRIETARY A-1 A-1 A-2 A-4 A-7 A-8 A-9 A-9 B-1 B-1 B-2 xv C. ON-LINE DIAGNOSTIC PROGRAM LIBRARIES • • C.1 C.2 C.3 C-l DIAGPL • • XMPPL CRAY1PL C-1 C-2 C-2 D. SOFTWARE PROBLEM REPORTING • D-1 E. SYSTEM UTILITIES • • 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 E-1 Fo SITE COMMUNICATIONS F-1 Go INSTALLATION INFORMATION G-1 Gol Go2 Go 3 Go4 G-1 G-2 G-2 G-3 G-3 G-3 G-4 G-4 G-5 G-5 G-6 G-6 G-7 G-8 G-9 G-10 Go5 G.6 Go7 ON-LINE DIAGNOSTIC DIRECTORIES 0 0 0 . 0 0 GENERATING ON-LINE DIAGNOSTIC BINARIES 0 GENERATING ON-LINE DIAGNOSTIC LISTINGS SAVING OFF-LINE VERSIONS OF ON-LINE CONFIDENCE TESTS Go401 MVS-based systems running CMS 0 0 . 0 0 Go4.2 Expander-based systems running DDS 0 0 0 SAVING IIO SUBSYSTEM (lOS) DEADSTART PROGRAMS Go501 OWS UNICOS 0 . 0 0 0 0 . 0 0 0 . Go5.2 Expander Tape UNICOS 0 0 Expander disk UNICOS 0 Go503 GENERATING olnet 0 0 . . . 0 . Go601 IBM front-end • 0 0 0 Go602 Sun Workstation front-end (NSC) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Go603 Sun Workstation front-end (VME) Go604 Motorola Workstation, OWS, or MWS front-end (VME) DELETING PROPRIETARY SOURCE CODE 0 0 0 0 0 0 0 0 0 0 0 0 FIGURES 4-1 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 5-9 5-10 5-11 5-12 5-13 5-14 xvi Sample Diagnostic Memory Image Main Menu for dODut 0 0 0 0 0 0 0 0 0 0 Buffer Utility Menu 0 0 Write Buffer Menu 0 0 0 0 0 0 0 0 Read Buffer Menu Error Utility Menu 0 0 0 0 0 0 Error Table Menu 0 0 0 0 0 Error Log Menu 0 0 0 0 Formatting Menu 0 0 0 0 0 0 Examine Data Buffer Menu 0 0 ID Analysis Menu for DD-39 and 00-49 Disk ID Analysis Menu for DD-40 Disk Drives Surface Tests Menu 0 0 0 0 0 0 0 0 0 Write Data Menu 0 . 0 0 0 0 0 Read Data and Compare Menu 0 0 0 0 CRAY PROPRIETARY 0 0 0 0 0 0 0 0 0 0 0 Drives 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4-14 5-9 5-14 5-15 5-15 5-17 5-18 5-19 5-20 5-23 5-25 5-26 5-28 5-30 5-30 SMM-1012 C FIGURES (continued) 5-15 5-16 5-17 5-18 5-19 5-20 5-21 5-22 5-23 5-24 5-25 5-26 5-27 5-28 5-29 5-30 5-31 5-32 5-33 5-34 5-35 5-36 5-37 5-38 5-39 5-40 5-41 7-1 7-2 7-3 7-4 D-1 Surface Analysis Menu • • .•....•••••• Flaw Table Utility Menu • • • • • Factory Flaw Table Menu . . .••..•.• User Flaw Table Menu for DD-39 and DD-49 Disk Drives User Flaw Table Menu for DD-10 and DD-40 Disk Drives System Flaw Table Menu • • • • . • • . • • • • • • • • • • • Found Flaw Table Menu for DD-19/29/39/49 Disk Drives Found Flaw Table Menu for DD-10 and DD-40 Disk Drives • . Parameter Menu Main Menu for oldman . . . . Scroll Mode Display . Screen Mode Display . . Main Menu for unitap Variable Menu . . Test Menu . . • . . Canned Test Menu . . • . Debug Menu . . . . . • • • . • Global Options Menu . • . . . Hardware Layout Menu • • • • • • . • • • • • . Block Multiplexer Layout Menu (BMC-5) . • • • • Breakpoint Tool . . . • . ......•. Channel Commands Tool • . Display Data Buffer Tool Compare Data Tool • System Call History Tool Programming Tool • • • . . . • . • • Packet Status Tool Disk Help Menu . . . • . Memory Help Menu . . • . Tape Help Menu . . . • . SSD Help Menu . . . . . SPR Form • • • • 5-31 5-33 5-36 5-37 5-37 5-38 5-38 5-39 5-42 5-53 5-61 5-62 5-92 5-93 5-94 5-96 5-98 5-99 5-100 5-101 5-103 5-104 5-105 5-107 5-108 5-109 5-110 7-7 7-8 7-9 7-9 D-2 TABLES 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 5-9 5-10 5-11 5-12 Main Menu Commands . • . . . . Commands to Set Arguments • . . Buffer Utility Menu Commands Commands for the Write Buffer and Error Utility Menu Commands . . . Error Table Menu Commands Error Log Menu Commands . . . Formatting Menu Commands . . Examine Data Buffer Menu Commands ID Analysis Menu Commands . . . . Surface Tests Menu Commands . . . Commands for the Write Data, Read Surface Analysis Menus SMM-1012 C Read Buffer Menus . . . • . . . . . • . . • . . . . . . • • . . . . • • . Data and Compare, and CRAY PROPRIETARY 5-10 5-11 5-14 5-16 5-18 5-19 5-20 5-21 5-23 5-27 5-28 5-31 xvii TABLES (continued) 5-13 5-14 5-15 5-16 A-I A-2 A-3 A-4 A-5 A-6 A-7 A-a A-9 8-1 8-2 Flaw Table Utility Menu Commands Commands for the Flaw Table Menus • Parameter Menu Commands • oldman Commands • • • Confidence Tests CPU Maintenance Tests • Down-Device Programs . . . • Down CPU Confidence Tests Down CPU Maintenance Tests On-line Network Communications Program lID Subsystem Deadstart Programs Utility Programs . • • . . . . • • • offmon Tests • • • . • • • • • • • Execution Times for Confidence Tests Execution Times for Maintenance Tests 5-34 5-39 5-43 5-52 A-I A-2 A-4 A-5 A-5 A-7 A-a A-9 A-9 B-2 8-2 INDEX xviii CRAY PROPRIETARY SMM-I012 C , 1. . OR-LINE DIAGNOSTIC SYSTEM This manual describes the on-line test environment for diagnostics that run under the Cray operating system UNICOS on the following computer systems: • CEA systems Y-mode (32-bit addressing) X-mode (24-bit addressing) • CX/1 systems The on-line diagnostic system performs error detection and isolation concurrent with system operation. This type of on-line maintenance provides the following benefits: • Ensures an enhanced level of continuous system operation • Prevents possible system software failures and identifies data integrity problems in system output • Provides the capability for concurrent maintenance • Reduces mean time to repair (MTTR) by isolating the failing hardware while the system is running • Reduces off-line preventive maintenance (PM) time required for failure detection, isolation, and repair 1.1 ON-LINE DIAGNOSTIC ENVIRONMENT The on-line diagnostic system consists of programs that reside in Cray central memory or in Cray mass storage. To run the on-line diagnostic programs in a Cray computer system configuration, UNICOS must be running in at least one Central Processing Unit (CPU). Throughout this document, the term operator's station refers to one of the following devices, as appropriate to your site: • Peripheral expander • Operator workstation SMM-1012 C CRAY PROPRIETARY 1-1 1.2 ON-LINE DIAGNOSTIC PROGRAMS To ensure maximum system reliability, the on-line diagnostic programs do the following: • Detect, isolate, and report hardware faults • Gather and analyze system performance data The on-line diagnostic programs are grouped as follows: t 1-2 Diagnostic Group Description Confidence tests These tests provide error detection and isolation. To verify system integrity, it is recommended that these tests be run at system startup and at intervals thereafter. Maintenance tests These tests provide error detection and isolation. These tests are variants of off-line diagnostic tests. Down-device programs The down-device programs provide on-line CPU and peripheral testing while the hardware is removed from normal system operations. Network test (olnet)t This test detects and isolates faults in the communications link between a Cray mainframe and a front-end computer system. IIO Subsystem (lOS) deadstart programs These programs can be run prior to system deadstart to verify the integrity of the lOS hardware. They isolate failures to the functional area, at which point a CRI field engineer must interpret the results. Utility programs These are on-line diagnostic tools. The olnet test is described in the On-line Diagnostic Network Communications Program (OLNET) Maintenance Manual, CRI publication SMM-I016. CRAY PROPRIETARY SMM-1012 C 2. CONFIDENCE TEST AND MORITOR OVERVIEW On-line diagnostic confidence tests provide a comprehensive performance check of the system hardware. This test level consists of the following: • High-level language diagnostic programs • A set of CAL Version 2 diagnostic programs that direct hardware testing to specific logic areas This section provides an overview of the following: • • • • • • • On-line confidence monitor (aleman) Program synopsis Test execution Test termination Test examples Test messages Off-line confidence monitor (offman) For a brief description of each confidence test, refer to appendix A, On-line Diagnostic Programs. For a list of test execution times, refer to appendix B, Test Execution Times. For additional information on specific confidence tests and their command options, refer to section 3, Confidence Test Descriptions. 2.1 ON-LINE CONFIDENCE MONITOR (aleman) The on-line confidence monitor program, aleman, does the following: 2.2 • Accepts and interprets command options and arguments • Sends test results to stdaut (standard output device) by default or to a file when UNICOS output redirection is indicated on the command line PROGRAM SYNOPSIS The aleman command options are entered with the test command options of each confidence test to be executed. The test-specific command options are described in section 3, Confidence Test Descriptions. SMM-1012 C CRAY PROPRIETARY 2-1 The oleman command options can be entered in any order. is omitted, the program uses the default value. If an option The following command options provide different methods of specifying the starting seed value (specify only one for each test executed): • +I-qetseed • qetseed file • seed n (a test-specific command option described in section 3, Confidence Test Descriptions) Synopsis: test [chtpnt mode] [cpu clist] [cputime h:m:s] [+I-qetseed] [qetseed file] [help] [mazerr n] [mazp n] [+I-parcel] [time h:m:s] [+I-verbose] [+zmp] [+crayl] [test options]t chtpnt mode Indicates whether restart files are to be generated. mode is one of the following arguments: Argument Description first Generates a restart file for the first failure detected (default) all Generates a restart file for each failure detected, including failures detected during error isolation none Does not generate restart files The default generates a restart file for the first failure detected. For additional information, refer to the following: chtpnt(l), restart(l), chtpnt(2), and restart(2). t 2-2 For additional information on confidence tests and their test-specific command options, refer to section 3, Confidence Test Descriptions. CRAY PROPRIETARY SMM-I012 C cpu clist Selects the CPUs to be tested. following format: Enter clist in the X,X, ••• ,X x can be a, b, c, d, e, f, q, or h. The first CPU selected is the master CPU. The default is cpu a. If you enter an invalid CPU value in clist or a value for a CPU that is currently down, you will receive an error message. . cputime h:m:s Sets the test execution time in CPU time. The time is specified in hours (h), minutes (m), and seconds (s); minutes and seconds; or just seconds. Use colons as delimiters, as follows: h:m:s. Generally, actual execution time is within one second of the specified CPU time. If cputime is allowed to default, or is set to 0, the test uses the mazp value. However, if set to a value other than 0, cputime overrides mazp. +/-getseed Enables (+qetseed) or disables (-qetseed) the option that reads the file test.seed to obtain a starting seed. If the test terminates because the maximum pass or error limit is reached, the seed from the last pass is saved in the file test. seed. If there are any problems with reading the seed from this file, the program uses the default seed (0'33). If you select +qetseed, do not select seed n (test-specific command option). The default is -qetseed. qetseed file Gets a starting seed from file. file can contain a dump from a previous failure or a single seed value. allowed to default, the program uses the seed value specified by +qetseed or seed n (test-specific command option). If help Generates an on-line help display containing a synopsis and a brief description of the command options and arguments. If help is entered with a test name, help information is written to stdout, and the test terminates. mazerr n Sets the maximum number of errors. value. The default for n is 1. SMM-I012 C CRAY PROPRIETARY n is an octal 2-3 Sets the maximum number of passes. n is an octal value. The default for n is 0'1000. If cputime or time is set to a value other than 0, the specified option overrides IIlaZp. mazp n +I-parcel Enables (+parcel) or disables (-parcel) the option that forces dumped data to parcel format. +parcel forces data that would otherwise be in word format (64 bits in octal, with leading O's) to parcel format (four groups of 16 bits in octal). Parcel format displays two words (8 parcels) per line. Word format displays four words per line. The default is -parcel. time h:m:s Sets the test execution time in elapsed (wall-clock) time. The time is specified in hours (h), minutes (m), and seconds (s); minutes and seconds; or just seconds. Use colons as delimiters, as follows: h:m:s. Generally, actual execution time is within one second of the specified elapsed time. If time is allowed to default (or is set to 0), the test uses the mazp value. However, if specified to a value other than 0, time overrides mazp. +I-verbose Enables (+verbose) or disables (-verbose) the generation of informational messages. The +verbose option causes a line of output to be generated after each pass of the diagnostic. The default is -verbose. +zmp Indicates the test mode for the following computer systems: +crayl Command Computer System +zmp CRAY X-MP +crayl CRAY-l If allowed to default, the monitor determines the machine type during test execution and selects the appropriate test mode. This option can be used to override the default selection. These command options are not applicable to a CEA system. 2-4 CRAY PROPRIETARY SMM-I012 C 2.3 TEST EXECUTION To start a single diagnostic test, enter the following on the command line: • • • test Monitor command options Test-specific command options To run a sequence of diagnostics, use the runsequence utility described in section 7, Utility Programs. Before a test can be started, UNICOS must be running in the CPUs to be tested. The master CPU (the first CPU selected) does the following: • Generates instructions and data • Generates expected results • Compares the test execution buffers of the selected CPUs to the expected results • Generates and formats error reports • Controls error isolation Each CPU, including the master, does the following: • Loads registers and buffers • Executes test instructions • Saves results TEST TERMINATION 2.4 A test stops under the following conditions: • The test successfully completes the maximum number of passes (mazp n). • The test reaches the specified CPU time (cputime h:m:s) or elapsed (wall-clock) time (time h:m:s). • The test detects and isolates the maximum number of errors (mazerr n). Error reports are automatically sent to stdout (standard output device), but they can be redirected to an error file. SMM-I012 C CRAY PROPRIETARY 2-5 2.5 • The help option is entered with a test name, help information is written to stdout, and the test terminates. • The monitor or test detects an error in a command line entry and writes a message to stderr (standard error device). Only the first error detected is reported. TEST EXAMPLES The following example executes olcsvc in CPUs c, a, and b, with c as the master. Example: olcsvc cpu c,a,b The following example executes olcsvc in CPUs a and b, with a as the master. The seed x option provides an octal seed value to start random number generation. Example: olcsvc seed x cpu a,b In the following example, the Dohup(l) command allows olcsvc to continue executing after you log off the system. The ampersand (&) causes the entire command to execute in the background, so that another prompt is immediately displayed and you can continue to use the system. Example: nohup olcsvc & 2-6 CRAY PROPRIETARY SMM-1012 C The following example shows the test-specific help information that is displayed if help is entered with a test name. Example: olcsvc help Help display: olcsvc help olcsvc [chkpnt mode] [cpu clist] [+/-getseed] [getseed file] [help] [maxerr n] [maxp n] [+/-parcel] [+/-verbose] [+cray1] [+xmp] [cputime h:m:s] [time h:m:s] [disable ilist] [enable ilist] [+/-isolate] [isop n] [numpar n] [+/-repeat] [seed n] [+/-sgci] [vI n] [+/-cm] [+/-fpadd] [+/-fpmult] [+/-fprecip] [+/-int] [+/-logical] [+/-pop] [+/-shift] [+/-onezero] [+/-random] [+/-slide] chkpnt mode - Checkpoint mode: none, first, or all. (Default: first) cpu clist - Run in selected CPUs. (Default: a) +/-getseed - Get/don't get seed from test. seed. (Default: -getseed) getseed file - Search file for starting seed help - Provides a help display. +/-verbose - Enable/disable info. messages to stdout. (Default: -verbose) maxp n - Set maximum pass limit to n. (Default: 0'1000) maxerr n - Set maximum error limit to n. (Default: 1) +/-parcel - Force/don't force dump to parcel format. (Default: -parcel) +cray1/+xmp Selects CRAY-!/CRAY X-MP test mode. (Default: host machine) cputime h:m:s - Set amount of CPU time to execute. time h:m:s - Set amount of wall clock time to execute. disable ilist - Do not run specific instructions. Ignored if invalid. enable ilist - Run specific instructions. Ignored if invalid. +/-isolate - Enable/disable isolation. (Default: +isolate) isop n Loop during isolation n times to find error. (Default: 0'1000) numpar n - Number of parcels to run in vector buffer. (Default: 0'100) +/-repeat - Repeat/do not repeat first pass. (Default: -repeat) seed n - Set seed for random number generator to n. (Default: 0'33) +/-sgci - Enable/disable scatter/gather/compressed index testing. vI n - Set VL. 0 <= n <= 100. If n = 0, VL is random. (Default: 0) +/-cm, +/-fpadd, +/-fpmult, +/-fprecip, +/-int, +/-logical, +/-pop, +/-shift - Enable/disable specific instruction groups. (Default: all instructions) +/-onezero, +/-random, +/-slide - Enable/disable specific data patterns. (Default: all data patterns) SMM-1012 C CRAY PROPRIETARY 2-7 The following example shows the output that is displayed when olesve is run with all default values. Example: olcsvc Output: olcsvc olcsvc: started in cpu A on Thu Jan 8 08:55:46 1987 CRAY X-MP MODE olcsvc reached maximum pass limit with 1000 passes and 0 errors on Thu Jan 8 08:56:08 1987 The following example shows the output that is displayed if +verbose is specified and mazp reaches 10. Example: olcsvc +verbose maxp 10 Output: olcsvc +verbose maxp 10 olcsvc: started in cpu A on Thu Jan 8 08:56:43 1987 CRAY X-MP MODE 1, error = olcsvc: pass = 0 Thu Jan 8 08:56:43 2, error = olcsvc: pass = 0 Thu Jan 8 08:56:43 3, error = olcsvc: pass = 0 Thu Jan 8 08:56:43 4, olcsvc: pass = error = 0 Thu Jan 8 08:56:43 5, error = olcsvc: pass = 0 Thu Jan 8 08:56:43 6, olcsvc: pass = error = 0 Thu Jan 8 08:56:43 7, error = olcsvc: pass = 0 Thu Jan 8 08:56:43 10, error = olcsvc: pass = 0 Thu Jan 8 08:56:43 olcsvc reached maximum pass limit with 10 passes and 0 errors on Thu Jan 8 08:56:43 1987 2.6 1987 1987 1987 1987 1987 1987 1987 1987 TEST MESSAGES Each test generates the following types of messages: • • Informative Error These messages are listed in the subsections that follow. 2-8 CRAY PROPRIETARY SMM-1012 C 2.6.1 INFORMATIVE MESSAGES This subsection lists the informative messages, which are sent to stdout (standard output device). test: Cannot open test. seed. Seed cannot be saved. The test cannot write test. seed. Therefore, the ending seed cannot be saved. Check write permissions of the current directory. test: Cannot write restart file. errno The test cannot write a restart file. representative. 2.6.2 = n. Contact your CRI ERROR MESSAGES This subsection lists the error messages, which are sent to stderr (standard error device). test: Illegal option x. Option x is invalid. Correct and rerun. test: Illegal argument x. Argument x is invalid. Correct and rerun. test: Illegal CPU selection x. CPU x is invalid. Correct and rerun. test: Maximum of O'x items in option list. Too many items are in the argument list for option. The maximum number of items allowed in the argument list is O'x. Correct and rerun. test: An error occurred when selecting CPU x. CPU x is unavailable. Contact your CRI representative. test: Cannot allocate memory. Cannot save buffers. The test cannot allocate memory or save buffers. Regenerate the diagnostic and rerun. If the problem persists, contact your CRI representative. test: Too many buffers. Cannot save buffers. The test cannot save buffers. Regenerate the diagnostic and rerun. If the problem persists, contact your CRI representative. test: Cannot open file. The test cannot open the file name specified by the getseed option. Correct and rerun. SMM-1012 C CRAY PROPRIETARY 2-9 test: Cannot find seed in file. The test cannot find the seed in file. valid and rerun. Ensure that file is Error selecting cluster x. Cluster x is unavailable. Contact your CRI representative. test: 2.7 OFF-LINE CONFIDENCE MONITOR (offman) The offmant monitor allows the following on-line confidence tests to be executed either in an off-line environment or in a down CPU under the down CPU monitor, oldmon:tt • • • • • olefpt olem olerit olesve olibuf To execute in these environments, each on-line confidence test is concatenated to offman and assembled (instead of being linked to oleman). To ensure compatibility between the on-line and off-line test environments, the on-line and off-line confidence tests are built from the same source code. The equivalent off-line confidence test names start with the prefix off instead of 01. For example, the off-line equivalent of olerit is offerit. To generate the same test conditions in both the on-line and off-line test environments, use the same seed value. Set the seed value for the on-line confidence test (refer to subsection 2.2, Program Synopsis), and use the same value for the off-line test. For information on executing offmon, refer to the diagnostic listing. t tt 2-10 The offman monitor is supported on CX/CEA systems only. The oldman monitor is supported on multiple-CPU Cray computer systems only. CRAY PROPRIETARY SMM-1012 C 3. CORFIDEHCB TEST DBSCRIPTIOHS This section describes the following on-line confidence tests: Test Description olcfdt olcfpt olem olcrit olcsvc olibuf olsbt Mass storage device test Comprehensive floating-point test Central memory test Comprehensive random instruction test Comprehensive scalar and vector comparison test Instruction buffer test Semaphore, shared B and shared T register test For general information on confidence tests, refer to section 2, Confidence Test and Monitor Overview. For a list of test execution times, refer to appendix B, Test Execution Times. 3.1 olcfdt The olcfdt test is an on-line confidence test for mass storage devices. It creates a user-specified file that is used for all input and output operations during test execution. To test a specific device, specify the absolute path name to the device. If an absolute path name is not specified, olcfdt creates a file on the user's current working directory and tests the device associated with the working directory. Your system file configuration determines which directories and files reside on each device. The created file is permanent. command. To delete the file, use the rm(l) The test uses the values specified by the record size (rsz) and file size (sz) options to determine the following: • • • Data record size Size of the device file to be created Number of data records required to fill the file The default values for the tests and patterns to be run (specified by the test and pat options, respectively) are designed for optimum functionality. When selecting arguments for these options, be aware that varying degrees of functionality may be achieved. SMM-1012 C CRAY PROPRIETARY 3-1 If a failure occurs, messages are output to stdout, provided the program is in control after the failure. However, you can redirect output from stdout to a specified file. 3.1.1 TEST SYNOPSIS The olcfdt command options can be entered in any order. is omitted, the program uses the default value. If an option Synopsis: olcfdt [disp display] dt type [fn file] [help] [mazp n] [ntks] [pat patterns] [rsz n] [seed n] [sz n] [test tests] [upat n] disp display Enables or disables the option that generates an error information/history display option. The default is err (all error information is displayed). display is one of the following: dt type Value Description hst Displays a history of the current iteration (test pattern and test sections executed) err Displays all error information none Does not display error information or a history of the current iteration all Displays all error information and a history of the current iteration Device type (required). If the specified device type is not associated with the specified file name, the program overrides the dt command option and tests the device type associated with file. type is one of the following (only one device type can be selected at a time): Device Type ddlO dd19 dd29 dd39 3-2 Description 00-10 00-19 DD-29 00-39 disk disk disk disk drive drive drive drive CRAY PROPRIETARY SMM-1012 C dt type (continued) Device Type dd40 dd49 bmr ssd Description 00-40 disk drive 00-49 disk drive Buffer memory resident storage SSD solid-state storage device fn file File name. file is the absolute path name to a file. The created file is permanent. When assigning a file, you must know which directory is associated with the selected device type. Consult your CRI analyst to determine the directory associated with a specific device. The default is workfil under the current working directory. help Produces an on-line help display containing a synopsis and brief description of the command options and arguments. If the help option is entered with a test name, help information is written to stdout, and the test terminates. mazp n Pass count (decimal). On each pass, all selected test patterns and test sections are run. The default for n is 512. ntks File size is in number of tracks. This command option indicates that the argument associated with the sz command option is the file size in number of tracks (decimal). If allowed to default, the file size is in data sectors (decimal). pat patterns Patterns to be run. The default is all (all test patterns are run). If the upat option is specified, you must either set pat to all or include user in the list of arguments. patterns is a comma-separated list of up to nine test pattern arguments. Duplicate entries are allowed. For example: pat zeros, ones patterns can be one of the following: SMM-1012 C Argument Pattern zeros All O's ones All l's chkbrd Checkerboard (1252525252525252525252B, 0525252525252525252525B •.. ) CRAY PROPRIETARY 3-3 pat patterns (continued) Argument Pattern chkbrdc Complement of the chkbrd pattern Record/word index. The record number in the upper 31 bits of the data word, followed by the data word number within the record in the lower 33 bits (hardware numbered bits). rvic Complement of the rvi pattern fpn Random floating-point numbers Random numbers user User pattern. This is the pattern specified by the upat option (upat must be specified if this argument is entered). all All patterns are run (default). The patterns are processed in the following order: zeros,ones,chkbrd,chkbrdc, rvi,rvic,fpn,rdm,user The user argument is processed only if the upat option is entered. all is a stand-alone argument. rsz n Record size in data words. n is a decimal record size of 512, or a multiple thereof, up to a maximum value of 4096. The default is 512 words. seed n Random number than or equal which selects random number sz n File size (decimal). If sz n is specified without the ntks command option, the file size is in data sectors; if ntks is specified, the file size is in number of tracks. The minimum value for n is 1. The maximum value for n is as follows: seed. n is an octal value that is less to 48 bits. The default for n is rdm, the nearest integer of the product of a and the real-time clock. (Track size * number of tracks) - 1 or Maximum file size allowed by the system 3-4 CRAY PROPRIETARY SMM-1012 C sz n (continued) The default for n is the track size of the device specified by the command option dt. test tests Test sections to be run. The test does a sequential write before executing the selected test sections. The default for tests is all (all test sections are run). tests is a comma-separated list of up to three test section entries. The test sections are processed in the order in which they are entered on the command line. Duplicate entries are allowed. For example: rw,rw,rr tests can be one of the following: Test Section rr Description Random read; performs random reads on the work file. A data compare is performed on each record read. On a miscompare, a message is displayed and the program is aborted. Random write; performs random writes on the work file. This section automatically performs a sequential read (sr) if sr is not selected after a random write (rw). For example, the following entry runs test sections rr, rw, and sr, respectively: test rr,rw upat n SMM-1012 C sr Sequential read; reads the work file sequentially. A data compare is performed on each record read. On a miscompare, a message is displayed and the program is aborted. all Runs all test sections (default). This is a stand-alone argument. The tests are run in the following order: rr,rw,sr. User pattern. n is an octal value that is less than or equal to 64 bits. An error occurs if the upat option is not specified when user is entered in the argument list for the pat option. The default is no user pattern. CRAY PROPRIETARY 3-5 3.1.2 TEST EXAMPLES This subsection contains olcfdt execution examples. The following example runs olcfdt using default command options to test a DD-29 disk drive. It is assumed that the current user directory is associated with the 00-29 disk drive to be tested. Example: olcfdt dt dd29 rsz 512 Output: olcfdt dt dd29 rsz 512 olcfdt submitted on Wed Mar 11 15:38:30 1987 odt06 - Test completed. The following example runs olcfdt using user-specified command options to test a 00-29 disk drive. It is assumed that the specified file name, Iw/xxxlyyy, is associated with the 00-29 disk drive to be tested. Example: olcfdt fn IW/xxxlyyy dt dd29 sz 36 rsz 512 test all pat all upat 707070707070707070707 seed 7070707070707070 maxp 10 disp none Output: olcfdt fn IW/xxXlyyy dt dd29 sz 36 rsz 512 test all pat all upat 707070707070707070707 seed 7070707070707070 maxp 10 disp none olcfdt submitted on Wed Mar 11 16:26:20 1987 odt06 - Test completed. The following example runs olcfdt using default options and the checkerboard data pattern to test a 00-29 disk drive. The test displays the data compare error output by default. The test output indicates that a data compare error was detected at word 99 of record 9. The test displays expected, actual, and difference data for the following words: • • • 3-6 Ten words on either side of the failing word Last word of the preceding record First word of the next record CRAY PROPRIETARY SMM-1012 C If there are less than 10 words preceding or following the word that failed, more words are displayed from one side than another to make up the difference. In the following example, data information is displayed for words 89 through 109 of record 9, word 1024 of record 8, and word 1 of record 10. Example: olcfdt dt dd29 pat chkbrd rsz 1024 Output: olcfdt dt dd29 pat chkbrd rsz 1024 olcfdt submitted on Wed Mar 11 13:14:19 1987 odt14 - Data compare error. ***** FILENAME FILE SIZE DEVICE TYPE CURRENT DATA PATTERN CURRENT TEST ITERATION COUNT NUMBER OF PASSES RECORD SIZE NUMBER OF RECORDS FAILING RECORD NUMBER FAILING WORD NUMBER USER PATTERN RANDOM NUMBER SEED EXPECTED WORD 89 90 91 92 93 94 95 96 97 98 99 100 101 102 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 SMM-1012 C DATA COMPARE ERROR ***** workfil 18 dd29 chkbrd sr 512 100 1024 13 9 99 0000000000000000000000 0000003427130120254365 ACTUAL 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 1777777777777777777777 0525252525252525252525 1252525252525252525252 0525252525252525252525 CRAY PROPRIETARY DIFFERENCE 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000000000 0525252525252525252525 0000000000000000000000 0000000000000000000000 0000000000000000000000 3-7 Output (continued): EXPECTED 103 104 105 106 107 108 109 1252525252525252525252 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 ***** WORD 1024 ***** 0525252525252525252525 FIRST WORDS OF NEXT RECORD EXPECTED 1 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000000000 ACTUAL 0525252525252525252525 WORD 1252525252525252525252 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 1252525252525252525252 0525252525252525252525 LAST WORDS OF PREVIOUS RECORD EXPECTED ***** DIFFERENCE ACTUAL 1252525252525252525252 ***** ACTUAL 1252525252525252525252 The following example runs o!cfdt with user-specified command options to test a DD-29 disk drive. Test output is sent to /a/b/ccc. Example: olcfdt fn /w/xxx/yyy dt dd29 sz 36 rsz 4096 test all pat rdm seed 7070707070707070 > /a/b/ccc 3.1.3 TEST MESSAGES The o!cfdt test produces the following types of messages: • • Informative Error These messages are listed in the subsections that follow. 3-8 CRAY PROPRIETARY SMM-10l2 C 3.1.3.1 Informative messages This subsection lists the informative messages, which are sent to stdout (standard output device). odt06 - Test completed. odt16 - iteration pattern tests odt16 - iteration pattern tests This message is generated if the disp command option is set to display the history of the current iteration. On each iteration through the test, the selected device is tested with one of the selected patterns in all of the selected test sections. The following information is displayed: iteration pattern tests 3.1.3.2 Current iteration Current test pattern (64-bit octal word) Test sections being run Error messages This subsection lists the error messages, which are sent to stderr (standard error device). odt01 - Option x is invalid. Enter a valid option and rerun. odt02 - Argument x is invalid. Enter a valid argument and rerun. odt03 - Too many items in value list 1. Reenter argument list and rerun. odt04 - Required option x is not present. Enter option x and rerun. odt15 - Argument is missing. An option requiring an argument was entered alone. option with an argument and rerun the test. Reenter the The following error messages are sent to stdout: odt05 - Specified record size exceeds odt05 - the maximum limit of 4096. Reenter the rsz option and rerun. odt07 - Cannot open file. Contact your CRI representative. odt08 - Cannot close file. Contact your CRI representative. SMM-1012 C CRAY PROPRIETARY 3-9 odt09 - Cannot seek file. Contact your CRI representative. odt10 - Cannot read file. Contact your CRI representative. odt11 - Cannot write file. Contact your CRI representative. odt12 - User pattern option (upat) must be specified odt12 - when pattern option (pat) is 'user'. Enter the upat option and rerun. odt13 - Pattern option (pat) must be 'user' or 'all' odt13 - when the user pattern option (upat) is specified. Enter the pat option and rerun. odt14 - Data compare error. Examine the error output to identify the point at which the failure occurred. 3-10 CRAY PROPRIETARY SMM-I012 C 3.2 olcfpt The olcfpt test is an on-line comprehensive floating-point test. It generates floating-point instructions and data to detect data-sensitive failures in the floating-point functional units. The generated instructions are simulated and then executed. The simulation and execution results are compared, and any differences are reported. This process continues until the maximum pass, error, or time limit is reached. If an error is detected, the diagnostic attempts to isolate the failing data. 3.2.1 TEST SYNOPSIS The olcfpt command options can be entered in any order. If an option is omitted, the program uses the default value. The test synopsis lists the olcfpt command options and arguments in the following order: 1. 2. 3. 4. Monitor options Test-specific options Data pattern options Instruction options Synopsis: olcfpt [chkpnt mode] [cpu clist] [cputime h:m:s] [+I-qetseed] [getseed file] [help] [maxerr n] [maxp n] [+I-parcel] [time h:m:s] [+I-verbose] [+xmp] [+crayl]t [disable ilist] [enable ilist] [+I-isolate] [isop n] [numins n] [+I-repeat] [seed n] [vI n] [+I-vload] [+I-fpbits] [+I-fprand] [+I-random] [+I-fpadd] [+I-fpmult] [+I-fprecip] [+I-scalar] [+I-vector] t The monitor command options are described in section 2, Confidence Test and Monitor Overview. SMM-1012 C CRAY PROPRIETARY 3-11 disable ilist Deselects specific instructions. following format: Enter ilist in the n,n, ... ,n n is the octal value in the gh field of the specific instruction. The disable ilist option overrides the enable ilist option and any selected (+) or deselected (-) instruction options. enable ilist Selects specific instructions. following format: Enter ilist in the n,n, ... ,n n is the octal value in the gh field of the specific instruction. The enable ilist option overrides any selected (+) or deselected (-) instruction options. When the test is run with default values for the +/instruction options, and the enable ilist option is selected, only the instructions specified by the enable ilist option are run. +I-isolate Enables (+isolate) or disables (-isolate) the error isolation option. The default is +isolate. isop n Sets the isolation pass limit to n (octal). During isolation, the diagnostic repeatedly executes the suspected failing sequence. If the sequence fails, the loop terminates and the diagnostic attempts to isolate the sequence further. If the sequence does not fail, the loop terminates after n passes, and olcfpt assumes that the error is not in the tested sequence. The default for n is 0'1000. numins n Sets the number of instructions to be generated. n can be any octal value within the range 1 through 0'20. The default for n is 0'20. +I-repeat Enables (+repeat) or disables (-repeat) the option that repeats the first pass until the diagnostic terminates. +repeat is useful for recreating an error. It is normally used with one of the following options: seed n, +qetseed, or qetseed file. The default is -repeat (the program generates new instructions and data after each pass) • 3-12 CRAY PROPRIETARY SMM-I012 C seed n Sets the random seed to n. n can be any 64-bit octal value. If n is 0, the test reads the real-time clock and uses the value for the initial seed. The default for n is 0'33. If seed n is selected, do not select +qetseed or getseed file. vI n Sets the vector length to n. n can be any octal value in the range 0 through 0'100. If vI is set to 0, a random vI value is used to initialize the test. The default for n is 100. +I-vload Selects (+vload) or deselects (-vload) vector instructions for the instruction buffer and, in the case of -vioad, does not allow you to load (write) or save (read) the ·vector registers. -vioad overrides vector instructions selected by +vector and enable ilist. The default is +vload. +I-fpbits, +I-fprand, +I-random Selects (+) or deselects (-) specific data patterns. If allowed to default, all of the data patterns are run. If the vI option is 0 or not specified, the vector length register is initialized with 6-bits of random data. The data patterns are as follows: Option Data Pattern fpbits Random number of consecutive I-bits in the coefficient. Exponent data depends on the floating-point instruction. For example: 0370000000000007740000 1574777740003777777777 0217600000000000030000 0237740000000000100000 fprand Random bit generation in the coefficient. Exponent data depends on the floating-point instruction. For example: 0224055214537525453301 1327217472141363076211 random Random bit generation in a word. For example: 1023122123232122777127 0003423100233344322177 1640034356453221213532 1123235467543221322120 1304322300332105534311 SMM-1012 C CRAY PROPRIETARY 3-13 +I-fpadd, +I-f~ult, +I-fprecip, +I-scalar, +I-vector Selects (+) or deselects (-) specific instruction groups for the following options: Option Instruction Type fpadd fpaault fprecip scalar vector Floating-point addition Floating-point multiply Floating-point reciprocal Scalar instruction (destination) Vector instruction (destination) If allowed to default, all instruction groups are run. groups are as follows: 3.2.2 Option Instruction Group fpadd 062, 063 170 through 173 fpmult 064 through 067 160 through 167 fprecip 070, 174 scalar 062, 063 064 through 067 070 vector 160 through 167 170 through 174 The TEST EXECUTION The olcfpt execution sequence is as follows: 1. 2. 3. 4. 5. 6. Test initialization Random floating-point instruction and data generation Random floating-point instruction buffer simulation Random floating-point instruction buffer execution Comparison of simulation and execution results Error isolation Steps 2 through 5 occur on each pass through the test loop. occurs only on error. 3-14 CRAY PROPRIETARY Step 6 SMM-1012 C 3.2.2.1 Test initialization At test initialization, the selected instructions are processed in the following order: 1. All instructions are initially enabled unless either of the following occurs (in which case no instructions are initially enabled): • An instruction group is selected (+option) • An enable option is entered and there are no deselected (-option) instruction group entries 2. Selected groups are processed, enabling instructions in the selected groups. 3. Deselected groups are processed, disabling instructions in the deselected groups. 4. Individually selected instructions are processed (all instructions specified by the enable option). 5. Individually deselected instructions are processed (all instructions specified by the disable option). 6. Vector instructions disabled by -vload are processed. 7. If no instructions are selected, an error message is displayed and the test is terminated. 3.2.2.2 Random floating-point instruction and data generation These routines build and generate the floating-point instruction buffer and initial data. Instructions for the buffer are randomly selected from a list of enabled floating-point instructions. If the i, j, or k field is represented by an x in the Cray Assembly Language (CAL), a 0 is used for the field (for additional information, refer to the CRAY Y-MP, CRAY X-MP EA, CRAY X-MP and CRAY-1 CAL Assembler Version 2 Ready Reference, CRI publication SQ-0083). 3.2.2.3 Random floating-point instruction buffer simulation After the instructions and data are generated, the floating-point instruction buffer is simulated by the master CPU only. The save monitor routine saves the results. SMM-1012 C CRAY PROPRIETARY 3-15 Each instruction type has a unique simulation routine. The simulation routines use machine resources differently from the instruction being simulated. For example, the scalar add, pop, leading zero, and logical functional units are used to simulate the floating-point add functional unit. 3.2.2.4 Random floating-point instruction buffer execution After the instructions are simulated, all of the selected CPUs execute the floating-point instruction buffer. Before the instructions can be executed, the program loads the following: • • • Scalar registers Vector "registers Vector length register Then an unconditional jump to the floating-point instruction buffer is executed. At the end of the floating-point instruction buffer is an unconditional jump to a routine that unloads the contents of all the registers. The save monitor routine saves the results. 3.2.2.5 Comparison of simulation and execution results After the instructions are executed in all of the selected CPUs, the compare monitor routine compares the results, and one of the following actions occurs: • If the results match, the test proceeds with the next data pattern. After all of the selected data patterns are run, the pass count is incremented. • If the results do not match, the test dumps all of the data related to the suspected failure and, if the isolation option is enabled (+isolate), attempts to isolate the failure. 3.2.2.6 Error isolation If an error is detected and the isolation option is enabled (+isolate), the test attempts to identify and isolate the failing instruction by executing the instructions in the floating-point instruction buffer, one at a time. For scalar instructions, error isolation occurs as follows: 3-16 1. The j operand is set to O. operand is restored. If no error is detected, the 2. The k operand is set to O. operand is restored. If an error is not detected, the CRAY PROPRIETARY SMM-1012 C 3. Each bit of the j operand is set to 0 (one at a time). error is detected, the bit is restored. If no 4. Each bit of the k operand is set to 0 (one at a time). error is detected, the bit is restored. If no For vector instructions, error isolation occurs as follows: 1. Each element of the j operand is set to 0 (one at a time). no error is detected, the element is restored. If 2. Each element of the k operand is set to 0 (one at a time). no error is detected, the element is restored. If 3. Each bit of the j operand is set to 0 (one at a time, for all elements) • If no error is detected, the bit is restored. 4. Each bit of the k operand is set to 0 (one at a time, for all elements) • If no error is detected, the bit is restored. When the isolation process terminates, the output dump contains the following: • • • • • Floating-point instruction buffer Data used when the failure occurred Simulated execution results Actual execution results (if different from the simulated results) An exclusive OR of the simulated and actual execution results If the failure is very intermittent, the isolation process may terminate without detecting an error, and then the output dump does not contain any actual execution results (differences). In this case, increase the value of isop n, enable the +repeat option, select the failing CPU, and use the failing seed to rerun the test. The program may report an error resulting from a failure in either the simulated or actual execution. To determine if the error is the result of an actual execution failure, start olcfpt in a different CPU and select the suspected failing CPU. For example, the following entry starts olcfpt in CPU c: olcfpt cpu c If olcfpt fails, and the simulated execution is suspect, rerun olcfpt using a different master CPU and the failing seed, as follows: olcfpt cpu a,c +repeat seed n If olcfpt fails in CPU c, the failure is in the actual execution of the floating-point instruction buffer. If olcfpt does not fail, the error is either in the simulated execution results from CPU c or it is very intermittent. SMM-1012 C CRAY PROPRIETARY 3-17 3.2.3 TEST TERMINATION For information on test termination, refer to section 2, Confidence Test and Monitor Overview. 3.2.4 TEST EXAMPLES This subsection contains olcfpt execution examples. The following example runs olcfpt for 0'10000000 passes. Output is redirected to olcfpt.log. The Dohup(l) command allows the program to continue executing after you log off the system. You can later log on to check the test's progress. The ampersand (&) causes the entire command to execute in the background, so that another prompt is immediately displayed and you can continue to use the system. nohup olcfpt maxp 10000000 )olcfpt.log & The following example runs olcfpt with selected command options and shell facilities. The test runs for 0'1000000 passes in CPU a with all default instructions. The job runs as a background process, and the output is sent to olcfpt.log. olcfpt maxp 1000000 cpu a )olcfpt.log & The following example shows a procedure for determining how frequently an error occurs. The test is rerun with the +repeat option, so that the first pass is run repeatedly until the test terminates. The test uses the seed value from the output at the time of the initial error. Error isolation is disabled. The output is filtered to olcfpt.log. olcfpt +repeat -isolate maxerr 100 maxp 100 cpu d seed 1436651016713554002511 I tail )olcfpt.loq & The following example runs olcfpt with floating-point multiply instructions, and instructions 70 and 174. olcfpt +fpmult enable 70,174 )olcfpt.log & The following example runs olcfpt with all of the floating-point vector instructions except instructions 166 and 167. olcfpt +vector disable 166,167 )olcfpt.log & 3-18 CRAY PROPRIETARY SMM-1012 C The following example runs olcfpt with all of the instructions except floating-point multiply. olcfpt -fpmult )olcfpt.log & The following example shows the output displayed when olc£pt is run with all default values. olcfpt Output: olcfpt olcfpt started in cpu A on Tue Aug 25 15:32:16 1987 olcfpt reached maximum pass limit with 1000 passes and 0 errors on Tue Aug 25 15:32:32 1987 The following example runs olcfpt with the +verbose option enabled so that a line of output is generated after each pass. olcfpt +verbose Output: olcfpt +verbose olcfpt started in olcfpt: pass = 1, olcfpt: pass = 2, olcfpt: pass = 3, cpu A on Tue Aug 25 11:42:47 1987 error = 0 Tue Aug 25 11:42:47 1987 error = 0 Tue Aug 25 11:42:47 1987 error = 0 Tue Aug 25 11:42:47 1987 olcfpt: pass = 1000, error = 0 Tue Aug 25 11:43:03 1987 olcfpt reached maximum pass limit with 1000 passes and 0 errors on Tue Aug 25 11:43:03 1987 The following example runs olcfpt in CPU conly. olcfpt cpu c Output: olcfpt olcfpt olcfpt on Tue cpu c started in cpu C on Tue Aug 25 11:44:51 1987 reached maximum pass limit with 1000 passes and 0 errors Aug 25 11:45:07 1987 SMM-1012 C CRAY PROPRIETARY 3-19 The following example runs olcfpt in CPUs a ~nd b, with a as the master. On each pass, olcfpt tests a sequence of instructions, using fpbits data for the initial register values. olcfpt +fpbits cpu a,b Output on an error: olcfpt +fpbits cpu a,b olcfpt started in cpus A, B with CRAY X-MP mode olcfpt: restart file written to ( name 1640> = ( rev 1641> = ( date 1642> = ( pass 1643> = ( error 1644> = ( seed 1645> = ( failpat 3254> = ( isop 1656> = master cpu A on Wed Oct 26 10:38:22 1988 A34408- olcf llt 'olcfpt '5.0 '10/21/88' 11 1 1260350316p37024772740 'fpbits 1000 random floating-point instruction buffer ibuff (the floating-point instruction buffer is di$played) 6040a 6040b 6040c 6040d 6041a 6041b 6041c 6041d 6042a 165431 063556 062607 062031 066742 163360 163125 174670 006000 016400 initial scalar register data ( initsO 12740> ( 12741> inits1 ( inits2 12742> ( inits3 12743> ( inits4 12744> ( 12745> inits5 ( 12746> inits6 ( inits7 12747> V4 S5 S6 SO S7 V3 VI V6 J = = = = = = = = V3*RV1 S5-FS6 SO+FS7 S3+FS1 S4+RS2 V6*HVO V2*HV5 /HV7 3500a 0200777600017777777777 1200174777777777777777 0201747777740037777777 1200070000000000000100 0201767777400000000007 0277760000000000037777 0277607777777777777617 1200750077777777777776 initial vector length register initvl ( 1 2 7 5 0 > = 0000000000000000000100 initial vector register data (vector register data is displayed) 3-20 CRAY PROPRIETARY SMM-1012 C Output (continued): simulated floating-point instruction buffer results The expected data shown below has the following format: name name: index: offset: data: *** + index The The The The= data name of the data dumped on this line. index into the data starting at name. offset into the data buffer. actual data dumped. Expected Results *** Optional, default: cpu A (master) Source data buffer at 13640 in Memory Memory address in source data buffer = (offset> + 13640 (source data buffer) simulated scalar register data results 1100> = 1200174777777777777777 < sO 1101> = 1200174777777777777777 sl < 1102> = 0201747777740037777777 s2 < 1103> = 1200070000000000000100 s3 < s4 1104> = 0201767777400000000007 < ( 1105> = 1277607777777777777617 s5 1106> = 1200677777777777777600 < s6 1107> = 0000000000000000000000 s7 < simulated vector length register data results 1110> = 0000000000000000000100 vI < simulated vector register data results (vector register data is displayed) Differences are the results from actual execution of the floating-point instruction buffer that differ from the master (simulated or actual) execution. sO-s7 = scalar register data results vI = vector length register data result vO-v7 = vector register data results The difference data shown below has the following format: name + index = data data differences SMM-1012 C o. CRAY PROPRIETARY 3-21 Output (continued): name: index: offset: data: The The The The The the name of the data dumped on this line. index into the data starting at name. Optional, default: O. offset into the data buffer. actual data dumped. differences are marked with an asterisk (*) preceding data word. data differences: The bits in difference between the actual results and the expected results. *** Differences *** cpu A (master) Source data buffer at 15640 in Memory copied to save buffer at 112573 in Memory Memory address in source data buffer = + 15640 (source data buffer) Memory address in save data buffer = + 112573 (save data buffer) actual floating-point buffer execution results *** Differences *** cpu B Source data buffer at 15640 in Memory copied to save buffer at 113705 in Memory Memory address in source data buffer = + 15640 (source data buffer) Memory address in save data buffer = + 113705 (save data buffer) actual floating-point buffer execution results s5 < 1105> = *1277607777776000000000 0000000000001777777617 = = = = = = = = 'olcfpt '5.0 '10/21/88' 11 1 1260350316637024772740 'fpbits 1000 Beginning error isolation Error isolation complete name rev date pass error seed failpat isop < < < < < < < < 1640> 1641> 1642> 1643> 1644> 1645> 3254> 1656> isolation: random floating-point instruction buffer ibuff 6040b 6040c 3-22 063556 006000 016400 S5 J CRAY PROPRIETARY S5-FS6 3500a SMM-I012 C Output (continued): isolation: initial scalar register data initsO 12740> = 0200777600017777777777 < inits1 12741> = 1200174777777777777777 < inits2 12742> = 0201747777740037777777 < inits3 12743> = 1200070000000000000100 < inits4 12744> = 0201767777400000000007 < inits5 12745> = 0200000000000000002000 < inits6 12746> = 0000000000000000000000 < inits7 12747> = 1200750077777777777776 < (From this point on, the dump is similar to the previously listed portion of the dump that displayed the unisolated error information.) The first address (FADD) of the diagnostic is 1640a olcfpt reached maximum error limit with 11 passes and 1 errors on Wed Oct 26 10:40:37 1988 3.2.5 TEST MESSAGES The olcfpt test produces the following types of messages: • • Informative Error These messages are described in the subsections that follow. 3.2.5.1 Informative messages If no error occurs, olcfpt produces two messages, one at start-up time and another at test termination. If the +verbose option is enabled, a message is sent to stdout (standard output device) after each pass through the test loop. On an error, the test provides information such as the following: • Pass and error counts • Seed at the beginning of the pass on which the error occurred • Contents of the instruction buffer • Initial data SMM-1012 C CRAY PROPRIETARY 3-23 • Data results from the simulated instruction execution in the master CPU • Differences between the simulated execution results from the master CPU and the actual execution results from all of the selected CPUs 3.2.5.2 Error messages One of the following error messages is sent to stderr (standard error device) if an invalid command option is entered: olcfpt: selins: No executable instructions selected. Correct and rerun. olcfpt: selins: Vector length must be in the range of 0 through 100. Correct the vI option and rerun. olcfpt: No data patterns(s) selected. All data patterns are deselected. Correct and rerun. One of the following error messages is sent to stderr if oIcfpt detects an unexpected error. Select a different master CPU and rerun the test. If the problem persists, contact your CRI representative. simulate: (software error) The gh field is greater than olcfpt: simulate: simxxx routine. (software error) The instruction does not have a olcfpt: generate: genxxx routine. (software error) The instruction does not have a olcfpt: 177. 3-24 CRAY PROPRIETARY SMM-1012 C 3.3 olcm The olom test is an on-line central memory test. It tests central memory and the paths for the S, T, B, and V registers by using unique algorithms that perform an ascending and descending READ/TEST/WRITE loop of central memory, one word at a time with scalars and one block (100 a > at a time with the T, B, and V registers. olcm also has a random-data section and a section to create memory conflicts. olcm runs on CX/CEA and CX/1 systems. 3.3.1 TEST SYNOPSIS The olcm command options can be entered in any order. If an option is omitted, the program uses the default value. The test synopsis lists the olcm command options and arguments in the following order: 1. 2. Monitor options Test-specific options Synopsis: olcm [chkpnt mode] [cpu clist] [cputime h:m:s] [qetseed file] [+I-qetseed] [help] [maxerr n] [maxp n] [+I-parcel] [time h:m:s] [+I-verbose] [+xmp] [+crayl]t [section slist] [seed n] [words n] section slist Selects the test sections to be executed. entered in the following format: slist is n,n, ... ,n n can be any of the following test sections, entered in any order (if allowed to default, all test sections are executed): Section t Description 1 Central memory storage and scalar path test 2 Central memory storage and T register path test The monitor command options are described in section 2, Confidence Test and Monitor Overview. SMM-1012 C CRAY PROPRIETARY 3-25 section slist (continued) Section Description 3 Central memory storage and B register path test 4 Central memory storage and vector register path test using only the first vector logical unit 5 Central memory storage and vector register path test using both vector logical units 6 Central memory random data test 7 Central memory conflict test seed n Sets the random seed to n. n can be any 64-bit octal value. If n is 0, the test reads the real-time clock and uses the value for the initial seed. The default for n is 0'33. If seed n is selected, do not select +getseed or getseed file. words n Indicates the number of words to be tested in central memory. n is a value in the range 0'100 through 0'4,000,000. All values are rounded down to the nearest 0'100 words. For example, 0'150 is rounded down to 0'100; 0'1000 remains unchanged. The default for n is 0'3,000. 3.3.2 TEST EXECUTION The olom execution sequence is as follows: 1. 2. 3. 4. Test initialization Test section execution Comparison of expected and actual data within each test section Error report Steps 2 and 3 occur on each pass through the test loop. only on error. 3.3.2.1 Step 4 occurs Test initialization At test initialization, the test information is processed as follows: 1. 3-26 The number of words to be tested in central memory is validated (words n). CRAY PROPRIETARY SMM-1012 C 2. Selected test sections are validated (section slist). 3. The random seed is validated (seed n). 3.3.2.2 Test section execution The subsections that follow describe the olem test sections. Test section 1 - This section tests central memory storage and the scalar paths. The following algorithm is used to perform an ascending and descending read/test/write loop of central memory (one word at a time): 1. Write a 64-bit address pattern to all memory locations in the test buffer. 2. Load the scalar register with the pattern from the address register. 3. Verify data integrity by comparing the memory location written with the 64-bit address pattern to the scalar register. Generate a dump on a data miscompare. 4. Write the 64-bit address pattern to the previously tested memory location. 5. Increment location if ascending, or decrement if descending. 6. Repeat steps 2 through 5 until all locations are written. Test sections 2 and 3 - These sections test the T and B register paths, respectively, and central memory storage. The algorithm used in test section 1 is used in these test sections to perform an ascending and descending read/test/write loop of central memory. However, in test sections 2 and 3, the algorithm differs as follows: • Data transfers are done in 64-word blocks (rather than one word at a time). • Data transfers use ascending memory addresses only (the descending loops contain descending data blocks with ascending addresses). Test section 4 - This section tests central memory storage and the vector register paths, using only the first vector logical unit. SMM-1012 C CRAY PROPRIETARY 3-27 The algorithm used in test section 1 is used in this test section to perform an ascending and descending test of central memory storage and the vector register paths. However, in test section 4, the algorithm differs as follows: • Data transfers are done in 64-word blocks (rather than one word at a time). • Data transfers use negative indexing in the descending test subsections, so that the 64-bit pattern is stored in the vector registers in reverse order of the way the pattern is stored in test sections 2 and 3. Test section 5 - This section tests central memory storage and the vector register paths, using both vector logical units. In section 5, the following occurs: • Vector loads are doubled to force the use of more than one central memory port. • Vector comparisons are doubled to force the use of both logical units. • The 64-bit pattern is generated with vector recursion. (In a vector instruction, vector recursion results when vi and Vj or vi and vk refer to the same vector register). The algorithm used in test section 1 is used in this test section to perform an ascending and descending test of central memory storage and the vector register paths. However, in test section 5, the algorithm differs as follows: • Data transfers are done in 64-word blocks (rather than one word at a time). • Data transfers use negative indexing in the descending test subsections, so that the 64-bit pattern is stored in the vector registers in reverse order of the way the pattern is stored in test sections 2 and 3. Test section 6 - This section tests central memory by generating random data in the subroutine RANCOM. The test does the following in subsection 1: 3-28 1. Loads random data (64 bits) into VI (all 100 elements). 2. Writes VI to the central memory area under test (the same block of 100 random words is written consecutively, so that each 100th word is the same). CRAY PROPRIETARY SMM-1012 C 3. The central memory area under test is read into V2. 4. Vl and V2 are compared in Vo. The test does the following in subsection 2: 1. Loads random data (64 bits) into TOO through T77. 2. Writes TOO through T77 to the central memory area under test (the same block of 100 random words is written consecutively, so that each 100th word is the same). 3. The central memory area under test is read into 82. 4. The T-registers are loaded into 81, one word at a time. 5. 81 and 82 are compared in 80. The test does the following in subsection 3: 1. Loads random data (32 bits) into B02 through B77. (BOO and B01 are skipped because they are used for return jumps.) 2. Writes B02 through B77 to the central memory area under test (the same block of 100 random words is written consecutively, so that each 100th word is the same). 3. The central memory area under test is read into A2. 4. The B registers are loaded into A1, one word at a time. 5. A1 and A8 are compared in AO. Test section 7 - This section tests central memory by generating conflicts in the vector reads. The conflicts are generated as follows: 1. Do a vector read from the first memory buffer location to V2, using an increment of O. 2. Increment the memory location by 0'40. 3. Initiate a fetch. 4. Do a vector read from the memory location (from step 2) to V3, using an increment of O. 5. Compare V2 and V3 to V4. 6. Increment the memory location (from step 1) by 0'1000, and write V4 to the new memory location, using an increment of 1. 7. Check for error. 8MM-1012 C CRAY PROPRIETARY 3-29 8. Increment the memory location (from step 1) by 1. 9. Repeat steps 1 through 8 until all memory locations are read. The two vector reads to locations 40-words apart generate section and subsection conflicts. A fetch issued between the two reads generates conflicts in port D. 3.3.2.3 Comparison of expected and actual data After each test section is executed, the actual results are compared to the expected results. If the results match, the test continues. If the results do not match, the test dumps all of the data related to the suspected failure. After all of the selected sections are run, the pass count is incremented. 3.3.2.4 Error report If an error is detected, the test dumps all of the data related to the suspected failure. The output dump contains the following: • • • • • • • • 3.3.3 Diagnostic Information Blocks (DIBs) Section and subsection under test Number of central memory words being tested Expected results Actual results Differences Address of the code at the time the error was detected Buffer address of the data at the time the error was detected TEST TERMINATION There are several monitor options that can cause a test to terminate. Refer to the information on test termination in section 2, Confidence Test and Monitor Overview. 3.3.4 TEST EXAMPLES This subsection contains olem execution examples. The following example executes olem for a maximum of 0'500 passes, testing 0'100,000 words of central memory. olcm maxp 500 words 100000 3-30 CRAY PROPRIETARY SMM-1012 C The following example executes olcm for a maximum of 0'1500 passes, with test sections 1 and 5 enabled. oLcm maxp 1500 section 1,5 The following example executes olcm for a maximum of 0'1000 passes (default), using an initial seed value of 12345, with test sections 1, 2, 3, 6, and 7 enabled. olcm seed 12345 section 6,3,2,1,7 The following example runs olcm for 0'1000 passes (default), with test sections 1, 2, 3, and 4 enabled. Output is redirected to olcm.log. The nohup(l) command allows the program to continue executing after you log off the system. You can later log on to check the test's progress. The ampersand (&) causes the entire command to execute in the background, so that another prompt is immediately displayed and you can continue to use the system. nohup olcm section 1,2,3,4 )olcm.log & The following example shows the output displayed when olcm is run with all default values. olcm Output: olcm olcm started in cpu A on Mon Jul 18 11:14:10 1988 CRAY Y-MP MODE olcm reached maximum pass limit with 1000 passes and 0 errors on Mon Jul 18 11:14:42 1988 The following example executes olcm for a maximum of 0'1000 passes (default), testing 0'150 words of central memory. olcm words 150 Output: olcm words 150 olcm started in cpu A on Fri Jul 15 15:30:12 1988 CRAY Y-MP MODE The value for words was rounded down to the nearest 100 octal words olcm reached maximum pass limit with 1000 passes and 0 errors on Fri Jul 15 15:30:47 1988 SMM-1012 C CRAY PROPRIETARY 3-31 The following example executes olem for 0 passes (terminated on error), testing 0'1234 words of central memory. olcm words 1234 Output (on error): error seed failsec words subs lower upper $dif $exp $act $elem $vm < 1 < 33 < 1 < < < < < < < < < 1200 2 13270 14470 0000000000000000000004 0000000000000000004467 0000000000000000004463 0000000000000000000000 0000000000000000000000 Error Address of the executing code errcode ( 2760> = 0000000000000000004577 Error Address of the data area errdata ( 2761> = 0000000000000000014467 A registers at the time of error savea ( 4 3 4 0 > = 0000000000000000000000 savea + 0004 ( 4344> = 0000000000000001100333 S registers at the time of error saves ( 4 3 5 0 > = 0000000000000000001234 saves + 0004 < 4354> = 0000000000000000000000 B registers (sections 3 and 6 only) $actb < 3640> = 0000000000000000000000 $actb + 0004 < 3644> = 0000000000000000000000 $actb + 0010 < 3650> = 0000000000000000000000 $actb 3-32 + 0074 ( 3734> 0000000000000000000000 CRAY PROPRIETARY SMM-1012 C Output (continued): T registers (sections 2 and 6 only) $actt (3740> 0000000000000006720344 $actt + 0004 < 3744> = 0000000015033227672440 + 0010 < $actt 3750> = 0000356647785921190300 = $actt + 0074 < VO - Difference $difvO $difvO + 0004 $difvO + 0010 $difvO 4034> = 3987564008722334539870 (section 6 only) < 4040> 0000000000000000000000 < 4044> 0000000000000000000000 < 4050> = 0000000000000000000000 + 0074 < = 4034> = 0000000000000000000000 V1 - Expected (section 6 only) $expv1 < 4140> = 0000000000000000000000 $expv1 4144> = 0000000000000000000000 + 0004 < $expv1 + 0010 < 4150> = 0000000000000000000000 4234> = 0000000000000000000000 V2 - Actual (section 6 only) 4240> $actv2 < $actv2 4244> + 0004 < $actv2 + 0010 < 4250> = = = 0000000000000000000000 0000000000000000000000 0000000000000000000000 = 0000000000000000000000 $expv1 $actv2 + 0074 < + 0074 < 4334> The first address (FADD) of the diagnostic is 550a olcm reached maximum pass limit with 0 passes and 1 errors on Mon Jul 18 14:58:37 1988 SMM-1012 C CRAY PROPRIETARY 3-33 3.3.5 TEST MESSAGES The olem test produces the following types of messages: • • Informative Error These messages are described in the subsections that follow. 3.3.5.1 Informative messages If no error occurs, olem produces two messages, one at start-up time and another at test termination. If the +verbose option is enabled, a message is sent to stdout (standard output device) after each pass through the test loop. If the value for words n is rounded down to the nearest 0'100 words, the following informative message is displayed: The value for words was rounded down to the nearest 100 octal words. If the value for seed n is set to 0, the following informative message is displayed: Seed selected was 0, so the test read RTC to initial seed. 3.3.5.2 Error messages One of the following error messages is sent to stderr (standard error device) if an invalid command option is entered: Invalid section selected. Valid sections are: 1, 2, 3, 4, 5, 6, and 7. Rerun olem using a valid value for section slist. Number of words selected is too small (minimum is 100 octal). Rerun olem using a valid value for words n. Number of words selected is too large (maximum is 4,000,000 octal). Rerun olem using a valid value for words n. System could not allocate words; words selected may be too large. Rerun olem using a smaller value for words n. 3-34 CRAY PROPRIETARY SMM-1012 C 3.3.5.3 Error output definitions The following are definitions of the output that is dumped on error. Refer to section 3.3.4, Test Examples, for an example of error output. Output Description failsec Test section that was executing when the error occurred words Size of the central memory buffer being tested subs Subsection of the test section lower Address of the beginning of the buffer defined by words upper Address of the end of the buffer defined by words errcode Address where the test code was executing errdata Address within the central memory buffer that was being tested at the time the error occurred SMM-I012 C CRAY PROPRIETARY 3-35 3.4 alcrit The alcrit test is an on-line comprehensive random instruction test. It randomly generates instructions and data to detect instruction-sensitive and data-sensitive sequence failures. The generated instructions are simulated and then executed. The simulation and execution results are compared, and any differences are reported. If an error is detected, the diagnostic attempts to isolate the failing instruction sequence. The test generates, simulates, executes, and compares new instructions and data until the maximum pass, error, or time limit is reached. The olcrit test runs under the confidence monitor program, aleman. The aleman monitor compares the test simulation and execution results. For additional information on aleman, refer to section 2, Confidence Test and Monitor Overview. 3.4.1 TEST SYNOPSIS The olcrit command options can be entered in any order. If an option is omitted, the program uses the default value. The test synopsis lists the olcrit command options and arguments in the following order: 1. 2. 3. 4. 3-36 Monitor options Test-specific options Data pattern options Instruction options CRAY PROPRIETARY SMM-1012 C Synopsis: olcrit [chkpnt mode] [cpu clist] [cputime h:m:s] [+I-getseed] [getseed file] [help] [mazerr n] [mazp n] [+I-parcel] [time h:m:s] [+I-verbose] [+Emp] [+crayl]t [+I-cluster] [cluster n] [disable ilist] [enable ilist] [+I-isolate] [isop n] [numins n] [+I-repeat] [seed n] [vI n] [+I-vload] [+I-bits] [+I-onezero] [+I-random] [+I-address] [+I-ci] [+I-cm] [+I-ema] [+I-fpadd] [+I-fpmult] [+I-fprecip] [+I-int] [+I-jump] [+I-logical] [+I-pop] [+I-scalar] [+I-shift] [+I-shr] [+I-vector] +I-cluster Enables (+cluster) or disables (-cluster) cluster selection. This option is recommended only for sites that run multitasking jobs. If a site runs multitasking jobs and olcrit detects a failure in the shared registers, the only way to determine which cluster was used is to enable the +cluster option. However, selecting a specific cluster with the cluster n option does not ensure that olcrit will be able to access that cluster immediately. The UNICOS scheduler must wait for that cluster to become available. The default is -cluster. t The monitor command options are described in section 2, Confidence Test and Monitor Overview. SMM-1012 C CRAY PROPRIETARY 3-37 cluster n Selects a specific cluster. n can be anyone of the following cluster numbers associated with the indicated mainframe (cluster number 1 is reserved for the operating system) : Mainframe Cluster Numbers CRAY CRAY CRAY CRAY CRAY 2, 2, 2, 2, 2, Y-MP/8 Y-MP/4 X-MP/4 X-MP/2 X-MP/1 3, 4, 5, 6, 7, 10, 11 3, 4, 5 3, 4, 5 3 3 'If cluster n is selected, the +cluster option must also be selected. The default for n is a random cluster number. disable ilist Deselects specific instructions. following format: Enter ilist in the n, n, ••• , n n is the octal value in the gh or ghijk field of the specific instruction. If the gh field does not specify a unique instruction, the ijk field can be used to deselect a specific instruction. For example, the following instructions all have the same gh field: 030jO, 036jk, 037jk To deselect the preceding instructions, you must specify the ghijk field, as follows: disable 03000,03600,03700 The disable ilist option overrides the enable ilist 'option and any selected (+) or deselected (-) instruction options. enable ilist Selects specific instructions. following format: Enter ilist in the n, n, .•• , n 3-38 CRAY PROPRIETARY SMM-1012 C enable ilist (continued) n is the octal value in the gh or ghijk field of the specific instruction. If the gh field does not specify a unique instruction, the ijk field can be used to select a specific instruction. For ~xample, the following instructions all have the same gh field: 0030jO, 0036jk, 0037jk To select the preceding instructions, you must specify the ghijk field, as follows: enable 003000,003600,003700 The enable ilist option overrides any selected (+) or deselected (-) instruction options. When the test is run with default values for the +/- instruction options, and the enable ilist option is selected, only the instructions specified by the enable ilist option are run. When using the enable option to select any of the following instructions, numins n should be greater than 1 or the selected instructions will not be placed in the instruction buffer: 34 through 37 56, 57, 76, 77 100 through 130 150 through 153 176, 177 All of these instructions use an A such as an index or a shift count. selected instructions is executed, A register load instruction. As a set to 1, there is no buffer space instruction using the A register. register for operations Before each of the the test executes an result, if numins is remaining for the +I-isolate Enables (+isolate) or disables (-isolate) the error isolation option. The default is +isolate. isop n SMM-1012 C Sets the isolation pass limit to n (octal). During isolation, the diagnostic repeatedly executes the suspected failing sequence. If the sequence fails, the loop terminates and the diagnostic attempts to isolate the sequence further. If the sequence does not fail, the loop terminates after n passes, and olcrit assumes that the error is not in the tested sequence. The default for n is 0'1000. CRAY PROPRIETARY 3-39 numins n Sets the number of instructions to be generated. n can be any octal value within the range 1 through 0'2000. The default for n is 0'200. +I-repeat Enables (+repeat) or disables (-repeat) the option that repeats the first pass until the diagnostic terminates. +repeat is useful for recreating an error. It is normally used with one of the following options: seed n, +getseed, getseed file, or +cluster together with cluster n. The default is -repeat (the program generates new instructions and data after each pass). seed n Sets the random seed to n. n can be any 64-bit ·octal value. If n is 0, the test reads the real-time clock and uses the value for the initial seed. The default for n is 0'33. If seed n is selected, do not select +getseed or getseed file. vI n Sets the vector length to n. n can be any octal value within the range 0 through 0'100. The default for n is O. If vI is set to 0, a random vI value is used to initialize the test and the value may change during the execution of the random instruction buffer. If the vI value is within the range 1 through 0'100, instruction 00200k is disabled. The vI value is initialized to n and remains set to n during the execution of the random instruction buffer. However, if instruction 00200k is selected by the enable option, the vI value is initialized to n and may change each time a 00200k instruction is executed in the random instruction buffer. +I-vload Selects (+vload) or deselects (-vload) vector instructions for the instruction buffer and, in the case of -vload, does not allow you to load (write) or save (read) the vector registers. -vload overrides vector instructions selected by +vector and enable ilist. The default is +vload. +I-bits, +I-onezero, +I-random Selects (+) or deselects (-) specific data patterns. If allowed to default, all of the data patterns are run. The selected data patterns are used for the initial register and memory values. However, the vector length (VL) register is always initialized with 6-bits of random data. The data patterns are as follows: 3-40 CRAY PROPRIETARY SMM-1012 C +I-bits, +I-onezero, +I-random (continued) Option Data Pattern bits Random number of consecutive I-bits in a word. For example: 0000017777777776000000 1777000000000000000377 1777777777777777777777 0000000000000000000000 0000000000100000000000 onezero Random selection of alII's or all O's in a word. For example: 1777777777777777777777 0000000000000000000000 random Random bit generation in a word. For example: 1023122123232122777127 0003423100233344322177 1640034356453221213532 1123235467543221322120 1304322300332105534311 +I-address, +I-ci, +I-em, +I-ema, +I-fpadd, +I-fpmult, +I-fprecip, +I-int, +I-jump, +I-loqical, +I-pop, +I-scalar, +I-shift, +I-shr, +I-vector Selects (+) or deselects (-) specific instruction groups for the following options: Option Instruction Type address ci em ema fpadd fpmult fprecip int Address register Compressed index Central memory Extended memory addressing Floating-point addition Floating-point multiply Floating-point reciprocal Integer Jump Logical Population/parity count Scalar register Shift Shared register Vector register jump loqical pop scalar shift shr vector SMM-1012 C CRAY PROPRIETARY 3-41 +I-address, +I-ci, +I-ca, +I-ema, +I-fpadd, +I-fpault, +I-fprecip, +I-int, +I-jump, +I-logical, +I-pop, +I-scalar, +I-shift, +I-shr, +I-vector (continued) The instruction groups are as follows: Option CX/CEA Instructions address 001000, 00200k 002200, 002300 002500 through 002700 01hijkm 010ijkm through 013ijkm CRAY-1 Instructions 001000, 00200k 010ijkm through 013ijkm 020 through 022 023i01 023i01 024, 025 030 through 032 034, 035 024, 025 10hijkm, 11hijkm 020 through 022 026ij7, 027ij7 030 through 032 034, 035 10hijkm, 11hijkm ci t 175ij4, 175ij5 175ij6, 175ij7 None 10h through 13h 34 through 37 176iOO, 1770jO 10h through 13h 34 through 37 176iOO, 1770jO e.at 01hijkm None fpadd 062, 063 170 through 173 062, 063 170 through 173 fpmult 064 through 067 160 through 167 064 through 067 160 through 167 fprecip 070, 174ijO 070, 174ijO int 030 through 032 060 through 061 154 through 157 030 through 032 060 through 061 154 through 157 Extended memory instructions are not available on CEA systems in Y-mode. 3-42 CRAY PROPRIETARY SMM-1012 C +I-address, +I-ci, +1-0., +I-e.a, +I-fpadd, +I-fpmult, +I-fprecip, +I-int, +I-jump, +I-logical, +I-pop, +I-scalar, +I-shift, +I-shr, +I-vector (continued) CRAY-1 Option CX/CEA Instructions Instructions jump 005, 006, 007 010 through 017 005, 006, 007 010 through 017 logical 042 through 051 140 through 147 175 042 through 051 140 through 147 175 pop 026ijO, 026ij1 027ijO 174ij1, 174ij2 026ijO, 026ij1 027ijx 174ij1, 174ij2 scalar 0036jk, 0037jk 014jkm through 017jkm 023ijO 026ijO, 026ij1 027ijO 036 through 071 072i02, 072ij3 073i02, 073ij3 074, 075 12hijkm, 13hijkm 014jkm through 017jkm 023ijO 026ijO, 026ij1 027ijO 036 through 071 074, 075 12hijkm, 13hijkm shift 052 through 057 150 through 153 052 through 057 150 through 153 shr 0036jk, 026ij7, 072i02, 073i02, None vector 0030jO, 073iOO 076, 077 140 through 177 0037jk 027ij7 072ij3 073ij3 003, 073, 076, 077 140 through 177 The diagnostic does not currently execute the following instructions in the random instruction buffer: 0, 002400, 0034jk, 4, 33, 072iOO, 073ij1, 176iOk, 176i1k, 1770jk, 1771jk. SMM-1012 C CRAY PROPRIETARY 3-43 +I-address, +I-ci, +1-0., +I-ema, +I-fpadd, +I-fpmult, +I-fprecip, +I-int, +I-jump, +I-loqical, +I-pop, +I-scalar, +I-shift, +I-shr, +I-vector (continued) If allowed to default on a CEA system in Y-mode, all instruction groups are selected with the following exceptions: • If the cluster number assigned to the job is 0, the shared register (shr) instruction group is deselected. • The extended memory addressing (ema) instruction group is deselected. If allowed to default on a CRAY X-MP computer system, all instruction groups are selected with the following exception: if extended memory addressing (ema) or compressed index (ci) hardware is not present in the system, the ema and ci instruction groups are deselected, respectively. If allowed to default on a CRAY-1 computer system, all instruction groups are selected except ema, ci, and shr. However, the vector population count and parity (pop) instruction group is selected only if pop hardware is-present in the system. 3.4.2 TEST EXECUTION The olcrit execution sequence is as follows: 1. 2. 3. 4. 5. 6. Test initialization and hardware configuration detection Random instruction and data generation Random instruction buffer simulation Random instruction buffer execution Comparison of simulation and execution results Error isolation Hardware configuration detection occurs only at test initiation. Steps 2 through 5 occur on each pass through the test loop. Step 6 occurs only on error. 3-44 CRAY PROPRIETARY SMM-1012 C 3.4.2.1 Test initialization and hardware configuration detection At test initialization, instructions are processed in the following order: 1. All instructions are initially enabled unless either of the following occurs (in which case no instructions are initially enabled): (+option) • An instruction group is selected • An enable option is entered and there are no deselected (-option) instruction group entries 2. Selected groups are processed, enabling instructions in the selected groups. 3. Deselected groups are processed, disabling instructions in the deselected groups. 4. If the vI option is set to a value within the range 1 through 0'100, instruction 00200k is deselected. 5. Individually selected instructions are processed (all instructions specified by the enable option). 6. Individually deselected instructions are processed (all instructions specified by the disable option). 7. Any vector instructions disabled by -vload are processed. 8. If no instructions are selected, an error message is displayed and the test is terminated. The hardware configuration detection routine determines which of the following computer systems is configured: • CRAY X-MP computer system • CRAY-1 computer system Then the hardware configuration detection routine adjusts testing accordingly, by determining the following: Mainframe Hardware Configuration Detection Routine CEA (Y-mode) Determines whether cluster 0 is in use CRAY X-MP Determines whether the system contains extended memory addressing and/or compressed indexing hardware, and whether cluster 0 is in use SMM-1012 C CRAY PROPRIETARY 3-45 Mainframe Hardware Configuration Detection Routine CRAY-l Determines whether the system contains a vector population count functional unit After determining the hardware characteristics, the routine writes a message to stdout to indicate the type of system detected, and disables all instructions that are not available because of hardware constraints. Instruction generation is dependent on the hardware configuration detected, as follows (you can use +I-ci, +I-ema, +I-pop, or +I-shr to override this default instruction generation process): Mainframe Instructions Generated CEA (Y-mode) All instructions except extended memory addressing instructions are generated CRAY X-MP All instructions are generated with the following exception: compressed indexing and extended memory instructions are generated only if present in the hardware. CRAY-l All instructions are generated except the following: A load VL instruction (00200k) Scatter/gather/compressed indexing instructions Extended memory instructions Shared register instructions Vector pop/parity instructions are generated only if the hardware contains a vector population count functional unit. 3.4.2.2 Random instruction and data generation These routines build and generate the random instruction buffer and initial data. Instructions for the buffer are randomly selected from a list of enabled instructions. The values of the i, j, and k fields are randomly selected when appropriate. 3-46 CRAY PROPRIETARY SMM-I012 C 3.4.2.3 Random instruction buffer simulation After the instructions and data are generated, the random instruction buffer is simulated by the master CPU only. The save monitor routine saves the results. Each instruction type has a unique simulation routine. The simulation routines use machine resources differently from the instruction being simulated. For example, the address multiply functional unit may be simulated with the floating-point multiply functional unit. 3.4.2.4 Random instruction buffer execution After the instructions are simulated, all of the selected CPUs execute the random instruction buffer code. Before the instructions can be executed, the program loads the following: • • • • • • • • • • • Vector registers Vector length register Vector mask register Address registers B registers T registers Semaphore registers Shared T registers Shared B registers Scalar registers Central memory Then an unconditional jump to the random instruction buffer is executed. At the end of the random instruction buffer is an unconditional jump to a routine that unloads the contents of the registers and central memory. The save monitor routine saves the results. 3.4.2.5 Comparison of simulation and execution results After the instructions are executed in all of the selected CPUs, the compare monitor routine compares the results, and one of the following actions occurs: • If the results match, the test proceeds with the next data pattern. After all of the selected data patterns are run, the pass count is incremented. • If the results do not match, the test dumps all of the data related to the suspected failure and, if the isolation option is enabled (+isolate), attempts to isolate the failure. SMM-I012 C CRAY PROPRIETARY 3-47 3.4.2.6 Error isolation If an error is detected and the isolation option is enabled (+isolate), the test attempts to reduce the random instruction buffer to the minimum number of failing instructions. The isolation process consists of two parts. In the first part of the isolation process, the instruction buffer is shortened from the end, one instruction at a time. The isolation routine initially tests the number of instructions to be generated minus one (numins n-l). The routine executes until the specified number of passes is reached (isop n) or an error is detected. If an error is detected, the number of instructions tested is decremented by one, and testing continues for isop n passes. This process continues until no errors are detected or there are no remaining instructions to be tested. If there are no remaining instructions to be tested and the test detects an error resulting from loading and unloading the registers, the test generates an output dump and the isolation process terminates. In the second part of the isolation process, the last instruction removed is tested by itself for isop n passes. If an error is not detected, the last instruction removed and the instruction preceding it in the random instruction buffer are tested for isop n passes. Until the program detects an error or reaches the beginning of the instruction buffer, one more preceding instruction is added to the test sequence on each iteration of the isolation process. When the isolation process terminates, the output dump contains the following: • • • • • Isolated instruction buffer Data used when the failure occurred Simulated execution results Actual execution results (if different from the simulated results) An exclusive OR of the simulated and actual execution results If the failure is very intermittent, the second part of the isolation process may terminate without detecting an error, and then the output dump will not contain any actual execution results (differences). In this case, increase the value of isop n, enable the +repeat option, select the failing CPU, and use the failing seed to rerun the test. The program may report an error resulting from a failure in either the simulated or actual execution. To determine if the error is the result of an actual execution failure, start olcrit in a different CPU and select the suspected failing CPU. For example, the following entry starts olcrit in CPU c: olcrit cpu c 3-48 CRAY PROPRIETARY SMM-1012 C If olcrit fails, and the simulated execution is suspect, rerun olcrit using a different master CPU and the failing seed, as follows: olcrit cpu a,c +repeat seed n If olcrit fails in CPU c, the failure is in the actual execution of the random instruction buffer. If olcrit does not fail, the error is either in the simulated execution results from CPU c or it is very intermittent. 3.4.3 TEST TERMINATION For information on test termination, refer to section 2, Confidence Test and Monitor Overview. 3.4.4 TEST EXAMPLES This subsection contains olcrit execution examples. The following example runs olcrit for 0'10000000 passes. Output is redirected to crit.log. The nohup(l) command allows the program to continue executing after you log off the system. You can later log on to check the test's progress. The ampersand (&) causes the entire command to execute in the background, so that another prompt is immediately displayed and you can continue to use the system. nohup olcrit maxp 10000000 )crit.log & The following example runs olcrit with selected command options and shell facilities. The test runs for 0'1000000 passes in CPU b with all default instructions. The job runs as a background process, and output is sent to crit.log. olcrit maxp 1000000 cpu b )crit.log & The following example shows a procedure for determining how frequently an error occurs. The test is rerun with the +repeat option, so that the first pass is run repeatedly until the test terminates. The test uses the seed value from the output at the time of the initial error. Error isolation is disabled. The output is filtered to crit.log olcrit +repeat -isolate maxerr 100 maxp 100 cpu d seed 1436651016713554002511 I tail )crit.log & SMM-1012 C CRAY PROPRIETARY 3-49 The following example runs olcrit with floating-point and vector instructions. olcrit +fpadd +fpmult +fprecip +vector )crit.log & The following example runs olcrit with all of the vector instructions except instructions 146 and 147. olcrit +vector disable 146,147 )crit.log & The following example runs olcrit with instructions 026ijO, 026ij1, 026ij7, 031, and 072i02. olcrit enable 26,31,072002 & The following example runs olcrit with all of the default instructions except floating-point add and multiply. olcrit -fpadd -fpmult )crit.log & The following example shows the output displayed when olcrit is run with all default values. olcrit Output: olcrit olcrit started in cpu A on Tue Aug 25 11:32:08 1987 CRAY X-MP MODE olcrit reached maximum pass limit with 1000 passes and 0 errors on Tue Aug 25 11:32:18 1987 The following example runs olcrit with the +verbose option enabled so that a line of output is generated after each pass. olcrit +verbose 3-50 CRAY PROPRIETARY SMM-1012 C Output: olcrit +verbose olcrit started in CRAY X-MP MODE olcrit: pass = 1, olcrit: pass = 2, olcrit: pass = 3, cpu A on Tue Aug 25 11:42:47 1987 error error error =0 =0 =0 Tue Aug 25 11:42:47 1987 Tue Aug 25 11:42:47 1987 Tue Aug 25 11:42:47 1987 olcrit: pass = 1000, error = 0 Tue Aug 25 11:42:57 1987 olcrit reached maximum pass limit with 1000 passes and 0 errors on Tue Aug 25 11:42:57 1987 The following example runs olcrit for 10 seconds (wall-clock time) in CPU conly. olcrit cpu c time 10 Output: olcrit cpu c time 10 olcrit started in cpu C on Tue Aug 25 11:44:51 1987 CRAY X-MP MODE olcrit reached maximum time limit with 1016 passes and 0 errors on Tue Aug 25 11:45:01 1987 The following example runs olcrit in CPUs a and b, with b as the master. On each pass, olcrit tests a sequence of 15 instructions, using random data for the initial register and memory values. olcrit numins 15 +random cpu b,a Output on an error: olcrit numins 15 +random cpu b,a olcrit started in cpus A, B with olcrit: restart file written to CRAY X-MP MODE ( 2100> = name ( 2101> = rev ( 2102> = date ( 2103> = pass ( 2104> = error ( 2105> = seed ( 4027> = failpat ( 2116> = isop ( 2107> = numins SMM-1012 C master cpu B on Tue Mar 1 12:40:37 1988 B67350-olcrit 'olcrit '4.0 '03/01/88' 31 1 1114623621420641250446 'random 1000 15 CRAY PROPRIETARY 3-51 Output (continued): random instruction buffer ibuff 10100a 10100b 10100e 10101a 10101e 10101d 10102a 10102b 10102e 10102d 10103a 10103c 10104a 10104c 144744 061032 012000 020000 077406 144107 030367 002700 037705 067045 020600 007000 021033 006000 V7 SO JAP AO V4,A6 VI A3 CMR O,AO SO A6 R AO J 042400 026211 000172 042410 031327 021200 54 V4 53-52 10500a 00026211 SO SO V7 A6+A7 T05,A7 54*155 00000172 10502a #06631327 4240a jump buffer (used by the random instruction buffer) jbuff 10500a 10500b 10500d 10501a 10501e 10501d 10502a 10502b 10502d 10503a 10503b 10503c 10503d 001000 110000 026400 001000 006000 040404 000000 000000 024100 110100 026401 001000 005000 000000 000000 000000 initial address register data 21600> initaO < 21601> inita1 < inita2 21602> < 21603> inita3 < 21604> inita4 < inita5 21605> < 21606> inita6 < 21607> inita7 < 3-52 PASS 26400,0 PASS J ERR ERR Al 26401,0 PASS J ERR ERR ERR AO 10101a BOO Al BOO = 0000000000000016317572 = 0000000000000017662707 = 0000000000000066352041 = 0000000000000066313277 = 0000000000000014173556 = 0000000000000027243236 = 0000000000000055114565 = 0000000000000006421710 CRAY PROPRIETARY SMM-1012 C Output (continued): initial scalar register data 21610> initsO < inits1 21611> < inits2 21612> < inits3 21613> < inits4 21614> < 21615> inits5 < 21616> inits6 < 21617> inits7 < = 0570435766134171410070 = 0657045641432164307775 = 0362774051154520352750 = 1427136526115123426026 = 1510553624661224560223 = 1734474576202245120017 = 1460472150234237442222 = 1214375337067423156017 initial vector length and mask register data (vector length and mask register data is displayed) initial central memory data (central memory data is displayed) initial jump data (octal ones pattern) (jump data is displayed) initial vector register data (vector register data is displayed) initial shared B register data (shared B register data is displayed) initial shared T register data (shared T register data is displayed) initial semaphore register data (semaphore register data is displayed) initial B register data (B register data is displayed) initial T register data (T register data is displayed) simulated random instruction buffer results The expected data shown below has the following format: The expected data shown below has the following format: + index name name: index: offset: data: The The The The SMM-I012 C (offset> = data ••• name of the data dumped on this line. index into the data starting at name. offset into the data buffer. actual data dumped. CRAY PROPRIETARY Optional, default: O. 3-53 Output (continued): *** Expected Results *** cpu B (master) Source data buffer at 22100 in Memory Memory address in source data buffer = + 22100 (source data buffer) simulated address register data < 2500> = aO a1 < 2501> = a2 < 2502> = < 2503> = a3 a4 < 2504> = < 2505> = a5 a6 < 2506> = a7 < 2507> = results 0000000000000071146450 0000000000000000040420 0000000000000066352041 0000000000000055114565 0000000000000014173556 0000000000000027243236 0000000000000000000172 0000000000000006421710 simulated scalar register data ( 2510> = sO s1 < 2511> = ( 2512> = s2 ( 2513> = s3 s4 < 2514> = s5 < 2515> = s6 < 2516> = s7 < 2517> = results 0600005600346143005524 0657045641432164307775 0362774051154520352750 1427136526115123426026 1510553624661224560223 1734474576202245120017 1460472150234237442222 1214375337067423156017 simulated vector length and mask register data results (vector length and mask register data is displayed) simulated central memory data results (central memory data is displayed) simulated jump data results (jump data is displayed) simulated vector register data results (vector register data is displayed) simulated shared B register data results (shared B register data is displayed) simulated shared T register data results (shared T register data is displayed) simulated semaphore register data results (semaphore register data is displayed) simulated B register data results (B register data is displayed) 3-54 CRAY PROPRIETARY SMM-1012 C Output (continued): simulated T register data results (T register data is displayed) Differences are the results from actual execution of the random instruction buffer that differ from the master (simulated or actual) execution. aO-a7 sO-s7 vI vm cm jmp vO-v7 sb st sm br tr = address register data results scalar register data results = vector length register data results vector mask register data results = central memory data results = jump buffer data results = vector register data results = sbO-sb7 register data results = stO-st7 register data results = semaphore register data result = bOO-b77 register data results = tOO-t77 register data results The difference data shown below has the following format: name + index (offset> = data data differences The name of the data dumped on this line. The index into the data starting at name. Optional, default: O. The offset into the data buffer. The actual data dumped. The differences are marked with an asterisk (*) preceding the data word. data differences: The bits that differ between the actual results and the expected results. name: index: offset: data: *** Differences *** cpu B (master) Source data buffer at 25100 in Memory copied to save buffer at 106362 in Memory Memory address in source data buffer = (offset> + 25100 (source data buffer) Memory address in save data buffer = (offset> + 106362 (save data buffer) actual random buffer execution results a3 ( 2503> = *0000000000000063536475 0000000000000036422110 *** Differences *** SMM-1012 C cpu A CRAY PROPRIETARY 3-55 Output (continued): Source data buffer at 25100 in Memory copied to save buffer at 106362 in Memory Memory address in source data buffer = (offset> + 25100 (source data buffer) Memory address in save data buffer = (offset> + 106362 (save data buffer) actual random buffer execution results a3 ( = *0000000000000063536475 2503> 0000000000000036422110 Beginning error isolation Error isolation complete name rev date pass error seed failpat isop numins ( ( ( ( ( ( ( ( ( 2100> 2101> 2102> 2103> 2104> 2105> 4027> 2116> 2107> = = = = = = = = = 'olcrit '4.0 '03/01/88' 31 1 1114623621420641250446 'random 1000 15 isolation: random instruction buffer ibuff 10102a 10102b 030367 006000 021200 A3 J A6+A7 4240a jump buffer (may be used by the isolated random instruction buffer) jbuff 10500a 10500b 10500d 10501a 10501c 10501d 10502a 10502b 10502d 10503a 10503b 10503c 10503d 3-56 001000 110000 026400 001000 006000 040404 000000 000000 024100 110100 026401 001000 005000 000000 000000 000000 PASS 26400,0 PASS AO J 10101a ERR ERR A1 26401,0 PASS J BOO A1 BOO ERR ERR ERR CRAY PROPRIETARY SMM-1012 C Output (continued): isolation: initial address register data ( initaO 21600> = 0000000000000000026211 inita1 21601> = 0000000000000017662707 < inita2 21602> = 0000000000000066352041 < inita3 21603> = 0000000000000066313277 < inita4 21604> = 0000000000000014173556 < inita5 21605> = 0000000000000027243236 < inita6 21606> = 0000000000000055114565 < 21607> = 0000000000000006421710 inita7 < isolation: initial scalar register data initsO 21610> = 1044142454740403053056 < inits1 21611> = 0657045641432164307775 < 21612> inits2 < = 0362774051154520352750 21613> inits3 < = 1427136526115123426026 inits4 21614> < = 1510553624661224560223 inits5 21615> < = 1734474576202245120017 inits6 21616> < = 1460472150234237442222 21617> inits7 < = 1214375337067423156017 (From this point on, the dump is similar to the previously listed portion o£ the dump that displayed the unisolated error in£ormation.) The first address (FADD) of the diagnostic is 2100a olcrit reached maximum error limit with 31 passes and 1 errors at Tue Mar 1 12:40:59 1988 3.4.5 TEST MESSAGES The olcrit test produces the following types of messages: • • • Test mode Informative Error These messages are listed in the subsections that follow. 3.4.5.1 Test mode messages During test execution, one of the following informational messages is displayed to indicate the test mode: CRAY Y-MP MODE Indicates that the mainframe is a CEA system in Y-mode. SMM-1012 C CRAY PROPRIETARY 3-57 CRAY Y-MP MODE, shared register testing disabled Indicates that the mainframe is a CEA system in Y-mode, and that shared register instruction testing is disabled because cluster 0 is in use. If this message is inconsistent with your hardware configuration, it normally indicates an instruction failure. To determine where the failure occurred, rerun olcrit with the +shr command option. Contact your CRI representative for additional assistance. CRAY X-MP MODE Indicates that the mainframe is a CRAY X-MP computer system. CRAY X-MP MODE, shared register testing disabled Indicates that the mainframe is a CRAY X-MP computer system, and that shared register instruction testing is disabled because cluster 0 is in use. If this message is inconsistent with your hardware configuration, it normally indicates an instruction failure. To determine where the failure occurred, rerun olcrit with the +shr command option. Contact your CRI representative for additional assistance. CRAY X-MP MODE, compressed index testing disabled Indicates that the mainframe is a CRAY X-MP computer system without compressed indexing hardware. If this message is inconsistent with your hardware configuration, it normally indicates an instruction failure. To determine where the failure occurred, rerun olcrit with the +ci command option. Contact your CRI representative for additional assistance. CRAY X-MP MODE, extended memory testing disabled Indicates that the mainframe is a CRAY X-MP computer system without extended memory instruction hardware. If this message is inconsistent with your hardware configuration, it normally indicates an instruction failure. To determine where the failure occurred, rerun olcrit with the +ema command option. Contact your CRI representative for additional assistance. CRAY-l MODE Indicates that the mainframe is a CRAY-l computer system. CRAY-l MODE, vector pop/parity testing disabled Indicates that the mainframe is a CRAY-l computer system without vector population count/parity instruction hardware. If this message is inconsistent with your hardware configuration, it normally indicates an instruction failure. To determine where the failure occurred, rerun olcrit with the +pop command option. Contact your CRI representative for additional assistance. 3-58 CRAY PROPRIETARY SMM-I012 C 3.4.5.2 Informative messages If the +verbose option is enabled, a message is sent to stdout (standard output device) after each pass through the test loop. On an error, the test provides information such as the following: • Pass and error counts • Seed at the beginning of the pass on which the error occurred • Contents of the instruction buffer '. Initial data • Data results from the simulated instruction execution in the master CPU • Differences between the simulated execution results from the master CPU and the actual execution results from all of the selected CPUs In addition, the following informative messages may be displayed: The ijk field is invalid; the instruction was not selected/deselected. The ijk field specified with the gh field for enable ilist or disable ilist is invalid. Correct and rerun. The ijk field is not needed to select/deselect the instruction. The ijk field specified with the gh field for enable ilist or disable ilist is not required. However, the specified instruction was selected or deselected. 3.4.5.3 Error messages One of the following error messages is sent to stderr (standard error device) if an invalid command option is entered: olcrit: pattern: No data pattern(s) selected. All data patterns are deselected. Correct and rerun. olcrit: selins: No executable instructions selected. All instructions are deselected. Correct and rerun. olcrit: selins: Vector length must be in the range 0 through 100. Vector length is not in the range a through 100. Correct the vI option and rerun. SMM-1012 C CRAY PROPRIETARY 3-59 One of the following error messages is sent to stderr if olcrit detects an unexpected error. Select a different master CPU and rerun the test. If the problem persists, contact your CRI representative. olcrit: simulate: simxxx routine. (software error) The instruction does not have a olcrit: generate: genxxx routine. (software error) The instruction does not have a olcrit: (software error) The gh field is greater than simulate: 177. 3-60 CRAY PROPRIETARY SMM-1012 C 3.5. olcsvc The olcsvc functional functional units, and test provides comprehensive testing of the vector registers, units, and paths, and limited testing of the scalar registers, units, and paths. All address registers, address functional related paths are assumed to be operating correctly. The olcsvc test generates a random sequence of vector instructions, followed by a sequence of scalar instructions. The scalar and vector instructions perform identical functions. The two sets of instructions are executed with random data, and the results are compared. Any differences are reported, and the test attempts to isolate the error. If no differences are detected, the test generates new instructions and data, and repeats the process. The olcsvc test runs under the confidence monitor program, olcmon. The olcmon monitor compares the scalar and vector execution results. For additional information on olemon, refer to section 2, Confidence Test and Monitor Overview. 3.5.1 TEST SYNOPSIS The olcsvc command options can be entered in any order. If an option is omitted, the program uses the default value. The test synopsis lists the olcsvc command options and arguments in the following order: 1. 2. 3. 4. Monitor options Test-specific options Data pattern options Instruction options SMM-1012 C CRAY PROPRIETARY 3-61 Synopsis: olcsvc [chkpnt mode] [cpu clist] [cputime h:m:s] [+I-qetseed] [qetseed file] [help] [mazerr n] [mazp n] [+I-parcel] [time h:m:s] [+I-verbose] [+zmp] [+crayl]t [disable ilist] [enable ilist] [+I-isolate] [isop n] [numpar n] [+I-repeat] [seed n] [+I-sqci] [vI n] [+/-onezero] [+I-random] [+I-slide] [+I-cm] [+I-fpadd] [+I-fpmult] [+I-fprecip] [+I-int] [+I-Ioqical] [+I-pop] [+I-shift] disable ilist Deselects specific instructions. following format: Enter ilist in the n,n, ••• ,n n is the octal value in the gh field of the specific vector instructions. Only vector instructions are valid; all other instructions are ignored. The disable ilist option overrides the enable ilist option and any selected (+) or deselected (-) instruction options. enable ilist Selects specific instructions. following format: Enter ilist in the n,n, ••. ,n n is the octal value in the gh field of the specific vector instructions. Only vector instructions are valid; all other instructions are ignored. If you do not enter enable ilist, all vector instructions are run. The enable ilist option overrides any selected (+) or deselected (-) instruction options. When the test is run with default values for the +1- instruction options, and the enable ilist option is selected, only the instructions specified by the enable ilist option are run. t The monitor command options are described in section 2, Confidence Test and Monitor Overview. 3-62 CRAY PROPRIETARY SMM-I012 C +I-isolate Enables (+isolate) or disables (-isolate) the error isolation option. The default is +isolate. isop n Sets the isolation pass limit to n (octal). During isolation, the diagnostic repeatedly executes the suspected failing sequence. If the sequence fails, the loop terminates and the diagnostic attempts to isolate the shortened sequence further. If the sequence does not fail, the loop terminates after n passes, and olcsvc assumes that the error is not in the tested sequence. The default for n is 0'1000. numpar n Sets the minimum number of parcels of vector instructions to be generated on each pass. The actual number of parcels generated can be greater than n on any given pass. n can be any octal value in the range 1 through 0'200. The default for n is 0'100. +I-repeat Enables (+repeat) or disables (-repeat) the option that repeats the first pass until the diagnostic terminates. +repeat is useful for recreating an error. It is normally used with one of the following options: seed n, +getseed, or getseed file. The default is -repeat (the program generates new instructions and data after each pass). seed n Sets the random seed to n. n can be any 64-bit octal value. If n is 0, the test reads the real-time clock and uses the value for the initial seed. The default for n is 0'33. If seed n is selected, do not select +getseed or getseed file. +I-sgci Enables (+sgci) or disables (-sgci) testing of the scatter/gather/compressed index hardware. When enabled, testing occurs even if the hardware configuration detection routine indicates that the hardware is not present in the system. However, if this option is enabled and the hardware is not present in the system, you will receive a When allowed dump indicating that the hardware has failed. to default, the test determines the type of hardware configuration and sets the default value accordingly. vI n Sets the vector length to n. n can be any octal value within the range 0 through 0'100. The default for n is o. If vI is set to 0, a random vI value is used to initialize the test and the value may change during the execution of the random instruction buffer. SMM-1012 C CRAY PROPRIETARY 3-63 vI n (continued) If the vI value is within the range 1 through 0'100, instruction 00200k is disabled. The vI value is initialized to n and remains set to n during the execution of the random instruction buffer. However, if instruction 00200k is selected by the enable option, the vI value is initialized to n and may change each time a 00200k instruction is executed in the random instruction buffer. +I-onezero, +I-random, +I-slide Selects (+) or deselects (-) specific data patterns. Except when the vI value is initialized to a value within the range 1 through 0'100, random data is used for the vector length register. The default is +onezero +random +slide. The data patterns are as follows: Option Data Pattern one zero Random selection of alII's or all O's in a word. For ex~nple: 1777777777777777777777 0000000000000000000000 random Random bit generation in a word. For example: 1023122123232122777127 0003423100233344322177 1640034356453221213532 1123235467543221322120 1304322300332105534311 slide Random number of consecutive l's (O's) that slide in either direction through a field of O's (l's). Consecutive words contain the sliding pattern. For example: 0777777777777777777777 0377777777777777777777 0177777777777777777777 1077777777777777777777 1437777777777777777777 1777777777777777777770 1777777777777777777774 1777777777777777777776 1777777777777777777777 3-64 CRAY PROPRIETARY SMM-1012 C +I-onezero, +I-random, +I-slide (continued) Option Data Pattern slide (Example continued): 0000000000000000000001 0000000000000000000003 0000000000000000000007 0000000000000000000017 0000000000000000000036 0740000000000000000000 1700000000000000000000 1600000000000000000000 1400000000000000000000 1000000000000000000000 0000000000000000000000 +I-om, +I-fpadd, +I-fpmult, +I-fprecip, +I-int, +I-logical, +I-pop, +I-shift Selects (+) or deselects (-) specific instruction groups for the following options: Option Instruction Type om Central memory Floating-point addition Floating-point multiply Floating-point reciprocal Integer Logical Population/parity count Shift fpadd fpmult fprecip int logical pop shift If allowed to default, all instruction groups are run. groups are as follows: Option Instruction Group om 176, 177 170 through 173 160 through 167t 174ijO 154 through 157 003, 073, 140 through 147, 175 174ijl, 174ij2 150 through 153 fpadd fpmult fprecip int logical pop shift t The Instruction 166 is not generated on a CEA system. SMM-I012 C CRAY PROPRIETARY 3-65 3.5.2 TEST EXECUTION The olcsvc execution sequence is as follows: 1. 2. 3. 4. 5. Test initialization and hardware configuration detection Random instruction and data generation Instruction buffer execution Comparison of execution results Error isolation Hardware configuration detection occurs only at test initiation. Steps 2 through 4 occur on each pass through the test loop. Step 5 occurs only on error. 3.5.2.1 Test initialization and hardware configuration detection At test initialization, instructions are processed in the following order: 1. All instructions are initially enabled unless either of the following occurs (in which case no instructions are initially enabled): • An instruction group is selected (+option) • An enable option is entered and there are no deselected (-option) instruction group entries 2. Selected groups are processed, enabling instructions in the selected groups. 3. Deselected groups are processed, disabling instructions in the deselected groups. 4. If the vI option is set to a value within the range 1 through 0'100, instruction 00200k is deselected. 5. Individually selected instructions are processed (all instructions specified by the enable option). 6. Individually deselected instructions are processed (all instructions specified by the disable option). 7. If no instructions are selected, an error message is displayed and the test is terminated. The hardware configuration detection routine determines which of the following computer systems is configured: 3-66 • CRAY X-MP computer system • CRAY-1 computer system CRAY PROPRIETARY SMM-1012 C Then the hardware configuration detection routine adjusts testing accordingly, by determining the following: Mainframe Hardware Configuration Detection Routine CRAY X-MP Determines whether the system contains scatter/gather/compressed indexing hardware CRAY-l Determines whether the system contains a vector population count functional unit After determining the hardware characteristics, the routine writes a message to stdout to indicate the type of system detected. Instruction generation is dependent on the hardware configuration detected, as follows (you can use the +/-sqci option to override this default instruction generation process): Mainframe Instructions Generated CEA All instructions are generated except instruction 166, which is the 32-bit vector integer multiply instruction CRAY X-MP All instructions are generated with one condition: scatter/gather/compressed indexing instructions are generated only if present in the hardware. CRAY-1 All instructions are generated except the following: A load VL instruction (00200k) Scatter/gather/compressed indexing instructions Any instructions that would cause vector recursion. (In a vector instruction, vector recursion results when vi and Vj or vi and vk refer to the same vector register). Vector pop/parity instructions are generated only if the hardware contains a vector population count functional unit. 3.5.2.2 Random instruction and data generation These routines build the random vector instruction buffer. As each vector instruction is generated, the sequence of scalar instructions that simulates the vector instructions is generated in the scalar instruction buffer. SMM-1012 C CRAY PROPRIETARY 3-67 The following information applies to the sequence of scalar instructions that is generated for each vector instruction: • Sa, Sb, Sc, and Sd are randomly selected S registers. Am, An, Ap, and Aq are randomly selected A registers. The test uses unique A registers and S registers for each sequence, but not AO or so. The registers are not selected based on the ijk fields of the vector instruction. Therefore, the same vector instruction does not always generate the same sequence of scalar instructions. The registers used in the scalar sequence will vary. • The labels vireg, vjreg, vkreg, sireg, sjreg, skreg, and vrnreg are central memory locations containing the simulated vector registers, scalar registers, and vector mask register, respectively. The actual address depends on the i, j, and k fields of the actual vector instruction. • For vector instructions that require A registers to contain certain values (memory and shift instructions), constant loads of the A registers are generated immediately preceding the actual vector instruction in the vector instruction buffer. • These sequences are altered for certain vector instructions if the i, j, and k fields of the vector instruction refer to the same vector register. For instructions 141, 143, 145, 155, 157, 161, 163, 165, 167, 171, and 173, if the j field is equal to the k field of the instruction, the read from vkreg in the scalar instruction sequence is not generated because it is the same as the read from vjreg; this results in faster execution of the scalar instruction sequence. The following applies only to CRAY-l computer systems: For instructions 141, 143, 145, 147 through 153, 155, 157, 161, 163, 165, 167, 171, and 173, the i field never equals the j field. For instructions 140 through 147, and 154 through 174, the i field never equals the k field. 3-68 • The shift instructions normally produce a shift value in the range o through 0'77 for a single shift and 0 through 0'177 for a double shift, and only occasionally use a random value for the shift amount. • For instructions 176iOk and 1770jk (read/write vector to central memory), the central memory address is a random address within the first 0'400 words of cmbuff. The stride is a random value with its upper limit based on the random address and the current vector length. Therefore, a large stride can be used if the vector length is small. CRAY PROPRIETARY SMM-1012 C • For instructions 176i1k and 1771jk (gather and scatter), the program sets up a vector register containing a specific range of values by forcing a sequence of instructions before instruction 176i1k or 1771jk is generated. The forced instructions consist of a load of an S register with a 9-bit mask from the right (042i67), followed by a 140 instruction (the logical product of a scalar register with a vector register to a vector register). The resulting vector register is then used as the vk register in a 176i1k or 1771jk instruction. This forces the values into the range 0 through 0'777, and it reduces the randomness of the instruction sequence generated. The test tracks the vector registers that can be used for a gather/scatter instruction. If the vk register is within the range 0 through 0'117 when a 176i1k or 1771jk instruction is generated, the set-up sequence is not generated. The following conditions indicate that a vector register is within the range 0 through 0'777: The register was set up for a previous gather/scatter instruction. The register received the results from a 174ij1 or 174ij2 instruction (pop/parity). The register received the results from a 140 instruction, and the vk field of the instruction was set up for scatter/gather. The register received the results from a 141 instruction, and either the Vj or vk field of the instruction was set up for scatter/gather. The register received the results from a 143, 145, or 147 instruction, and the vj and vk fields of the instruction were set up for scatter/gather. The register received the results from a 151 instruction (single shift right), and the shift value was greater than 55 (decimal). The register received the results from a 153 instruction (double shift right), and the shift value was greater than 119 (decimal). The scalar instruction sequence that is generated for each vector instruction follows. SMM-1012 C CRAY PROPRIETARY 3-69 Scalar instructions are not generated for vector instruction 00200k. However, during the vector instruction sequence, the VL value to be used in scalar instruction sequences is loaded into an A register and, subsequently, the VL register is loaded from the A register. The scalar instruction sequence for vector instruction 0030jO is as follows: Sb sjreg, vmreg, Sb Read Sj value ; Store Resulting VM The scalar instruction sequence for vector instruction 073iOO is as follows: ·Sa sireg, vmreg, Sa Read simulated VM reg. The scalar instruction sequence for vector instruction 076 is as follows: Ap Sa sireg, element vjreg,Ap Sa · Read , Random element number element from vj Store into Si The scalar instruction sequence for vector instruction 077 is as follows: Ap Sa vireg,Ap element sjreg, Sa Random element number ; Read Sj Store into vi The scalar instruction sequence for vector instructions 140, 142, 144, 154, 156, 160, 162, 164, 166,t 170, and 172 is as follows: Am An loop Sb Sc vI 0 sjreg, vkreg,An Sa vireg,An An AO jan Sa An+1 Am-An loop SbopSc Current simulated VL Index Get S register value Get next vector element Perform operation , Store result , Update index , Test for end Loop until index = VL · · · op can be one of the following: &,!, t , +, -, *f, *h, *r, *i, +f, -f Instruction 166 is not generated on a CEA system in Y-mode. 3-70 CRAY PROPRIETARY SMM-1012 C The scalar instruction sequence for vector instructions 141, 143, 145, 155, 157, 161, 163, 165, 167, 171, and 173 is as follows: loop Am An vI Sb Sc vjreg,An vkreg,An Sa vireg,An An AO jan Sa An+1 Am-An loop 0 SbopSc Current simulated VL Index Get next vector elements Perform operation Store result Update index Test for end Loop until index = VL op can be one of the following: &, I, , , +, -, *f, *h, *r, *'1., +f, -f The scalar instruction sequence for vector instruction 146 is as follows: loop Am An Sd SO jsp Sa j skip1 skip2 Sa vireg,An Sd An AO jan vI o vrnreg, Sd skip1 sjreg, skip2 vkreg,An Sa Sd<1 An+1 Am-An loop Current simulated VL Index Get simulated VM reg. VM to SO for testing Decide on result Read Sj register Skip vector read Read vector element Write result element Shift VM value Update index Test for end Loop until index = VL The scalar instruction sequence for vector instruction 147 is as follows: loop Am An Sd SO jsp Sa j skip1 skip2 SMM-1012 C Sa vireg,An Sd An AO jan vI o vrnreg, Sd skip1 vjreg,An skip2 vkreg,An Sa Sd<1 An+1 Am-An loop Current simulated VL Index Get simulated VM reg. VM to SO for testing Decide on result Read vj element Skip vector read Read vk element Write result element Shift VM value Update index Test for end Loop until index = VL CRAY PROPRIETARY 3-71 The scalar instruction sequence for vector instructions 150 and 151 is as follows: loop · Ap shift , Amount to shift Am vI 0 vjreg,An SaopAp Sa An+1 Am-An loop , Current simulated VL , Index ; Get Vj element Do the shift Store result Update index , Test for end , Loop until index = VL AS Sa Sa vireg,An An AO jan · · · · op can be < (left shift) or > (right shift). The scalar instruction sequence for vector instruction 152 is as follows: loop skip Ap Am An Sa An AO shift vI 0 vjreg,An An+1 Am-An Sb 0 jaz skip vjreg,An Sb Sa,Sb Ap Sa Sd An+1 Am-An loop simulated VL Index Zero fill the shift , Get Vj element ; Copy Sa into Sd , Do the shift , Store the result Copy Sd into Sb Update index Test for end Loop until index = VL · · · CRAY PROPRIETARY SMM-1012 C The scalar instruction sequence for vector instruction 174ijO is as follows: loop Am An vI sb vjreg,An IhSl Sa An+l Am-An loop Current simulated VL Index Get Vj element Perform operation Store result Update index Test for end Loop until index = VL 0 Sa vireg,An An AO jan The scalar instruction sequence for vector instructions 174ijl and 174ij2 is as follows: loop Am An vI Sb vjreg,An opSl Ap An+l Am-An loop Ap vireg,An An AO jan op can be P or Current simulated VL Index Get Vj element Perform operation Store result Update index Test for end Loop until index = VL 0 Q The scalar instruction sequence for vector instructions 175ijO through 175ij3 is as follows: Am An Sc Sa SO loop jump Sa Sc An AO jan vrnreg, skip vI ; 0 SB 0 vjreg,A5 skip Sa!Sc Sc>l An+l Am-An loop Sa Current simulated VL Index Mask of current element Build VM in this register Get next element Set VM bit? Yes ~ Set bit in VM Shift for next element Update index Test for end Loop until index = VL Store resulting VM The jump value is determined by the vector instruction, as follows: Vector Instruction Value 175ijO 175ijl 175ij2 175ij3 jsn jsz jsm jsp SMM-l012 C Jump CRAY PROPRIETARY 3-73 The scalar instruction sequence for vector instructions 175ij4 through 175ij7 is as follows: loop skip Am vI An Sc Sa Ap SO jump Sa vireg,Ap Ap Sc 'An AO jan vmreg, o SB o o vjreg,An skip Sa!Sc An Ap+l Sc>l An+l Am-An loop Sa ; Current simulated VL ; Index Mask of current element ; Build VM in this register ; Compressed index pointer Get next element Set VM bit? ; Yes - set bit in VM Store index in vi ; Update compressed index Shift for next element Update index ; Test for end ; Loop until index = VL Store resulting VM The jump value is determined by the vector instruction, as follows: Jump Value Vector Instruction jsn jsz jsm jsp 175ij4 175ij5 175ij6 175ij7 The scalar instruction sequence for vector instruction 176iok is as follows: loop 3-74 Ap Aq Am An Sa vireg,An Ap An AO jan cmaddress stride vI 0 ,Ap Sa Ap+Aq An+l Am-An loop CM address in cmbuff · Current , Random stride value simulated VL Index , Read from cmbuff ; Store element of vector Increment address by stride Update index , Test for end Loop until index = VL · · CRAY PROPRIETARY SMM-I012 C The scalar instruction sequence for vector instruction 176ilk is as follows: Ap Am An Aq Aq Sa vireg,An An AO jan loop cmbuff vI 0 vkreg,An Aq+Ap ,Aq Sa An+l Am-An loop Address of cmbuff Current simulated VL Index Get element of vector Calculate address Get word from memory Store vector element : Update index Test for end Loop until index = VL The scalar instruction sequence for vector instruction 177ijO is as follows: Ap Aq Am An loop cmaddress stride vI 0 Sb vjreg,An ,Ap Ap An AO jan Ap+Aq An+l Am-An loop Sb CM address in cmbuff Random stride value Current simulated VL Index Get element of vector Write to cmbuff Increment address by stride Update index : Test for end Loop until index = VL The scalar instruction sequence for vector instruction 177ijl is as follows: Ap Am An Aq Aq loop Sb ,Aq An AO jan 3.5.2.3 cmbuff vI o vkreg,An Aq+Ap vjreg,An sb An+l Am-An loop Address of cmbuff Current simulated VL Index Get element of vector Calculate address Get vector element Write word to memory Update index Test for end Loop until index = VL Instruction buffer execution After the instructions and data are generated, the scalar and vector instruction buffers are executed first in the master CPU, and then in each of the other selected CPUs. Immediately following the execution of an instruction buffer, the save monitor routine is called to save the execution results. SMM-l012 C CRAY PROPRIETARY 3-75 3.5.2.4 Comparison of execution results After the scalar and vector instruction buffers are executed in all of the selected CPUs, the compare monitor routine compares the results, and one of the following actions occurs: • If the results match, the test proceeds with the next pass. • If the results do not match, the test dumps all of the data related to the suspected failure and, if the isolation option is enabled (+isolate), attempts to isolate the failure by reducing the number of instructions in the execution buffers in which the failure is occurring. Refer to the test output to determine which CPU has failed. 3.5.2.5 Error isolation If an error is detected and the isolation option is enabled (+isolate), the test attempts to reduce the random vector instruction buffer to the minimum number of failing instructions. If an instruction sequence is removed from the vector instruction buffer, the corresponding scalar instruction sequence is removed from the scalar instruction buffer. If a vector instruction requires that a set of registers be used together to perform a specific function, such as the address registers for memory references, the set of instructions is considered to be a single instruction sequence. The isolation process consists of two parts. During the first part, the vector instruction buffer is shortened from the end, one instruction sequence at a time. The isolation routine initially tests the number of instruction sequences generated minus one. The routine executes until the specified number of passes is reached (isop n) or an error is detected. If an error is detected, the number of instruction sequences tested is decremented by one, and testing continues for isop n passes. This process continues until no errors are detected or until there are no remaining instructions to be tested. If there are no remaining instructions to be tested and the test detects an error resulting from loading and unloading the registers, the test generates an output dump and the isolation process terminates. During the second part of the isolation process, the last instruction sequence removed is tested by itself for isop n passes. If no error is detected, the preceding instruction sequence is loaded into the random vector instruction buffer and tested for isop n passes. Until the program detects an error or reaches the beginning of the instruction buffer, one more preceding instruction is added to the test sequence on each iteration of the isolation process. 3-76 CRAY PROPRIETARY SMM-I012 C When the isolation process terminates, the output dump contains the following: • • • • • Isolated vector and scalar instruction buffers Data used when the failure occurred Scalar execution results from the master CPU Vector execution differences from the master CPU Scalar and vector execution differences from other CPUs If the failure occurs intermittently, the second part of the isolation process may terminate without detecting an error, and execution difference results do not appear in the output dump. In this case, increase the value of isop n, enable the +repeat option, select the failing CPU, and use the failing seed to rerun the test. All of the selected CPUs execute the scalar and vector instruction buffers. Therefore, if the program reports an error resulting from a failure in either the scalar or vector execution, the differences results should indicate where the failure occurred. For example, if the scalar and vector results indicate differences in all of the selected CPUs, the scalar instruction buffer in the master CPU is suspect. In this case, use the failing seed to rerun olcsvc in a different master CPU. 3.5.3 TEST TERMINATION For information on test termination, refer to section 2, Confidence Test and Monitor Overview. 3.5.4 TEST EXAMPLES This subsection contains olcsvc execution examples. The following example runs olcsvc for 0'10000000 passes in CPU b. Output is redirected to olcsvc.log. The Dohup(1) command allows the program to continue executing after you log off the system. You can later log on to check the test's progress. The ampersand (&) causes the entire command to execute in the background, so that another prompt is immediately displayed and you can continue to use the system. nohup olcsvc maxp 10000000 cpu b >olcsvc.log & SMM-I012 C CRAY PROPRIETARY 3-77 The following example shows a procedure for determining how frequently an error occurs. The test is rerun with the +repeat option, so that the first pass is run repeatedly until the test terminates. The test uses the seed value from the output sent to fail.log at the time of the initial error. Error isolation is disabled. The output is filtered to olesve.log. olcsvc +repeat -isolate maxerr 100 maxp 100 cpu d getseed fail. log I tail >olcsvc.log & The following example runs olesve with floating-point multiply and central memory instructions, and instructions 140 through 143. The test uses a constant vector length of 0'100. olcsvc +fpmult +cm enable 140,141,142,143 vI 100 >olcsvc.log & The following example runs olesve with all of the vector logical instructions except instructions 146 and 147. olcsvc +logical disable 146,147 >olcsvc.log & The following example runs olesve with all of the instructions except floating-point multiply. olcsvc -fpmult >olcsvc.log & The following example shows the output displayed when olesve is run with all default values. olcsvc Output: olcsvc olcsvc started in cpu A on Tue Aug 25 13:42:07 1987 CRAY X-MP MODE olcsvc reached maximum pass limit with 1000 passes and 0 errors on Tue Aug 25 13:42:15 1987 The following example runs olesve with the +verbose option enabled so that a line of output is generated after each pass. olcsvc +verbose 3-78 CRAY PROPRIETARY SMM-1012 C Output: olcsvc +verbose olcsvc started in CRAY X-MP MODE olcsvc: pass = 1, olcsvc: pass = 2, olcsvc: pass = 3, cpu A on Tue Aug 25 11:42:47 1987 error error error =0 = = 0 0 Tue Aug 25 11:42:47 1987 Tue Aug 25 11:42:47 1987 Tue Aug 25 11:42:47 1981 olcsvc: pass = 1000, error = 0 Tue Aug 25 11:42:55 1987 olcsvc reached maximum pass limit with 1000 passes and 0 errors on Tue Aug 25 11:42:55 1981 The following example runs olcsvc for 10 seconds (CPU time) in CPU c only. olcsvc cpu c cputime 10 Output: olcsvc cpu c cputime 10 olcsvc started in cpu C on Tue Aug 25 11:44:51 1981 CRAY X-MP MODE olcsvc reached maximum cputime limit with 1510 passes and 0 errors on Tue Aug 25 11:45:06 1987 The following example runs olcsvc in CPUs a and c, with a as the master. On each pass, the test generates 20 parcels of vector instructions. olcsvc cpu a,c numpar 20 Output on an error: olcsvc cpu a,e numpar 20 olcsvc started in cpus A, C with master cpu A on Mon Feb 9 11:19:19 1981 CRAY X-MP MODE olcsvc: restart file written to Al1524-olcsvc 11760> = 'olcsvc name < 11161> = '4.0 rev < 11762> = '02/09/87' date < pass 11763> = 4 < 11764> = 1 error < 11765> = 37507312636362015466 seed < 11110> = 0 vI < 12016> = 20 numpar < 14521> = 1000 isop < 12475> = 'slide , failpat < SMM-1012 C CRAY PROPRIETARY 3-79 Output (continued): random vector instruction buffer 15456a 15456b 15456c 15456d 15457a 15457c 15457d 15460a 15460b 15460c 15460d 15461a 15461b 15461c 15461d 15462a 175073 160464 143005 153060 020600 000072 150156 163334 162334 147015 163607 165604 141752 162716 141227 172347 006000 057120 VM V4 VO VO A6 V1 V3 V3 VO V6 V6 V7 V7 V2 V3 J vbuff V7,M S6*FV4 VO!V5 V6,V6>AO 00000072 V5 00 S6 24457,A2 SO 15523c JSP S6!S2 S6 S2>100-77 S2 A2+AO A2 AO A6-A2 15522b JAN 23546,0 S6 00000002 A6 00 A5 23555,0 S2 24157,A5 S3 S2*FS3 S4 24157,A5 S4 A5+AO A5 A6-A5 AO 15526c JAN (scalar instructions simulating all of the vector instructions are displayed) initial vector length and mask register data 21533> = 0000000000000000000002 < initvl 21534> = 1600000000000000000000 < initvrn 3-80 CRAY PROPRIETARY SMM-1012 C Output (continued): initial scalar register data 21535> initsO < 21536> initsl < 21537> inits2 < 21540> inits3 < 21541> inits4 < 21542> < inits5 21543> inits6 < 21544> inits7 < = = = = = = = = 1700000000000000000000 1740000000000000000000 1760000000000000000000 1770000000000000000000 1774000000000000000000 1776000000000000000000 1777000000000000000000 1777400000000000000000 initial vector register data (vector register data is displayed) initial Central Memory data (central memory data is displayed) scalar instruction buffer execution results The expected data shown below has the following format: + index name name: index: offset: data: The The The The = data •.• name of the data dumped on this line. index into the data starting at name. offset into the data buffer. actual data dumped. *** Expected Results *** Optional, default: O. cpu A (master) Source data buffer at 16300 in Memory copied to save buffer at 73613 in Memory Memory address in source data buffer = + 16300 (source data buffer) Memory address in save data buffer + 73613 (save data buffer) = Scalar Buffer Execution Results scalar buffer execution: vector length and mask register data results vlreg < 2010> 0000000000000000000002 2011> 0000000000000000000000 vrnreg < = = scalar buffer execution: scalar register data results < 2000> = 1700000000000000000000 sOreg 2001> = 1740000000000000000000 slreg < 2002> = 1760000000000000000000 s2reg < 2003> s3reg < = 1770000000000000000000 2004> s4reg < = 1774000000000000000000 2005> = 1776000000000000000000 s5reg < 2006> = 1777000000000000000000 s6reg < 2007> s7reg < = 1777400000000000000000 SMM-1012 C CRAY PROPRIETARY 3-81 Output (continued): scalar buffer execution: vector register data results (vector register data is displayed) scalar buffer execution: central memory data results (central memory data is displayed) The following data shows the differences between executing the scalar buffer in the master CPU and executing the vector buffer and scalar buffer in any remaining CPUs. vlreg vrnreg sOreg-s7reg vOreg-v7reg cmbuff = vector length register results = vector mask register results = scalar register data results = vector register data results = central memory data results The difference data shown below has the following format: n~e + index (offset> = data data differences The n~e of the data dumped on this line. The index into the data starting at n~e. Optional, default: O. The offset into the data buffer. The actual data dumped. The differences are marked with an asterisk (*) preceding the data word. data differences: The bits that differ between the actual results and the expected results. n~e: index: offset: data: *** Differences *** cpu A (master) Source data buffer at 16300 in Memory copied to save buffer at 75626 in Memory Memory address in source data buffer = (offset> + 16300 (source data buffer) Memory address in save data buffer = (offset> + 75626 (save data buffer) Vector Buffer Execution Results *** Differences *** cpu C Source data buffer at 16300 in Memory copied to save buffer at 77641 in Memory Memory address in source data buffer = (offset> + 16300 (source data buffer) Memory address in save data buffer = (offset> + 77641 (save data buffer) Scalar Buffer Execution Results 3-82 CRAY PROPRIETARY SMM-1012 C Output (continued): *** Differences *** cpu A (master) Source data buffer at 16300 in Memory copied to save buffer at 101654 in Memory Memory address in source data buffer = + 16300 (source data buffer) Memory address in save data buffer = + 101654 (save data buffer) Vector Buffer Execution Results vOreg < 23557> = *1773777777777777777000 0004000000000000000000 Beginning error isolation Error isolation complete name rev date pass error seed vI numpar isop failpat < < < < < < < < < < = = 11760> 11761> 11762> 11763> 11764> 11765> 11770> 12016> 14527> 12475> = = = = = = = 'olcsvc '4.0 '02/09/87' 4 1 37507312636362015466 0 20 1000 'slide isolated random vector instruction buffer 162334 147015 006000 057120 15460a 15460b 15460c V3 VO J vbuff S3*HV4 V1!V5&VM 13624a (From this point on, the dump is similar to the previously listed portion of the dump that displayed the unisolated error information.) The first address (FADD) of the diagnostic is 11760a olcsvc reached maximum error limit with 4 passes and 1 errors on Mon Feb 9 17:23:52 1987 3.5.5 TEST MESSAGES The olcsvc test produces the following types of messages: • • Test mode Informative These messages are listed in the subsections that follow. SMM-1012 C CRAY PROPRIETARY 3-83 3.5.5.1 Test mode messaqes During test execution, one of the following messages is displayed to indicate the test mode: CRAY Y-MP MODE Indicates that the mainframe is a CEA system. CRAY X-MP MODE Indicates that the mainframe is a CRAY X-MP computer system. CRAY X-MP MODE: scatter/gather/compressed index testing disabled Indicates that the mainframe is a CRAY X-MP computer system without scatter/gather/compressed indexing hardware. If this message is inconsistent with your hardware configuration, it normally indicates an instruction failure. To determine where the failure occurred, rerun olcsvc with the +sgci command option. Contact your CRI representative for additional assistance. CRAY-l MODE Indicates that the mainframe is a CRAY-1 computer system. CRAY-1 MODE: vector pop/parity testing disabled Indicates that the mainframe is a CRAY-1 computer system without vecto~opulation count/parity hardware. If this message is inconsistent with your hardware configuration, it normally indicates an instruction failure. To determine where the failure occurred, rerun olcsvc with the +pop command option. Contact your CRI representative for additional assistance. 3.5.5.2 Informative messaqes If the +verbose option is enabled, a message is sent to stdout (standard output device) after each pass through the test loop. On an error, the test provides information such as the following: 3-84 • Pass and error counts • Seed at the beginning of the pass on which the error occurred • Contents of the vector instruction buffer • Contents of the scalar instruction buffer • Initial data • Data results from the scalar instruction execution in the master CPU • Differences in the scalar execution results from the master CPU, the scalar execution results from the remaining selected CPUs, and the vector execution results from all of the selected CPUs CRAY PROPRIETARY SMM-I012 C 3.6 olibuf The olibuf test is an on-line instruction buffer test. To detect data-sensitive failures, the program generates test buffers and runs data patterns through the instruction buffer. To detect branching failures, the program generates test buffers containing in-stack and out-of-stack jumps, compares expected jump addresses to actual jump addresses, and reports any differences. The test continues until the maximum pass, error, or time limit is reached. 3.6.1 TEST SYNOPSIS The olibuf command options can be entered in any order. If an option is omitted, the program uses the default value. The test synopsis lists the olibuf command options and arguments in the following order: 1. 2. 3. Monitor options Test-specific options Data pattern options Synopsis: olibuf [chkpnt mode] [cpu clist] [cputime h:m:s] [+I-getseed] [getseed file] [help] [maxerr n] [maxp n] [+I-parcel] [time h:m:s] [+I-verbose] [+xmp] [+crayl]t [+I-repeat] [seed n] [section slist] [+I-onezero] [+I-random] [+I-solid] +I-repeat Enables (+repeat) or disables (-repeat) the option that repeats the first pass until the diagnostic terminates. +repeat is useful for recreating an error. It is normally used with one of the following options: seed n, +getseed, or getseed file. The default is -repeat (the program generates new instructions and data after each pass). t The monitor command options are described in section 2, Confidence Test and Monitor Overview. SMM-1012 C CRAY PROPRIETARY 3-85 seed n Sets the random seed to n. n can be any 64-bit octal value. If n is 0, the test reads the real-time clock and uses the value for the initial seed. The default for n is 0'33. If seed n is selected, do not select +qetseed or getseed file. section slist Selects the test sections to be executed. entered in the following format: slist is n, n, ••• , n n can be one of the following test sections (if allowed to default, all test sections are executed): Section Description 1 Executes a 16-bit pattern through parcel 0 of all words in the instruction buffer 2 Executes a 16-bit pattern through parcel 1 of all words in the instruction buffer 3 Executes a 16-bit pattern through parcel 2 of all words in the instruction buffer 4 Executes a 16-bit pattern through parcel 3 of all words in the instruction buffer 5 Executes random in-stack and out-of-stack jumps in the instruction buffer +I-onezero, +I-random, +I-solid Selects (+) or deselects (-) specific data patterns. If allowed to default, all of the data patterns are run. The data patterns are as follows: Option Data Pattern onezero On each pass, random patterns of all l's or all O's are run through the test area. For example: 177777 000000 3-86 CRAY PROPRIETARY SMM-1012 C +I-onezero, +I-random, +I-solid (continued) Data Pattern Option random On each pass, random bit patterns are run through the test area. For example: 102314 000347 164002 112323 130431 solid On each pass, a random pattern of either all l's or all O's is run through the test area with one complement pattern. The location of the complement pattern is randomly selected. For example: Pass 1 177777 177777 000000 (complement) 177777 177777 Pass 2 000000 177777 (complement) 000000 SMM-1012 C CRAY PROPRIETARY 3-87 +I-onezero, +I-random, +I-solid (continued) Option Data Pattern solid (continued) : Pass 3 000000 177777 (complement) 000000 Pass 4 177777 (complement) 000000 177777 3.6.2 TEST EXECUTION The olibuf execution sequence is as follows: 1. 2. 3. 4. 5. Test initialization Test buffer generation Test buffer execution Comparison of expected and actual data Error report Steps 2 through 4 occur on each pass through the test loop. occurs only on error. 3.6.2.1 Step 5 Test initialization At test initialization, the selected sections and patterns are processed in the following order: 3-88 1. All sections and patterns are initially enabled. 2. Selected sections are processed. 3. Deselected patterns are processed. If all patterns are deselected, an error message is displayed and the test is terminated. CRAY PROPRIETARY SMM-I012 C 3.6.2.2 CRAY X-MP computer system test buffer generation The generation routine builds and generates the test buffers. A test buffer is generated for each section selected. Test sections 1 through 4 use the following instructions to execute a pattern through the instruction buffer: 001000 PASS 020ijk.m 11hiOOO 030ijk 0050jk Ai IAh Ai Aj+Ak J Bjk exp Ai Pass Transmit exp=jkm to Ai Store (Ai) to (Ah) Integer sum of (Aj) and (Ak) to Ai Jump to (Bjk) Test section 5 uses the following instructions to execute random in-stack and out-of-stack jumps in the instruction buffer: 020ijkm 11hiOOO 030ijk 006ijkm 0050jk Ai IAh Ai exp Ai Aj+Ak exp J J Bjk Transmit exp=jkm to Ai Store (Ai) to (Ah) Integer sum of (Aj) and (Ak) to Ai Jump to exp Jump to (Bjk) The following example shows a sample test buffer for section 1. The parcel 0 instructions and data patterns are used to test first the odd and then the even words. When the test buffer is executed, each data pattern (nnnnnn) is loaded into parcel 0 of each instruction buffer word. Example: Address Opcode CAL Mnemonics 5340a 5340b 5340c 5340d 5341b 5341d 5342a 5342b 5342c 5342d 5343b 5343d 5344a 5344b 5344c 001000 001000 001000 020100 112100 030223 001000 001000 001000 020100 112100 030223 001000 001000 001000 PASS PASS PASS A1 0,A2 A2 PASS PASS PASS A1 0,A2 A2 PASS PASS PASS 5536d 020100 nnnnnn SMM-1012 C nnnnnn 000000 nnnnnn 000000 A1 OOnnnnnn Instruction Buffer Word 001 A1 A2+A3 OOnnnnnn 003 A1 A2+A3 OOnnnnnn CRAY PROPRIETARY 177 3-89 Example (continued): Address Opcode CAL Mnemonics 5537b 5537d 5540a 5540b 5540c 5540d 5541a 5541b 5541c 5541d 5542b 5542d 5543a 5543b 5543c 112100 000000 030223 001000 001000 001000 001000 001000 001000 001000 020100 nnnnnn 112100 000000 030223 001000 001000 001000 0,A2 A2 PASS PASS PASS PASS PASS PASS PASS A1 0,A2 A2 PASS PASS PASS 5735d 5736b 5736d 5737a 5737b 5737c 5737d 5740b 5740d 5741a 020100 112100 030223 001000 001000 001000 020100 112100 030223 005000 nnnnnn Al 0,A2 A2 PASS PASS PASS Al 0,A2 A2 000000 nnnnnn 000000 J Instruction Buffer Word Al A2+A3 OOnnnnnn 002 Al A2+A3 OOnnnnnn 176 Al A2+A3 OOnnnnnn 000 A1 A2+A3 BOO The following example shows a sample test buffer for section 5. Example: 3-90 Absolute Address CAL Mnemonics testbuff testbuff+02: testbuff+06: testbuff+12: testbuff+14: testbuff+20: testbuff+22: testbuff+24: testbuff+26: testbuff+30: testbuff+32: ERR A1 0,A2 A2 J ERR ERR ERR ERR ERR ERR 000 0000000001 Al A2+A3 00000026660 000 000 000 000 000 000 CRAY PROPRIETARY Jump Address testbuff+214a SMM-1012 C Example (continued): Absolute Address CAL Mnemonics testbuff+34: testbuff+36: testbuff+40: testbuff+44: testbuff+SO: testbuff+52: testbuff+S6: testbuff+60: testbuff+62: testbuff+64: testbuff+66: testbuff+72: testbuff+76: testbuff+l00: ERR ERR Al 0,A2 A2 J ERR ERR ERR ERR Al 0,A2 A2 J 000 000 0000000020 Al A2+A3 00000026201 000 000 000 000 0000000033 Al A2+A3 00000026507 testbuff+2340: testbuff+2342: testbuff+2344: testbuff+2350: testbuff+2354: testbuff+2356: testbuff+2360: ERR ERR Al 0,A2 A2 ERR 000 000 0000001162 Al A2+A3 BOO 000 testbuff+2370: testbuff+2372: testbuff+2374: testbuff+2400: testbuff+2404: testbuff+2406: testbuff+2412: testbuff+2414: testbuff+2416: ERR ERR A1 0,A2 A2 J ERR ERR ERR 000 000 0000001176 Al A2+A3 00000026634 000 000 000 SMM-I012 C J CRAY PROPRIETARY Jump Address testbuff+l00b testbuff+161d Return jump testbuff+207a 3-91 3.6.2.3 CRAY Y-MP computer system test buffer generation The generation routine builds and generates the test buffers. A test buffer is generated for each section selected. Test sections 1 through 4 use the following instructions to execute a pattern through the instruction buffer: 0010000 020iOOmn 11hiOO 00 030ijk 0050jk PASS Ai ,Ah Ai J Pass Transmit nm to Ai Store (Ai) to (Ah) Integer sum of (Aj) and (Ak) to Ai Jump to (Bjk) exp Ai Aj+Ak Bjk Test section 5 uses the following instructions to execute random in-stack and out-of-stack jumps in the instruction buffer: 0010000 020iOOmn 11hiOO 00 030ijk 006ijkm 0050jk Pass Transmit nm to Ai Store (Ai) to (Ah) Integer sum of (Aj) and (Ak) to Ai Jump to exp Jump to (Bjk) PASS Ai exp ,Ah Ai Ai Aj+Ak exp J Bjk J The following example shows a sample test buffer for section 1. The parcel 0 instructions and data patterns are used to test first the odd and then the even words. When the test buffer is executed, each data pattern (nnnnnn) is loaded into parcel 0 of each instruction buffer word. Example: Address Opcode CAL Mnemonics 15740a 15740b 15740c 15740d 15741c 15742b 15742c 15742d 15743c 15744b 15744c 15744d 001000 001000 001000 020100 112100 030223 001000 020100 112100 030223 001000 020100 PASS PASS PASS A1 0,A2 A2 PASS A1 0,A2 A2 PASS A1 3-92 nnnnnn 000000 000000 000000 nnnnnn 000000 000000 000000 nnnnnn 000000 CRAY PROPRIETARY Instruction Buffer Word OOOOOnnnnnn A1 A2+A3 001 OOOOOnnnnnn A1 A2+A3 003 OOOOOnnnnnn 005 SMM-1012 C Example (continued): Instruction Buffer Word Address Dpcode CAL Mnemonics 15745c 15746b 112100 000000 000000 030223 0,A2 A2 A1 A2+A3 16136d 16137c 16140b 16140C 16140d 16141a 16141b 16141c 16141d 16142c 16143b 16143c 16143d 16144c 16145b 020100 112100 030223 001000 001000 001000 001000 001000 020100 112100 030223 001000 020100 112100 030223 nnnnnn 000000 000000 000000 A1 0,A2 A2 PASS PASS PASS PASS PASS A1 0,A2 A2 PASS A1 0,A2 A2 OOOOOnnnnnn A1 A2+A3 177 OOOOOnnnnnn A1 A2+A3 002 OOOOOnnnnnn A1 A2+A3 004 16335d 16336c 16337b 16337c 16337d 16340c 16341b 16341c 020100 112100 030223 001000 020100 112100 030223 005000 nnnnnn 000000 000000 000000 A1 0,A2 A2 PASS A1 0,A2 A2 OOOOOnnnnnn A1 A2+A3 176 OOOOOnnnnnn A1 A2+A3 BOO 000 SMM-1012 C nnnnnn 000000 000000 000000 nnnnnn 000000 000000 000000 nnnnnn 000000 000000 000000 J CRAY PROPRIETARY 3-93 The following example shows a sample test buffer for section 5. Example: 3-94 Absolute Address CAL Mnemonics I5740a: I5740d: I574Ic: I574Id: I5742b: I5742c: I5742d: I5743a: I5743b: 15743c: I5743d: 15744a: 15744d: I5745c: I5745d: 15746b: I5746c: I5746d: I5747c: I5750b: I5750c: I575Ia: 15751b: I575Ic: I575Id: I5752a: I5752b: I5752c: 15752d: I5753a: I5753b: 15753c: I5753d: I5754a: I5754b: I5754c: I5754d: I5755a: 15755b: I5755c: I5756b: I5757a: I5757b: Al 0,A2 A2 J ERR ERR ERR ERR ERR ERR ERR Al 0,A2 A2 J ERR ERR Al 0,A2 A2 J ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR Al 0,A2 A2 J 00000000000 Al A2+A3 001606Ib 00000000020 Al A2+A3 0016040b 00000000033 Al A2+A3 0016152d 00000000066 Al A2+A3 00160I2c CRAY PROPRIETARY ------------------- SMM-I012 C Example (continued): Absolute Address CAL Mnemonics 16034b: 16034c: 16034d: 16035a: 16035b: 16036a: 16036d: 16037a: 16037b: 16037c: I6037d: 16040a: I6040b: 16041a: I604Id: 16042a: 16042c: I6042d: ERR ERR ERR ERR Al 0,A2 A2 J ERR ERR ERR ERR Al 0,A2 A2 J ERR ERR 16166c: 16166d: 16167c: I6I70b: 16170c: 16I71a: 16171b: 16171c: I6I71d: 16172a: 16172b: 16172c: I6I72d: 16173a: I6I73b: 16173c: 16173d: 16174a: I6174b: 16174c 16I75b: 16176a: 16I76b: 16176d: 16I77a: ERR Al 0,A2 A2 J ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR Al 0,A2 A2 J ERR ERR SMM-I012 C 00000000365 Al A2+A3 BOO (Return Jump) 00000000401 Al A2+A3 0015746d 00000001133 Al A2+A3 0015744a 00000001162 Al A2+A3 0015775c CRAY PROPRIETARY 3-95 3.6.2.4 Test buffer execution After the test buffers are generated, the execution routine jumps to the buffer and executes the test buffer code in all of the selected CPUs. The save monitor routine saves the results. If a jump fails and an error exit occurs (section 5 only), no results are saved. 3.6.2.5 Comparison of expected and actual data After the instructions are executed in all of the selected CPUs, the compare monitor routine compares the results. The actual results are compared to the expected results. If the results match, the test continues. After all of the selected sections and data patterns are run, the pass count is incremented. If the results do not match, the test dumps all of the data related to the suspected failure. 3.6.2.6 Error report If an error is detected, the test dumps all of the data related to the suspected failure. The output dump contains the following: • Diagnostic Information Block • Test buffer data at the time of the failure • Expected results • Differences 3.6.3 ERROR ISOLATION TO THE FAILING BIT An error report is generated for each section in which an error occurs. By examining a dump for anyone of the test sections 1 through 4, you can isolate the error to the failing bit. 3-96 CRAY PROPRIETARY SMM-I012 C 3.6.3.1 CX/1 system error isolation Use the following procedure to isolate an error to the failing bit (perform all arithmetic operations in octal): 1. For a CRAY X-MP computer system, use the index to determine the failing word as follows: Index Failing Word 0'177 0 index < 0'100 index >= 0'100 (index x 2) + 1 (index - 0'77) x 2 For aCRAY-1 computer system, use the index to determine the failing word as follows: Index Failing Word 0' 77 o (index x 2) + 1 (index - 0'37) x 2 index < 0'40 index >= 0'40 2. Examine the failing word to isolate the error to the failing bit. The following example for a CRAY X-MP computer system shows a dump that was generated after test section 1 detected an error. By examining the dump, you can isolate the error to the failing bit, as follows (perform all arithmetic operations in octal): 1. Use the index (0'100) to determine the failing word as follows: (index - 0'77) x 2 failing word (0'100 - 0'77) x 2 = 2 2. By examining the failing word, you can see that bit 2 5 is dropped. Example: olibuf started in cpu A on Mon May 23 15:53:40 1988 olibuf: running olibuf: restart file written to A33641-olibuf 1340> = 'olibuf name < 1341> = '1.0 rev < 1342> = '05/17/88' date < 1343> = 0 pass < 1344> = 1 error < ( 1345> = 33 seed 1422> = 1 failsec < 2156> = 'random failpat < SMM-1012 C CRAY PROPRIETARY 3-97 Example (continued): Section 1 - test buffer tests parcel 0 buff 5340a 5340b 5340c 5340d 5341b 5341d 5342a 5342b 5342c 001000 001000 001000 020100 000033 112100 000000 030223 001000 001000 001000 PASS PASS PASS A1 0,A2 A2 PASS PASS PASS 5541a 5541b 5541c 5541d 5542b 5542d 5543a 5543b 5543c 5543d 5544b 5544d 5545a 5545b 5545c 001000 001000 001000 020100 112100 030223 001000 001000 001000 020100 112100 030223 001000 001000 001000 PASS PASS PASS A1 0,A2 A2 PASS PASS PASS Al 0,A2 A2 PASS PASS PASS 120304 000000 164114 000000 00000033 A1 A2+A3 00120304 A1 A2+A3 00164114 A1 A2+A3 Expected results data < data + 0002 < data + 0004 < = 000000 0> 2> 4> = 000000 data + 0174 <174> data + 0176 <176> = 000000 = 000000 000000 000033 000000 000000 016667 000000 000000 000000 130653 = 000000 000000 000000 147000 000000 000000 073260 000000 000000 000000 000505 000000 000000 000000 010021 000000 000000 000000 042425 000000 000000 000000 141014 000000 000000 000000 042520 Difference(s) between exp and act results data+ 0100 <200> 3-98 = *000000 000000 000000 120304* 000000 000000 000000 164114 000000 000000 000000 000040 000000 000000 000000 000000 CRAY PROPRIETARY SMM-I012 C 3.6.3.2 CRAY Y-MP computer system error isolation Use the following procedure to isolate an error to the failing bit (perform all arithmetic operations in octal): 1. 2. Use the index to determine the failing word as follows: Index Failing Word 0'177 o index < 0'100 index >= 0'100 (index x 2) + 1 (index - 0'77) x 2 Examine the failing word to isolate the error to the failing bit. The following example for a CRAY Y-MP computer system shows a dump that was generated after test section 1 detected an error. By examining the dump, you can isolate the error to the failing bit, as follows (perform all arithmetic operations in octal): 1. Use the index (0'132) to determine the failing word as follows: (index - 0'77) x 2 = failing word (0'132 - 0'77) x 2 = 66 2. By examining the failing word, you can see that bit 2 3 is dropped. Example: olibuf started in cpu A on Thu Aug 25 15:14:33 1988 olibuf: restart file written to A62851-olibuf 10740> = 'olibuf name < 10741> = '1.0 rev < 10742> = '08/19/88' date < 10743> pass 0 < 10744> = 1 error < 10745> = 33 seed < failsec 11022> = 1 < 11616> = 'random failpat < SMM-1012 C CRAY PROPRIETARY 3-99 Example ( continued) : Section 1 - test buffer tests parcel 0 buff 15740a 15740b 15740c 15740d 15741c 15742b 15742c 15742d 15743c 15744b 15744c 15744d 15745c 15746b 001000 001000 001000 020100 112100 030223 001000 020100 112100 030223 001000 020100 112100 030223 16223d 16224c 16225b 16225c 16225d 16226c 16227b 16227c 16227d 16230c 16231b 020100 112100 030223 001000 020100 112100 030223 001000 020100 112100 030223 000033 000000 000000 000000 000505 000000 000000 000000 016667 000000 000000 000000 063732 000000 000000 000000 165420 000000 000000 000000 152151 000000 000000 000000 PASS PASS PASS A1 0,A2 A2 PASS A1 0,A2 A2 PASS A1 0,A2 A2 A1 0,A2 A2 PASS A1 0,A2 A2 PASS A1 0,A2 A2 00000000033 A1 A2+A3 00000000505 A1 A2+A3 00000016667 A1 A2+A3 00000063732 A1 A2+A3 00000165420 A1 A2+A3 00000152151 A1 A2+A3 Expected results data < data + 0002 < data + 0004 < = 000000 0> 2> 4> = 000000 data + 0174 <174> data + 0176 <176> = 000000 = 000000 3-100 = 000000 000000 000033 000000 000000 016667 000000 000000 000000 130653 000000 000000 147000 000000 000000 073260 CRAY PROPRIETARY 000000 000000 000000 000505 000000 000000 000000 010021 000000 000000 000000 042425 000000 000000 000000 141014 000000 000000 000000 042520 SMM-1012 C The difference data shown below has the following format: + n~e index (offset> data data differences The name of the data dumped on this line. The index into the data starting at name. Optional, default: O. The offset into the data buffer. The actual data dumped. The differences are marked with an asterisk (*) preceding the data word. data differences: The bits that differ between the actual results and the expected results. name: index: offset: data: *** Differences *** Source data buffer at 14740 in Memory copied to save buffer at 103362 in Memory Memory address in source data buffer = (offset> + 14740 (source data buffer) Memory address in save data buffer = (offset> + 103362 (save data buffer) Difference(s) between exp and act results data + 0132 (264) = *000000 000000 000000 165420* 000000 000000 000000 152151 000000 000000 000000 000010 3.6.4 000000 000000 000000 000000 TEST TERMINATION If a jump fails in section 5, an error exit occurs. There are several monitor options that can cause a test to terminate. Refer to the information on test termination in section 2, Confidence Test and Monitor Overview. 3.6.5 TEST EXAMPLES This subsection contains olibuf execution ex~ples. The following example runs olibuf with selected command options and shell facilities. The test runs for 0'1000000 passes in CPU b with all default instructions. The job runs as a background process, and the output is sent to olibuf.log. olibuf maxp 1000000 cpu b >olibuf.log The following example runs olibuf with section 1 selected. olibuf section 1 SMM-1012 C CRAY PROPRIETARY 3-101 The following example runs olibuf for 0'10000000 passes. Output is redirected to olibuf.log. The Dohup(l) command allows the program to continue executing after you log off the system. You can later log on to check the test's progress. The ampersand (&) causes the entire command to execute in the background, so that another prompt is immediately displayed and you can continue to use the system. nohup olibuf maxp 10000000 )olibuf.log & The following example shows the output displayed when olibuf is run with all default values. olibuf Output: olibuf olibuf started in cpu A on Fri Aug 28 11:14:10 1987 olibuf reached maximum pass limit with 1000 passes and 0 errors on Fri Aug 28 11:14:14 1987 The following example runs olibuf with the +verbose option enabled so that a line of output is generated after each pass. olibuf +verbose Output: olibuf +verbose olibuf started in olibuf: pass = 1, olibuf: pass = 2, olibuf: pass = 3, = cpu A on Fri Aug 28 11:14:14 1987 error = 0 Fri Aug 28 11:14:14 1987 error 0 Fri Aug 28 11:14:14 1987 error = 0 Fri Aug 28 11:14:14 1987 = = olibuf: pass 1000, error 0 Fri Aug 28 11:14:14 1987 olibuf reached maximum pass limit with 1000 passes and 0 errors on Fri Aug 28 11:14:14 1987 3-102 CRAY PROPRIETARY SMM-1012 C The following example runs olibuf in CPU conly. olibuf cpu c Output: olibuf olibuf olibuf on Fri cpu c started in cpu C on Fri Aug 28 11:14:14 1987 reached maximum pass limit with 1000 passes and 0 errors Aug 28 11:14:14 1987 The following example runs olibuf in CPUs a and b , with a as the master. olibuf cpu a,b olibuf olibuf olibuf on Fri cpu a,b started in cpus A, B with master cpu A on Fri Aug 28 11:14:14 1987 reached maximum pass limit with 1000 passes and 0 errors Aug 28 11:14:14 1987 The following example runs olibuf with the +verbose option enabled. The output is generated after an error is detected. olibuf +verbose Output: olibuf +verbose olibuf started in cpu A on Fri Aug 28 11:14:14 1987 olibuf: restart file written to A7465-olibuf 14420> = 'olibuf name < 14421> = '1.0 rev < 14422> = '08/27/87' date < 14423> 0 pass < 14424> = 1 error < 14425> 52301500217376 seed < 23221> 1 failsec < = 15174> = 'solid failpat < Generated test buffer tests parcel 0 buff (the test buffer that was executing when the error was detected is dumped in parcel and ASCII format) SMM-1012 C CRAY PROPRIETARY 3-103 Section 1 - parcel 0 test The expected data shown below has the following format: n~e n~e: index: offset: data: + index = data ••• The n~e of the data dumped on this line. The index into the data starting at n~e. The offset into the data buffer. The actual data dumped. *** Expected Results *** Optional, default: O. cpu A (master) Source data buffer at 6427 in Memory copied to save buffer at 70201 in Memory Memory address in source data buffer = + 6427 (source data buffer) Memory address in save data buffer = + 70201 (save data buffer) *** Expected Results *** (the expected data is dumped in parcel format) The difference data shown below has the following format: n~e + index = data data differences The name of the data dumped on this line. The index into the data starting at n~e. Optional, default: O. The offset into the data buffer. The actual data dumped. The differences are marked with an asterisk (*) preceding the data word. data differences: The bits that differ between the actual results and the expected results. n~e: index: offset: data: *** Differences *** 3-104 cpu A (master) CRAY PROPRIETARY SMM-1012 C Example (continued): Source data buffer at 6427 in Memory copied to save buffer at 71204 in Memory Memory address in source data buffer = (offset> + 6427 (source data buffer) Memory address in save data buffer = (offset> + 71204 (save data buffer) *** Differences *** (The differences are displayed. Differences are the results of the actual execution of the test buffer that differ from the expected results.) The first address (FADD) of the diagnostic is 14420a olibuf reached maximum error limit with 0 passes and 1 errors on Fri Aug 2811:14:23 1987 3.6.6 TEST MESSAGES The olibuf test produces the following types of messages: • • Informative Error These messages are described in the subsections that follow. 3.6.6.1 Informative messages If no error occurs, olibuf produces two messages, one at start-up time and another at test termination. If the +verbose option is enabled, a message is sent to stdout (standard output device) after each pass through the test lOop. On an error, the test provides information such as the following: • Pass and error counts • Seed at the beginning of the pass on which the error occurred • Failing word and parcel • Test buffer data used when the error occurred • Expected results • Actual results • Differences between the expected results from the master CPU and the actual execution results from all of the selected CPUs SMM-1012 C CRAY PROPRIETARY 3-105 3.6.6.2 Error messages One of the following error messages is sent to stderr (standard error device) if an invalid command option is entered: olibuf: initpat: No data patterns selected Select one or more data patterns and rerun. olibuf: bldtbl: Invalid section selected. Valid sections are: Select one or more valid test sections and rerun. 3-106 CRAY PROPRIETARY 1-5. SMM-1012 C 3.7 olsbt The olsbt test is an on-line semaphore, shared B and shared T register test for CX/CEA systems. It tests the following components: • • • • Shared B registers Shared T registers Semaphores Clusters The olsbt test generates a random sequence of shared register instructions and data to detect inter-CPU communication failures. The generated instructions are simulated and then executed. If no differences are detected, the test generates new instructions and data, and repeats the process until the maximum pass, error, or time limit is reached for the selected cluster number. The olsbt test runs under the confidence monitor program, oleman. The oleman monitor compares the actual and simulated results. For additional information on oleman, refer to section 2 of this manual, Confidence Test and Monitor Overview. For additional information on inter-CPU communication, refer to the following manuals (as appropriate to your system configuration): Publication Title CSM0110000 CSM-0111-000 CSM0112000 CSM-0400-000 CRAY CRAY CRAY CRAY 3.7.1 X-MP/2 System Programmer Reference Manual X-MP/1 System Programmer Reference Manual X-MP/4 System Programmer Reference Manual Y-MP System Programmer Hardware Reference Manual TEST SYNOPSIS The olsbt command options can be entered in any order. If an option is omitted, the program uses the default value. The test synopsis lists the olsbt command options and arguments in the following order: 1. 2. 3. 4. Monitor options Test-specific options Data pattern options Instruction options SMM-1012 C CRAY PROPRIETARY 3-107 Synopsis: olsbt [chkpnt mode] [cpu clist] [cputime h:m:s] [+I-qetseed] [getseed file] [help] [mazerr n] [mazp n] [+I-parcel] [time h:m:s] [+I-verbose]t [cluster n] [numins n] [+I-repeat] [seed n] [+I-bits] [+I-onezero] [+I-random] cluster n Selects specific cluster. n can be anyone of the following cluster numbers associated with the indicated mainframe (cluster number 1 is reserved for the operating system) : Mainframe Cluster Numbers CRAY CRAY CRAY CRAY CRAY 2, 2, 2, 2, 2, Y-MP/8 Y-MP/4 X-MP/4 X-MP/2 X-MP/l 3, 4, 5, 6, 7, 10, 11 3, 4, 5 3, 4, 5 3 3 The default for n is a random cluster number. The cluster number does not change during test execution. cluster n must be used to recreate a failure. numins n Sets the number of instructions to be generated. n can be any value within the range 1 through 0'20. The default for n is 0'20. +I-repeat Enables (+repeat) or disables (-repeat) the option that repeats the first pass until the diagnostic terminates. +repeat is useful for recreating an error. It is normally used with cluster n and one of the following options: seed n, +getseed, or getseed file. The default is -repeat (the program generates new instructions and data after each pass). t The monitor command options are described in section 2, Confidence Test and Monitor Overview. 3-108 CRAY PROPRIETARY SMM-1012 C seed n Sets the random seed to n. n can be any 64-bit octal value. If n is 0, the test reads the real-time clock and uses the value for the initial seed. The default for n is 0'33. If seed n is selected, do not select +getseed or getseed file. +I-bits, +I-onezero, +I-random Selects (+) or deselects (-) specific data patterns. The default selects all of the patterns. The data patterns are as follows: Option Data Pattern bits Random number of consecutive 1-bits in a word. For example: 0000017777777776000000 1777000000000000000377 1777777777777777777777 0000000000000000000000 0000000000100000000000 one zero Random selection of all l's or all O's in a word. For example: 1777777777777777777777 0000000000000000000000 random Random bit generation in a word. For example: 1023122123232122777127 0003423100233344322177 1640034356453221213532 1123235467543221344120 1304322300332105534311 SMM-1012 C CRAY PROPRIETARY 3-109 3.7.2 TEST EXECUTION The olsbt test should be executed with the maximum number of CPUs available on the system. This allows the requested cluster number to become available more quickly, since one process will be started in each CPU. The olsbt test execution sequence is as follows: 1. 2. 3. 4. 5. 6. Test initialization and hardware configuration detection Random instruction and data generation Random instruction buffer simulation Random instruction buffer execution Comparison of simulation and execution results Error isolation Steps 2 through 5 occur on each pass through the test loop. occurs only on error. 3.7.2.1 Step 6 Test initialization and hardware configuration detection At test initialization, all instructions are enabled. The hardware configuration detection routine identifies the number of available clusters. If the cluster specified by the command option cluster n is not available, the program overrides cluster n and uses a random cluster. 3.7.2.2 Random instruction and data generation These routines build and generate the random instruction buffers and initial data. Instructions for the buffers are randomly selected from a list of instructions. The values of the i, j, and k fields are randomly selected when appropriate. If four CPUs are selected, four random instruction buffers are created; one for each CPU. If only one CPU is selected, two random instruction buffers are created and both are executed in the selected CPU. Each instruction buffer contains instructions that enable it to write to the shared registers. Only one buffer can write to the shared registers at a time. The buffer that can write to the shared registers is rotated through the selected CPUs, starting with the selected master CPU. The other buffers can read from the shared registers if the master is not writing to that particular shared register. Before another buffer can begin writing to the shared registers, all buffers must be syncronized. 3-110 CRAY PROPRIETARY SMM-1012 C A sample of the instruction buffers for four CPUs is as follows: 003416 003404 003401 003603 003627 003434 003730 003726 003702 026227 003634 003635 003405 003605 003617 003413 003613 003410 003415 003406 003636 005000 ibuffO SM16 SM04 SM01 SM03 SM27 SM34 SM30 SM26 SM02 A2 SM34 SM35 SM05 SM05 SM17 SM13 SM13 SM10 SM15 SM06 SM36 J 1,TS 1,TS 1,TS 0 0 1,TS 1 1 1 SB2 0 0 1,TS 0 0 1,TS 0 1,TS 1,TS 1,TS 0 BOO 003616 003403 072473 072333 026607 003603 003431 003425 003427 003634 003620 026427 003623 003405 003605 003600 003413 003613 003610 003436 003636 005000 ibuff1 SM16 SM03 S4 S3 A6 SM03 SM31 SM25 SM27 SM34 SM20 A4 SM23 SM05 SM05 SMOO SM13 SM13 SM10 SM36 SM36 J 0 1,TS ST7 ST3 SBO 0 1,TS 1,TS 1,TS 0 0 SB2 0 1,TS 0 0 1,TS 0 0 1,TS 0 BOO SMM-1012 C CRAY PROPRIETARY 3-111 Example (continued) : 003604 003403 003603 003631 003434 003634 003433 003435 003423 026267 072663 073343 003605 072213 026647 003621 003413 003613 003615 003436 003636 005000 ibuff2 SM04 SM03 SM03 SM31 SM34 SM34 SM33 SM35 SM23 A2 S6 ST4 SM05 S2 A6 SM21 SM13 SM13 SM15 SM36 SM36 J 0 1,TS 0 0 1,TS 0 1,TS 1,TS 1,TS SB6 ST6 S3 0 ST1 SB4 0 1,TS 0 0 1,TS 0 BOO 003601 003403 003603 026067 026367 026767 003614 003625 003434 003634 003633 03405 003605 003417 003400 003421 003613 003606 003436 003636 05000 ibuff3 SM01 SM03 SM03 AO A3 A7 SM14 SM25 SM34 SM34 SM33 SM05 SM05 SM17 SMOO SM21 SM13 SM06 SM36 SM36 J 0 1,TS 0 SB6 SB6 SB6 0 0 1,TS 0 0 1,TS 0 1,TS 1,TS 1,TS 0 0 1,TS 0 BOO 3-112 CRAY PROPRIETARY SMM-1012 C 3.7.2.3 Random instruction buffer simulation After the instructions and data are generated, the master CPU simulates the random instruction buffers. The save monitor routine saves the results. Each instruction type has a unique simulation routine. The simulation routines do not use any of the shared register hardware. 3.7.2.4 Random instruction buffer execution After the instructions are simulated, all of the selected CPUs execute their own instruction buffer in the selected cluster. The master CPU uses the system call cpu(4D) to select the cluster. The olsbt test allows you to test inter-CPU control and communication by synchronizing code execution among selected CPUs. The first CPU selected is the master CPU, which generates and simulates all instruction buffers for all selected CPUs. The following characteristics apply to instruction buffer execution: • The master CPU creates and schedules processes using the following system calls: System Call Description tfork(2) Creates a multitasking process for each selected CPU cpselect(2) Schedules the processes in the CPUs • Only one buffer can write to the shared B and shared T registers in the specified cluster at a time. • The master CPU loads the shared registers with the generated data before starting the other CPUs. The master CPU then waits for all CPUs to execute their buffers before unloading the shared registers. • All semaphores used in the test and set instructions in the instruction buffers are initially set. SMM-1012 C CRAY PROPRIETARY 3-113 Before the instructions can be executed, the master CPU loads the following: • • • • • Shared B registers Shared T registers Semaphore register Address registers for the master CPU Scalar registers for the master CPU The other CPUs load the following: • • Address registers Scalar registers Then an unconditional jump to the random instruction buffer is executed in each CPU. At the end of the random instruction buffer is a jump to BO. Each CPU unloads the contents of its address and scalar registers. The master CPU waits until all CPUs have executed and then unloads the contents of the shared registers. The save monitor routine saves the results. 3.7.2.5 Comparison of simulation and execution results After the instructions execute in all of the selected CPUs, the compare monitor routine compares the results, and one of the following actions occurs: • If the results match, the test proceeds with the next data pattern. After all of the selected data patterns are run, the pass count is incremented. • If the results do not match, the test dumps all of the data related to the suspected failure. If a deadlock interrupt was received, a core dump is produced and the test terminates. 3.7.2.6 Error isolation The output dump contains the following: • • • • 3-114 Data used when the failure occurred Simulated execution results Actual execution results (if different from the simulated results) Exclusive OR of the simulated and actual execution results CRAY PROPRIETARY SMM-I012 C The program may report an error resulting from a failure in either the simulated or actual execution. To determine if the error is the result of an actual execution failure, start oIsbt in a different CPU and select the suspected failing cpu. For example, the following entry starts oIsbt in CPU c: olsbt cpu c If oIsbt fails, and the simulated execution is suspect, rerun oIsbt using a different master CPU, the failing seed, and the failing cluster, as follows: olsbt cpu a,c +repeat seed n cluster n If oIsbt fails in CPU c, the failure is in the actual execution of the random instruction buffer. If oIsbt does not fail, the error is either in the simulated execution results from CPU c or it is very intermittent. 3.7.3 TEST TERMINATION For information on test termination, refer to section 2.4, Test Termination. 3.7.4 TEST EXAMPLES This subsection contains oIsbt execution examples. The following example runs oIsbt with all defaults. oIsbt executes in CPU a. The output is displayed at the operator console. olsbt The following example runs oIsbt in CPUs a, b, c, and d. displayed at the operator console. The output is olsbt cpu a,b,c,d The following example runs oIsbt for 0'10000000 passes. By default, oIsbt executes in CPU a. Output is redirected to sbt.log. The nohup(l) command allows the program to continue executing after you log off the system. You can later log on to check the test's progress. The ampersand (&) causes the entire command to execute in the background, so that another prompt is immediately displayed and you can continue to use the system. nohup olsbt maxp 10000000 )sbt.log & SMM-1012 C CRAY PROPRIETARY 3-115 The following example runs olsbt with selected command options and shell facilities. oIsbt runs for 0'1000000 passes in CPUs a and b. The job runs as a background process, and output is sent to sbt.log. olsbt maxp 1000000 cpu a,b )sbt.log & The following example shows a procedure for determining how frequently an error occurs. oIsbt is rerun with the +repeat option, so that the first pass is run repeatedly until the test terminates. The test uses the seed value and the failing cluster number from the output at the time of the initial error. Error isolation is disabled and olsbt executes in CPUs a, b, c, and d. The job runs as a background process, and output is sent to sbt.Iog. olsbt +repeat -isolate maxerr 100 maxp 100 cpu a,b,c,d seed 1436651016713554002511 cluster 4 )sbt.log & The following example shows the ouput displayed when oIsbt is run with all default values. olsbt Output: olsbt olsbt started in cpu A on Wed Dec 14 15:18:56 1988 CRAY Y-MP MODE olsbt reached maximum pass limit with 1000 passes and 0 errors on Wed Dec 14 15:20:23 1988 The following example runs olsbt in four CPUs with the +verbose option enabled so that a line of output is generated after each pass. olsbt cpu a,b,c,d +verbose Output: olsbt cpu a,b,c,d +verbose olsbt started in cpus A, B, C, D with master cpu A on Wed Dec 14 15:19:08 1988 CRAY Y-MP MODE olsbt: pass olsbt: pass olsbt: pass 3-116 = 1, error = 2, error = 3, error = = = o o o CRAY PROPRIETARY Wed Dec 14 15:19:26 1988 Wed Dec 14 15:19:26 1988 Wed Dec 14 15:19:26 1988 SMM-1012 C Output (continued): = = olsbt: pass 1000, error 0 Wed Dec 14 15:21:23 1988 olsbt reached maximum pass limit with 1000 passes and 0 errors on Wed Dec 14 15:21:23 1988 The following example runs o!sbt in CPUs a, b, c, d with CPU a as the master. olsbt cpu a,b,c,d Output on an error: olsbt cpu a,b,c,d olsbt started in cpus A, B, C, D with master cpu A on Wed Dec CRAY Y-MP MODE 7 14:27:00 1988 olsbt: restart file written to A35411-olsbt 200> = 'olsbt name < 201> rev < = '5.0 202> date < = '12/07/88' 203> pass < = 4 204> error < = 1 205> = 103336000000000000000 seed < 1774> = 'bits failpat < 220> = 2 failcln < 206> numins < = 20 TASK 0 random instruction buffer executed in CPU A ibuffO 4200a 4200b 4200c 4200d 4201a 4201b 4201c 4201d 4202a 4202b 4202c 4202d 4203a 4203b 4203c 4203d 4204a 4204b 4204c 4204d 4205a 4205b SMM-1012 C 003416 003404 003401 003603 003627 003434 003730 003726 003702 026227 003634 003635 003405 003605 003617 003413 003613 003410 003415 003406 003636 005000 SM16 SM04 SM01 SM03 SM27 SM34 SM30 SM26 SM02 A2 SM34 SM35 SM05 SM05 SM17 SM13 SM13 SM10 SM15 SM06 SM36 J CRAY PROPRIETARY 1,TS 1,TS 1,TS 0 0 1,TS 1 1 1 SB2 0 0 1,TS 0 0 1,TS 0 1,TS 1,TS 1,TS 0 BOO 3-117 Output (continued): TASK 1 random instruction buffer executed in CPU B ibuff1 4240a 4240b 4240c 4240d 4241a 4241b 4241c 4241d 4242a 4242b 4242c 4242d 4243a 4243b 4243c 4243d 4244a 4244b 4244c 4244d 4245a 4245b TASK SM16 SM03 S4 S3 A6 SM03 SM31 SM25 SM27 SM34 SM20 A4 SM23 SM05 SM05 SMOO SM13 SM13 SM10 SM36 SM36 003616 003403 072473 072333 026607 003603 003431 003425 003427 003634 003620 026427 003623 003405 003605 003600 003413 003613 003610 003436 003636 005000 J 0 1,TS ST7 ST3 SBO 0 1,TS 1,TS 1,TS 0 0 SB2 0 1,TS 0 0 1,TS 0 0 1,TS 0 BOO 2 random instruction buffer executed in CPU C ibuff2 4300a 4300b 4300c 4300d 4301a 4301b 4301c 4301d 4302a 4302b 4302c 4302d 4303a 4303b 4303c 4303d 4304a 4304b 4304c 4304d 4305a 4305b 3-118 SM04 SM03 SM03 SM31 SM34 SM34 SM33 SM35 SM23 A2 S6 ST4 SM05 S2 A6 SM21 SM13 SM13 SM15 SM36 SM36 003604 003403 003603 003631 003434 003634 003433 003435 003423 026267 072663 073343 003605 072213 026647 003621 003413 003613 003615 003436 003636 005000 J CRAY PROPRIETARY 0 1,TS 0 0 1,TS 0 1,TS 1,TS 1,TS SB6 ST6 S3 0 ST1 SB4 0 1,TS 0 0 1,TS 0 BOO SMM-1012 C Output (continued): TASK 3 random instruction buffer executed in CPU D ibuff3 4340a 4340b 4340c 4340d 4341a 4341b 4341c 4341d 4342a 4342b 4342c 4342d 4343a 4343b 4343c 4343d 4344a 4344b 4344c 4344d 4345a 003601 003403 003603 026067 026367 026767 003614 003625 003434 003634 003633 003405 003605 003417 003400 003421 003613 003606 003436 003636 005000 SM01 SM03 SM03 AO A3 A7 SM14 SM25 SM34 SM34 SM33 SM05 SM05 SM17 SMOO SM21 SM13 SM06 SM36 SM36 J 0 1 , TS 0 SB6 SB6 SB6 0 0 1 , TS 0 0 1 , TS 0 1 , TS 1 , TS 1,TS 0 0 1 , TS 0 BOO initial address register data for TASK 0 initarO 5210> = 0000000000020000000000 < initarO + 0004 < 5214> = 0000000000000000000000 initial scalar register data for TASK 0 initsrO < 5200> 0377777777776000000000 initsrO + 0004 < 5204> = 0000000000000000000000 = initial address register data for TASK 1 (address register data is displayed for task 1) initial scalar register data for TASK 1 (scalar register data is displayed for task 1) initial address register data for TASK 2 (address register data is displayed for task 2) SMM-1012 C CRAY PROPRIETARY 3-119 Output (continued): initial scalar register data for TASK 2 (scalar register data is displayed for task 2) initial address register data for TASK 3 (address register data is displayed for task 3) initial scalar register data for TASK 3 (scalar register data is displayed for task 3) initial shared B register data initsb < 5300> = initsb + 0004 < 5304> = 0000000000000000000000 0000000000000177777777 initial shared T register data initst < 5310> = 0000000000000777760000 initst + 0004 < 5314> = 1777740000000001777777 initial semaphore register data initsm < 5320> = 1577777777700000000000 simulated random instruction buffer results The expected data shown below has the following format: name name: index: offset: data: + index The The The The = data ••. name of the data dumped on this line. index into the data starting at name. offset into the data buffer. actual data dumped. *** Expected Results *** Optional, default: O. cpu A (master) Source data buffer at 6200 in Memory Memory address in source data buffer = + 6200 (source data buffer) simulated address register data results for TASK 0 actarO < 10> = 0000000000020000000000 actarO + 0004 14> = 0000000000000000000000 < 3-120 CRAY PROPRIETARY SMM-1012 C Output (continued): simulated scalar register data results for TASK 0 actsrO ( 0 ) = 0377777777776000000000 actsrO + 0004 4> 0000000000000000000000 < simulated address register data results for TASK 1 (address register data is displayed for task 1) simulated scalar register data results for TASK 1 (scalar register data is displayed for task 1) simulated address register data results for TASK 2 (address register data is displayed for task 2) simulated scalar register data results for TASK 2 (scalar register data is displayed for task 2) simulated address register data results for TASK 3 (address register data is displayed for task 3) simulated scalar register data results for TASK 3 (scalar register data is displayed for task 3) simulated shared B register data results 100> actsb 0000000000000000000000 < + 0004 104> = 0000000000000177777777 actsb < simulated shared T register data results 110> = 0000000000000777760000 actst < 114> = 1777777777777777777777 actst + 0004 < simulated semaphore register data results actsm < 120> = 1657473777200000000000 SMM-1012 C CRAY PROPRIETARY 3-121 Output (continued): Differences are the results from actual execution of the random instruction buffer that differ from the master (simulated or actual) execution. = actar address register data results actsr = scalar register data results actsb sbO-sb7 register data results actst stO-st7 register data results actsm semaphore register data result The difference data shown below has the following format: = = = n~e + index = data data differences The n~e of the data dumped on this line. The index into the data starting at n~e. Optional, default: O. The offset into the data buffer. The actual data dumped. The differences are marked with an asterisk (*) preceding the data word. data differences: The bits in difference between the actual results and the expected results. n~e: index: offset: data: *** Differences *** cpu A (master) Source data buffer at 7200 in Memory copied to save buffer at 113755 in Memory Memory address in source data buffer + 7200 (source data buffer) Memory address in save data buffer = + 113755 (save data buffer) = actual random buffer execution results actst + 0004 < 114> = *0000000000000000000000 1777777777777777777777 The first address (FADD) of the diagnostic is 200a olsbt reached maximum error limit with 4 passes and 1 errors at Wed Dec 1988 3-122 CRAY PROPRIETARY 7 14:27:00 SMM-1012 C If olsbt determines that the initial load of the semaphores failed, the test produces a dump and terminates. Output on an error: olsbt cpu a,b,c,d olsbt started in cpus A, B, C, D with master cpu A on Wed Dec 7 15:12:29 1988 CRAY Y-MP MODE execute: an error was detected in the initial load of the semaphore register olsbt: restart file written to A60249-olsbt name < 200> = 'olsbt rev 201> = '5.0 < date 202> = '12/07/88' < 203> = 0 pass < error 204> = 1 < seed 205> = 33 < 1774> failpat 'bits < failcln 220> = 2 < numins 206> = 20 < TASK 0 random instruction buffer executed in CPU A 2175a 2175b 2175c 073102 072202 046012 SM Sl S2 SO Sl\S2 SM initial address register data for TASK 0 initarO < 5210> = 0000000000000000000000 initarO + 0004 < 5214> = 0000000000000000000000 initial scalar register data for TASK 0 initsrO < 5200> = 0000000000000000000760 initsrO + 0004 < 5204> = 0000777777777777777777 initial shared B register data initsb < 5300> = 0000000000000000000000 initsb + 0004 < 5304> = 0000000000000000000000 initial shared T register data initst < 5310> = 0000000000000000000020 initst + 0004 < 5314> = 1777776000000000000007 SMM-1012 C CRAY PROPRIETARY 3-123 Output (continued): initial semaphore register data initsm < 5320> = 1106721617240000000000 simulated random instruction buffer results The expected data shown below has the following format: name name: index: offset: data: *** + index The The The The Expected Results = data .•. name of the data dumped on this line. index into the data starting at name. offset into the data buffer. actual data dumped. *** Optional, default: O. cpu A (master) Source data buffer at 6200 in Memory Memory address in source data buffer = + 6200 (source data buffer) simulated address register data results for TASK 0 actarO < 10> = 0000000000000000000000 actarO + 0004 14> = 0000000000000000000000 < simulated scalar register data results for TASK 0 actsrO < 0> = 0000000000000000000000 actsrO + 0004 < 4> 0000000000000000000000 = simulated shared B register data results < 100> = 0000000000000000000000 actsb 104> = 0000000000000000000000 < actsb + 0004 simulated shared T register data results 110> = 0000000000000000000000 < actst < 114> = 0000000000000000000000 actst + 0004 simulated semaphore register data results actsm < 120> = 1106721617240000000000 3-124 CRAY PROPRIETARY SMM-1012 C Output (continued): Differences are the results from actual execution of the random instruction buffer that differ from the master (simulated or actual) execution. actar = address register data results actsr = scalar register data results actsb = sbO-sb7 register data results actst = stO-st7 register data results actsm = semaphore register data result The difference data shown below has the following format: n~e + index (offset> = data data differences name: index: offset: data: The name of the data dumped on this line. The index into the data starting at name. Optional, default: O. The offset into the data buffer. The actual data dumped. The differences are marked with an asterisk (*) preceding the data word. data differences: The bits in difference between the actual results and the expected results. *** Differences *** cpu A (master) Source data buffer at 6200 in Memory Memory address in source data buffer *** Differences *** = (offset> + 6200 (source data buffer) cpu A (master) Source data buffer at 7200 in Memory copied to save buffer at 113755 in Memory Memory address in source data buffer = (offset> + 7200 (source data buffer) Memory address in save data buffer = (offset> + 113755 (save data buffer) actsm ( 120> = *1000000000000000000000 0106721617240000000000 The first address (FADD) of the diagnostic is 200a olsbt reached maximum error limit with 0 passes and 1 errors at Wed Dec 7 15:12:30 1988 SMM-1012 C CRAY PROPRIETARY 3-125 3.7.5 TEST MESSAGES The olsbt test produces the following types of messages: • • • Test mode Informative Error These messages are listed in the subsections that follow. 3.7.5.1 Test mode messages During test execution, one of the following messages is displayed to indicate the test mode: CRAY Y-MP MODE Indicates that the mainframe is a CEA system (Y-mode). CRAY X-MP MODE Indicates that the mainframe is a CRAY X-MP computer system. 3.7.5.2 Informative messages If no error occurs, the test generates two messages, one at start-up time and the other at test termination. If the +verbose option is enabled, a message is sent to stdout (standard output device) after each pass through the test loop. error, the test provides information such as the following: On an • Pass and error counts • Seed at the beginning of the pass on which the error occurred • Cluster number for the error that occurred • Contents of the instruction buffers and in which CPU each instruction buffer was executed • Initial data • Resulting data from the simulated instruction execution in the master CPU • Differences between the simulation execution results from the master CPU and the actual execution results from all of the selected CPUs 3-126 CRAY PROPRIETARY SMM-1012 C 3.7.5.3 Error messages The following error message is sent to stderr (standard error device) if an invalid command option is entered: olsbt: no data pattern(s) selected All data patterns were deselected (-bits -onezero -random). Correct and rerun. The following messages are sent to stderr if olsbt detects an unexpected error. Select a different master CPU and rerun the test. the problem persists, contact your CRI representative. If olsbt: generate: (software error) generation routine. The instruction does not have a olsbt: simulate: (software error) during simulation. a deadlock was encountered olsbt: simulate: (software error) gh field is not valid. olsbt: simulate: (software error) ijk field is not valid. olsbt: simulate: (software error) simulation routine. The instruction does not have a The following error message is sent to stderr if olsbt detects an error in the initial load of the semaphore register. Contact your CRI representative. execute: an error was detected in the initial load of the semaphore register. SMM-1012 C CRAY PROPRIETARY 3-127 4. MAIRTEHAHCE TEST AND KBI'l'OR OVERVIEW The on-line maintenance tests provide error detection and isolation. These on-line tests are variants of the off-line diagnostic tests. This section provides an overview of the following information: • • • • • • • • Maintenance monitor (almont) Program synopsis Test execution Test-specific requirements Test termination Test examples Test messages Diagnostic memory image For a brief description of each maintenance test, refer to appendix A, On-line Diagnostic Programs. For a list of test execution times, refer to appendix B, Test Execution Times. For additional information on the maintenance tests, refer to the on-line diagnostic listings. 4.1 MAINTENANCE MONITOR (alman) The olmon monitor is a C program monitor for the on-line maintenance tests. The loader program attaches olmon to a slightly modified version of an off-line diagnostic test to create an on-line maintenance program. The alman monitor provides the interface to the on-line maintenance tests. By accepting and interpreting command options and arguments, olmon allows you to do the following: t • Set the diagnostic information block (DIB) locations in the diagnostic • Set limits on the maximum number of passes and errors allowed (maxerr nand maxp n) • Set limits on test execution time, in CPU time (cputime h:m:s) or elapsed (wall-clock) time (time h:m:s) CEA (X-mode) and CX/1 systems only SMM-1012 C CRAY PROPRIETARY 4-1 4.2 • Allocate memory for memory tests • Select the CPU to be tested • Send test results to stdout (standard output device) by default or to a file by indicating output redirection on the command line PROGRAM SYNOPSIS Before a test can be started, UNICOS must be running in the CPU to be tested. The olmon command options can be entered in any order. If an option is omitted, the program uses the default value. Synopsis: test [chtpnt mode] [cpu x] [cputime h:m:s] [data x:y] [dib x] [help] [mazerr n] [mazp n] [time h:m:s] [+I-verbose] [words n] chtpnt mode Indicates whether restart files are to be generated. Restart files cannot be created unless output is directed to a disk file. mode is one of the following arguments: Argument Description first Generates a restart file for the first failure detected (default) all Generates a restart file for each failure detected, including failures detected during error isolation none Does not generate restart files The default generates a restart file for the first failure detected. For additional information, refer to the following: chtpnt(l), restart(l), chtpnt(2), and restart(2). cpu x 4-2 Selects cpu x. x can be a, b, c, d, e, f, q, or h. The default is cpu a. CRAY PROPRIETARY SMM-I012 C cputiae h:m:s Sets the test execution time in CPU time. The time is specified in hours (h), minutes (m), and seconds (s); minutes and seconds; or just seconds. Use colons as delimiters, as follows: h:m:s. Generally, actual the specified CPU (or is set to 0), if set to a value execution time is within one second of time. If eputime is allowed to default the test uses the mazp value. However, other than 0, eputime overrides mazp. data x:y Stores data y (octal) at location x (octal) before the diagnostic is started; no length check is performed on x. dib x Allows you to set the following diagnostic information block (DIB) options in the diagnostic: Option Description modes x sees x stop x Test mode Section select Stop condition bits option x Refer to the on-line listings for additional DIB descriptions. In addition to the previously listed options, you can set the following options for olcmx only (refer to subsection 4.4.2, olcmx): Option Description param x rep x Test control bits Repeat current pass Number of parcels requested Repeat isolation loop Initial random number Starting pass count (mazp n must be greater than rpass x) r~ix rislp x rnum X rpass x To determine the dib x settings, refer to the on-line diagnostic listings. help Generates an on-line help display containing a synopsis and brief description of the command options and arguments. If help is entered with a test name, help information is written to stdout, and the test terminates. mazerr n Sets the maximum number of errors. value. The default for n is 1. SMM-1012 C CRAY PROPRIETARY n is an octal 4-3 Sets the maximum number of passes. n is an octal value. The default for n is 0'1000. If cputime or time is set to a value other than 0, the specified option overrides rnClZp. mazpn time h:m:s Sets the test execution time in elapsed (wall-clock) time. The time is specified in hours (h), minutes (m), and seconds (s); minutes and seconds; or just seconds. Use colons as delimiters, as follows: h:m:s. Generally, actual execution time is within one second of the specified elapsed time. If time is allowed to default (or is set to 0), the test uses the rnClZp value • . However, if specified to a value other than 0, tirne overrides rnClZp. +I-verbose Enables (+verbose) or disables (-verbose) the generation of informational messages. The +verbose option causes a line of output to be generated after each pass of the diagnostic. The default is -verbose. words n 4.3 Allocates words for memory testing, and sets the DIB locations rnfrst and rnlast (the first and last memory addresses to be tested). n is an octal value. If words n is not entered, the diagnostic sets the test limits by default. Default values are test-dependent (refer to the on-line diagnostic listings). TEST EXECUTION To start a single diagnostic test, enter the following: • test • Monitor command options To run a sequence of diagnostics, use the runsequence utility described in section 7, Utility Programs. 4.4 TEST-SPECIFIC REQUIREMENTS This subsection provides information on test-specific requirements and command line entries. You must observe these requirements to ensure that the indicated test executes properly. 4-4 CRAY PROPRIETARY SMM-I012 C 4.4.1 olaht To run olahtt (on-line A register indexing test), you must set cput n (the OIB option to set the CPU type), as follows: Value CPU Type 10 CRAY X-MP/1 20 CRAY X-MP/2 40 (default) CRAY Y-MP CRAY X-MP EA (X-mode) CRAY X-MP/4 To execute olaht on a CRAY X-MP/2 or CRAY X-MP/l computer system, you must set cput as previously indicated (rather than allow it to default) or the test will generate invalid results. To ensure that the test automatically selects the appropriate cput value, do the following: 1. Rename olaht to olabtl or olaht2. 2. Create a shell script called olaht. 3. Enter the following information into the olaht shell script: 01aht1 cput 10 $* or olaht2 cput 20 $* 4.4.2 olcmx To run olcmxt (on-line random instruction and operand test) on a Cray computer system without compressed indexing capabilities, you must set param n (OIB option to set the test control bits) so that the vector compressed indexing instructions are disabled. To disable these instructions, set param as follows: olcmx param 400000001 The default value for param is 1 (stop on isolated error). If you allow param to default, and the Cray computer system does not have compressed indexing capabilities, the test does not run properly. t CRAY X-MP EA (X-mode) and CRAY X-MP computer systems only. SMM-1012 C CRAY PROPRIETARY 4-5 To ensure that the test automatically disables the vector compressed indexing instructions, do the following: 1. Rename olcmz to olcmza. 2. Create a shell script called olcmz. 3. Enter the following information into the olcmz shell script: olcmxa param 400000001 $* 4.4.3 olibz To run olibzt (on-line instruction buffer test), you must set cput (the DIB option to set the CPU type), as follows: CPU Type Value 10 (default) CRAY X-MP/l 20 CRAY X-MP/2 40 CRAY X-MP EA (X-mode) CRAY X-MP/4 The default value for cput is 10, indicating a CRAY X-MP/l computer system. If you allow cput to default, and you attempt to run olibz on a mainframe other than the CRAY X-MP/1, the test executes but it generates invalid error information. Therefore, ensure that the appropriate cput value is set. To ensure that the test automatically selects the appropriate cput value, do the following: 1. Rename olibz to olibz4 or olibz2. 2. Create a shell script called olibz. 3. Enter the following information into the olibz shell script: olibz4 cput 40 $* or olibz2 cput 20 $* t 4-6 CRAY X-MP EA (X-mode) and CRAY X-MP computer systems only. CRAY PROPRIETARY SMM-I012 C 4.5 TEST TERMINATION A test stops under the following conditions: 4.6 • The test successfully completes the maximum number of passes (mazp n). • The test reaches the specified CPU time (eputime h:m:s) or elapsed (wall-clock) time (time h:m:s). • The test detects the maximum number of errors (mazerr n). If maxerr is set to a value greater than 1, stop (DIB option to set stop condition bits) must be set to 0 (continue on error). Error reports are automatically sent to stdout (standard output device), but they can be redirected to an error file. • The test detects an error and stop is set to 1 (stop on error). • The help option is entered with a test name, help information is written to stdout, and the test terminates. • The monitor or test detects an error in a command line entry and writes a message to stderr (standard error device). Only the first error detected is reported. TEST EXAMPLES The following example executes olvrz with two DIB options set: sees 3 executes test sections 0 and 1; stop 0 directs the program to continue on error. To exit a continue on error, enter the kill(l) command to terminate test execution. Example: olvrx secs 3 stop 0 The following example executes olvrx with two DIB options set: sees 3 executes test sections 0 and 1; data 205:77 stores the value 0'77 at location 0'205. Example: olvrx secs 3 data 205:77 SMM-1012 C CRAY PROPRIETARY 4-7 The following example executes olvrz with one DIB option: executes test section o. sees 1 Example: olvrx sees 1 The following example executes test in CPU c, sets the maximum error limit to 3, and redirects the output to test. loge. Example: test cpu c maxerr 3 > test. loge The following example displays test results from test. loge one page at a time (press the RETURN key to display the next page). Example: pg test. loge The following example executes olcmx in CPU b for 500,000 passes, starting at pass 500,000. Output is redirected to olcmx.log. The Dohup(l) command allows the program to continue executing after you log off the system. You can later log on to check the test's progress. The ampersand (&) causes the entire command to execute in the background, so that another prompt is immediately displayed and you can continue to use the system. Example: nohup olcmx cpu b maxp 1000000 rpass 500000 > olcmx.log & The following example shows the help information that is displayed if help is entered with a test name. Example: olaht help 4-8 CRAY PROPRIETARY SMM-1012 C Help display: olaht help olaht [help] [chkpnt mode] [cpu x] [cputime h:m:s] [data x:y] [maxerr n] [maxp n] [time h:m:s] [+I-verbose] [words n] [dib x] chkpnt mode - Checkpoint mode: none, first, or all. (Default: first) cpu x - Selects CPU x. (Default: a) cputime h:m:s- Set amount of CPU time to execute. data x:y - Stores data y at diagnostic location x before the diagnostic is started. maxerr n - Sets maximum number of errors. (Default: 1) maxp n - Sets maximum number of passes. (Default: 0'1000) time h:m:s - Set amount of wall clock time to execute. - Send (+verbose)/do not send (-verbose) informational +I-verbose messages to output. (Default: -verbose) - Allocates x words for Central Memory testing. words n MFRST (sta) and MLAST (lim) are set with the appropriate values. - Sets the DIB location to x. dib x Refer to the individual test to determine which DIBs are available for the test. NOTE: Actual results of setting a DIB location are test-dependent. The following example shows the output that is displayed when the test is run with all default values. Example: olsr3 Output: olsr3 olsr3: started running in cpu A on Thu Dec 17 09:10:05 1987 olsr3 reached maximum pass limit with 1000 passes and 0 errors on Thu Dec 17 09:10:05 1987 The following example shows the output that is displayed if +verbose is specified and mazp reaches 10. Example: olsr3 +verbose maxp 10 SMM-1012 C CRAY PROPRIETARY 4-9 Output: olsr3 +verbose maxp 10 olsr3: started running in cpu A on Thu Dec 17 09:10:48 1987 1, error = olsr3: pass = 0 Thu Dec 17 09:10:48 2, error = olsr3: pass = 0 Thu Dec 17 09:10:48 3, error olsr3: pass = 0 Thu Dec 17 09:10:48 4, error = olsr3: pass = 0 Thu Dec 17 09:10:48 5, error = 0 Thu Dec 17 09:10:48 olsr3: pass = 6, error = olsr3: pass = 0 Thu Dec 17 09:10:48 7, error = olsr3: pass = 0 Thu Dec 17 09:10:48 10, error = olsr3: pass = 0 Thu Dec 17 09:10:48 olsr3 reached maximum pass limit with 10 passes and 0 errors on Thu Dec 17 09:10:48 1987 1987 1987 1987 1987 1987 1987 1987 1987 The following example shows the output that is displayed if olsr3 is run for 2 minutes (CPU time) in CPU conly. Example: olsr3 cpu c cputime 2:00 Output: olsr3 cpu c cputime 2:00 olsr3: started running in cpu C on Fri Dec 4 09:11:45 1987 olsr3 reached maximum cputime limit with 1114656 passes and 0 errors on Fri Dec 4 09:13:49 1987 The following example shows the output that is displayed if mazerr reaches 1 (default). Example: oltrb Output: oltrb oltrb started running in cpu A at Wed Jan 6 0, error = oltrb: pass = file written to A55663-oltrb oltrb: restart 630> = 'TRB NAME < REV 632> = 'X3.0 < DATE 634> = '12/07/87' < 636> = 'TB RU MODES < 642> = 16 MTRT < 241> = 7654321 SECS < 4-10 CRAY PROPRIETARY 15:30:34 1988 1 Wed Jan 6 15:30:34 1988 000000 000000 000000 000016 000000 000000 000037 054321 SMM-1012 C Output (continued): 64> 66> 63> 65> 61> 62> 60> 67> 1440> 1441> =0 =1 =1 = 1576 = 1777777777777777777 =1 = 1777777777777777776 =0 = 1777777777777777777 = 1777777777777777777 000000 000000 000000 000000 177777 000000 177777 000000 177777 177777 = 1777777777777777777 + 0001 < 1537> 1540> 1541> 177777 177777 177777 177777 177777 177777 177777 177777 177777 177777 177777 177777 0077 < 1637> = 1777777777777777777 < 27616> 0000000000000000001 = 0000000000000000100 = 0000000000000000076 = 0000000000000000077 = 0000000000000034772 = 0000000000000037035 = 0000000000000037027 = 0000000000000000001 = 0000000000000001576 0000000000000001311 = 0000000000000001576 = 0000000000000001471 = 0000000000000036711 = 0000000000000000000 = 0000000000000000000 = 0000000000000000000 = 1777777777777777777 = 0000000000000000004 = 0000000000000000000 0000000000000000102 = 0000000000000000001 = 0000000000000000001 = 0000000000000000003 0000000000000000000 177777 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 177777 000000 000000 000000 000000 000000 000000 000000 PASS STOP ERROR ERA ACT EXP DIF CF IBUF IBUF + 0001 IBUF OBUF OBUF OBUF SAVAO SAVAO SAVAO SAVAO SAVAO SAVAO SAVAO SAVAO SAVBR SAVBR SAVBR SAVBR SAVBR SAVBR SAVSO SAVSO SAVSO SAVSO SAVSO SAVSO SAVSO SAVSO SAVVL SAVVM < < < < ( < < < < ( + 0077 < ( + + + + + + + + 0001 0002 0003 0004 0005 0006 0007 0001 0002 + 0003 + 0004 + 0005 + + + + + + + + + 0001 0002 0003 0004 0005 0006 0007 SMM-1012 C ( 27617> < 27620> ( 27621> < 27622> < 27623> < 27624> < 27625> ( 30640> < 30641> < 30642> ( 30643> < 30644> < 30645> < 27626> < 27627> < 27630> < 27631> < 27632> < 27633> ( 27634> < 27635> < 30636> ( 30637> = 1777777777777777777 = 1777777777777777777 = CRAY PROPRIETARY 000000 000000 000000 000000 177777 000000 177777 000000 177777 177777 177777 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 177777 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 177777 000000 177777 000000 177777 177777 177777 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 177777 000000 000000 000000 000000 000000 000000 000000 000000 000001 000001 001576 177777 000001 177776 000000 177777 177777 177777 000001 000100 000076 000077 034772 037035 037027 000001 001576 001311 001576 001471 036711 000000 000000 000000 177777 000004 000000 000102 000001 000001 000003 000000 4-11 Output (continued): SAVTR ( 30740> SAVTR + 0001 ( 30741> = 1777777777777777777 = 1777777777777777777 177777 177777 177777 177777 177777 177777 177777 177777 SAVTR + 0077 ( 31037> = 1777777777777777777 177777 177777 177777 177777 The first address (FADD) of the diagnostic is 40a oltrb reached maximum error limit with 0 passes and 1 errors on Wed Jan 6 15:30:35 1988 4.7 TEST MESSAGES Each test sends messages to stdout (standard output device) by default or to a file when UNICOS output redirection is indicated on the command line. When a test detects an error, the following information is displayed: • • • DIBs Absolute addresses of the DIBs DIB values in word and parcel formats The following error messages are sent to stderr (standard error device): test: Illegal argument x. Argument x is invalid. Correct and rerun. test: Error selecting cpu x. CPU x,is unavailable. Contact your CRI representative. test: Error allocating memory: number of words n, error O. The test cannot allocate memory. Decrease the amount of memory requested by the words n option, or regenerate the diagnostic, and rerun. If the problem persists, contact your CRI representative. = = test: Cannot write restart file. errno The test cannot write a restart file. representative. 4-12 CRAY PROPRIETARY = n. Contact your CRI SMM-1012 C 4.8 DIAGNOSTIC MEMORY IMAGE FOR MAINTENANCE TESTS Figure 4-1 shows a sample memory image of a diagnostic that is executing. The diagnostic test is relocated to start at the first address (FADD) of the test. FADD must be subtracted from the error address if the diagnostic fails. After an error occurs, FADD is displayed in the following format: The first address (FADD) of the diagnostic is xa The value x is determined by the length of the on-line monitor program. The on-line maintenance tests call the following monitor routines: Routine Description UERROR() The test calls the UERROR() routine when an error is detected. The monitor dumps the DIB and examines a DIB macro at the end of the diagnostic for memory areas to be dumped. UPASS() The test calls the UPASS() routine on each successful pass. If an error is detected, the following occurs: 1. The test does the following: Creates a restart file Saves the CPU registers using the SAVEREG macro, defined in the common deck OLMAC Calls the monitor error function routine, UERROR() Restores the CPU registers using the RESTORE macro, defined in the common deck OLMAC For additional information on the restart file, refer to the following system calls: chkpnt(2) and restart(2). The SAVEREG and RESTORE code is assembled into the on-line maintenance test, but the memory required to save the registers is allocated to the following monitor arrays: SAVAO, SAVBR, SAVSO, SAVTR, SAVVO, SAVVL, and SAVVM. 2. The system produces a core dump of the diagnostic test area. SMM-1012 C CRAY PROPRIETARY 4-13 Location Names Memory Image Base Address UERROR() UPASS() SAVAO SAVBR SAVSO SAVTR SAVVO SAVVL SAVVM Monitor program (olmon) Data area for storing register data FADD START ( ) DIB Diagnostic program SAVEREG RESTORE mfrst Memory allocated for a memory test mlast C library routines Unused area System stack Limit Address Figure 4-1. 4-14 Sample Diagnostic Memory Image CRAY PROPRIETARY SMM-1012 C 5. DONH-DEVICE PROGRAMS The down-device programs provide on-line CPU and peripheral testing. The hardware is removed from normal system operations and can be accessed and exercised only by'the down-device programs. This section describes the following programs: Program Oescription donut On-line disk maintenance program Oown CPU monitor On-line magnetic tape test oldJnont unitap 5.1 donut The donut program is an interactive, menu-driven diagnostic program for testing and maintaining 00-10, 00-19, 00-29, 00-39, 00-40, and 00-49 disk drives. The donut program cannot be run off-line. The donut program can be used to perform the following functions: • • • • • • Buffer testingt Error correction code (ECC) testingtt Flaw table maintenance Formatting 1D verificationtt Surface analysis The subsections that follow describe the following topics: • • • • • • • t tt oisk selection Oisk mode System mode Maintenance mode Warnings and messages Menu displays Program execution Menus Program execution examples Multiple-CPU Cray computer systems only Not available for 00-19 or 00-29 disk drives SMM-1012 C CRAY PROPRIETARY 5-1 5.1.1 DISK SELECTION The donut program can test only one disk at a time. However, multiple copies of donut can be executed simultaneously to test different disk drives. To access a disk, donut uses the same logical device name as that assigned during system configuration. To select the disk to be exercised, define the logical device name by doing one of the following: • Enter dey from the Main menu (refer to subsection 5.1.6.3, Commands to Set Arguments) • Enter a from the Parameter menu (refer to subsection, 5.1.13, Parameter Menu) The donut program attempts to open and retrieve iobuf information for the specified device, to determine whether the specified logical device name is valid. If the logical device name is valid, donut determines the device type and adjusts the other arguments accordingly. As a precaution, donut sets the initial cylinder argument to point to a scratch cylinder. donut reads and verifies the disk flaw tables for the device, and displays an appropriate message if any abnormalities are detected. If the logical device name is invalid, donut does not accept disk requests and the device argument is set as follows: * none * Reenter a valid logical device name and continue. 5.1.2 DISK MODE A disk in the system configuration can be in one of the following modes: Mode Description System UNICOS system routines and all user jobs can access the disk Only UNICOS system routines and donut can access the disk Maintenance The current mode is displayed under the MODE heading in the argument banner of various menus (refer to subsection 5.1.4, Menu Displays). 5-2 CRAY PROPRIETARY SMM-1012 C To change the mode, do the following: 1. Select the mode by doing one of the following: Enter mode from the Main menu (refer to subsection 5.1.6.3, Commands to Set Arguments) Enter t from the Parameter menu (refer to subsection 5.1.13, Parameter Menu) /////////////////////////////////////////////////////// WARNING The donut program can write to any of the cylinders on a disk. Therefore, device labels and flaw tables are vulnerable to accidental destruction. It is recommended that writes and surface analysis not be performed on the CE cylinders that contain the flaw tables (typically, cylinder 0 and the second-to-Iast cylinder on a device) unless absolutely necessary, and then only if backup procedures are used. Before writing to a disk, donut displays a message that flaw table information will be destroyed on those cylinders that contain information. /////////////////////////////////////////////////////// 5.1.2.1 System mode In system mode, donut and other user jobs have equal access to the disk. The following operations are supported: • • • 5.1.2.2 donut can read from and write to CE cylinders only donut can perform ID verification (except on DD-19s and DD-29s) Flaw tables can be updated Maintenance mode In maintenance mode, only UNICOS and donut requests can access the disk. All donut functions are valid. If a maintenance mode function is requested while the disk is in system mode, the function aborts and donut displays the following message: *** DIAGNOSTIC TASK ERROR CODE 1 - Device not in Maintenance mode SMM-1012 C *** CRAY PROPRIETARY 5-3 5.1.3 WARNINGS AND MESSAGES The donut program displays various warnings and messages. For example, the following warning is displayed if you are about to overwrite the User Flaw Table in donut's area of central memory: WARNING USER flaw table in memory will be altered. Enter go to continue or enter anything else to abort. If an invalid command is entered, an error message is displayed and the menu from which the command was entered is redisplayed. If an invalid argument is entered, an informative message is displayed. After some of the informative messages, the following prompt is displayed: ---) Enter anything to continue (--Some of the donut messages require a response. For example, the following message requires a response to ensure that read, write, and surface analysis operations are performed on only selected sectors. LIM ITS CHE C K Check CYLINDER, HEAD and SECTOR limits. Enter go to initiate. Enter any other character to abort. 5.1.4 MENU DISPLAYS At the top of various menus is the argument banner displaying the arguments used in the program. A sample argument banner is as follows: =============================================================== DEVICE = CYLINDERS HEADS SECTORS SLIP o DISK MODE 09:50:10 = = = none -----= ========================================================================== 5-4 * none * 0-0 0-0 0-0 CRAY PROPRIETARY SMM-1012 C By default, arguments are displayed in decimal. The cylinder, head, and sector values must be entered in decimal unless otherwise indicated. To generate an octal display, enter oct from any of the menus (enter dec to return to a decimal display). If you generate an octal display, the following applies: • The argument banner displays the heading (OCTAL) to the left of the arguments • The cylinder, head, and sector information is entered and displayed in octal 5.1.5 PROGRAM EXECUTION The donut program resides in Ice/bin directory. enter the following: To execute donut, Ice/bin/donut The initial donut screen display is as follows: W e 1 com e t 0 X - M P V e r s ion U N I COS DON U T 2.0 ---) Enter anything to continue (--- To continue, press any key. The program displays the Main menu. From the Main menu, you can get to various other menus. Menu commands are not case sensitive. They can be entered in uppercase or lowercase. In this document, the menus show commands in uppercase; however, the descriptions show them in lowercase and bold, according to UNICOS conventions. SMM-I012 C CRAY PROPRIETARY 5-5 The menu structure is as follows: Main Menu Command Description a Displays disk information b Displays Buffer Utility menu Command e a Displays Write Buffer menu b Displays Read Buffer menu Displays Error Utility menu Command a b 5-6 Description Description Displays Error Table menu a Adds the displayed error to the Found Flaw Table b Adds all errors to the Found Flaw Table d Deletes the displayed error record from the Error Table e Prints the error record to a file Displays Error Log menu a Adds top entry to the Found Flaw Table b Adds all entries to the Found Flaw Table c Prints the entire error log e Deletes all error log entries CRAY PROPRIETARY SMM-I012 C Main Menu Command f Description Displays Formatting menu Command s b Displays argument banner with warning. Enter qo to format IDs with flaw handling. c Displays argument banner with warning. Enter qo to format IDs with no flaw handling. e Displays Examine Data Buffer menu f Verifies track IDs using the User Flaw Table q Verifies track IDs without using the User Flaw Table z Displays Parameter menu Displays Surface Tests menu Command SMM-I012 C Description Description a Displays Write Data menu b Displays Read Data and Compare menu c Displays argument banner with warning. Enter qo to perform a read exercise. d Displays Surface Analysis menu e Displays Examine Data Buffer menu f Displays argument banner with warning. Enter qo to execute a read absolute operation. q Displays argument banner with warning. Enter qo to execute a write current data buffer operation. z Displays Parameter menu CRAY PROPRIETARY 5-7 Main Menu Command t Description Displays Flaw Table Utility menu Command Description a Displays Displays Displays Displays b c d Factory Flaw Table menu User Flaw Table menu System Flaw Table menu Found Flaw Table menu 11 Executes the Error Correction Code test z Displays Parameter menu Command Description Sets logical device name Sets cylinder limits Sets head limits Sets sector limits Sets diagnostic flags Toggles disk mode a b c d e t q Exits donut In addition, there are various commands that can be entered from the Main menu or various other menus. These commands are described in the following subsections: • Subsection 5.1.6.1, Commands to Display Submenus • Subsection 5.1.6.2, Commands to Select Display Format • • • • • Subsection 5.1.6.3, Commands to Set Arguments • 5-8 Subsection 5.1.6.4, Commands to Display the Data Buffer Subsection 5.1.6.5, Commands to Display Flaw Table Menus Subsection 5.1.6.6, Commands to Change the Data Buffer Subsection 5.1.6.7, Commands to Change the Type of Write Command Used Subsection 5.1.6.8, Commands to Display Commands List CRAY PROPRIETARY SMM-1012 C 5 • 1. 6 MAIN MENU Figure 5-1 shows donut's Main menu. =============================================================== DEVICE = = = * none * CYLINDERS HEADS SECTORS 0-0 0-0 0-0 SLIP o DISK MODE none ------ 09:50:10 = = = = ========================================================================== D I S K A B E F S T W Z Q o N LIN E (DONUT) Disk Information Buffer tests Review Errors Formatting and ID analysis Surface tests Flaw Table Utility Error Correction Test Reset Parameters Exit DONUT - (Quit) Enter command ==> Figure 5-1. 5.1.6.1 U TIL I T Y Main Menu for donut Commands to display submenus Table 5-1 lists the Main menu commands, which are used to do the following: • Display disk information (enter a from the Main menu or enter info from any menu) • Display various submenus • Execute the Error Correction Code test SMM-1012 C CRAY PROPRIETARY 5-9 Table 5-1. Command 5.1.6.2 Main Menu Commands Description a Displays disk information b Displays the Buffer Utility menu e Displays the Error Utility menu f Displays the Formatting menu s Displays the Surface Tests menu t Displays the Flaw Table Utility menu 11' Executes the Error Correction Code test z Displays the Parameter menu q Quit; exits donut. Commands to select display format The following commands for selecting the display format can be entered from any menu: Command Description oct Displays the cylinder, head, and sector information in octal dec Displays the cylinder, head, and sector information in decimal (default) 5.1.6.3 Commands to set arguments Table 5-2 lists the commands to set arguments from the Main menu or any of the subsequent menus (except the data pattern menus). Alternatively, you can set arguments by entering z (reset parameters) from the Main menu or various other menus. 5-10 CRAY PROPRIETARY SMM-1012 C Table 5-2. Command 5.1.6.4 Description Sets Sets Sets Sets Sets Sets cyl dey flags hed mode sec Commands to Set Arguments the cylinder range the logical device name diagnostic flags the head range the disk mode to system or maintenance the sector range Commands to display the data buffer The donut program keeps a record of the 1-track buffer used during the last disk operation. When donut writes data or IDs, the buffer contains data for the last track written. When donut reads data or IDs or performs surface analysis, the buffer contains data for the last track read. The buffer is reused during the next disk operation. To display the data buffer from any menu, enter the following command: data The data buffer can also be displayed by entering e from the Formatting menu (subsection 5.1.9) or the Surface Tests menu (subsection 5.1.10). 5.1.6.5 Commands to display flaw table menus To display a flaw table without going through the Flaw Table Utility menu, enter one of the following commands from the Main menu or any of the flaw table menus, as appropriate: Command Description fac Factory Flaw Table menu Found Flaw Table menu System Flaw Table menu User Flaw Table menu fnd sys usr For additional information on flaw tables, refer to subsection 5.1.11, Flaw Table Utility Menus. SMM-1012 C CRAY PROPRIETARY 5-11 5.1.6.6 Commands to change the data buffer To change the contents of the donut data buffer, the following commands can be used: Command Description clr Fills all sectors of the data buffer selected in the sectors section of the argument banner with O's fill Fills all sectors of the data buffer selected in the sectors section of the argument banner with l's 5.1.6.7 Commands to chanqe the type of write command used To change the type of write command used during write operations to the disk, the following commands can be used. These commands need only be used for 00-40 type disks. 5-12 Command Description wrt Sets the write command to perform a write (function code 4) during write operations. The write function is the default. fill Sets the write command to issue a write immediate (function code 22 octal) during write operations. This function code is valid only for 00-40 disks. It may be used when a controller releases control after all data is received but before the data is written to the disk and an error occurs when the remaining data is finally written. CRAY PROPRIETARY SMM-1012 C 5.1.6.8 Commands to display commands list Entering the he1p command displays a list of global commands that can be entered from any menu: Parameters changes: DEV - Change DEVICE Parameter CYL - Change CYLINDER Parameter Limits HED - Change HEAD Parameter Limits SEC - Change SECTOR Parameter Limits MODE - Toggle Disk MODE (System/Maint.) Flaw tables: FAC - Factory Flaw Table FND - Found Flaw Table SYS USR - System Flaw Table - User Flaw Table Miscellaneous: CLR - Clear Data Buffer To Zeros DATA - Display Data Buffer FILL - Fill Data Buffer With Ones HELP - Display This Help Information INFO - Display Disk Information MAIN - Main Menu WRT - Select Write Function (WRT=4) WRIM - Select Write Immediate Function (WRTIM=22 oct) 5.1.7 BUFFER UTILITY MENU Figure 5-2 shows the Buffer Utility menu (not applicable to 00-19 or 00-29 disk drives). Table 5-3 lists the Buffer Utility menu commands. These commands display the following submenus: • Write Buffer menu • Read Buffer menu From the submenus, you can execute a write or read function in the controller's 16-parcel buffer. To exercise the basic Cray-to-disk communication path, put the disk in maintenance mode and execute a write followed by a read and compare (if the disk is in system mode, other jobs may be using the buffer and the test may not be effective). SMM-1012 C CRAY PROPRIETARY 5-13 =============================================================== 09:54:01 = CYLINDERS HEADS SECTORS SLIP DISK MODE DEVICE = = --------------------------= = 10 20 2 0049 Maint. 0 - 7 0 - 41 49-2-24A = = ========================================================================== - B U F FER A B R Write Buffer Read Buffer and compare Return Figure 5-2. Table 5-3. Command UTI LIT Y Buffer Utility Menu Buffer Utility Menu Commands Description a Displays the Write Buffer menu, from which you can select a data pattern to perform a 16-parcel write to the buffer b Displays the Read Buffer menu, from which you can compare actual data to the selected data pattern r Returns to previous menu Figure 5-3 shows the Write Buffer menu. Figure 5-4 shows the Read Buffer menu. Table 5-4 lists the commands for the Write Buffer and Read Buffer menus. 5-14 CRAY PROPRIETARY SMM-1012 C =============================================================== DEVICE = = CYLINDERS HEADS SECTORS SLIP DISK 09:53:52 MODE = = = 2 10 - 20 0-7 o - 41 0049 Maint. 49-2-24A = = ========================================================================== WR I T E o A C E S Z B U F F E R All zeros Addressing pattern Alternating 0,1 pattern Hole Sequential data Reset Parameters Input the data pattern Figure 5-3. 1 B All ones Bump F Fixed data Peak shift Return T R ==> Write Buffer Menu =============================================================== = DEVICE CYLINDERS HEADS SECTORS SLIP DISK 09:54:07 MODE = = = 2 o - 41 DD49 Maint. 10 - 20 0-7 49-2-24A = ========================================================================== REA D o A C E S Z B U F FER All zeros Addressing pattern Alternating 0,1 pattern Hole Sequential data Reset Parameters 1 B F T R All ones Bump Fixed data Peak shift Return Input the data pattern ==> Figure 5-4. SMM-1012 C Read Buffer Menu CRAY PROPRIETARY 5-15 Table 5-4. Commands for the Write Buffer and Read Buffer Menus Command Description o All O's 1 All l's a Addressing pattern in a Cray word: Parcel Value o Cylinder number Head number Sector number Word number 1 2 3 b Bump. Word 0 1 2 3 c o 1 Hole. Word 0 1 2 3 f 5-16 Octal Hexadecimal 0525252525242104252525 0525250421052525252525 0104212525252525252525 0525252525252525210421 5555 5555 1111 5555 Alternating O's and l's. pattern: Word e This is a repeating 4-word pattern: 5555 1111 5555 5555 1111 5555 5555 5555 5555 5555 5555 1111 This is a repeating 2-word Octal Hexadecimal 1252525252525252525252 0525252525252525252525 AAAA AAAA AAAA AAAA 5555 5555 5555 5555 This is a repeating 4-word pattern: Octal Hexadecimal 0525252525256735652525 0525356735252525252525 0735672525252525252525 0525252525252525273567 5555 5555 7777 5555 Fixed data. 5555 7777 5555 5555 7777 5555 5555 5555 5555 5555 5555 7777 This is a 1-word, user-input pattern. CRAY PROPRIETARY SMM-1012 C Table 5-4. Commands for the Write Buffer and Read Buffer Menus (continued) Command s Description Sequential data pattern: Word Description o Random number Word 0 + n n t Peak shift. Word 5.1.8 This is a repeating 3-word pattern: Octal Hexadecimal o 0631466735667356663146 6666 DODD BBBB 6666 1 1567355673554631556735 DODD BBBB 6666 DODD 2 1356733146333567335673 BBBB 6666 DODD BBBB z Displays the Parameter menu r Return to previous menu ERROR UTILITY MENU Figure 5-5 shows the Error Utility menu. Table 5-5 lists the Error Utility menu commands. These commands display the following submenus: • • Error Table menu Error Log menu =============================================================== = = DEVICE CYLINDERS HEADS SECTORS SLIP DISK 09:54:21 MODE = = 10 - 20 2 49-2-24A o - 41 0049 Maint. 0-7 = ========================================================================== = ERR 0 R A B R Review details of the latest Error Table Review Error Log Return Figure 5-5. SMM-1012 C UTI LIT Y Error Utility Menu CRAY PROPRIETARY 5-17 Table 5-5. Error Utility Menu Commands Description Command 5.1.8.1 a Displays an error record and the Error Table menu b Displays the error log and the Error Log menu r Returns to previous menu Error Table menu When a disk request generates an error (such as a seek, read, or write error), the lOS sends donut an error record containing information such as function, address, status, and syndromes. The donut program interprets these records and stores them in the Error Table. If no error is detected in the disk function, no error record is returned. The error table is only valid for the latest call-in-error and is overwritten during each disk function call. Figure 5-6 shows an error record for a 00-39 read time-out error, and the Error Table menu. Table 5-6 lists the Error Table menu commands. ---------------- E R R 0 R R E COR 0 1 of 1 (octal data) ----------Read Dev Type 000004 lOP number 0001 Channel # 000032 Major Err Expect CYL 001511 Fin Err St Unrecov Expect HED 000001 Expect SEC 000017 Disk Funct LMA Rg1 Retry Cnt 000000 Orig Cntlr 007611 Orig GenSt 041600 Sel Stat 0 001600 Sel Stat 1 103200 Sel Stat 2 000200 Sel Stat 3 070200 Err Correc Is off Sel Stat 4 000200 Unit numbr 000000 Offset dir None C3 Cor Msk 000000 C3 Cor Off 000000 C2 Cor Msk 000000 C2 Cor Off 000000 C1 Cor Msk 000000 C1 Cor Off 000000 CO Cor Msk 000000 CO Cor Off 000000 Expect LMA 000000 Actual LMA 000000 Fin ctl st 007611 Fin gen st 041600 Fin Dsk Fn Unknown Orig Recov ON -set Finl Recov Unknown C3 Syn upr 000000 C3 Syn low 000000 C2 Syn upr 000000 C2 Syn low 000000 C1 Syn upr 000000 C1 Syn low 000000 CO Syn upr 000000 CO Syn low 000000 A Add THIS error to FOUND Flaw Table Add ALL errors to FOUND Flaw Table 0 Delete THIS error record E Erase ALL error records R Return Enter Command or Error Number ==> B Figure 5-6. 5-18 Error Table Menu CRAY PROPRIETARY SMM-1012 C Table 5-6. Error Table Menu Commands Description Command a Adds the displayed error record to the Found Flaw Table b Adds all error records in the Error Table to the Found Flaw Table d Deletes the displayed error record from the Error Table e Creates a file called ERRECRD in the current directory. The error record is saved in this file. r Returns to previous menu 5.1.8.2 Error Log menu The dODut program maintains a log of all disk errors detected during a session. For each error, the log contains an error summary with the time, device, address, function, and pattern. The Error Log is deleted if you exit or abort dODut, or if you enter e from the Error Table menu. Figure 5-7 shows a typical Error Log display and the Error Log menu. Table 5-7 lists the Error Log menu commands. ERR 0 R LOG 17 LAST= ------------------------------------------------------------------------CHANNEL ERROR DISK FUNC TEST CYL HEAD SEC TIME NUM LOG DEV --------------------------------------09:58:56 09:58:58 09:59:02 09:59:04 09:59:24 09:59:25 09:59:28 09:59:31 10:08:19 10:08:20 1 2 3 4 5 6 7 8 9 10 Bl Read LMA 0 0 49-2-24A 10 Read LMA B1 11 0 0 49-2-24A B1 Read LMA 12 0 0 49-2-24A Read LMA Bl 0 0 49-2-24A 13 Read LMA Bl 0 10 0 49-2-24A Read LMA B1 0 0 49-2-24A 11 Read LMA Bl 12 0 0 49-2-24A B1 Read LMA 13 0 0 49-2-24A B1 Read LMA 0 0 49-2-24A 10 B1 Read LMA 0 0 49-2-24A 11 Add TOP entry to FOUND Flaw Table A B Add ALL entries to FOUND Flaw Table Print out entire log C Erase ALL log entries E Return R Enter Command or Entry Number ==> Figure 5-7. SMM-1012 C Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg 1 1 1 1 1 1 1 1 1 1 Compare Compare Compare Compare Compare Compare Compare Compare Compare Compare Error Log Menu CRAY PROPRIETARY 5-19 Table 5-7. Error Log Menu Commands Description Command 5.1.9 a Adds the top entry in the Error Log to the Found Flaw Table. Duplicate entries are skipped. b Adds all entries in the Error Log to the Found Flaw Table. Duplicate entries are skipped. c Creates a file called DOHULOG in the current directory. The Error Log is saved in this file. e Deletes all Error Log entries. r Returns to previous menu. FORMATTING MENU Figure 5-8 shows the Formatting menu. Table 5-8 lists the Formatting menu commands. These commands display the following submenus: • • • Examine Data Buffer menu ID Analysis Results menu Parameter menu =============================================================== = DEVICE CYLINDERS HEADS SECTORS SLIP DISK 09:57:18 MODE = = = = DD49 Maint. 10 - 20 2 o - 41 49-2-24A 0-7 = ========================================================================== = FOR MAT TIN G B C E F G Z R Format with USER Flaw Table Format with NO flaw handling Examine Buffer Verify IDs with USER flaw table Verify IDs with NO flaw handling Reset Parameters Return Enter Command ==> Figure 5-8. 5-20 Formatting Menu CRAY PROPRIETARY SMM-1012 C Table 5-8. Command Formatting Menu Commands Description b Uses the User Flaw Table to format IDs c Formats IDs without using the User Flaw Table (donut assumes there are no flaws) e Displays the Examine Oata Buffer menu f Reads track IDs and does ID verification based on the assumption that IDs were formatted with the User Flaw Table (DD-10, DD-39, 00-40, and 00-49 disk drives only) q Reads track IDs and does 10 verification based on the assumption that IDs were formatted without the User Flaw Table (DD-39, OD-40, and OD-49 disk drives only) z Displays the Parameter menu, from which you can set the arguments in the argument banner r Returns to previous menu 5.1.9.1 Logical address of the sector ID Formatting applicable OD-40, and disks have is performed on a track basis, using spare sectors if and the User Flaw Table if specified. Only 00-10, DD-39, DD-49 disks have a User Flaw Table, and only DD-39 and DD-49 spare sectors. During formatting, the logical address is written into the sector 10 field. For flawed sectors, a flawed ID is written into this field. The formatting routine does the following: • Uses the slip argument to calculate the logical address • Determines whether the User Flaw Table is to be used SMM-1012 C CRAY PROPRIETARY 5-21 When the logical address is written into the sector IO field, the type of disk drive determines how the data is affected, as follows: Disk Logical Address is Written to Sector 10 00-10/39/40/49s Data in the sector is not affected when the logical address is written to the IO field. 00-19/29s The entire data area is corrupted when the logical address is written to the ID field. dODut does not automatically write to the newly formatted sectors. If a read is attempted following a formatting operation, an unrecoverable error occurs. Therefore, after completing a formatting operation, write data before performing a read. 5.1.9.2 position field of the sector IO (00-10s and 00-40s only) 00-40 disk drives can contain the following types of defects: Oefect Type Oescription Hideable Contains a defect that resides in a 16-byte field called the defect address, which is skipped during all disk operations. The defect address is written to the position field (POS) of the sector ID. Unhideable Contains either a defect that spans more than one address or multiple defects. These defects are not hidden because only one defect address is skipped during all disk operations. The sector IO is set to all l's to indicate that the sector is unavailable. If a sector has no defects, the sector ID is formatted with the position field set to 0'511 (all l's). 5.1.9.3 Examine Oata Buffer menu Figure 5-9 shows the Examine Oata Buffer menu. Examine Data Buffer menu commands. 5-22 CRAY PROPRIETARY Table 5-9 lists the SMM-1012 C E X A MIN E nn[, WPH] A B/nn R D A T A B U F FER Oisplay sector nn (Word(8), Parcel(8) or Hex) Print out ALL sectors Print out sector nn Return Input sector number or option ==) Figure 5-9. Table 5-9. Examine Oata Buffer Menu Commands Command nne ,"Ph] Examine Data Buffer Menu Description Displays sector nn in octal words (W) or parcels (P), or in hexadecimal (H) a Prints all sectors to file BUFFER in the current directory b,nn Prints sector nn to file BUFFER in the current directory r Returns to previous menu 5.1.9.4 10 Analysis menu (00-10s, 00-39s, 00-40s, and 00-49s only) 10 analysis can be performed with or without the User Flaw Table (see commands f and q, respectively, in table 5-8). The 10 analysis report contains the following field headings for both the expected and actual lOs: Heading Description NUM Entry number Cylinder number Head number Sector number CYL HEO SEC SMM-1012 C CRAY PROPRIETARY 5-23 The ID analysis report for DD-10s and DD-40s contains the following additional headings: Heading Description pas position field (PaS) of the sector ID (contains the defect address) SPIN Spindle associated with the sector ID. Each DD-40 contains four spindles, each of which is associated with 12 sectors. For DD-10s, SPIN is always O. ID analysis (DD-39s/49s) - Figure 5-10 shows the ID Analysis menu for 00-39 and 00-49 disk drives. Table 5-10 shows the 10 analysis menu commands. The following example describes the results of an ID analysis that was performed using the User Flaw Table (enter f from the Formatting menu). To verify that lOs are being written correctly, the User Flaw Table is used to read the lOs of a track containing a flawed 10. If all IDs match, a display such as the following is generated: On On On On On On On On On On On VERIFYING lOs Cylinder = 10 at Cylinder = 11 at Cylinder = 12 at Cylinder = 13 at Cylinder 14 at Cylinder = 15 at Cylinder 16 at Cylinder = 17 at Cylinder = 18 at Cylinder = 19 at Cylinder = 20 at 09:58:55 09:58:56 09:59:01 09:59:03 09:59:05 09:59:06 09:59:06 09:59:08 09:59:08 09:59:09 09:59:09 -------------------------------------All IDs checked were correct ----------------------------------------) Enter anything to continue <--- If there are any unexpected lOs, such as a flawed ID or an invalid sector ID, the routine generates an 10 analysis report and displays the report with the ID Analysis menu (refer to figure 5-10, ID Analysis Menu for 00-39 and 00-49 disk drives). 5-24 CRAY PROPRIETARY SMM-1012 C If an ID matches the expected value, MATCH is displayed in the RESULTS column; otherwise, MISMATCH is displayed. If a mismatch occurs, refer to the mismatch column to determine whether the error is in the 10's cylinder (C), head (H), or sector (S). An ID of -1 (0'77) represents a flawed ID. Generally, when one ID is in error, all subsequent IDs for the track are in error. To view specific IDs in the report, enter the desired entry number (NUM). V E R I F Y I D A N A L Y S I S FOR 39-1-32A 15:21:55 06/23/87 EXPECTED ID ACTUAL 10 CYL HED SEC NUM CYL HED SEC RESULTS MISMATCH ----------------------------- -------1 841 0 0 841 MAT C H 0 0 2 841 1 841 0 1 MAT C H 0 3 841 0 2 - Uncharted flaw found -1 -1 -1 C H S 4 841 0 3 841 0 2 M I SMA T C H - - - ) S 5 841 0 4 841 0 3 M I S M A T C H ---) S 6 841 0 841 4 M I S M A T C H ---) 5 0 S 7 841 841 0 6 0 5 M I S MAT C H - - - ) S 7 8 841 841 M I S MAT C H - - - ) 0 0 6 S 7 841 M I S MAT C H - - - ) 9 841 0 8 0 S 10 841 841 0 9 0 8 M I S MAT C H - - - ) S A Show all entry types Show mismatched entries: First= Print out all entries C 0 Print only mismatched entries R Return Enter Command or Entry Number ==) B Figure 5-10. 1 Last= 72 10 Analysis Menu for OD-39 and DD-49 Disk Drives 10 analysis (OD-40s) - Figure 5-11 shows the ID Analysis Menu for 00-40 disk drives. Table 5-10 shows the 10 analysis menu commands. The following example describes the results of an 10 analysis that was performed without using the User Flaw Table (enter q from the Formatting menu). The 10 analysis report preceding the 10 Analysis menu (figure 5-11) is for logical device 40-2-30A (command b, 'Show mismatched entries,' was entered). The results show that three mismatched entries were detected in the position (POS) field of the sector ID. SMM-1012 C CRAY PROPRIETARY 5-25 The SEC column in the ID analysis report shows the physical sector number. To calculate the logical sector number, do the following: 1. Multiply the spindle number (SPIN) by 12 (the number of sectors in each spindle). 2. Add the result from step 1 to the physical sector number. For example, the ID analysis report in figure 5-11 shows physical sector 5 is associated with spindle 1. Calculate the logical sector number as follows: 1. 2. 1 12 * + 12 = 12 5 = 17 (spindle number * number of sectors in the spindle) (result from step 1 * physical sector number) Logical sector 17 is the equivalent of physical sector 5 on spindle 1. V E R I F Y I D A N A L Y SIS F o R 40-2-30A 15:16:44 05/04/88 ACTUAL ID EXPECTED ID RESULTS CYL HED SEC POS SPIN MISMATCH NUM CYL HED SEC POS ------------------ -------3 511 1063 0 3 210 0 M I S M A T C H -> 4 1063 P 0 7 2 1 M I S M A T C H -> 5 114 1063 2 5 511 1063 P 1 169 2 M I S M A T C H -> 1 511 1063 3 P 170 1063 3 D AT A E N D o F Show all entry types Show mismatched entries: First= Print out all entries C D Print only mismatched entries R Return Enter Command or Entry Number ==> A B Figure 5-11. 5-26 4 Last= 170 ID Analysis Menu for DD-40 Disk Drives CRAY PROPRIETARY SMM-I012 C ID Analysis menu commands - Table 5-10 shows the ID analysis menu commands. Table 5-10. ID Analysis Menu Commands Command 5.1.9.5 Description a Displays all entries in the report b Displays only the mismatched entries (which are not necessarily contiguous). The first and last mismatched entry numbers are displayed on the command line. c Enters the entire report in a file called PRINTIDS, which is located in the current directory d Enters only the mismatched entries in a file called PRIHTIDS, which is located in the current directory r Returns to previous menu Parameter menu Figure 5-23 shows the Parameter menu. Table 5-15 lists the Parameter menu commands (refer to subsection 5.1.13, Parameter Menu). 5.1.10 SURFACE TESTS MENU Figure 5-12 shows the Surface Tests menu. Table 5-11 lists the Surface Tests menu commands. These commands display the following submenus: • • • • • Write Data menu Read Data and Compare menu Surface Analysis menu Examine Data Buffer menu Parameter menu Surface tests consist of the following operations: reads, writes, read absolute, and surface analysis. These operations are all performed within the cylinder, head, and sector ranges specified in the argument banner. The read absolute operation only reads from the lowest track specified. SMM-1012 C CRAY PROPRIETARY 5-27 =============================================================== CYLINDERS DEVICE HEADS SECTORS SLIP DISK 10:11:47 MODE = = = DD49 Maint. 2 o 7 o 41 10 20 49-2-24A = = ========================================================================== = S U R F ACE A B C D E F G Z R T EST Write data Read data and compare Read exercise Surface Analysis Examine Buffer Read Absolute (one track only) Write Current Data Buffer Reset parameters Return Enter read/write option Figure 5-12. Table 5-11. Command C HOI C E S ==> Surface Tests Menu Surface Tests Menu Commands Description a Displays the Write Data menu, from which you can select a data pattern to perform a write operation b Displays the Read Data and Compare menu, from which you can read the sectors listed in the argument banner and compare the data to the selected data pattern. 5-28 c Reads the sectors listed in the argument banner. This command can be used to verify the readability of a sector or group of sectors. d Displays the Surface Analysis menu, from which you can do a write-read-compare on the sectors listed in the argument banner, using the selected surface analysis pattern. e Displays the Examine Data Buffer menu CRAY PROPRIETARY SMM-1012 C Table 5-11. Command Surface Tests Menu Commands (continued) Description f Executes a read absolute operation, reading the specified sectors of the track with the lowest cylinder and head numbers in the argument banner. The read is performed without checking the sector's IO field. Therefore, the program reads the physical, rather than the logical, sector addresses. 9 Writes the contents of the data buffer to the specified cylinder, head, and sector locations t Reads the track headers of all the tracks in the cylinder with the lowest number in the argument banner. The information is stored in the data buffer. This menu command is displayed for 00-39s only. z Displays the Parameter menu, from which you can set the arguments in the argument banner. r Return to previous menu 5.1.10.1 Write Data, Read Data and Compare, and Surface Analysis menus Figure 5-13 shows the Write Data menu. Figure 5-14 shows the Read Data and Compare menu. Figure 5-15 shows the Surface Analysis menu. Table 5-12 lists the commands for these menus. Use the commands to select patterns to be used for various operations. For a write or a read and compare operation, select only one pattern. For a surface analysis operation, select one or more patterns. SMM-1012 C CRAY PROPRIETARY 5-29 =============================================================== DEVICE = = CYLINDERS HEADS SECTORS SLIP DISK 15:21:55 MODE = = = DD39 System 39-1-32A 841 - 841 0 - 4 1 o - 23 = ========================================================================== = WR I T E o A C E G S Z OAT A All zeros Addressing pattern Alternating 0, 1 pattern Hole Random Sequential data Reset Parameters 1 B All ones Bump F Fixed data T Peak shift Return R ==> Input the data pattern Figure 5-13. Write Data Menu =============================================================== DEVICE = CYLINDERS HEADS SECTORS SLIP DISK 15:21:55 MODE = = = = 1 39-1-32A 841 - 841 0 - 4 o - 23 0039 System = ========================================================================== = REA 0 o A C E S Z B U F FER All zeros Addressing pattern Alternating 0, 1 pattern Hole Sequential data Reset Parameters Input the data pattern Figure 5-14. 5-30 & COM PAR E 1 B All ones Bump F T R Fixed data Peak shift Return ==> Read Data and Compare Menu CRAY PROPRIETARY SMM-1012 C =============================================================== DEVICE = CYLINDERS HEADS SECTORS SLIP DISK 15:21:55 MODE = = = 39-1-32A 841 - 841 0 - 4 o - 23 1 0039 System = ========================================================================== = SUR F ACE o A N A L Y SIS All zeros Addressing pattern Alternating 0, 1 pattern All patterns but F Hole Random Sequential data Reset Parameters A C D E G S Z Table 5-12. All ones Bump F Fixed data T Peak shift Return 1 R Input the data pattern Figure 5-15. B ==> Surface Analysis Menu Commands for the Write Data, Read Data and Compare, and Surface Analysis Menus Command Description o All O's 1 All l's a Addressing pattern in a Cray word: Parcel o 1 2 3 b Bump. Word 0 1 2 3 SMM-1012 C Value Cylinder number Head number Sector number Word number This is a repeating 4-word pattern: Octal Hexadecimal 0525252525242104252525 0525250421052525252525 0104212525252525252525 0525252525252525210421 5555 5555 1111 5555 CRAY PROPRIETARY 5555 1111 5555 5555 1111 5555 5555 5555 5555 5555 5555 1111 5-31 Table 5-12. Commands for the Write Data, Read Data and Compare, and Surface Analysis Menus (continued) Command c Description Alternating O's and l's. pattern: o 1 Octal Hexadecimal 1252525252525252525252 0525252525252525252525 AAAA AAAA AAAA AAAA 5555 5555 5555 5555 d All patterns except the fixed data pattern (F) e Hole. Word 0 1 2 3 This is a repeating 4-word pattern: Octal Hexadecimal 0525252525256735652525 0525356735252525252525 0735672525252525252525 0525252525252525273567 5555 5555 7777 5555 f Fixed data. q Random data s Sequential data pattern: Word o t o 1 2 7777 5555 5555 5555 5555 5555 5555 7777 This is a 1-word, user-input pattern. Random number Word 0 + n Peak shift. Word 5555 7777 5555 5555 Description n 5-32 This is a repeating 2-word This is a repeating 3-word pattern: Octal Hexadecimal 0631466735667356663146 1567355673554631556735 1356733146333567335673 6666 DDDD BBBB 6666 DDDD BBBB 6666 DDDD BBBB 6666 DDDD BBBB z Displays the Parameter menu r Return to previous menu CRAY PROPRIETARY SMM-1012 C 5.1.10.2 Examine Data Buffer menu Figure 5-9 shows the Examine Data Buffer menu. Table 5-9 lists the Examine Data Buffer menu commands (refer to subsection 5.1.10.2, Examine Data Buffer Menu). 5.1.10.3 Parameter menu Figure 5-23 shows the Parameter menu. Table 5-15 lists the Parameter menu commands (refer to subsection 5.1.13, Parameter Menu). 5.1.11 FLAW TABLE UTILITY MENUS Figure 5-16 shows the Flaw Table Utility menu. Table 5-13 lists the Flaw Table Utility menu commands. These commands display the following submenus: • • • • Factory Flaw Table User Flaw Table System Flaw Table Found Flaw Table =============================================================== = = DEVICE CYLINDERS HEADS SECTORS SLIP DISK 10:11:53 MODE = = = 2 10 - 20 o - 7 o - 41 49-2-24A DD49 Maint. = ========================================================================== = F L A W A B C D R TAB L E FACTORY Flaw Table USER Flaw Table SYSTEM Flaw Table FOUND Flaw Table Return Choose a flaw table Figure 5-16. SMM-1012 C U T I LIT Y ==> Flaw Table Utility Menu CRAY PROPRIETARY 5-33 Table 5-13. Command Flaw Table Utility Menu Commands Description a Displays the Factory Flaw Table (not used for DD-19/29 disks). This table contains the factory flaws originally found on the disk. b Displays the User Flaw Table (not used for DD-19/29 disks). This table contains the physical addresses of the flawed sectors. c Displays the System Flaw Table (sometimes called the System EFT). This table contains the flaws used by UNICOS when creating the UNICOS Flaw Map. d Displays the Found Flaw Table, which resides in donut. This table contains flaws detected during surface analysis. r Returns to the previous menu The flaw table utilities allow you to read, edit, write, or print the disk flaw tables. Flaw tables can be edited in donut's area of central memory only. However, donut does not automatically write the edited tables to disk; you must enter f (Write flaw table to disk) from either the User or System flaw table, as appropriate. Any function that requires flaw tables (such as formatting) uses the tables currently in donut's area of central memory (the tables must be read into donut before they can be referenced). To display a flaw table without going through the Flaw Table Utility menu, enter one of the following commands from the Main menu or any of the flaw table menus, as appropriate: Command Description FAC USR SYS FND Displays Displays Displays Displays the the the the Factory Flaw Table menu User Flaw Table menu System Flaw Table menu Found Flaw Table menu For example, if your current screen display shows the User Flaw Table menu, you can enter sys to display the System Flaw Table menu. To return to the User Flaw Table menu, enter r. 5-34 CRAY PROPRIETARY SMM-1012 C The main heading in each flaw table menu contains the following information: • • • Logical device name Flaw table name Number of entries Below the main heading are the following field headings: Heading Description NUM Entry number Channel number Cylinder number Head number Sector number User-input-flaw bit CHANNEL CYL HEAD SEC USER The User and Found Flaw Tables for DO-lOs and 00-40s contain the following additional headings (and no channel number heading): Heading Description UIH Hideablelunhideable defects. For additional information, refer to subsection 5.1.9.2, Position Field of the Sector 10 (DO-lOs and 00-40s only). Position Position field (POS) of the sector 10. contains the defect address. The POS field In the System Flaw Table, the field heading contains a contiguous (CONTIG) number, which is always a value of 1, instead of a channel number and no USER bit heading; however, this field is not used under UNICOS. Each flaw table display lists up to 18 flaws, two per line. the flaw tables, you can do the following: From any of • Enter a menu command to perform a specific function • Enter the number of the first flaw that you want to appear in a display of any contiguous group of flaws • Enter + (plus) or - (minus) to scroll forward or backward, respectively For additional flaw information, refer to the Disk Systems Hardware Reference Manual, CRI publication HR-0077. SMM-1012 C CRAY PROPRIETARY 5-35 The flaw tables are shown in the following figures: Figure Title 5-17 5-18 5-19 5-20 5-21 5-22 Factory Flaw Table Menu User Flaw Table Menu for DD-39 and DD-49 Disk Drives User Flaw Table Menu for 00-10 and DO-40 disk drives System Flaw Table Menu Found Flaw Table Menu for 00-19/29/39/49 Disk Drives Found Flaw Table Menu for DD-10 and DD-40 Disk Drives Table 5-14 shows the commands for the flaw table menus. These commands apply to all of the flaw tables unless otherwise indicated. F ACT 0 R Y 49-2-24A CYL HEAD SEC CHANNEL HUM F LAW USER A2 A2 A2 A2 A2 A2 A2 A2 0 1 2 3 4 5 6 7 7 CHANNEL HUM LAST= CYL HEAD SEC 249 USER ------------ --------------1 2 3 4 5 6 7 8 9 B2 TAB L E 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 10 11 12 13 14 15 16 17 18 B2 A2 A2 A2 A2 A2 A2 A2 A2 8 9 10 11 12 13 40 41 43 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 21 0 0 0 0 0 0 0 0 0 - Read flaw table from disk - Check flaw table validity - Erase flaw table from memory V - Print out flaw table X n - Start display at cylinder n R - Return Enter Command or Flaw Number ==> B C E Figure 5-17. 5-36 Factory Flaw Table Menu CRAY PROPRIETARY SMM-1012 C USE R 49-2-24A 2 3 4 5 6 7 8 9 B2 - X n - o A2 A2 A2 A2 A2 A2 A2 A2 1 A C E G CYL HEAD SEC CHANNEL NUM· Add a Check Erase Merge Start o o o 2 3 4 o 5 6 o o o o o o USER o o o o o o o o o o 7 o o o 7 5 1 o F LAW 1055 1057 1058 1059 1060 1060 1063 1 1 A C E V Y n - 15 6 1 15 2 15 15 1 8 28 16 37 16 27 16 16 2 12 o o o o o o 1 1 o U U U U U U U H H 511 511 511 511 511 511 511 69 199 NOM Figure 5-19. o 8 9 10 o o 11 o o 12 13 40 41 43 o o SEC USER o o o o o o o o o o o o o o o o o 5 21 o Read flaw table from disk Delete a flaw Write flaw table to disk Print out flaw table Return HIDEABLE = 425 LAST=1165 CYL HEAD SEC USER U/H POSITION 427 428 429 430 431 432 433 434 435 Add a flaw B Check flaw table validity 0 Erase flaw table from memory F Print out flaw table X n Display hideables at CYL n R Enter Command or Flaw Number ==) SMM-I012 C A2 A2 A2 A2 A2 A2 A2 A2 TAB L E CYL HEAD SEC USER U/H POSITION 418 419 420 421 422 423 424 425 426 10 11 12 13 14 15 16 17 18 B2 CYL HEAD 249 User Flaw Table Menu for DD-39 and 00-49 Disk Drives USE R 40-1-36A LAST= CHANNEL NUM flaw B flaw table validity 0 flaw table from memory F FACTORY flaws into USER V display at cylinder n R Enter Command or Flaw Number ==) Figure 5-18. NOM o 1 TAB L E F LAW 6 7 1 1 1 7 7 6 12 19 12 o o 4 4 13 11 35 13 11 12 o o o o o o 14 15 12 14 15 12 3 3 4 151 214 148 151 215 256 117 95 256 H H H H H H H H H Read flaw table from disk Delete a flaw Write flaw table to disk Display unhideables at CYL n Return User Flaw Table Menu for 00-10 and 00-40 Disk Drives CRAY PROPRIETARY 5-37 S Y S T E M 49-2-24A I: CONTIG HUM CYL HEAD F LAW TAB L E SEC HUM 41 10 11 12 13 14 15 16 17 LAST= I: CONTIG CYL HEAD 1 SEC -----------1 1 495 7 2 3 4 5 6 7 8 9 18 - B - Read flaw table from disk Add a flaw D - Delete a flaw - Check flaw table validity - Erase flaw table from memory F - Write flaw table to disk - Make SYSTEM table from FACTORY H - Make SYSTEM table from USER - Print out flaw table X n - Start display at cylinder n - Return Enter Command or Flaw Number ==> A C E G V R Figure 5-20. F LAW F 0 U N D 49-2-24A HUM System Flaw Table Menu CHANNEL 1 B2 Bl A2 Al 2 3 CYL HEAD SEC 1 2 USER HUM 1 3 CYL HEAD SEC 1 USER 10 12 13 14 15 16 17 5 6 7 8 9 X n CHANNEL LAST= 11 4 A D V TAB L E 18 - Add a flaw - Delete a flaw - Print out flaw table - Start display at cylinder n C E G R - Check flaw table validity Erase flaw table from memory Merge FOUND flaws into USER flaw table Return Enter Command or Flaw Number ==> Figure 5-21. 5-38 Found Flaw Table Menu for DD-19/29/39/49 Disk Drives CRAY PROPRIETARY SMM-I012 C 40-1-36A NOM F LAW F 0 UNO NUM CYL HEAD SEC USER U/H POSITION 1 1055 2 1 15 28 1 2 o u 1 A - HIOEABLE = 2 LAST= CYL HEAD SEC USER U/H POSITION C E G R - Check flaw table validity - Erase flaw table from memory - Merge FOUND flaws into USER flaw table - Return Enter Command or Flaw Number ==> Figure 5-22. Found Flaw Table Menu for 00-10 and 00-40 Disk Drives Table 5-14. Commands for the Flaw Table Menus Command Description a Adds a flaw; issues prompts for the flaw arguments and inserts valid flaws in their proper order in the flaw table. Flaws cannot be added to the Factory Flaw Table. b Reads the flaw table from disk to central memory, after first deleting the table currently in central memory. • System Flaw Table menu When the System Flaw Table is read from disk, the table is compared to the UNICOS Flaw Map and any mismatches are displayed on the screen. c Verifies that the flaw table is in order, that no duplicate entries exist, that values are within a valid range, and that the table is terminated correctly. If a problem exists in any of these areas, a message is displayed indicating the first entry in error. d Deletes a flaw; issues prompts for the entry number of the flaw to be deleted. The flaw is only removed from the table currently in central memory (does not affect the disk-resident table). Factory flaws cannot be deleted. SMM-1012 C 2 511 69 H - Add a flaw Delete a flaw V - Print out flaw table X n - Start display at cylinder n o TAB L E CRAY PROPRIETARY 5-39 Table 5-14. Commands for the Flaw Table Menus (continued) Command Description e Deletes flaw table from central memory (does not affect the disk-resident table) f Writes flaw table from central memory to disk, overwriting the disk-resident table. Factory and Found flaw tables cannot be written to disk. • System Flaw Table menu In addition to writing the table from central memory to disk, the UNICOS Flaw Map (used by UNICOS to define alternate sectors for flawed sectors) will be updated to reflect the new System Flaw Table. 9 Merges flaws from one flaw table into another. The menu from which the 9 command is entered and (in some cases) the device type being exercised determine which flaw tables are merged. You can enter 9 from the following menus: • Found Flaw Table menu For DD-39, DD-40, and DD-49 disk drives: Copies the Found Flaw Table entries into the User Flaw Table. Duplicate entries are skipped. Entries are added in their proper order. For DD-19 and DD-29 disk drives: Copies the Found Flaw Table entries into the System Flaw Table (this does not overwrite the current System Flaw Table) • User Flaw Table menu Copies the Factory Flaw Table entries into the User Flaw Table. Duplicate entries are skipped. Entries are added in their proper order. 5-40 CRAY PROPRIETARY SMM-1012 C Table 5-14. Commands for the Flaw Table Menus (continued) Command Description • System Flaw Table menu: Creates a System Flaw Table from the Factory Flaw Table entries. The SLIP argument determines which entries are made in the System Flaw Table. h Creates a System Flaw Table from the User Flaw Table entries (h is entered from the System Flaw Table menu only). The SLIP argument determines which entries are made in the System Flaw Table. v Creates a file with the name of the flaw table (FACTORY, USER, SYSTEM, or FOUND) in the current directory. z n Displays flaws starting at cylinder n. For DD-40s, the flaws displayed are unhideable defects. y n Displays hideable defects starting at cylinder n (DD-40s only) + Displays the next screen of flaws Displays the previous screen of flaws r 5.1.12 Returns to previous menu ERROR CORRECTION CODE TEST The Error Correction Code (ECC) test does the following:t t 1. Writes a 512-word buffer of random data with O's for ECC. 2. Reads the data, expecting an ECC error. 3. Writes the same data with standard ECC. 4. Reads the data, expecting no errors. The ECC test cannot be performed on DD-19 or DD-29 disk drives. SMM-1012 C CRAY PROPRIETARY 5-41 5. Compares the data read with that written in step 3. 6. Displays a message indicating whether the ECC test passed or failed. If the test failed, the message also indicates the word-in-error. The ECC test uses the DISK and DEVICE arguments (displayed in the argument banner) and the following software CE cylinder numbers (instead of the numbers in the argument banner): n Cylinder = Head = 0 Sector = 0 5.1.13 Scratch cylinder; typically the last cylinder on the device. PARAMETER MENU Figure 5-23 shows the Parameter menu, from which you can define the logical device name and set the arguments (parameters) in the argument banner. Table 5-15 lists the Parameter menu commands. =============================================================== = DEVICE CYLINDERS HEADS SECTORS SLIP DISK MODE = = 09:50:28 = = = o none -----0-0 0-0 0-0 * none * = ========================================================================== PAR A MET E R S A Logical Device B Cylinder limits C Head limits D Sector limits E Diagnostic flags (not displayed) T Toggle disk mode (system/maintenance) R Return Enter Command ==> Figure 5-23. 5-42 Parameter Menu CRAY PROPRIETARY SMM-1012 C Table 5-15. Command Parameter Menu Commands Parameter Description a DEVICE Sets the logical device name. respond to the prompts. b CYLINDERS Sets the cylinder range. to the prompts. c HEADS Sets the head range. the prompts. d SECTORS Sets the sector range. the prompts. e FLAGS Sets diagnostic flags related to IDS error handling and read-ahead/write behind operations. You can set any combination of the following flags: Flag You must You must respond You must respond to You must respond to Description a Returns the error record to the diagnostic error logger, diaqerr b Disables error recovery. The lOS does not attempt a retry. c Disables error reporting. The IDS does not log errors in the error logger. d Disables read-ahead/write behind operations If no flags are set, all flags are enabled. t r SMM-I012 C MODE Sets the disk mode to system or maintenance Returns to previous menu CRAY PROPRIETARY 5-43 5.1.14 EXITING donut To exit donut, enter q from the Main menu. The exit process does not change the disk mode or write any edited flaw tables to disk. (It is assumed that these operations are performed prior to exiting.) The final donut screen display is as follows: Goo d bye 5.1.15 fro m o 0 NUT PROGRAM EXAMPLES This subsection contains various dODut execution examples, all of which originate from the Main menu. Example 1 shows how to enable maintenance mode for a 00-39 disk with a logical device name of 39-2-27A. Example 1: 1. Enter z (reset parameters) from the Main menu. 2. Enter a (logical device) from the Parameter menu. Enter 39-2-27A for the logical device name. 3. Enter t (toggle disk mode) from the Parameter menu. Enter qo to acknowledge the warning. The following message is displayed and remains on the screen until the disk is offloaded and in maintenance mode: Please wait while 39-2-27A is entering MAINTENANCE mode 4. 5-44 Enter r to return to the Main menu. CRAY PROPRIETARY SMM-1012 C Example 2 shows the procedure to do the following: • Read the User Flaw Table from disk • Add the following flaw to the table: CYLINDER=25, HEAD=2, SECTOR=19, all surfaces • Write the modified User Flaw Table to the disk • Print the User Flaw Table in octal format Example 2: 1. Entert (Flaw Table Utility) from the Main menu. 2. Enter b (USER Flaw Table) from the Flaw Table Utility menu. 3. Enter b (Read flaw table from disk) from the User Flaw Table menu. Enter qo to acknowledge the warning. 4. Enter a (Add a flaw) from the User Flaw Table menu. Enter Enter Enter Enter 5. 25 for the cylinder number. 2 for the head number. 19 for the sector number. a for all surfaces. Enter f (Write flaw table to disk) from the User Flaw Table menu. Enter qo to acknowledge the warning. 6. Enter v (Print out flaw table) from the User Flaw Table menu. Enter c for octal format. Enter r to return to the User Flaw Table menu. 7. Enter r to return to the Main menu. Example 3 shows the procedure to do the following: • • Format the track of CYLINDER=25, HEAD=2 (using the User Flaw Table) Verify that the IDs were written correctly Example 3: 1. Enter f (formatting and ID analysis) from the Main menu. 2. Enter z (reset parameters) from the Formatting menu. SMM-I012 C CRAY PROPRIETARY 5-45 Example 3 (continued): 3. Enter b (cylinder limits) from the Parameter menu. Enter 25 for the lower cylinder number. Enter 25 for the upper cylinder number. 4. Enter c (head limits) from the Parameter menu. Enter 2 for the lower head number. Enter 2 for the upper head number. Enter r to return to the Formatting menu. 5. Enter b (Format with USER Flaw Table) from the Formatting menu. Enter go v after checking the formatting limits. After formatting, the IDs are checked. If all IDs match their expected values, a message to that effect is displayed with the following prompt: ---) Enter anything to continue <--If an ID error occurs, the ID Analysis Results menu is displayed. Check the results and/or obtain a printout. Enter r to return to the Formatting menu. 6. Enter r to return to the Main menu. Example 4 shows how to perform surface analysis on cylinder 25, using the default patterns and executing the random pattern 50 times with a seed value of 6065. Example 4: 1. Enter z (reset parameters) from the Main menu. 2. Enter b (cylinder limits) from the Parameter menu. Enter 25 for the lower cylinder number. Enter 25 for the upper cylinder number. 3. Enter c (head limits) from the Parameter menu. Enter a for all heads. 4. Enter d (sector limits) from the Parameter menu. Enter a for all sectors. 5-46 CRAY PROPRIETARY SMM-1012 C Example 4 (continued): 5. Enter r to return to the Main menu. 6. Enter s (surface tests) from the Main menu. 7. Enter d (surface analysis) from the Surface Tests menu. Enter d to execute all patterns except the fixed data pattern. Enter 50 for the number of random passes. Enter 6065 for the seed value. Enter go after checking the arguments. The display changes as the program analyzes each track. After all tracks are analyzed, the program displays a message indicating the number of flaws added to the Found Flaw Table. This signals the end of the surface analysis operation. Respond to the following prompt: ---) Enter anything to continue (--Enter r to return to the Surface Tests menu. 8. Enter r to return to the Main menu. Example 5 shows the procedure to do the following: • Read the User Flaw Table for the DD-49 disk with a logical device name of 49-1-24A. • Add the following flaw to the User Flaw Table: Cylinder Head Sector Channel = 1507 3 = = 17 = A2 (octal) (octal) • Generate a printout of the User Flaw Table (in octal). • Write the User Flaw Table to disk. • Generate the System Flaw Table from the User Flaw Table. • Generate a printout of the System Flaw Table (in octal). SMM-1012 C CRAY PROPRIETARY 5-47 • Write the System Flaw Table to disk • • Reformat Cylinder central memory. = 1507, Head = 3, using the User Flaw Table in Example 5: 1. Enter oct (octal display) from the Main menu. 2. Enter dey from the Main menu to change the logical device name. 3. Enter 49-1-24A. 4. Enter usr (display the User Flaw Table) from the Main menu. 5. Enter b (read flaw table from disk) from the User Flaw Table menu. Enter go to acknowledge the warning. 6. Enter a (Add a flaw) from the User Flaw Table menu. Enter Enter Enter Enter 7. 1507 for the cylinder number. 3 for the head number. 17 for the sector number. a2 for the channel number. Enter v (Print out flaw table) from the User Flaw Table menu. Enter c for octal printout. 8 Enter f (Write flaw table to disk) from the User Flaw Table menu. Enter go to acknowledge the warning. 9. Enter sys (display the System Flaw Table) from the User Flaw Table menu. 10. Enter h (Make SYSTEM table from USER) from the System Flaw Table menu. 11. Enter v (Print out flaw table) from the System Flaw Table menu. Enter c for an octal printout. 12. Enter f (Write flaw table to disk) from the System Flaw Table menu. Enter go to acknowledge the warning. 13. 5-48 Enter r to return to the Main menu. CRAY PROPRIETARY SMM-1012 C 14. Enter cyl (set cylinder range) from the Main menu. Enter 1507 for the lower cylinder number. Enter 1507 for the upper cylinder number. 15. Enter hed (set head range) from the Main menu. Enter 3 for the lower head number. Enter 3 for the upper head number. 16. Enter f (Formatting and ID analysis) from the Main menu. 17. Enter b (Format with USER Flaw Table) from the Formatting menu. Enter go after checking the argument limits. After formatting, the IDs are checked. If all IDs match their expected values, a message to that effect is displayed with the following prompt: ---) Enter anything to continue (--If an ID error occurs, the ID Analysis Results menu is displayed. Check the results and/or obtain a printout. Enter r to return to the Formatting menu. 18. Enter r to return to the Main menu. Example 6 shows how to return the disk to system mode before exiting dODut. Example 6: 1. Enter z (reset parameters) from the Main menu. 2. Enter t (toggle disk mode) from the Parameter menu. (Alternatively, you can enter mode from the Main menu instead of steps 1 and 2 and proceed with step 3.) 3. Enter go to acknowledge the request. The following message is displayed and remains on the screen until the disk is in system mode: Please wait while 39-2-27A is entering SYSTEM mode 4. Enter r to return to the Main menu. 5. Enter q to exit donut. SMM-1012 C CRAY PROPRIETARY 5-49 5.2 oldman The oldmant monitor is the down CPU monitor, which initiates, controls, and monitors the down CPU tests. These tests execute under oldman in multiple-CPU environments only. For a list of the down CPU tests, refer to appendix A, On-line Diagnostic Programs. For information on the down CPU interface to UNICOS, refer to cpu(4D). 5.2.1 DOWN CPU TESTS The down CPU tests are executed in a down CPU from an operational CPU. Down CPU tests cannot be executed in monitor mode; consequently, they cannot perform I/O operations. A CPU other than the down CPU initiates I/O activity and all CPUs other than the down CPU are favored for external interrupts. If the down CPU receives interrupts, it redirects them to another CPU. For additional information on interrupts and monitor mode, refer to the following manuals, as appropriate: CSM0111000 CSM0110000 CSMOl12000 CSM-0400-000 CRAY CRAY CRAY CRAY X-MP/1 System Programmer Reference Manual X-MP/2 System Programmer Reference Manual X-MP/4 System Programmer Reference Manual Y-MP System Programmer Reference Manual To execute in a down CPU, a program must meet the following requirements: • Must be an absolute binary • Must not require any operating system support (the program cannot allow screen output, keyboard input, disk reading, or disk writing) The oldmon monitor does the following: t • Downs the CPU • Loads a down CPU test from a file into central memory • Monitors and controls the execution of a down CPU test • Loads central memory areas from files • Allows an operator to modify the central memory image of a down CPU test Multiple-CPU Cray computer systems only 5-50 CRAY PROPRIETARY SMM-1012 C • Displays central memory areas in various data formats • Writes central memory areas to files • Dumps central memory areas in a variety of formats to files or to the expander printer • Executes user-defined program loops 5.2.2 PROGRAM SYNOPSIS The oldman monitor resides in Ice/bin. Log on interactively at the system console or any other supported front-end station (refer to the appropriate front-end station reference manual). Synopsis: oldmon [-d cpulist] [-q] [-u cpulist] -d cpulist Down CPUs immediately. following format: cpulist is entered in the n, n, ••• , n n is a value in one of the following ranges: O,1,2, ••• ,n or a,b,c, ... ,x If allowed to default, no CPUs are downed. -q Exit oldmon after processing the command line entry. This command option should be entered with other options. -u cpulist Return CPUs to normal system operations. entered in the following format: cpulist is n,n, ... ,n n is a value in one of the following ranges: O,1,2, ••• ,n SMM-1012 C or a,b,c, ... ,x CRAY PROPRIETARY 5-51 Table 5-16 lists the oldman commands. For additional information on these commands, refer to subsections 5.2.5.2 through 5.2.5.17. Table 5-16. oldman Commands Command 5-52 Description a Appends a formatted central memory dump to a file c Specifies a new default CPU d Dumps a formatted central memory dump to a file e Enters a value at a specific address f Fills consecutive central memory locations 9 Starts a test in a CPU h Halts test execution in a down CPU I Loads a test into a CPU's central memory buffer o Sets test options q Exits oldman r Redraws the display s Updates the current Exchange Package of the current CPU u Returns a down CPU to normal system operations v Views a formatted area of central memory v Writes an area of central memory to a binary file z Executes a command buffer containing oldman commands CRAY PROPRIETARY SMM-1012 C 5.2.3 PROGRAM EXECUTION When oldman is started, it does the following: 1. Allocates an area of central memory to each CPU 2. Loads the test loop code into each CPU's memory area 3. Executes $HOME/.oldmanrc (a profile file containing any oldman commands) 4. Displays the Main menu for oldman (refer to figure 5-24) A/Dump Cpu Enter Fill Go Halt Load Opts Quit Redraw Stat Up View Write Xecute Figure 5-24. Main Menu for oldmon The following subsections describe program execution under oldman: • • • 5.2.3.1 Down CPU tests (listed in appendix A, On-line Diagnostic Programs) Test loop code Environment variables Down CPU tests The down CPU tests reside in Ice/oldman. Two types of down CPU tests run under oldman: confidence tests and maintenance tests. The down CPU confidence tests are on-line confidence tests that have been converted to run under oldman (off-line). The down CPU maintenance tests are taken from the off-line diagnostic release.t The initial Exchange Package starts each test. The current Exchange Package allows a test to continue from the point at which it is interrupted. For a list of the off-line diagnostics (down CPU tests) that run under oldmon, refer to Appendix A, On-line Diagnostic Programs. t The down CPU maintenance tests are deferred for CEA systems. SMM-1012 C CRAY PROPRIETARY 5-53 Modifications to the off-line diagnostic test base - The down CPU tests are derived from the off-line diagnostic release X3.0. Some of the off-line diagnostics require modifications before they can be executed in a down CPU test environment. A configuration file containing a list of oldman commands is used to make the necessary modifications. When oldman is executed, it attempts to access the configuration file oldman.cf. If oldman.cf is found, oldman uses the information in the oldmon.cf configuration file to automatically configure a loaded diagnostic to execute in a down CPU environment; if oldman.cf is not found, oldman uses the default configuration file. If oldmon.cf is not found, you can initialize it by entering y (yes) in response to the following prompt: Cannot find configuration file oldmon.cf, should I initialize it? Enter Yes or No (yIn» If you enter n (no), oldman does not initialize oldman.cf. Default configuration files - The default configuration files are used to make the necessary modifications to the off-line diagnostics tests, so that they can execute in a down CPU test environment: The following is the default configuration file for a CRAY X-MP computer system. # OLDMON configuration file for X-MP off-line diagnostics. # aht: e k cput 40 # Set CPU type, 20 for X-MP/2, 40 for X-MP/4 e k mlast 7777 # Set last address to be tested o 1 7777 # Set limit address arb: e 40a 005000 # Change MTA 1/0 routine to return e 140 100000000000 # Set P in SEXP e 143 160000000000000 # Set mode bits in SEXP e 144 1000000000000000000000 # Set EMA bit in SEXP arm: # Nothing to configure brb: 0 1 1577 # Set limit address cmp: # Nothing to configure cmx: e 26c 1000 # Run CMX with cluster 0 o 1 44777 # Set limit address e 1152c 001000 # Change monitor req. exch to pass gth: e k mlast 33777 # Set last address to be tested o 1 33777 # Set limit address ibz: e k cput 40 # Set CPU type, 20 for X-MP/2, 40 for X-MP/4 e k secs 62 # Can only run sections 1, 4 and 5 # Set limit address o 1 400777 e k cpun 1 # Set number of CPUs to 1 mit: e k cput 40 # Set CPU type, 20 for X-MP/2, 40 for X-MP/4 e k mlast 7777 # Set last address to be tested o 1 7777 # Set limit address 5-54 CRAY PROPRIETARY SMM-1012 C Default configuration file (continued): sfa: sfm: sfr: sis: sr3: sra: srb: e 20Sc 177777 e 205e 177777 e k secs 65432 srI: e e e e e e e e o srs: stan: svc: trb: vpp: vra: o o e o e e e e e e o o o e o o o o o o vrl: vrn: vrr: vrs: vrx: olcrit: olcsvc: olcfpt: olibuf: olcm: # Disable timing portion of test # Disable timing portion of test # Disable section 1 of test # Nothing to configure I 6277 # Set limit address # Nothing to configure # Nothing to configure 40a 005000 # Change MTA liD routine to return 140 100000000000 # Set P in SEXP 143 160000000000000 # Set mode bits in SEXP 144 1000000000000000000000 # Set EMA bit in SEXP 40a 005000 # Change MTA liD routine to return 140 100000000000 # Set P in SEXP 143 160000000000000 # Set mode bits in SEXP 144 1000000000000000000000 # Set EMA bit in SEXP # Nothing to configure I 1577 # Set limit address 1 1577 # Set limit address 20Sc 177777 # Disable timing portion of test I 2077 # Set limit address 20Sc 177777 # Disable timing portion of test 40a 005000 # Change MTA liD routine to return 140 100000000000 # Set P in SEXP 143 160000000000000 # Set mode bits in SEXP 144 1000000000000000000000 # Set EMA bit in SEXP 205b 177777 # Disable timing portion of test 1 2777 # Set limit address 1 4777 # Set limit address I 2077 # Set limit address 205d 177777 # Disable timing portion of test I 23777 # Set limit address 1 60000 # Set limit address 1 50000 # Set limit address 1 40000 # Set limit address 1 30000 # Set limit address 1 40000 # Set limit address The following is the default configuration file for a CEA system. # OLDMON configuration file for Y-MP off-line diagnostics. # olcrit: 0 1 60000 # Set limit address olcsvc: 0 1 50000 # Set limit address olcfpt: olibuf: olcm: 0 0 0 SMM-1012 C 1 40000 1 30000 1 40000 it Set limit address # Set limit address # Set limit address CRAY PROPRIETARY 5-55 5.2.3.2 Test loop code The test loop code can be used to build a failing loop. The initial Exchange Package resides at address 0'140. Use either the Enter or Fill command to overwrite the PASS instructions (instruction 001000 at address 0'500a) with the suspected failing code. The suspected failing code (at address O'500a) is executed with the test loop. The program then jumps to a check routine. The check routine does the following: 1. Compares the actual results in Sl to the expected results in S2 2. Increments the PASS and ERROR counts 3. Jumps to the suspected failing code sequence (at address O'500a) to loop The current Exchange Package resides at address 0'120. It allows the loop to continue from the point at which it is interrupted. The test loop code is as follows: START = SO PASS, ERROR, ACT, EXP, OIF, MAINLOOP = J * * * * * 5-56 Initialize values. * 0 SO SO SO SO SO * TESTCOOE ; Jump to testcode provided by user. Test code provided by user should return here. The test code can use all registers. It should return with sl containing the actual value, and s2 containing the expected value. CRAY PROPRIETARY SMM-1012 C Test loop code (continued) : TE8TRTN = * 80 JSZ 81\82 CONTIN ACT, EXP, DIF, Sl S2 SO S6 S7 S6ERROR, ERROR, 1 S6+S7 86 Increment error count S6 SO JSN STOP, S6\S7 CONTIN check stop flag Compare actual and expected. No failure, increment pass count. ; ERR CONTIN = S6 S7 S6 PASS, J Save actual result Save expected result Save difference Stop on error * PASS, 1 S6+S7 S6 MAINLOOP Increment pass count The following gives the locations of items within the test code. CRAY X-MP Computer System START TESTCODE PASS ERROR ACT EXP DIF STOP 200 500 24 23 21 22 20 26 CEA System 2000 2100 1104 1103 1101 1102 1100 1010 Location TESTCODE contains a series of PASS instructions, followed by an unconditional jump to TESTRTN. You can create a test loop by overwriting the PASS instructions at TESTCODE with the suspected failing instructions. Before the jump to TESTRTN, the actual value should be in Sl, and the expected value in S2. SMM-1012 C CRAY PROPRIETARY 5-57 5.2.3.3 Environment variables The oldman environment can be modified by setting certain environment variables. These variables are as follows: Variable Description DMONPATH: Enter a list of directories to search when opening a file for reading. Separate directories with a colon. When oldman tries to read a file, it first checks the current directory for that file. If the file is not found, oldman checks SHOME/oldman. If the file is not found, the program searches the directories specified by the DMONPATH environment variable. If the file is not found in any of those directories, the program searches the directory Ice/oldman. If the file is still not found, oldman issues an error message. OLDMON_PRINTER: Command used to print output. The data to be printed is sent to stdin (the command's standard input). If this variable is not defined, ezlp(l) is used. TERM: Terminal type being used. The terminal specified must be defined in the terminfo(4F) database. Set the environment variables before entering oldman. If you are running under the Bourne shell, sh(l), enter the following: VAR=value export VAR If you are running under the C shell, csh(l), enter the following: setenv VAR value Examples: To specify a VT100 terminal type while running under csh(l), enter the following at the csh(l) prompt: ~ setenv TERM vt100 To specify an oldman search path while running under sh(l), enter the following at the sh(l) prompt: $ DMONPATH=search-path-one:search-path-two $ export DMONPATH 5-58 CRAY PROPRIETARY SMM-1012 C To specify a different print command while running under sh(l), enter the following at the sh(l) prompt: $ OLDMON PRINTER='remsh remsys lusr/ucb/lpr' $ export OLDMON PRINTER In the preceding example, the single quotes are necessary because the command contains spaces. When oldman wants to print output, it will execute this command and send the data to be printed to the standard input (stdin) of this command. In this example, the remsh command will initiate a remote shell on the remsys system and execute the lusr/ucb/lpr command on the remote system. This allows oldmon output to be sent to a printer attached to a remote system. See remsh(l) for more information. 5.2.4 DISPLAY MODES The following subsections describe the oldman display modes: • Scroll mode display • Screen mode display The oldman display contains the following information: Information Description Command menu Lists input values Command prompt Prompts user for information Error messages Identifies error condition SMM-1012 C CRAY PROPRIETARY 5-59 Information Description CPU status Displays the following information for the current CPU: • State of the CPU: Up Down Down, idle Down, running • Name of the diagnostic in the current CPU • Program register (P) of the current Exchange Package of the current CPU • Status bits (S) of the current Exchange Package of the current CPU For CRAY X-MP computer systems: f fff nun nun c f fff indicates flags. nun nun indicates mode bits. C indicates the cluster number. For CEA systems: ffff mmmm cc ffff indicates flags. mmmm indicates modes. cc indicates the cluster number. 5-60 Down CPU list List of the down CPUs Time Current date and time Display area Display area for the portion of central memory associated with the current CPU. The display area can be divided into separate displays, showing different areas of central memory. In addition, each central memory display can be formatted differently. For additional information, refer to subsection 5.2.5.16, View command (v). CRAY PROPRIETARY SMM-I012 C 5.2.4.1 Scroll mode display Figure 5-25 shows a scroll mode display. CPU B: Down, running Name: offcrit Oa B 0 8 0000 0000 00 L 0 DIB display for olcrit name ='olcrit rev ='5.0 ='10/12/88' date pass = 252 error =0 seed = 1206302764022300543002 failpat ='onezero ' failcln = 0 isop 1000 numins = 200 P ibuff 12000a S5 87+85 jbuff jbuff jbuff jbuff 12400a 12400b 12401a 12401b BOO AO BOO AO 32300,0 J ERR Wed Oct 19 14:13:14 1988 Downed CPUs: B 00 01 02 03 04 05 06 07 00 04 10 14 20 24 30 34 00,000,004,000 0675543067115135020040 olcrit 0324561402004010020040 5.0 0304601363046213634070 10/12/88 0000000000000000000252 0000000000000000000000 1206302764022300543002 - @•••• 0000000000000000000000 0000000000000000000200 00,000,003,600 running ....................... . ............................... . ·....... · ....... ·· ........ · ....... ............................... . single cpu mode .•.•....•....•.•• •••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••• A/Dump Cpu Enter Fill Go Halt Load Opts Quit Redraw Stat Up View Write Xecute Figure 5-25. Scroll Mode Display The following information is displayed (in the order listed): 1. Current CPU status; time; down CPU list 2. Central memory display area 3. Error messages 4. Command menu 5. Command prompt SMM-1012 C CRAY PROPRIETARY 5-61 The following information applies to command line entries: • Enter commands after the command prompt. • If a command string is executed, the display scrolls upward and a new display appears. • If a command is entered without a required argument, the argument menu is displayed with a command prompt. Enter an argument after the prompt. After all commands are executed, the display scrolls upward and a new display appears. 5.2.4.2 Screen mode display Figure 5-26 shows a screen mode display. AIDump Cpu Enter Fill Go Halt Load Opts Quit Redraw Stat Up View Write Xecute B: Down, running Name: offcrit Oa B 0 S 0000 0000 00 L 0 DIB display for olcrit name ='olcrit rev ='5.0 date ='10/12/88' pass = 252 error 0 = 1206302764022300543002 seed failpat ='onezero ' failcln =0 isop = 1000 numins = 200 Wed Oct 19 14:13:14 1988 Downed CPUs: B 00,000,004,000 00 0675543067115135020040 olcrit 01 0324561402004010020040 5.0 02 0304601363046213634070 10/12/88 03 0000000000000000000252 04 0000000000000000000000 05 1206302764022300543002 - @•••• 06 0000000000000000000000 07 0000000000000000000200 00,000,003,600 00 running . . . . . . . . . . . . . . . . . . . . . . . . ibuff 12000a S5 S7+S5 04 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 •••••••••••••••••••••••••••••••• 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . jbuff jbuff jbuff jbuff 12400a 12400b 12401a 12401b BOO AO BOO CPU P = AO 32300,0 J ERR Figure 5-26. 5-62 · ....... ·. · ....... · ....... ·....... 20 24 30 34 single cpu mode ••...•...•...•.•• •••••••••••••••••••••••••••••••• ..•..............••.......••.... •••••••••••••••••••••••••••••••• Screen Mode Display CRAY PROPRIETARY SMM-1012 C To execute in screen mode, your terminal type must be defined in the terminfo(4F) database. See terminfo(4F) and curses(3X) for more information. The TERM environment variable sets the default terminal type. If TERM is set to a valid terminal type, oldman executes in screen mode; if not, oldmon executes in scroll mode. For information on the TERM environment variable, refer to sh(I). If your terminal type is not defined or is invalid, oldman does not enter screen mode; instead, an error message is displayed. In screen mode, the display is updated (overwritten) rather than scrolled. The following information is displayed (in the order listed): 1. 2. 3. 4. 5. Command menu Command prompt Error messages Current CPU status; time; down CPU list Central memory display area The following information applies to command line entries: • Enter commands after the command prompt. • If a command string is executed, the entire display is updated. • If a command is entered without a required argument, the argument menu is displayed with a command prompt. Enter an argument after the prompt. After all commands are executed, the entire display is updated. 5.2.5 PROGRAM COMMANDS The oldman commands are entered from a front-end terminal or an lOS station console. Figure 5-24 shows the Main menu for oldman. Unless a complete command string is entered from the Main menu (with all of the required arguments), the program displays various menus with prompts for additional entries. If you enter an invalid argument, the program displays a menu listing the valid arguments. Reenter a valid argument and continue. Between argument entries, the menu, prompt, and message lines are updated. After a command is executed with all of the required arguments, the entire display is redrawn. SMM-I012 C CRAY PROPRIETARY 5-63 The following guidelines apply to all command entries: • Select commands from the command menu by entering the first letter of the command. Depending on the command, the program displays various menus with prompts for arguments. • Enter all inputs in uppercase, lowercase, or a combination of both. • Press the Return key to receive a prompt for the next required argument or to execute the command if all of the required arguments are entered. • Enter the less-than key «) to return to the preceding menu. allows you to reenter an argument. • Enter the greater-than key (» return to the Main menu. • Use a semicolon (;) to combine commands. a combined command entry: This to abort the current command and The following applies to If any of the command entries are incomplete, the program issues a prompt for additional arguments for the first incomplete command. If an error is detected in the command list, the program displays the menu for the first incorrect command. This allows you to reenter the menu commands and any subsequent commands. If you have not yet pressed the Return key to execute the command list, you can abort the last command in the list by pressing the greater-than key (». All commands in the list are executed except the last entry, and the program returns to the Main menu. 5-64 • Use white space (blank spaces, tabs, and newline characters) to indicate the end of an address or file name. • Enter a pound sign (#) to start a comment in a command buffer. CRAY PROPRIETARY SMM-1012 C 5.2.5.1 Common arguments Several of the oldman commands accept the following arguments: Argument Description address Enter an octal address, or press K (Key) followed by a diagnostic information block (DIB) entry (refer to the off-line diagnostic listings for a list of DIB entries). All addresses are relative to the central memory image of the current CPU. The related menus indicate whether a parcel or word address is expected. If a parcel address is required, enter the word address followed by a parcel designator (do not leave a space between them). The parcel designator can be a, b, c, or d; the default is a. If a parcel address is not required and no parcel designator is specified, the address is assumed to be a word address. cpu CPU number. ranges: cpu is a value in one of the following 0, 1, 2, 3, 4, 5, 6, 7 or a, b, c, d, e, f, q, h The default is the current CPU. file SMM-1012 C Enter a valid file name. Full and relative path names are valid file names. If a relative path name is specified, the program searches for the file in the current directory. If the file is not found, the program uses the DMONPATH environment variable to search. For information on the DMONPATH environment variable, refer to sh(l). CRAY PROPRIETARY 5-65 Argument Description format Enter one of the following arguments to select the display format for the Dump (d) and View (v) commands: Argument Format d DIB format (View command only); displays the DIB of the diagnostic in the current CPU. i Instruction format; displays central memory in disassembled instructions. The program issues a prompt for a word or parcel address. p Parcel format; displays central memory in 6-digit octal parcels. The program issues a prompt for a word address. r Register format (View command only); displays the registers of the current CPU when the CPU is down and idle. t Text format; displays central memory in ASCII. The program issues a prompt for a word address. w Word format; displays central memory in 22-digit octal words. The program issues a prompt for a word address. Exchange Package format; displays central memory as an Exchange Package (View command only). The program issues a prompt for a word address or an Exchange Package value. The Exchange Package arguments are as follows: Argument c s 5.2.5.2 Exchange Package Current (default) Starting Append (a) and Dump (d) commands To append or dump a formatted central memory dump to a file (commands a and d, respectively), use the following command synopses. 5-66 CRAY PROPRIETARY SMM-I012 C Synopsis (Append command): a start-address end-address format file Synopsis (Dump command): d start-address end-address format file You must have permission to write to the specified file. The file is created if it does not already exist. Before writing the dump to the file, the program issues a prompt for comments to precede the dump. To print the dump, enter an asterisk (*) for file. See subsection 5.2.3.3, Environment Variables, for more information. To set append or dump arguments, use the following command synopses. Synopsis (Append command): a argument file Synopsis (Dump command): d argument file argument Enter one of the following values for Argument 5.2.5.3 argument: Description d Appends or dumps the DIB of the diagnostic in the current CPU to file r Appends or dumps the registers of the current CPU to file (the CPU must be down and idle) s Appends or dumps the current screen to file CPU command (c) To specify a new default CPU, use the following command synopsis. Synopsis: c cpu SMM-1012 C CRAY PROPRIETARY 5-67 The default CPU's memory area can be displayed in the memory display area. The Status command is valid for the default CPU only. The Go, Halt, and Load commands assume the default CPU if a different CPU is not specified. The initial default CPU is the first CPU downed from the command line or CPU a if no CPU was downed. 5.2.5.4 Enter command (e) To enter a value at a specific address, use the following command synopsis. Synopsis: e address value If address is a parcel address and program displays an error message. 5.2.5.5 value exceeds 0'177777, the Reenter and continue. Execute command (z) To execute a command buffer containing oldman commands, use the following command synopsis. Synopsis: z file 5.2.5.6 Fill command (f) To fill consecutive central memory locations, use either of the following command synopses. Synopsis: f address value ... value address Indicates the first central memory location to be filled with the first value specified. Each consecutive value is placed in the next consecutive central memory location. Depending on the address specified, the program fills the memory location with words or parcels. Press the Return key after address and after each value. If you press the Return key without first entering a value, the current central memory location remains unchanged and the next value specified is placed in the next consecutive memory location. 5-68 CRAY PROPRIETARY SMM-I012 C To return to the preceding word or parcel location, press the less-than key You can modify the word or parcel value before proceeding to the next location. «). To signal the completion of the consecutive entries, enter a period (.) or the greater-than key (». To fill memory in a specified range with a specific data pattern, use the following command synopsis: Synopsis: fp start-address end-address value If parcel addresses are specified, each parcel in the given range is filled with the given data value. If word addresses are given, the given range of words is filled with the given data value. 5.2.5.7 Go command (9) To start a test in a CPU, use the following command synopsis. Synopsis: q [cpu] [exchange-package] exchange-package Enter one of the following arguments for exchange-package: Argument Exchange Package c Current Starting (default) s CX/CEA Location 120 140 CEA Location 1200 740 address If the CPU is not down, the program issues a prompt for you to verify the request to down the CPU. Enter y (yes) to down the CPU and start the test. Enter D (no) to cancel the Go command. 5.2.5.8 Halt command (h) To halt test execution in a down CPU, use the following command synopsis. Synopsis: h [cpu] The CPU idles until the Go or Up command is executed. SMM-1012 C CRAY PROPRIETARY 5-69 5.2.5.9 Load command (1) To load a test into a CPU's central memory buffer, use the following command synopsis. Synopsis: 1 [cpu] address file file 5.2.5.10 Enter one of the following arguments for file: Argument Description file * File containing the test to be executed Test loop Options command (0) To set test options, use the following command synopsis. Synopsis: o option argument option The values for option are as follows (the argument value is dependent on option): Option c Description Generates a display that is continuously refreshed at a specified interval (in seconds). Use the following command synopsis: o c seconds seconds is the number of seconds; a value in the range 1 through 9. To return to the Main menu, an interrupt must be sent to oldman. Typically, pressing the Control-C keys sends an interrupt to oldman. See the appropriate front-end station guide and stty(l). d Downs a specified cPU. command synopsis: o d 5-70 Use the following cpu CRAY PROPRIETARY SMM-I012 C option (continued): Option Description cpu defaults to the current CPU. The CPU is downed and left idle. (The Go command also downs the CPU.) I Sets a new limit address for the current CPU. Use the following command synopsis: o 1 address The new limit address is rounded up to the next 0'1000 word boundary. t Specifies the terminal type (required for screen mode; refer to subsection 5.2.4.2, Screen mode display). Use the following command synopsis: o t type type is one of the terminal types defined in the terminfo(4F) database. The TERM environment variable sets the default terminal type. For information on the TERM environment variable, refer to sh(l). 5.2.5.11 Quit command (q) To exit oldmon, enter one of the following commands: Command Description eof End-of-file (typically, press the Control-d keys). Enter from any menu. A prompt is displayed before the request is processed. To verify or cancel the request, enter y (yes) or n (no), respectively. q Quit. Enter from the Main menu only. A prompt is displayed before the request is processed. To verify or cancel the request, enter y (yes) or n (no), respectively. 5.2.5.12 Redraw command (r) To redraw the display, enter r. SMM-1012 C CRAY PROPRIETARY 5-71 5.2.5.13 Shell escape command (!) To execute a shell command, use the following command synopsis. Synopsis: ! [shell-command] The oldman monitor will execute shell-command in a subshell. If shell-command is omitted, oldman will execute /bin/sh. You must exit this shell to continue oldmon. See sh(l) for more information. 5.2.5.14 Status command (s) To update the current Exchange Package of the current CPU, enter s. the current CPU is not down, an error message is displayed. 5.2.5.15 If Up command (u) To return a down CPU to normal system operations, use the following command synopsis. Synopsis: u [cpu] 5.2.5.16 View command (v) To view a formatted area of central memory on all or part of the display area, use the following command synopsis. Synopsis: v display format address display Enter one of the following arguments for display: Argument f I r tl tr bl br 5-72 Description Full display Left half of the display Right half of the display Top left quadrant Top right quadrant Bottom left quadrant Bottom right quadrant CRAY PROPRIETARY SMM-I012 C To display the DIB of the current diagnostic, use the following synopsis. Synopsis: v display d argument argument Enter one of the following arguments: Argument Description RETURN Displays the DIB starting at the beginning. Displays the differences section of the DIB (confidence tests only) Displays the DIB starting with DIB d k key To display the current values of the CPU's registers, use the following synopsis. Synopsis: v display r To scroll the display areas forward or backward, use the plus (+) or minus (-) parameters, respectively. The command synopses are as follows. Synopsis: v [display] +[n] or v [display] -[n] display Enter the display to be scrolled. areas are scrolled. n Number of lines to scroll. The default for n is 8 if display is tI, tr, bI, or br. Otherwise, the default is 16 (the number of lines in the display area). 5.2.5.17 If omitted, all display Write command (w) To write an area of central memory to a binary file, use the following command synopsis. Synopsis: w start-address end-address file SMM-1012 C CRAY PROPRIETARY 5-73 5.2.6 PROGRAM EXAMPLE This subsection contains a commented oldman execution example. Example: $ oldmon -d b Do you really want to down CPU b? Type y or n> y ************************************************************** The -d b command line option requests that oldman down CPU B immediately. Enter y to confirm the request. ************************************************************** Cannot find configuration file oldmon.cf, should I initialize it? Enter Yes or No (yIn» y ************************************************************** The oldman monitor cannot locate the configuration file oldman.cf. Enter y to initialize oldman.cf. ************************************************************** 5-74 CRAY PROPRIETARY SMM-1012 C Example (continued): A/Dump Cpu Enter Fill Go Halt Load Opts Quit Redraw Stat Up View Write Xecute v CPU P S B: Down, idle Oa 0000 0000 00 Name: ** none ** Wed Oct 19 13:21:18 1988 B a Downed CPUs: L a B OLDMaN Version 1.0 - Online Down CPU Monitor CRAY Y-MP Down CPU Monitor for the UNICOS Operating System. Copyright (c) Cray Research, Inc. Unpublished - All rights reserved under the copyright laws of the United States. CRAY PROPRIETARY ************************************************************** The Main menu for oIdmon is displayed. CPU B is the default CPU. It is displayed as down and idle. Enter v to set the View command. ************************************************************** Display: Full, Top, Bottom, Left, Right; Scroll: + View I ************************************************************** The choice of input values is displayed. Enter I to select the left half of the screen as the display area. ************************************************************** SMM-1012 C CRAY PROPRIETARY 5-75 Example (continued): Format: Dib, Instr, Parcel, Word, Register, Text, eXchange pkg; Scroll: + View Left in d ************************************************************** The choice of input values is displayed. select the DIB format. Enter d to ************************************************************** RETURN for DIB; Differences; Key View Left in DIB format RETURN ************************************************************** The choice of input values is displayed. to display the beginning of the DIB. Press RETURN ************************************************************** A/Dump Cpu Enter Fill Go Halt Load Opts Quit Redraw Stat Up View Write Xecute 1 CPU B: Down, idle P S Oa Name: ** none ** B L 0000 0000 00 DIB display unavailable 5-76 0 0 Wed Oct 19 13:22:49 1988 Downed CPUs: CRAY PROPRIETARY B SMM-1012 C Example (continued): ************************************************************** The Main menu for oldman is redisplayed. Enter I to load a diagnostic into the common memory buffer for CPU B. ************************************************************** Enter word address Load cpu B at 0 RETURN ************************************************************** Enter the address within the buffer where the diagnostic is to be loaded. Pressing RETURN without entering an address will default to zero. ************************************************************** Enter file name, * for testloop Load cpu B at 0 from offcrit ************************************************************** Enter a file name. In this example, offcrit (off-line version of olcrit) is specified. ************************************************************** SMM-1012 C CRAY PROPRIETARY 5-77 Example (continued): A/Dump Cpu Enter Fill Go Halt Load Opts Quit Redraw Stat Up View Write Xecute v r • 4000 CPU B: Down, idle Name: offcrit P Oa B 0 L 0 S 0000 0000 00 DIB display for olcrit name ='olcrit rev ='5.0 ='10/12/88' date pass = 0 error = 0 seed = 33 lmstart = 0 failpat = isop = 1000 numins = 200 ibuff 17000a EXIT 00 jbuff 17400a EXIT 00 initaO inita1 Wed Oct 19 13:24:39 1988 Downed CPUs: B = 0000000000000000000000 = 0000000000000000000000 ************************************************************** The command string to set the right half of the display is entered. The blank space between each entry is optional. 1. Enter v to select the View command. 2. Enter r to select the right half of the display. 3. Enter w to select word format. 4. Enter 4000 to specify the display address. ************************************************************** 5-78 CRAY PROPRIETARY SMM-1012 C Example (continued): A/Dump Cpu Enter Fill Go Halt Load Opts Quit Redraw Stat Up View Write Xecute e CPU B: Down, idle Name: offcrit B 0 P Oa L 0 S 0000 0000 00 DIB display for olcrit name ='olcrit rev ='5.0 ='10/12/88' date pass = 0 error = 0 seed = 33 failpat = failcln = 0 isop = 1000 numins = 200 ibuff 12000a ERR jbuff 12400a ERR initaO inital = 0000000000000000000000 0000000000000000000000 Wed Oct 19 Downed CPUs: B 00,000,004,000 00 0675543067115135020040 01 0324561402004010020040 02 0304601363046213634070 03 0000000000000000000000 04 0000000000000000000000 05 0000000000000000000033 06 0000000000000000000000 07 0000000000000000000200 10 11 12 13 14 15 16 17 14:10:33 1988 olcrit 5.0 10/12/88 · ....... · ....... · ....... · ....... ·....... 1000000000000000037777 • ••••• ? • 0000000000000000000000 0000000000000000000007 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000001000 0000000000000000000000 · ....... · ....... ·....... · ....... ·....... · ....... ·....... ************************************************************** The new display is shown. Use the Enter command to set a location within the memory buffer. ************************************************************** Key : Press RETURN when complete Enter at Key seed ************************************************************** Enter seed to specify that the seed DIB entry is to be used. ************************************************************** The current value at Key seed is 0000000000000000000033 Enter at Key seed the value of 1206302764022300543002 ************************************************************** Enter the value 1206302764022300543002. Presumably, this is the seed from an on-line failure of olcrit. ************************************************************** A/Dump Cpu Enter Fill Go Halt Load Opts Quit Redraw Stat Up View Write Xecute e 4017 1 Name: offcrit B: Down, idle B 0 Oa L 0 S 0000 0000 00 DIB display for olcrit name ='olcrit rev ='5.0 date ='10/12/88' pass = 0 error = 0 seed = 1206302764022300543002 failpat = failcln =0 isop = 1000 numins = 200 CPU P ibuff 12000a ERR jbuff 12400a ERR initaO inita1 5-80 = 0000000000000000000000 = 0000000000000000000000 Wed Oct 19 14:12:59 1988 Downed CPUs: B 00 01 02 03 04 05 06 07 00,000,004,000 0675543067115135020040 0324561402004010020040 0304601363046213634070 0000000000000000000000 0000000000000000000000 1206302764022300543002 0000000000000000000000 0000000000000000000200 10 11 12 13 14 15 16 17 1000000000000000037777 •••••• ? • 0000000000000000000000 0000000000000000000007 0000000000000000000000 0000000000000000000000 0000000000000000000000 0000000000000000001000 0000000000000000000000 CRAY PROPRIETARY olcrit 5.0 10/12/88 •• -@ •••• SMM-1012 C Example (continued): ************************************************************** The Enter command is used again to enter a 1 at location 4017. This sets the repeat flag for offcrit. (Refer to the offcrit listing for more information.) ************************************************************** A/Dump Cpu Enter Fill Go Halt Load Opts Quit Redraw Stat Up View Write Xecute y ************************************************************** Enter a y to confirm the quit. Note that CPU B will be left down since it was not explicitly returned to UNICOS with the Up command. ************************************************************** 5-86 CRAY PROPRIETARY SMM-1012 C 5.2.7 PROGRAM MESSAGES This subsection lists the oldman messages in alphabetical order. Address addr exceeds limit address This message is associated with the Enter (e) command. valid address to continue. Reenter a Cannot access printer If the OLDMON PRINTER environment variable is set, its value is not a valid command. If OLDMON PRINTER is not set, the command ezlp cannot be executed. Cannot allocate memory This message is associated with the Load (1) or Options (0) command. Cannot dump DIB of the loaded diagnostic This message is associated with the Append (a) or Dump (d) command. Cannot fill memory outside of buffer This message is associated with the Fill (f) command. Fill command. Reenter the Cannot find DIB entry x This message is associated with the Enter (e) or Fill (f) command. CPU n interrupts: list This message lists all the interrupts for CPU n. CPU n is already down The oldmon monitor tried to down a CPU that it has downed already. Indicates an internal oldman error. Contact your CRI representative. CPU n is not down This message is associated with the Status (s) or Up (u) command. CPU n registers are unavailable and cannot be dumped Registers cannot be dumped unless the current CPU is down and idle. This message is associated with the Append (a) or Dump (d) command. Exception condition: caught signal Refer to signal(2). Exchange Package is not in the CPU's memory This message is associated with the Go (g) command. File file is empty An empty file was specified when loading a diagnostic. SMM-1012 C CRAY PROPRIETARY 5-87 File file: system error message The oldman monitor had an error while accessing, reading, or writing file. Invalid input input The oldman monitor received unexpected input. The ioctl-request ioctl failed for cpu-device: errno n: system error message The oldman monitor made the specified request to UNICOS and the request failed. plock: errno n: system error message The oldman monitor made a request to be locked in memory and the request failed. Second address must be greater than first address This message is associated with the Append (a), Dump (d), Fill (f), or Write (w) command. Single CPU system; cannot down a CPU. The oldman monitor does not allow downing a CPU on a single CPU system. Terminal type not set, cannot use screen mode The TERM environment variable was not set when oldman was started. Unable to configure loaded diagnostic This message is associated with the Load (1) command. Unknown terminal terminal; cannot use screen mode terminal is not defined in the terminfo(4F) database. Value exceeds parcel size This message is associated with the Enter (e) or Fill (f) command. value must not exceed 0'177777. 5-88 CRAY PROPRIETARY SMM-1012 C 5.3 unitap The unitapt test is an on-line magnetic tape test that allows you to test up to 8 tape paths in parallel. It is supported in a standard configuration. You can execute unitap interactively or from a UNICOS shell script.tt Interactive execution is menu-driven, with a 240-character command buffer. From each menu, you can access all of the other menus. All user input and output is saved in a trace file for later evaluation. To simulate passing and failing test execution examples without removing the tape device from normal system operations, you can execute unitap in Learn mode. The unitap testing options are as follows: Testing Option Description All tape tests All of the tape tests (test sections) are executed (run time: approximately 3 minutes). Two-channel conflict tests A selection of tape tests are executed in parallel to exercise 2 tape paths (run time: approximately 10 minutes). The tests verify whether the channels can withstand conflict. Three-channel conflict tests A selection of tape tests are executed in parallel to exercise 3 tape paths (run time: approximately 10 minutes). The tests verify whether the channels can withstand conflict. Canned test A user-selected test is executed (for example, a byte counter test). Test loop A user-defined test is executed (refer to subsection 5.3.4.6, Programming Tool). For additional information, refer to subsection 5.3.3.3, Test Menu. t tt CX/CEA systems only. Execution from a shell script is deferred. SMM-1012 C CRAY PROPRIETARY 5-89 In addition to providing error detection capabilities, unitap provides the following troubleshooting tools: Troubleshooting Tool Description Breakpoint Sets breakpoints in the tape tests Channel Commandst Issues channel commands Compare Data Buffer Displays data miscomparisons for the write and read data buffers Display Memory Displays the write and read central memory data buffers, and allows you to modify the write buffer System Call History Displays a history of the last 15 system calls and the last 10 events that preceded the current event. An event is defined as any of the following actions: A failure occurs A breakpoint is reached Programming Builds test loops Packet Status Displays the status of the last packet sent for each channel at the time of the last 10 events that preceded the current event For additional information, refer to subsection 5.3.4, Debug Tools. 5.3.1 PROGRAM SYNOPSIS You can execute unitap interactively or from a UNICOS shell script.tt This subsection describes how to execute unitap from a shell script. For a description of interactive execution, refer to subsection 5.3.2, Interactive Program Execution. t tt 5-90 Deferred implementation. Execution from a shell script is deferred. CRAY PROPRIETARY SMM-1012 C 5.3.2 INTERACTIVE PROGRAM EXECUTION Interactive execution is menu-driven, with a 240-character command buffer. From each menu, you can access all of the other menus. Menu options can be entered in uppercase or lowercase. 5.3.3 PROGRAM MENUS This subsection provides a summary of the unitap menu system. following menus are described. • • • • • • • The Main menu Variable menu Test menu Canned Test menu Debug menu Global Options menu Hardware Layout menu SMM-1012 C CRAY PROPRIETARY 5-91 5.3.3.1 Main Menu The Main Menu is displayed when unitap is initialized or when you enter MR from any menu (refer to figure 5-27). unitap Main Menu Option Description Debug Menu Test Menu Variable Menu D T V Global Options Menu Program notes G W EXIT HELP option Note: Exit the diagnostic Information on option these menu options are global (valid from all menus). Figure 5-27. Main Menu for unitap The menu options are as follows: Option Description D Debug Menu (refer to subsection 5.3.3.5) T Test Menu (refer to subsection 5.3.3.3) V Variable Menu (refer to subsection 5.3.3.2) G Global Options Menu (refer to subsection 5.3.3.6) w Program notes EXIT Exit the diagnostic; channels dedicated to on-line diagnostic testing are released. HELP option Information on option 5-92 CRAY PROPRIETARY SMM-1012 C 5.3.3.2 Variable Menu The Variable Menu is displayed when you enter V from any menu (refer to figure 5-28). Variable Menu unitap Path ! CH=20, CO=O, DV=dv, DN=6250, PC=! Option Description CH CO DN DV Pn PC RL Channel number (20-33 octal) Controller number (O-F hexadecimal) Density value (800, 1600, or 6250, CART) Device number (O-FFF ASCII) Path (1-8) Pass count (decimal) Release the dedicated (reserved) path for the tape unit n n n dv n G R Note: Global Options Menu Previous menu these menu options are global (valid from all menus). Figure 5-28. Variable Menu Each option is briefly described in the Variable Menu. following descriptions provide further clarification: However, the Option Description CH n Channel number. n is a value in the range 0'20 through 0'33. The default for n is 0'20 through 0'27, for paths 1 through 8, respectively. CO n Controller number. n is a value in the range 0 through F (hexadecimal). The default for n is O. DN n Density value. n is one of the following values: 1600, or 6250 (default), CART. DV dv Device number (required). value. SMM-1012 C 800, n is a site-defined ASCII CRAY PROPRIETARY 5-93 Option Description Pn Path under test (channel, controller, and device). a value in the range 1 through 8. The default for is 1. PC n Pass count. RL Release the dedicated path for the tape unit. 5.3.3.3 n is n The default for n is 1. Test Menu The Test Menu 'is displayed when you enter T from any menu (refer to figure 5-29). unitap Path 1 Option Test Menu CH=20, CO=O, Dv=dv, DN=6250, PC=l Description A C 2 3 Execute Display Execute Execute G R Global Options Menu Previous menu Note: all the the the the tape tests Canned Test Menu two-channel conflict tests three-channel conflict tests these menu options are global (valid from all menus). Figure 5-29. 5-94 Test Menu CRAY PROPRIETARY SMM-1012 C The menu options are as follows: Option Description A All tape tests. All of the tape tests are executed (run time: approximately 3 minutes). 2 Two-channel conflict tests. A selection of tape tests are executed in parallel to exercise 2 tape paths (run time: approximately 10 minutes). The tests verify whether the channels can withstand conflict. 3 Three-channel conflict tests. A selection of tape tests are executed in parallel to exercise 3 tape paths (run time: approximately 10 minutes). The tests verify whether the channels can withstand conflict. C Canned test. A user-selected test is executed (for example, a byte counter test). SMM-1012 C CRAY PROPRIETARY 5-95 5.3.3.4 Canned Test Menu The Canned Test Menu is displayed when you enter C from any menu (refer to figure 5-30). unitap Path 1 Option Canned Test Menu CH=20, CO=O, Dv=dv, DN=6250, PC=l Description AC BC BF BN BS LA RB ST TP All basic commands tests (except Read) Byte counter test (transfers up to 4 kbytes) Buffer tests (R/W 64 bits) Next byte counter test (transfers 4 to 8 kbytes) Bus test (R/W 8 bits) Ladder tests Random buffer tests (R/W 64 random bits) Stress test Tape position commands tests G Global Options Menu Previous menu R Figure 5-30. 5-96 Canned Test Menu CRAY PROPRIETARY SMM-lOl2 C The menu options are as follows: Option Description AC All basic commands tests. Tests the rewind, write, write tape mark, forward block, backward block, forward tape mark, and backward tape mark tape movement commands. BC Byte counter test. Writes and reads 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, and 4096 bytes to the tape. BF Buffer tests. tape. BN Next byte counter test. Writes and reads 1 sector (4096 bytes) plus 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, and 4096 bytes to the tape. BS Bus test. LA Ladder tests. Writes and reads 1, 2, 3, 4, 5, 6, 7, and 8 sectors to the tape. RB Random buffer tests. Writes and reads random data patterns to the tape. ST Stress test TP Tape position commands tests. Writes patterns to the tape, issues tape positioning commands, and then reads the patterns to verify that the positioning commands work. SMM-1012 C Writes and reads 64-bit patterns to the Writes and reads 8-bit patterns to the tape. CRAY PROPRIETARY 5-97 5.3.3.5 Debug Menu The Debug Menu is displayed when you enter D from any menu (refer to figure 5-31). unitap Option B cct CD E H L LO M PG S G R Note: t Debug Menu Description Breakpoint Tool Channel Commands Tool Compare Data Buffer Tool Fail execution (Learn mode) System Call History Tool Learn mode/System mode (toggle) Hardware Layout Menu Memory Tool (Central Memory) Programming Tool Packet Status Tool Global Options Menu Previous menu these menu options are global (valid from all menus). Deferred implementation Figure 5-31. Debug Menu For additional information, refer to subsection 5.3.4, Debug Tools. 5-98 CRAY PROPRIETARY SMM-1012 C 5.3.3.6 Global Options Menu The Global Options Menu is displayed when you enter G from any menu (refer to figure 5-32). Global Options Menu unitap Option Description Option All confidence tests Breakpoint Tool Canned Test Menu Command buffer pass count Channel Commands Tool Compare Data Buffer Tool Channel number Controller number Debug Menu Density value Device number Error mode (Learn mode) System Call History Tool M 3 LO Learn mode/System mode Display layout EXIT R Exit diagnostic Previous menu HELP A B C CB n cct CD CH n CO n o ON n DV n E H Memory Tool Main menu Programming Tool Pass count (decimal) Path (1-8) Print screen MN PG PC n Pn PT RL RT S T V W Release path Return from breakpoint Packet Status Tool Test Menu Variable Menu Program notes Two-channel conflict test Three-channel conflict test 2 L t Description option Information on option Deferred implementation Figure 5-32. SMM-1012 C Global Options Menu CRAY PROPRIETARY 5-99 5.3.3.7 Hardware Layout Menu The Hardware Layout Menu is displayed when you enter LO from any menu (refer to figure 5-33). unitap Option D BM Hardware Layout Menu Description Debug Menu Block Multiplexer layout Figure 5-33. 5-100 Hardware Layout Menu CRAY PROPRIETARY SMM-1012 C The Block Multiplexer Layout Menu for a BMC-5 is displayed when you enter 8M from the Hardware Layout Menu (refer to figure 5-34). unitap Option D BM Block Multiplexer Layout Menu (BMC-5) Description Debug Menu Block Multiplexer layout Figure 5-34. SMM-1012 C Block Multiplexer Layout Menu (BMC-5) CRAY PROPRIETARY 5-101 5.3.4 DEBUG TOOLS The unitap debug tools can be selected from any menu. as follows: Tool Breakpoint Channel Commandst Memory Buffer Compare Data Buffer System Call History Programming Packet Status These tools are Menu Option B CC M CD H PG S These tools are described in the subsections that follow. t Deferred implementation 5-102 CRAY PROPRIETARY SMM-1012 C 5.3.4.1 Breakpoint Tool The Breakpoint Tool is displayed when you enter B from any menu (refer to figure 5-35). This tool allows you to set a breakpoint immediately preceding or following a system call in a test. When the breakpoint is reached, the user's keyboard input is executed. If an error is detected, information relating to the event is displayed. An event is defined as any of the following actions: a failure occurs or a breakpoint is reached. Use the System Call History and Packet Status tools to display additional information regarding an event. Breakpoint Tool unitap Breakpoint =0 Breakpoint pass count =I When breakpoint is reached, the user's keyboard input is executed. message displayed on error Event n occurred after y system calls. Option Description BP n Execute a breakpoint on pass n BR n Set or clear a breakpoint. n is one of the following breakpoint numbers: o- Clear the breakpoint 1 - Set breakpoint prior to the system call 2 - Set breakpoint after the system call RT Return to test after a breakpoint (global option) D Debug Menu Global Options Menu Previous menu G R Figure 5-35. SMM-1012 C Breakpoint Tool CRAY PROPRIETARY 5-103 5.3.4.2 Channel Commands Tool The Channel Commands Toolt is displayed when you enter CC from any menu (refer to figure 5-36). This tool allows you to issue channel commands to the tape device, and to display channel status. For additional information on the channel commands, refer to the APML Reference Card for COS and UNICOS, CRI publication SQ-0059. unitap Path 1 Channel Commands Tool CH=20, CO=O, DV=dv, DN=6250, PC=l LMARO = 123456 LMAR1 = 123457 Byte counter = 1000 Command 00 01 02 03 04 05 10 Description Bus in = 123001 Tags in = 377 Flags = IDLE Command Clear chan control Reset channel Send command Read address Single byte IIO Run diagnostics Read LMAR 11 12 13 14 15 16 17 n n n n Description Read byte counter register Read bus and status Read input tags Write LMAR ( n: accumulator Write BC ( n: accumulator Enter Addr ( n: accumulator Write tags ( n: accumulator value) value) value) value) R Previous menu G Global Options Menu Figure 5-36. t Channel Commands Tool Deferred implementation 5-104 CRAY PROPRIETARY SMM-1012 C 5.3.4.3 Display Data Buffer Tool The Display Data Buffer Tool is displayed when you enter M from any menu (refer to figure 5-37). This tool allows you to display the read and write data buffers, and to modify the write data buffer. Each data buffer is 16 Kwords. Display Data Buffer Tool unitap message displayed on error = Write Address a a 000000 000000 020124 1 000000 000000 070565 2 000000 000000 041162 3 000000 000000 021106 4 000000 000000 045155 5 000000 000000 000000 6 000000 000000 000001 7 000000 000000 000001 Option DA n DF DB DP DW DI DO DD DX ST SS SP SK CP LP LN = 044145 064553 067556 067570 070144 170435 020526 050617 000000 000022 000045 000070 000113 000136 000161 000203 000000 153207 126416 101625 055034 030243 003452 156661 Description Display address Display Forward or Back in Parcel or Word format Display in Ascii,Octal,Decimal,Hex Store adr data, Store Seeded random, Store Pattern, Store Skip Copy a block of data, Locate Pattern, Locate a non-pattern Figure 5-37. SMM-1012 C Read Address a a 000000 000000 1 000000 000000 2 000000 000000 3 000000 000000 4 000000 000000 5 000000 000000 6 000000 000000 7 000000 000000 Display Data Buffer Tool (1 of 2) CRAY PROPRIETARY 5-105 Display Data Buffer Tool unitap Command Description CP addrl addr2 n LP addr pattern SK addr data n y Copy n words from addrl to addr2 Search for pattern starting at addr Store data in n words (skip y words between stores), starting at addr Store data consecutively in n words, starting at addr Store random data consecutively in n words, starting at addr, using seed to start the random number generator Store data in addr SP addr data n SS addr seed n ST addr data D/DR/DL addr Dx/DRx/DLx Display full/right/left screen starting at addr Display x: F (forward), B (backward), A (ASCII), 0 (octal), D (decimal), X (hexadecimal), P (parcel), W (word) Figure 5-37. 5-106 Display Data Buffer Tool (2 of 2) CRAY PROPRIETARY SMM-1012 C 5.3.4.4 Compare Data Tool The Compare Data Tool is displayed when you enter CD from any menu (refer to figure 5-38). This tool allows you to display the read and write data buffers, and exclusive ORs (logical differences) for the Write and Read address comparisons. Each data buffer is 16 Kwords. unitap Compare Data Tool The Read compare grid is the Exclusive OR (or logical difference) of the data at the Write grid address and the data at the Read grid address. Write Address 0 1 2 3 4 5 6 7 000000 000000 000000 000000 000000 000000 000000 000000 =0 000000 000000 000000 000000 000000 000000 000000 000000 READ COMPARE Address 020124 705654 041162 021106 045155 000000 000001 000001 044145 064553 067556 067570 070144 170435 020526 050617 0 1 2 3 4 5 6 7 =0 20124 70547 41127 21176 45046 136 160 202 044145 137754 141140 166355 025170 140676 023174 106076 ------------------------------------------------------------------------Display Enter Forw,Back,Oct,Dec,Hex,Parc,Word, Display Address, Locate Error DB DO DD DX DP DF DW LE DA Figure 5-38. SMM-1012 C Compare Data Tool CRAY PROPRIETARY 5-107 5.3.4.5 System Call History Tool The System Call History Tool is displayed when you enter H from any menu (refer to figure 5-39). This tool allows you to display a history of the last 15 system calls (commands) and the last 10 events that preceded the current event. An event is defined as any of the following actions: a failure occurs or a breakpoint is reached. unitap System Call History Tool Event # 1 was on PATH 1 in the LMAR Test at label L11002 pattern=40 The diagnostic wrote 40 to the LMAR and read back 44445 Seq Path Chan Cont Dev CMD Sec Blk 14 13 12 11 10 9 8 7 6 5 4 3 2 1 LAST 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 20 22 21 20 22 21 20 22 21 20 22 21 20 22 21 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RLMAR F BK W BUS WLMAR BK BK W TAG RLMAR F BK W TAG WLMAR BK BK R BUS RLMAR W TAG W BUS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 B Adr FIg 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ACC Label 10 0 2 20 0 2000 20 0 2000 40 0 2000 40 0 3 11001 27008 15000 11000 27009 15001 11001 27010 15002 11000 27011 15003 11001 21000 15000 Pattern 10 0 2 20 0 2 20 0 2 40 0 2 40 0 3 ------------------------------------------------------------------------Option Description Option Description D G Debug Menu Global Options Menu N or P Previous or next event Status tool Figure 5-39. 5-108 S System Call History Tool CRAY PROPRIETARY SMM-1012 C 5.3.4.6 Programming Tool The Programming Tool is displayed when you enter PG from any menu (refer to figure 5-40). This tool allows you to define a test loop with up to 32 steps and up to 8 channels performing read, write, rewind, and compare operations. unitap Programming Tool Path 1 STEP PATH 1 2 1 2 1 2 0 0 1 2 3 4 5 6 7 8 CH=20, CO=O, Dv=dv, DN=6250, PC=1 DEV 20 21 20 21 20 21 0 0 COMMAND WRITE REWIND REWIND READ READ FORW TM SECT BLOCKS 5 0 0 3 0 2 0 0 1 0 0 2 0 0 0 0 BYTES BUF ADR FLAGS 0 0 0 0 0 0 0 0 1234 0 0 7010 11000 0 0 0 1357 0 0 0 0 0 0 0 JUMP TO STEP 0 0 0 0 0 2 0 0 Option Description Option Description BA n BK n BY n eM n FG n Buffer address Number of blocks Number of bytes Tape/channel command Flag settings JP n PPn SC n ST n Jump to step n Path (1-8) Number of sectors Step (1-32) DF/DB G Scroll forward/backward Display global options HELP RUN option Information on option Run test for n passes (PC n) Now loading step number n Figure 5-40. SMM-1012 C Programming Tool CRAY PROPRIETARY 5-109 5.3.4.7 Packet Status Tool The Packet Status Tool is displayed when you enter S from any menu (refer to figure 5-41). This tool allows you to display the status of the last packet sent for each channel at the time of the last 10 events that preceded the current event. An event may be either of the following actions: a failure occurs or a breakpoint is reached. unitap Path 1 Packet Status Tool CH=20, CO=O, DV=dv, DN=6250, PC=l Path 1 was in the LMAR Test at label L11002 pattern=40. Event # 1 was on PATH 1 in the LMAR Test at label L11002 pattern=40. The diagnostic wrote 40 to the LMAR and read back 44445 Requested Sector Count = Requested Block Count = Data buffer address = Accumulator = Function = Diagnostic Flags = DFT packet Status flag = DFT packet Status code = Last DFT Last DFT Reply 0 0 0 40 RLMAR 0 0 0 0 44445 RLMAR 0 DONE 0 Option Description G Global Options Menu System Call History Tool Previous or next event, respectively Status for path (1-8) Previous menu H P or N Pn R Figure 5-41. 5-110 Packet Status Tool CRAY PROPRIETARY SMM-1012 C 5.3.5 TRACE FILE All user input and output is saved in a trace file for later evaluation. 5.3.6 LEARN MODE To simulate passing test execution examples without removing the tape device from normal system operations, you can execute unitap in Learn mode. To enter Learn mode, enter L from any menu; to return to normal system operations (system mode), enter L again. When you execute in Learn mode, the mode is indicated at the top of all the menus. 5.3.7 PROGRAM EXAMPLES This subsection contains unitap execution examples. The following example runs all of the unitap tests on device 00 and then exits the program. unitap dv 00 a exit The following example runs the two-channel conflict tests on devices 00 and 01, and then exits the program. unitap dv 00 p2 dv 01 2 exit 5.3.8 PROGRAM MESSAGES The following subsections contain the unitap messages: • Messages with menu displays • Messages without menu displays The messages are listed alphabetically in each subsection. SMM-1012 C CRAY PROPRIETARY 5-111 5.3.8.1 Messages with menu displays The messages are listed alphabetically in this subsection. BREAKPOINT PROCESSED Option C D G H MN N Option Description Canned Test Menu Debug Menu Global Options Menu System Call History Tool Main Menu Continue testing with next pattern 0 PG R S T V Description Rerun test Programming Tool Previous menu Packet Status Tool Test Menu Variable Menu TEST FAILED Path 1 CH=20, CO=O, DV=dv, DN=6250 3-channel conflict tests were executing on pass 1 at label L4 Event # 1 was flagged in the diagnostic at label DL11002 Path 2 was in the Bus test at label L15004 variable=2 Path 3 was in the Tag-Loopback test at label L21001 variable=O The error was on Path 1 in the LMAR Test at label L11002 variable=40 The diagnostic wrote 40 to the LMAR and read 44445 Option Description Option Continue testing with next pattern Rerun test Packet Status Tool Test Menu H Canned Test Menu Debug Menu Global Options Menu System Call History Tool F X Loop on failing pattern until next error or pass count is reached Loop on failing pattern until abort (press the ESC-A keys) C D G 5-112 N Description 0 S T CRAY PROPRIETARY SMM-1012 C 5.3.8.2 Messages without menu displays The messages are listed alphabetically in this subsection. Invalid entry: n Range: n through n (radix) Enter a valid value to continue or an asterisk (*) to abort. The value entered is invalid. Enter a valid value. Test passed: test The test completed successfully. SMM-1012 C CRAY PROPRIETARY 5-113 6. 1/0 SUBSYSTEM DEADSTART PROGRAMS This section describes the following 1/0 Subsystem (105) deadstart programs: Program Description cleario 105 deadstart utility. The cleario utility attempts to clear the lOS if the deadstart procedure fails. dsdiag 105 deadstart diagnostic control program. The dsdiag program allows the system operator to run deadstart diagnostics from tape or disk. 6.1 SYSTEM CONFIGURATION The file aptezt contains the system text, including the configuration information for the lOS deadstart programs. The following system components are defined during system configuration: • Optional IIO processors (IOP-2 and IOP-3) • lOS type (model A, B, C, or D) • High-speed channel connections to central memory and the SSD solid-state storage device • Low-speed channel connection from IOP-O to the CPU • Console channels • Central memory size • Buffer memory size • SSD memory size For information on the lOS installation parameters, refer to the I/O Subsystem (lOS) Administrator's Guide, CRI publication SG-0307. SMM-I012 C CRAY PROPRIETARY 6-1 6.2 cleario If the lOS deadstart procedure fails, the system operator can execute cleario from tape or disk in an attempt to clear the lOS. For information on the lOS deadstart procedure, refer to one of the following CRI publications, as appropriate to your configuration: SG-2005 SN-3030 I/O Subsystem (lOS) Operator's Guide for UNICOS Operator Workstation (OWS) Guide IOP-O must be minimally operational to execute the tape, disk, or OWS bootstrap routine (TAPELOAD, DISKLOAD, or VMELOAD, respectively) and cleario. 6.2.1 PROGRAM EXECUTION The cleario program does the fOllowing: • • • Disables all interrupts Clears all of the lOS channels Zeros the following: The exit stack, the operand registers, and local memory in each rop Buffer memory The last 64 words of central memory Use the following procedure to execute cleario: 1. Mount the deadstart tape or disk at the operator's station. 2. Set the lOS maintenance panel toggle switches, as follows: Tape/Disk Unit Tape Ampex disk CDC disk Switch Setting Octal Binary 22 60 27 010 010 110 000 010 111 NOTE If the lOS maintenance panel has a 'maintenance mode' switch, set the switch to the 'on' position. When cleario is completed (successfully or unsuccessfully), return the switch to the 'off' position. 6-2 CRAY PROPRIETARY SMM-1012 C 3. Press the IOP-O MC button (or the MASTER CLEAR button on a CRAY-l A computer system) and the DEADSTART button on the Power Distribution Unit or lOS chassis maintenance panel (as appropriate for your site). 4. Respond to one of the following prompts (for tape or disk, respectively) at the IOP-O Kernel console: FILE @MTO: or FILE @DKO: NOTE The FILE @MTO prompt is not displayed unless a tape is mounted at the operator's station. In response to the tape prompt, enter the number of the tape file containing cleario and press RETURN. If a tape is written using standard Cray generation procedures, file 7 contains cleario. In response to the disk prompt, enter the name of the directory and file containing cleario (dir/cleario) and press RETURN. 5. If cleario completes successfully, the following message is displayed at the IOP-O Kernel console: CLEARIO COMPLETE The operating system bootstrap program is reloaded and one of the following prompts (for tape or disk, respectively) is displayed: FILE @MTO: or FILE @DKO: Proceed with the lOS deadstart procedure. For information on the lOS deadstart procedure, refer to one of the following CRI publications, as appropriate to your configuration: SG-2005 SN-3030 SMM-I012 C 1/0 Subsystem (lOS) Operator's Guide for UNICOS Operator Workstation (OWS) Guide CRAY PROPRIETARY 6-3 6. If either of the following conditions occurs, run the lOS deadstart tests to determine if an lOS hardware malfunction exists: cleario does not complete successfully (the message 'CLEARIO TERMINATED' is displayed or there is no response within one minute). The lOS deadstart procedure continues to fail after cleario completes execution. 6.2.2 PROGRAM MESSAGES The cleario program generates the following types of messages: • • 6.2.2.1 Informative Error Informative messages The following informative messages are displayed at the IOP-O Kernel console: CLEARIO COMPLETE cleario completed successfully. TAPE NOT READY This message is displayed until the tape is ready for use. 6.2.2.2 Error messages The following error messages are displayed at the IOP-O Kernel console. Unless otherwise indicated, use the lOS deadstart tests to do further error isolation. CLEARIO TERMINATED An error in one of the lOPs prevented cleario from executing successfully. Check the error logger for errors and run the dsdiag program for more information on the failure. BUFFER MEMORY TIMEOUT A Done flag is not set on the buffer memory channel. Check the error logger for errors and run the dsdiag program for more information on the failure. BUFFER MEMORY ERROR A Busy flag is set on the buffer memory channel. Check the error logger for errors and run the dsdiag program for more information on the failure. 6-4 CRAY PROPRIETARY SMM-1012 C device ERROR, STATUS=status A device error occurred while the overlay was being loaded. device can be TAPE or DISK. status is the controller status for the deadstart device. Select a different device and deadstart the lOS. If no other device is available or the failure continues, use off-line diagnostics to isolate the error. TAPE ERROR, STATUS=status AFTER REWIND A tape error occurred after the overlay was loaded. status is the controller status for the tape device. Use a disk device and deadstart the lOS. If a disk device is unavailable or the failure continues, use off-line diagnostics to isolate the error. dsdiag 6.3 The dsdiaq program is the deadstart diagnostic control program that allows the system operator to run deadstart tests from tape or disk. The dsdiaq program does the following: 1. Executes a series of basic IOP-O tests 2. Loads and executes subsequent lOS tests from a diagnostic overlay file 6.3.1 PROGRAM EXECUTION Prior to loading the lOS Kernel, the system operator can run deadstart diagnostics from tape or disk by loading and executing the deadstart diagnostic control program, dsdiaq. IOP-O must be minimally operational to execute the tape, disk, or OWS bootstrap routine (TAPELOAD, DISKLOAD, or VMELOAD, respectively) and dsdiaq. Use the following procedure to execute the IDS deadstart diagnostics: 1. Mount the deadstart tape or disk at the operator's station. 2. Set the IDS maintenance panel toggle switches, as follows: Tape/Disk Unit Tape Ampex disk CDC disk SMM-1012 C Switch Setting Binary Octal 22 60 27 010 010 110 000 010 111 CRAY PROPRIETARY 6-5 3. Press the IOP-O MC button (or the MASTER CLEAR button on a CRAY-l A computer system) and the DEADSTART button on the Power Distribution Unit or lOS chassis maintenance panel (as appropriate for your site). 4. Respond to one of the following prompts (for tape or disk, respectively) at the IOP-O Kernel console: FILE @MTO: or FILE @DKO: NOTE The FILE @MTO prompt is not displayed unless a tape is mounted at the operator's station. In response to the tape prompt, enter the number of the tape file containing dsdiaq and press RETURN. If a tape is written using standard Cray generation procedures, file 8 contains dsdiaq. In response to the disk prompt, enter the name of the directory and file containing dsdiaq (dir/dsdiaq) and press RETURN. Pass/fail status messages are displayed at the IOP-O Kernel console during test execution. 5. If the diagnostic tests complete successfully, the following message is displayed: DIAGNOSTICS COMPLETE The operating system bootstrap program is reloaded and one of the following prompts (for tape or disk, respectively) is redisplayed at the IOP-O Kernel console: FILE @MTO: or FILE @DKO: 6-6 CRAY PROPRIETARY SMM-I012 C Proceed with the lOS deadstart procedure. For information on the lOS deadstart procedure, refer to one of the following CRI publications, as appropriate to your configuration: SG-2005 SN-3030 6. 6.3.1.1 1/0 Subsystem (lOS) Operator's Guide for UNICOS Operator Workstation (OWS) Guide If a diagnostic test detects a failure, the message 'DIAGNOSTICS TERMINATED' is displayed at the IOP-O Kernel console or there is no response within one minute. The system operator should report failures to a CRI field engineer. IOP-O tests Although IOP-O must be minimally operational to perform deadstart operations, it can still contain faults. Therefore, dsdiag tests IOP-O before loading the deadstart tests from an overlay file. If the IOP-O diagnostics do not execute successfully, use off-line diagnostics to do further testing. The IOP-O tests exercise the following areas, in the order shown: 1. 2. 3. 4. 5. Instruction buffers Exit stack Operand registers Local memory Real-time clock The test procedure is as follows: Logic Tested Test Procedure Instruction buffers Forces l's and O's through each buffer location to detect dropped and picked bits, and adder faults. If a failure is detected, the test does not issue an error message; instead, it loops at the point of failure. Use off-line diagnostics to do further testing. Instruction buffer addressing is not tested. However, a fault in this area is likely to prevent dsdiag from loading. If no messages are displayed at the IOP-O Kernel console within a few seconds of loading, a failure exists. You can scope the IOP-O P register before using off-line diagnostics to do further testing. SMM-1012 C CRAY PROPRIETARY 6-7 Logic Tested Test Procedure Exit stackt Checks for basic addressing and data faults in each stack location. Using I/O instructions for access, the test detects all single-stuck addressing and data faults, and all coupled-data bit faults. It also tests return jumps and exits at all stack depths. Operand registerst Checks for basic faults in all of the registers except 0 and 1, which are used to run the test algorithm. The test detects all single-stuck addressing and data faults, and all coupled-data bit faults. Local memory Tests the area of local memory between the end of dsdiag and the highest local memory address. The test uses an algorithm with a parcel-oriented, ascending and descending, marching l's and O's pattern to detect all single-stuck addressing and data faults, and all coupled-data bit faults. Real-time clock Tests the real-time clock to ensure that an interrupt occurs approximately once every millisecond. When all of the IOP-O tests complete successfully, the following message is displayed at the IOP-O Kernel console (it is not required that the real-time clock test complete successfully): IOP-O KERNEL PASSED The dsdiag program then loads and executes the deadstart tests contained in an overlay file. If anyone of the IOP-O tests does not complete successfully (excluding the real-time clock test), dsdiag does not execute any subsequent diagnostics. An error message is displayed if a test fails (with the exception of the instruction buffer test, which loops at the point of failure instead of issuing an error message). The dsdiag program automatically attempts to reload the deadstart bootstrap program, TAPELOAD, DISKLOAD, or VMELOAD. If the attempt is unsuccessful, dsdiag halts and you can use off-line diagnostics to isolate the fault. For a list of messages, refer to subsection 6.3.2, Program Messages. t 6-8 The test uses a variant of the Milner fast memory test algorithm (EDN, 28, 21; Oct 13, 1983). The Milner algorithm detects dropped and picked bits in address data, and coupled-data bit faults. The algorithm uses a rotating single-bit pattern to ensure that only one bit is changed in each memory chip at each step. CRAY PROPRIETARY SMM-1012 C 6.3.1.2 1/0 Subsystem tests If all of the IOP-O tests complete successfully (excluding the real-time clock test), dsdiag loads and executes subsequent lOS tests from a diagnostic overlay file. The tests are executed in the following order: Test Description dsmos16k Test of the lower 16 Kwords of buffer memory from IOP-O only dsiom Local memory addressing and data test for each lOP except IOP-O dsiop Instruction test for each lOP dsmos Buffer memory addressing and data path test for each rop dshsp High-speed channel test from an lOP to central memory or to an SSD solid-state storage device dslsp Low-speed channel test from IOP-O to central memory dsmos16k - This program tests addressing and data in the first 16384 words of buffer memory from IOP-O only. This area of buffer memory is used to load an lOP. Therefore, dsmos16k must complete successfully before tests can be executed in rOP-1, rOP-2, or rOP-3. The dsmos16k program consists of the following test sections: 1. 2. Address and data test Block length test The dsmos16k test sections are as follows: Section Description 1 Address and data test. This section uses an algorithm with a word-oriented, ascending and descending, marching l's and O's pattern to test the lower 16 Kwords of buffer memory. The block length is 1. 2 Block length test. This section tests block length bits 1 through 13 (that is, block lengths 21 through 2 13 ). If dsmos16k completes successfully, the following message is displayed: MOS-16K PASSED The test completed successfully. For a list of messages, refer to subsection 6.3.2, Program Messages. SMM-1012 C CRAY PROPRIETARY 6-9 dsiom - This program tests local memory addressing and data for each lOP except IOP-O. The test detects basic faults that would inhibit the proper loading of diagnostics into an lOP. The dsiom program consists of the following test sections: 1. 2. 3. 4. All O's AlII's Address All O's test. test. pattern test test The test uses deadstart and dead dump procedures to load and dump data patterns. In the lOP being tested, no code is executed except a jump to P + 0 at address O. The jump is required to prevent the lOP from executing after a deadstart. (In each of the dsiom test sections, address 0 contains 0'7000.) The dsiom test sections are as follows: Section Description 1 All O's test. The test data is all O's. data is all l's. The background 2 All l's test. The test data is all l's. data is all O's. The background 3 Address pattern test. The test data for each parcel (except parcel 0) is the parcel address. The background data is all O's. 4 All O's test. This section is the same as section 1. Section 4 is run so that local memory is reset to all O's at the end of the test. Each section uses the upper half of IOP-O and the lower 16 Kwords of buffer memory as data buffers. If dsiom completes successfully, the following message is displayed at the IOP-O Kernel console: IOP-n 10M PASSED The test completed successfully in IOP-n. For a list of messages, refer to subsection 6.3.2, Program Messages. dsiop - This program tests instructions and registers in lOP-I, lOP-2, and lOP-3. Part of test section 1, basic instructions and registers test, executes in all of the lOPs, including lOP-O. 6-10 CRAY PROPRIETARY SMM-1012 C The dsiop program consists of the following test sections: 1. 2. 3. Basic instructions and registers test Jump instructions test Operand registers test The dsiop test sections are as follows: Section 1 Description Basic instructions and registers test. Testing starts with the simplest instructions and data paths and becomes increasingly complex. The following rop components are tested: 1. Registers A, B, and C 2. Instructions in the range 4 through 67 (octal) 3. Add and shift networks 4. Operand registers 0 through 20, 40, 100, 200, 400, and 777 (octal) 5. Local memory addressing 6. I/O instructions on channels 0 through 5 7. E register and exit stack location 0 8. Interprocessor channels to IOP-O In rop-o, only areas 1, 2, and 3 are tested; testing in the other areas would conflict with resident code. IOP-O must be minimally operational to execute dsdiag. Therefore, this test is run in IOP-O only to ensure that the basic instructions and the add/shift network are tested completely. There are no jumps in this test except a jump to P + 0, which is executed when a fault is detected, causing the test to loop at the point of failure. 2 Jump instructions test. This section is not run in IOP-O. The following areas of the lOP are tested: 1. 2. 3. 4. SMM-1012 C Jump instructions 070 through 137 Exit instruction 001 Operand registers 0 and 1 Exit stack data and addressing CRAY PROPRIETARY 6-11 Section 3 Description Operand registers test. This section is not run in lOP-D. This test section contains two subsections, as follows: Subsection Description Systematic data Performs a comprehensive test of operand register addressing and data.t The test detects all single-stuck faults in addressing or data, and all coupled data-bit faults. Random data Uses random data patterns to test registers 20 through 777 (octal). The test detects pattern-sensitive faults, which normally cannot be detected by systematic data. New data patterns are used each time the test is run. If test section 1 (basic instructions and registers test) completes successfully, the following message is displayed at the IOP-O Kernel console: IOP-n BASIC PASSED If test section 2 (jump instructions test) completes successfully, the exit stack is reset to all O's and the following message is displayed at the Kernel consoles of IOP-O and the lOP being tested: IOP-n JUMPS PASSED If test section 3 (operand registers test) completes successfully, the operand registers are reset to all O's and the following message is displayed at the Kernel consoles of IOP-O and the lOP being tested: IOP-n OPREG PASSED The dsiop program is run in all of the lOPs, regardless of whether a fault is detected in any single lOP. However, if a fault is detected in any of the lOPs, subsequent diagnostics cannot be executed until the fault is corrected. Use off-line diagnostics to isolate the failure. For a list of messages, refer to subsection 6.3.2, Program Messages. t The test uses a variant of the Milner fast memory test algorithm (EDN, 28, 21; Oct 13, 1983). 6-12 CRAY PROPRIETARY SMM-1012 C dsmos - This program tests the address and data paths from each lOP to buffer memory. It does not test the buffer memory data chips. The dsmos program consists of the following test sections: 1. 2. 3. Data path test Local memory addressing test Buffer memory addressing test The dsmos test sections are as follows: Section Description 1 Data path test. This section tests for dropped or picked data bits by transferring a single word between address 0 of local memory and address 0 of buffer memory. Dropped address bits do not affect this test. 2 Local memory addressing test. This section transfers data between address 0 of buffer memory and selected local memory addresses, using an algorithm with an ascending and descending, marching l's and O's pattern. The block length is always 1. The following local memory addresses (in octal) are used for test data: 0, 100000, 100000 + 2 n (includes all values for which n is an integer in the range 2 through 14), and 177774. 3 Buffer memory addressing test. This section transfers data between local memory and selected buffer memory addresses. The block length is always 1. The test algorithm is identical to that used in section 2 (local memory addressing) except that the local memory address is fixed and the buffer memory address varies. The following buffer memory word addresses are used for test data: 0, 2 n (includes all values for which n is an integer value in the interval [0, log2(MOS@SIZ)]). If dsmos completes successfully, the following message is displayed at the Kernel consoles of IOP-O and the lOP being tested: IOP-n MOS PASSED The test completed successfully in IOP-n. The dsmos program is run in all of the lOPs, regardless of whether a fault is detected in any single lOP. However, if a fault is detected in any of the lOPs, subsequent diagnostics cannot be executed until the fault is corrected. Use off-line diagnostics to isolate the failure. For a list of messages, refer to subsection 6.3.2, Program Messages. SMM-1012 C CRAY PROPRIETARY 6-13 dshsp - This program is a high-speed channel test from IOP-n to central memory or to an SSD solid-state storage device. Although it does not test memory, dshsp uses part of central memory or SSD memory to test the channel. The contents in the portion of memory used for testing are saved at the start of test execution and are restored only if the test completes successfully. The dshsp program consists of the following test sections: 1. 2. 3. Buffer addressing and data test Local memory addressing test Central memory or SSD addressing test The dshsp test sections are as follows: Section 1 Description Buffer addressing and data test. This section detects all single-stuck faults and coupled-data bit faults in the high-speed channel data buffers. The test writes to and reads from a block of memory beginning at absolute address 0 in either central memory or an SSD. For central memory, the block length is fixed at 32 words (the size of the data buffers). For an SSD, the block length is fixed at 64 words (minimum block size). This test section uses an algorithmt to move a block of sliding l's and O's through memory in an ascending and descending pattern. The block is addressed in ascending order due to hardware constraints. 2 Local memory addressing test. This test uses an algorithm with an ascending and descending marching l's and O's pattern. The transfer length is always one word for central memory and 64 words for an SSD. The central memory or SSD address is always O. The following local memory addresses are tested if the test is from IOP-n to central memory: 77774, 100000, 100000 + 2 n (includes all values for which n is an integer in the range 2 through 14), and 177774. The following local memory addresses are tested if the test is from IOP-n to an SSD: 77400, 100000, 100000 + 2 n (includes all values for which n is an integer in the range 8 through 14), and 177400. t The test uses a variant of the Milner fast memory test algorithm (EDN, 28, 21; Oct 13, 1983). 6-14 CRAY PROPRIETARY SMM-1012 C Section 3 Description Central memory or SSD addressing test. This section uses an algorithm with an ascending and descending marching l's and O's pattern. The transfer length is always one word for central memory and 64 words for an SSD. The local memory address is arbitrary because it is assumed that section 2 (local memory addressing test) passed successfully. The following central memory addresses are tested if the test is from IOP-n to central memory: 0, 2 n (includes all values for which n is an integer in the interval [0, log2(central memory size)-l]). The following SSD addresses are tested if the test is from IOP-n to an SSD: 0, 2 n (includes all values for which n is an integer in the interval [0, log2(SSD size)-l]). If dshsp completes successfully, the following message is displayed at the Kernel consoles of IOP-O and the lOP being tested: IOP-n HSP CH=chlch PASSED The test completed successfully in the high-speed channel pair chich in IOP-n. The contents of central memory or the SSD are restored. The dshsp program is run in all of the lOPs for which a high-speed channel is defined in $APTEXT, regardless of whether a fault is detected in any single lOP. However, if a fault is detected in any of the lOPs, subsequent diagnostics cannot be executed until the fault is corrected. Use off-line diagnostics to isolate the failure. For a list of messages, refer to subsection 6.3.2, Program Messages. dslsp - This program tests the low-speed deadstart channel from IOP-O to the Cray mainframe. The dslsp program consists of the following test sections: 1. 2. Deadstart data test Central memory addressing test The dslsp test sections are as follows: Section 1 SMM-1012 C Description Deadstart data test. This section uses an algorithm with a marching l's and O's pattern to test the lower 64 words of central memory. Each data transfer begins at address 0 of central memory for a dead load or a dead dump. CRAY PROPRIETARY 6-15 Section 2 Description Central memory addressing test. This section uses a CPU driver for the CPU end of the low-speed channel to test all address bits. The CPU driver occupies the first 64 words of central memory. The driver manages the channel protocol; it does not check for errors. All transfers are one word in length. The test uses the following central memory addresses: 2n (includes all values for which n is an integer value in the interval [5, lo92(CM@SIZE/2)]). The first five address bits are tested in section 1, deadstart data test. If dslsp completes successfully, the following message is displayed at the IOP-O Kernel console: IOP-O LSP CH=chlch PASSED The test completed successfully in the low-speed channel pair chIch in IOP-O. The contents of central memory are restored. If a fault is detected, subsequent diagnostics cannot be executed until the fault is corrected. Use off-line diagnostics to isolate the failure. For a list of messages, refer to subsection 6.3.2, Program Messages. 6.3.2 PROGRAM MESSAGES The dsdiaq program generates the following types of messages: • • 6.3.2.1 Informative Error Informative messaqes The following informative messages are displayed at the IOP-O Kernel console unless otherwise indicated. DIAGNOSTICS COMPLETE The dsdiaq program completed successfully. test PASSED test completed successfully. This message is displayed at the Kernel consoles of IOP-O and the lOP being tested. TAPE NOT READY This message is displayed until the tape is ready for use. 6-16 CRAY PROPRIETARY SMM-IOI2 C 6.3.2.2 Error messages This subsection lists the dsdiag error messages, which are grouped as follows: • • • • • • • • Messages applicable to all tests IOP-O messages dsmos16t messages dsiom messages dsiop messages dsmos messages dshsp messages dslsp messages Messages applicable to all tests - The following error messages are displayed at the IOP-O Kernel console. Use off-line diagnostics to do further error isolation. DIAGNOSTICS TERMINATED An error in one of the tests prevented dsdiaq from executing successfully. An error message from the failing test is displayed at one or more of the Kernel consoles. Use off-line diagnostics to do further error isolation. device ERROR, STATUS:status A device error occurred while the overlay was being loaded. device can be TAPE or DISK. status is the controller status for the deadstart device. Select a different device and deadstart the lOS. If no other device is available or the failure continues, use off-line diagnostics to isolate the error. TAPE ERROR, STATUS:status AFTER REWIND A tape error occurred after the overlay was loaded. status is the controller status for the tape device. Use a disk device and deadstart the lOS. If a disk device is unavailable or the failure continues, use off-line diagnostics to isolate the error. OVERLAY HEADER ERROR The dsdiaq program detected an error in the overlay header. Select a different device and deadstart the IDS. If no other device is available or the failure continues, use off-line diagnostics to isolate the error. ATTEMPTED TO READ PAST ADDRESS 77777 The dsdiaq program attempted to read beyond address 77777 in the overlay. Select a different device and deadstart the IDS. If no other device is available or the failure continues, use off-line diagnostics to isolate the error. SMM-1012 C CRAY PROPRIETARY 6-17 END-OF-FILE ENCOUNTERED While reading the overlay, dsdiaq detected an unexpected end-of-file. Select a different device and deadstart the 105. no other device is available or the failure continues, use off-line diagnostics to isolate the error. If INVALID OVERLAY DIRECTORY While reading the overlay, dsdiaq detected an invalid overlay directory. Select a different device and deadstart the 105. If no other device is available or the failure continues, use off-line diagnostics to isolate the error. NO OVERLAY FILE FOUND The dsdiaq program did not find an overlay file. Select a different device and deadstart the 105. If no other device is available or the failure continues, use off-line diagnostics to isolate the error. IOP-O messages - The following error messages are displayed at the IOP-O Kernel console. Use off-line diagnostics to do further error isolation. IOP-O FAILED EXIT STACK The test terminated after detecting a fault in the IOP-O exit stack. The bootstrap program is not reloaded. An 105 deadstart is required. IOP-O FAILED OPERAND REGISTER The test terminated after detecting a fault in an IOP-O operand register. IOP-O FAILED MEMORY, p=address, LMA=lma EXP=exp ACT=act The test terminated after detecting a data compare error in IOP-O local memory. The following information is displayed: 6-18 P=address Parcel address relative to the start of the test module in which the fault was detected LMA=lma Absolute parcel address in IOP-O local memory EXP=exp Expected data ACT=act Actual data CRAY PROPRIETARY SMM-1012 C IOP-O FAILED REAL-TIME CLOCK The test detected a fault in the real-time clock. Although the test continues, subsequent tests can fail as a result of an inaccurate clock. A clock failure can occur if the lOP model is not defined correctly when the deadstart tests are generated. Check the I@IOPMOD installation parameter and regenerate. If the failure continues, use off-line diagnostics to isolate the fault. For a brief description of the lOS installation parameters, refer to the I/O Subsystem (lOS) Administrator's Guide, CRI publication SG-0307. dsmos16t messages - The following error messages are displayed at the IOP-O Kernel console. Use off-line diagnostics to do further error isolation. MOS-16K FAILED, p=address, BMA=bma The test detected a hardware failure in buffer memory. following information is displayed: The P=address Parcel address relative to the start of dsmos16t in IOP-O BMA=bma Absolute word address in buffer memory MOS-16K FAILED, p=address, BMA=bma EXP=exp ACT=act The test detected a data compare error in buffer memory. following information is displayed: The P=address Parcel address relative to the start of dsmos16t in IOP-O BMA=bma Absolute word address in buffer memory EXP=exp Expected data ACT=act Actual data dsiom messages - The following error messages are displayed at the IOP-O Kernel console. Use off-line diagnostics to do further error isolation. IOP-n 10M FAILED, p=address, LMA=lma The test detected a hardware failure in IOP-n local memory. following information is displayed: The P=address Parcel address relative to the start of dsiom in IOP-O LMA=lma Absolute parcel address in IOP-n local memory SMM-1012 C CRAY PROPRIETARY 6-19 IOP-n 10M FAILED, p=address, LMA=lma EXP=exp ACT=act The test detected a data compare error in IOP-n local memory. The following information is displayed: P=address Parcel address relative to the start of dsiom in IOP-O LMA=lma Absolute parcel address in IOP-n local memory EXP=exp Expected data ACT=exp Actual data dsiop messages - The following error messages are displayed at the IOP-O Kernel console unless otherwise indicated. Use off-line diagnostics to do further error isolation. IOP-n section FAILED, NO RESPONSE An input-channel-done signal was not received from IOP-n within the required time limit. section is one of the following test sections: BASIC, JUMPS, or OPREG. This message precedes the following message (described in this subsection): PRESS ANY KEY TO CONTINUE WITH REGISTER DUMP IOP-n section FAILED, p=address, CH=ipc The test detected a time-out or a protocol error in ipc, the interprocessor channel from IOP-O to IOP-n. section is one of the following test sections: BASIC, JUMPS, or OPREG. The following information is displayed: P=address CH=ipc Parcel address relative to the start of dsiop in IOP-O • Interprocessor channel number associated with IOP-O This message precedes the following message (described in this subsection): PRESS ANY KEY TO CONTINUE WITH REGISTER DUMP 6-20 CRAY PROPRIETARY SMM-I012 C IOP-n section FAILED, p=address, MOS ERROR, BMA=bma The test detected a failure in a data transfer between local memory in one of the configured lOPs and buffer memory. section is one of the following test sections: BASIC, JUMPS, or OPREG. The following information is displayed: P=address Parcel address relative to the start of dsiop in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. BMA=bma Absolute word address in buffer memory IOP-n BASIC FAILED, P=address, CH=ipc EXP=exp, ACT=act The BASIC test section detected a data compare error in ipc, the interprocessor channel from IOP-O to IOP-n. The following information is displayed: P=address Parcel address relative to the start of dsiop in IOP-O CH=ipc Interprocessor channel number associated with IOP-O EXP=exp Expected data ACT=act Actual data This message precedes the following message (described in this subsection) : PRESS ANY KEY TO CONTINUE WITH REGISTER DUMP IOP-n JUMPS FAILED, CODE=code The JUMPS test section detected a jump instruction error in IOP-n. code is the error code returned from the accumulator of the lOP being tested. This message precedes the following message (described in this subsection): PRESS ANY KEY TO CONTINUE WITH REGISTER DUMP IOP-n OPREG FAILED, P=address, B=register EXP=exp, ACT=act The OPREG test section detected a data compare error in the operand register in IOP-n. The following information is displayed: P=address Parcel address relative to the start of dsiop in IOP-O B=register B register in which the error was detected SMM-1012 C CRAY PROPRIETARY 6-21 EXP= exp Expected data ACT=act Actual data The message is displayed at the Kernel consoles of IOP-O and the lOP being tested. This message precedes the following message (described in this subsection), which is displayed at the IOP-O Kernel console only: PRESS ANY KEY TO CONTINUE WITH REGISTER DUMP PRESS ANY KEY TO CONTINUE WITH REGISTER DUMP The dsiop program detected an error and issued the err~r message that preceded this message. If you press any key, dsiop dumps the lOP being tested to the IOP-O Kernel console. The following information is displayed: A=a, C=c, B=b, (B)=r, E=e, (E)=sl, (E-l)=s2, (E-2)=s3 Accumulator of the lOP being tested C=c Carry flag B=b B register (B)=r B register contents Exit stack pointer (E)=sl (E-l)=s2 (E-2)=s3 Contents of the top three exit stack locations. One of the stack locations normally represents the address at which a fault was detected in the lOP being tested. Examine the dump values to isolate the fault. Depending on the fault, some or all of the dump values can be unreliable. Therefore, check the values for consistency. Prior to taking the dump (by pressing any key), a field engineer can scope the P register of the lOP being tested to ensure reliable values. Use off-line diagnostics to isolate the fault. dsmos messages - The following error messages are displayed at the IOP-O Kernel console unless otherwise indicated. Use off-line diagnostics to do further error isolation. 6-22 CRAY PROPRIETARY SMM-1012 C IOP-n MOS FAILED, P=address The test detected a failure in the path between IOP-n and buffer memory. The following information is displayed: P=address Parcel address relative to the start of dsmos in IOP-O; or, if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. IOP-n MOS FAILED, P=address, NO RESPONSE IOP-O did not receive a response from IOP-n following the buffer memory test. The following information is displayed: p=address Parcel address relative to the start of dsmos in IOP-O; or, if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. IOP-n MOS FAILED, P=address, MOS ERROR The test detected a failure in the path between IOP-n and buffer memory. The following information is displayed: p=address Parcel address relative to the start of dsmos in IOP-O; or, if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. This message is displayed at the Kernel consoles of IOP-O and the lOP being tested. IOP-n MOS FAILED, P=address LMA=lma, BMA=bma EXP=exp ACT=act The test detected a data compare error in the path between IOP-n and buffer memory. The following information is displayed: P=address Parcel address relative to the start of dsmos in IOP-O; or, if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. LMA=lma Absolute parcel address in local memory BMA=bma Absolute word address in buffer memory EXP=exp Expected data ACT=act Actual data This message is displayed at the Kernel consoles of IOP-O and the lOP being tested. SMM-1012 C CRAY PROPRIETARY 6-23 dshsp messages - The following error messages are displayed at the IOP-O Kernel console unless otherwise indicated. Check the error logger for double bit errors. Use off-line diagnostics to do further isolation. IOP-O HSP CH=chlch FAILED, P=address, MOS ERROR IOP-O tried to write the diagnostic overlay to MOS. Upon completion, both the Busy and Done flags were found to be set. The probable error is in the channel from IOP-O to MOS memory. Run off-line diagnostics to further isolate the problem. CH=chlch High-speed channel pair P=address Parcel address relative to the start of dshsp in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. The contents of CM or SSD remain unchanged. displayed on the IOP-O console. This message is IOP-n HSP CH=chlch FAILED, p=address, NO RESPONSE IOP-O sent an overlay package to MOS, deadstarted IOP-n, and waited for a response. The Done flag was never set (indicating that IOP-n did not respond by sending a return code). The probable error is in the deadstarting of IOP-n, the ability of IOP-n to read from MOS, or the test code was corrupt (due to a hardware memory problem). Check for further test messages or run off-line diagnostics. IOP-n The lOP that would not deadstart CH=chlch High-speed channel pair p=address Parcel address relative to the start of dshsp in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. The contents of CM or SSD remain unchanged. displayed on the IOP-O console. This message is IOP-n HSP CH=chlch FAILED, p=address, BAD RETURN STATUS, S=address IOP-O sent a test to IOP-n. IOP-n executed the tests and returned a bad status. This indicates that the test found an error in IOP-n. Check the IOP-n console for further messages. 6-24 IOP-n The lOP that sent the message to IOP-O CH=chlch High-speed channel pair CRAY PROPRIETARY SMM-1012 C p=address Parcel address relative to the start of dshsp in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. S=address The address of the problem in IOP-n is returned. The address is relative to the start of the overlay sent to IOP-n. It is unknown whether the contents of CM or SSD have been corrupted. This message is displayed on the IOP-O console. IOP-n HSP CH=chlch PASSED IOP-O sent a test to IOP-n. IOP-n executed the tests and returned a zero status indicating that no errors were discovered. lOP-n The lOP that sent the message to lOP-O CH=chlch High-speed channel pair The contents of CM or SSD were restored to their original state. This message is displayed on the lOP-O console. The following messages are displayed on the lOP-n console. lOP-n HSP CH=chlch FAILED, p=address, NO CONFIGURED MEMORY SIZE IOP-n found a high-speed channel configured, but the configured memory size for CM or SSD attached to that channel is zero. This is not a hardware error. Correct the channel and memory size configured in $APTEXT or $lOSDEF. The test in lOP-n for this channel was bypassed. lOP-n The lOP being tested CH=chlch High-speed channel pair P=address Parcel address relative to the start of dshsp in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. The contents of CM or SSD remain unchanged. displayed on the IOP-n console. SMM-1012 C CRAY PROPRIETARY This message is 6-25 IOP-n HSP CH=chlch FAILED, P=address, CH=ch, routine, TIMEOUT SAVEMEM IOP-n tried to read from CM or SSD to save the contents of the memory to be tested before beginning the test. After the read was started, the program waited for the Done flag to be set. The Done flag was never set so the program timed out. The probable error is in the channel from IOP-n to CM or SSD memory. Run off-line diagnostics to further isolate the problem. IOP-n The lOP being tested CH=chlch High-speed channel pair P=address Parcel address relative to the start of dshsp in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. CH=ch Channel on which the error was detected routine The test routine executing in IOP-n when the error was encountered. The test routines in order are HSPBUFF, HSPLMCM, HSPLMSSD, HSPCMA, and HSPSSDA. The test routine HSPBUFF is the first time the HSP channel is used. The contents of CM or SSD remain unchanged. displayed on the IOP-n console. This message is IOP-n HSP CH=chlch FAILED, P=address, CH=ch, routine, BZ & DN SAVEMEM LMA=address, CMA or SSDA=address EXP=exp ACT=act IOP-n tried to read from CM or SSD to save the contents of the memory to be tested before beginning the test. Upon completion of the read (when the Done flag was set), both the Busy and Done flags were found to be set. The probable error is in the channel from IOP-n to CM or SSD memory. Check the error logger for double bit errors. Run off-line diagnostics to further isolate the problem. This error can also occur if the test tries to read or write past the end of CM or SSD. Check the configured memory size of CM or SSD in $APTEXT. 6-26 IOP-n The lOP being tested CH=chlch High-speed channel pair CRAY PROPRIETARY SMM-I012 C P=address Parcel address relative to the start of dshsp in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. CH=ch Channel on which the error was detected routine The test routine executing in IOP-n when the error was encountered. The test routines in order are HSPBUFF, HSPLMCM, HSPLMSSD, HSPCMA, and HSPSSDA. The test routine HSPBUFF is the first time the HSP channel is used. LMA=address Absolute parcel address in local memory of data CMA or SSDA=address Absolute word address in central memory or SSD of the data EXP=exp Expected data ACT=act Actual data The contents of CM or SSD remain unchanged. displayed on the IOP-n console. This message is IOP-n HSP CH=chlch FAILED, p=address, CH=ch, routine, TIMEOUT IOP-n tried to read/write a test pattern from/to CM or SSD. Check the channel number to determine if the error was on a read or write. After the read/write was started, the program waited for the Done flag to be set. The Done flag was never set so the program timed out. The probable error is in the channel CH=ch from IOP-n to CM or SSD memory. Run off-line diagnostics to further isolate the problem. lOP-n The lOP being tested CH=chlch High-speed channel pair p=address Parcel address relative to the start of dshsp in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. CH=ch Channel on which the error was detected routine The test routine executing in IOP-n when the error was encountered. The test routines in order are HSPBUFF, HSPLMCM, HSPLMSSD, HSPCMA, and HSPSSDA. The contents of CM or SSD may have been corrupted. is displayed on the IOP-n console. SMM-1012 C CRAY PROPRIETARY This message 6-27 IOP-n HSP CH=chlch FAILED, P=address, CH=ch, routine, ERROR FLAG IOP-n tried to write a test pattern to CM or SSD. Upon completion of the write (when the Done flag was set), both the Busy and Done flags were found to be set. The probable error is in the channel CH=ch from IOP-n to CM or SSD memory. Check the error logger for double bit errors. Run off-line diagnostics to further isolate the problem. IOP-n The lOP being tested CH=chlch High-speed channel pair P=address Parcel address relative to the start of dshsp in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. CH=ch Channel on which the error was detected routine The test routine executing in IOP-n when the error was encountered. The test routines in order are HSPBUFF, HSPLMCM, HSPLMSSD, HSPCMA, and HSPSSDA. The contents of CM or SSD may have been corrupted. is displayed on the IOP-n console. This message IOP-n HSP CH=chlch FAILED, P=address, CH=ch, routine, ERROR FLAG LMA=address, CMA or SSDA=address EXP=exp ACT=act IOP-n tried to read a test pattern from CM or SSD. Upon completion of the read (when the Done flag was set), both the Busy and Done flags were found to be set. The probable error is in the channel CH=ch from IOP-n to CM or SSD memory •. Check the error logger for double bit errors. Run off-line diagnostics to further isolate the problem. 6-28 IOP-n The lOP being tested CH=chlch High-speed channel pair P=address Parcel address relative to the start of dshsp in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. CH=ch Channel on which the error was detected CRAY PROPRIETARY SMM-1012 C routine The test routine executing in IOP-n when the error was encountered. The test routines in order are HSPBUFF, HSPLMCM, HSPLMSSD, HSPCMA, and HSPSSDA. LMA=dddress Absolute parcel address in local memory of data CMA or SSDA=address Absolute word address in central memory or SSD of the data EXP=exp Expected data ACT=dct Actual data The contents of CM or SSD may have been corrupted. is displayed on the IOP-n console. This message IOP-n HSP CH=chlch FAILED, P=address, routine, CH=ch, DATA COMPARE LMA=address, CMA or SSDA=address EXP=exp AC=dCt IOP-n wrote a test pattern to CM or SSD and then read it back. The data read from memory (ACT) did not match the original data (EXP) written to memory. The probable error is in the channel from IOP-n to CM or SSD memory. Run off-line diagnostics to further isolate the problem. IOP-n The lOP being tested CH=chlch High-speed channel pair P=address Parcel address relative to the start of dshsp in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. CH=ch Channel on which the error was detected routine The test routine executing in IOP-n when the error was encountered. The test routines in order are HSPBUFF, HSPLMCM, HSPLMSSD, HSPCMA, and HSPSSDA. LMA=address Absolute parcel address in local memory of data SMM-1012 C CRAY PROPRIETARY 6-29 CMA or SSDA=address Absolute word address in central memory or SSD of the data EXP=exp Expected data ACT=act Actual data The contents of CM or SSD may have been corrupted. is displayed on the IOP-n console. This message IOP-n HSP CH=chlch FAILED, P=address, CH=ch, routine, TIMEOUT RESTMEM After testing, IOP-n tried to write to CM or SSD to restore the original contents of memory. After the write was started, the program waited for the Done flag to be set. The Done flag was never set so the program timed-out. The probable error is in the channel from IOP-n to CM or SSD memory. Run off-line diagnostics to further isolate the problem. IOP-n The lOP being tested CH=chlch High-speed channel pair P=address Parcel address relative to the start of dshsp in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. CH=ch Channel on which the error was detected routine The test routine executing in IOP-n when the error was encountered. The test routines in order are HSPBUFF, HSPLMCM, HSPLMSSD, HSPCMA, and HSPSSDA. The contents of CM or SSD may have been corrupted. is displayed on the IOP-n console. This message IOP-n HSP CH=chlch FAILED, p=address, CH=ch, routine, BZ & ON RESTMEM After testing, IOP-n tried to write to CM or SSD to restore the original contents of memory. Upon completion of the write (when the Done flag was set), both the Busy and Done flags were found to be set. The probable error is in the channel from IOP-n to CM or SSD memory. Check the error logger for double bit errors. Run off-line diagnostics to further isolate the problem. 6-30 IOP-n The lOP being tested CH=chlch High-speed channel pair CRAY PROPRIETARY SMM-1012 C P=address Parcel address relative to the start of dshsp in IOP-O; or if IOP-O is being tested, the parcel address relative to the start of the test module in which the fault was detected. CH=ch Channel on which the error was detected routine The test routine executing in IOP-n when the error was encountered. The test routines in order are HSPBUFF, HSPLMCM, HSPLMSSD, HSPCMA, and HSPSSDA. The contents of CM or SSD may have been corrupted. is displayed on the IOP-n console. This message dslsp messages - The error messages are displayed at the IOP-O Kernel console. Use off-line diagnostics to do further error isolation. In this subsection, the messages are grouped as follows: • • • • Time-out messages Channel interface status flag messages Data compare error messages Overlay messages For information on the channel interface status flags (FLAGS=flags), refer to the following CRI publications, as appropriate: HR-0030 HR-0081 IIO Subsystem Model B Hardware Reference Manual IIO Subsystem Model C/D Hardware Reference Manual The time-out messages follow. IOP-n LSP CH=chlch FAILED, P=address, LMA=lma, CH=ch, TIMEOUT LSPCPUA, READ FROM CM While attempting to read one word from central memory addresses in mUltiples of 10, starting at address 100 and continuing to the end of central memory, the program detected a time-out in the low-speed channel pair chich in IOP-n. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch P=address LMA=lma CH=ch LSPCPUA READ FROM CM SMM-1012 C lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp Absolute parcel address in local memory Low-speed channel pair Read one word from central memory addresses in multiples of 10, starting at address 100 and continuing to the end of central memory Read from central memory CRAY PROPRIETARY 6-31 IOP-n LSP CH=chlch FAILED, p=address, LMA=lma, CH=ch, TIMEOUT LSPCPUA, WRITE TO CM While attempting to write one word to central memory addresses in multiples of 10, starting at address 100 and continuing to the end of central memory, the program detected a time-out in the low-speed channel pair chIch in IOP-n. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch P=address LMA=lma CH=ch LSPCPUA WRITE TO CM lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp Absolute parcel address in local memory Low-speed channel pair Write one word to central memory addresses in multiples of 10, starting at address 100 and continuing to the end of central memory Write to central memory IOP-n LSP CH=chlch FAILED, p=address, LMA=lma, CH=ch, TIMEOUT LSPDSDD, READ FROM CM While attempting to read blocks of various lengths from central memory address 0, the program detected a time-out in the low-speed channel pair chIch in IOP-n. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch P=address LMA=lma CH=ch LSPDSDD READ FROM CM lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp Absolute parcel address in local memory Low-speed channel pair Read blocks of various lengths from central memory address 0 Read from central memory IOP-n LSP CH=chlch FAILED, p=address, LMA=lma, CH=ch, TIMEOUT LSPDSDD, WRITE TO CM While attempting to write blocks of various lengths to central memory address 0, the program detected a time-out in the low-speed channel pair chIch in IOP-n. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch P=address LMA=lma CH=ch LSPDSDD WRITE TO CM 6-32 lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp Absolute parcel address in local memory Channel on which the error was detected Write blocks of various lengths to central memory address 0 Write to central memory CRAY PROPRIETARY SMM-I012 C IOP-n LSP CH=chlch FAILED, p=address, LMA=lma, CH=ch, TIMEOUT RESTMEM, WRITE TO CM While attempting to restore the central memory locations used in the test, the program detected a time-out in the low-speed channel pair chich in IOP-n. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch P=address LMA=lma CH=ch RESTMEM WRITE TO CM lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp Absolute parcel address in local memory Channel on which the error was detected Final write to central memory Write to central memory IOP-n LSP CH=chlch FAILED, p=address, LMA=lma, CH=ch, TIMEOUT SAVEMEM, READ FROM CM While attempting to save the central memory locations used in the test, the program detected a time-out in the low-speed channel pair chich in IOP-n. Central memory is not corrupted. The following information is displayed: IOP-n CH=chlch P=address LMA=lma CH=ch SAVEMEM READ FROM CM lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp Absolute parcel address in local memory Low-speed channel pair Initial read from central memory Read from central memory The status flag messages follow. IOP-n LSP CH=chlch FAILED, p=address, FLAGS=flags, CH=ch LSPCPUA, READ FROM CM While attempting to read one word from central memory addresses in multiples of 10, starting at address 100 and continuing to the end of central memory, the program detected a hardware error in the low-speed channel pair chich in lOP-D. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch p=address FLAGS=flags CH=ch LSPCPUA READ FROM CM SMM-1012 C lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp An octal value representing one or more channel interface status flags Channel on which the error was detected Read one word from central memory addresses in multiples of 10, starting at address 100 and continuing to the end of central memory Read from central memory CRAY PROPRIETARY 6-33 IOP-n LSP CH=chlch FAILED, p=address, FLAGS=flags, CH=ch LSPCPUA, WRITE TO CM While attempting to write one word to central memory addresses in multiples of 10, starting at address 100 and continuing to the end of central memory, the program detected a hardware error in the low-speed channel pair chIch in IOP-O. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch p=address FLAGS=flags CH=ch LSPCPUA WRITE TO CM lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp An octal value representing one or more channel interface status flags Channel on which the error was detected Write one word to central memory addresses in multiples of 10, starting at address 100 and continuing to the end of central memory Write to central memory IOP-n LSP CH=chlch FAILED, p=address, FLAGS=flags, CH=ch LSPDSDD, READ FROM CM While attempting to read blocks of various lengths from central memory address 0, the program detected a hardware error in the low-speed channel pair chIch in IOP-O. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch P=address FLAGS=£lags CH=ch LSPDSDD READ FROM CM lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp An octal value representing one or more channel interface status flags Channel on which the error was detected Read blocks of various lengths from central memory address 0 Read from central memory IOP-n LSP CH=chlch FAILED, p=address, FLAGS=flags, CH=ch LSPDSDD, WRITE TO CM While attempting to write blocks of various lengths to central memory address 0, the program detected a hardware error in the low-speed channel pair ChIch in IOP-O. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch P=address FLAGS=flags CH=ch LSPDSDD WRITE TO CM 6-34 lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp An octal value representing one or more channel interface status flags Channel on which the error was detected Write blocks of various lengths to central memory address 0 Write to central memory CRAY PROPRIETARY SMM-1012 C IOP-n LSP CH=chlch FAILED, P=address, FLAGS=flags, CH=ch RESTMEM, WRITE TO CM While attempting to restore the central memory locations used in the test, the program detected a hardware error in the low-speed channel pair chich in IOP-O. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch P=address FLAGS=flags CH=ch RESTMEM WRITE TO CM lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp An octal value representing one or more channel interface status flags Channel on which the error was detected Final write to central memory Write to central memory IOP-n LSP CH=chlch FAILED, P=address, FLAGS=flags, CH=ch SAVEMEM, READ FROM CM While attempting to save the central memory locations used in the test, the program detected a hardware error in the low-speed channel pair chich in rop-o. Central memory is not corrupted. The following information is displayed: IOP-n CH=chlch P=address FLAGS=flags CH=ch SAVEMEM READ FROM CM lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp An octal value representing one or more channel interface status flags Channel on which the error was detected Initial read from central memory Read from central memory The data compare error messages follow. IOP-n LSP CH=chlch FAILED, P=address, CMA=cma LSPCPUA EXP=exp ACT=act While writing and reading one word to and from central memory addresses in multiples of 10, starting at address 100 and continuing to the end of central memory, the program detected a data compare error in the low-speed channel pair chich in lOP-no The expected data did not match the actual data. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch P=address CMA=cma SMM-I012 C lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp Absolute word address in central memory CRAY PROPRIETARY 6-35 LSPCPUA EXP=exp ACT=act Write and read one word to and from central memory addresses in multiples of 10, starting at address 100 and continuing to the end of central memory Expected data Actual data IOP-n LSP CH=chiCh FAILED, p=address, CMA=cma LSPDSDD EXP=exp ACT=act While writing and reading blocks of various lengths to and from central memory address 0, the program detected a data compare error in the low-speed channel pair chich in IOP-n. The expected data did not match the actual data. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch P=address CMA=cma LSPDSDD EXP=exp ACT=act lOP in which the test was executing Low-speed channel pair Parcel address relative to the start of dslsp Absolute word address in central memory Write and read blocks of various lengths to and from central memory address 0 Expected data Actual data The overlay messages follow. IOP-n LSP CH=chlch FAILED - OVERLAY NOT DSLSPCP The overlay that the test read was not DSLSPCP. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch lOP in which the test was executing Low-speed channel pair IOP-n LSP CH=chlch FAILED - OVERLAYS NOT FOUND The test could not find an overlay file. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch lOP in which the test was executing Low-speed channel pair IOP-n LSP CH=chiCh FAILED - OVERLAY WRONG TYPE The test found the overlay file DSLSPCP, but it has the wrong overlay type. Central memory may have been corrupted. The following information is displayed: IOP-n CH=chlch 6-36 lOP in which the test was executing Low-speed channel pair CRAY PROPRIETARY SMM-1012 C 7. UTILITY PROGRAMS Utility programs are on-line diagnostic tools rather than tests. section describes the following utilities: • • 7.1 This olhpa (hardware performance analyzer) runsequence (automatic test sequencer) olhpa The olhpa program is a hardware performance analyzer that analyzes and reports the hardware errors and statuses recorded in the system error log. The olhpa program displays the following types of reports: • A report listing one line of error information for each hardware error. The error information is displayed in fields and is sorted from left to right (refer to sort(l». • A comprehensive error report similar to the errpt(lM) report (-1 command option) • A summary of total errors (-q command option) • A bar graph showing total errors for the specified time interval (-9 [d]n command option) 7.1.1 PROGRAM SYNOPSIS This subsection contains the olhpa program synopsis. All of the command options except errfiles can be entered in any order. If errfiles is specified, it must be the last entry on the command line. The olhpa program displays disk, memory, tape, and SSD error reports in fields. If olhpa is entered without command options and arguments, it is equivalent to entering the following: olhpa -dmtv The start time is the current time and date minus 30 days. The end time is the current time and date. The olhpa program reads from the error file /usr/adm/errfi1e. r SMM-1012 C CRAY PROPRIETARY 7-1 Synopsis: olhpa [-1] [-q] [-g [d]n] [-d] [-m] [-t] [-v] [-D argument] [-M argument] [-T argument] [-V argument] [-8 start] [-e end] [errfiles] -1 Displays a long version of the selected error ~eport. If you select -1, do not select -q or -g [d]n. A -1 report contains the same information as the errpt(lM) report. For example, enter the following to display a long version of a memory error report: olhpa -m -1 Long reports are not sorted. -q Displays only the summary information of an error report. If you select -q, do not select -lor -g [d]n. -g [d]n Displays a bar graph showing the total errors for the specified time interval. If you select -g [d]n, do not select -lor -q. A single mnemonic value represents each error, as follows: Mnemonic R U Description Represents one recovered/corrected error Represents one unrecovered/uncorrected error The required argument n indicates the time interval that each bar in the graph represents. If the interval (n) is in days, precede n with the d command; otherwise it is assumed that n is in hours. n can be any integer value. However, n should be within the limits set by the start/end times and dates (-8 start and -e end, respectively). For example, if the start time is 7:00, the end time is 11:00, and n is 8, the interval is adjusted so that the program generates a report for one 4-hour interval. 7-2 CRAY PROPRIETARY SMM-1012 C -d Displays a report of all disk errors. The default display contains the following information in the order listed: Field Mnemonic Field Date Time Error type Device type lOP Channel Head Sector Cylinder General status Status -m SMM-I012 C dte tme et dt iop cha hd sct cyl gs st Displays a report of all memory errors. The default display contains the following information in the order listed: Field Field Mnemonic Date Time Syndrome Bank Failing bit Chip select Failing module CPU Current command Count Status dte tme syn bnk bit chp loc cpu cmd cnt st CRAY PROPRIETARY 7-3 -t Displays a report of all tape errors. The default display contains the following information in the order listed: Field Mnemonic Date Time Error type Initial channel Initial device path Final device path Block Retry Sense byte #00 Status -v dte tme et ich idp fdp blk ret sO st Displays a report of all SSD errors. The default display contains the following information in the order listed: Field Field Mnemonic Date Time Channel Status SSD address Central memory address Transfer length Read/write flag dte tme cha st sad mad len rwf -D argument, -M argument, -T argument -v argument Displays a report of disk, memory, tape, or SSD errors (-D, -M, -T, or -v option, respectively). The required argument can be one of the following: Argument Description p[,+],field[,field] Replaces or adds to the default display. If entered with the plus (+) option, the specified fields are displayed in addition to the default display. If entered without the plus (+) option, the specified fields are displayed instead of the default fields (and the specified fields become the default display for the test run). field can be any mnemonic listed in the help menu. 7-4 CRAY PROPRIETARY SMM-1012 C -D argument, -M argument, -T argument -v argument (continued) The fields are displayed in the order in which they are entered. The error information is sorted from left to right. Refer to sort(l). S,field=value[,field=value] Displays only the records in which the fields meet all of the associated value restrictions. field can be any mnemonic listed in the help menu. value is the field assignment. H -s start Displays an associated help menu. The mnemonics in the menu are used to select fields for the field portion of the preceding arguments. Sets the start time and date of the report. Enter the -s option with one of the following required arguments: Argument 1 Description n End time and date of the report (-e end) minus n days hh:mm,MM/DD/YY Time (hours:minutes) and date (month/day/year) M:mm Time (hours:minutes). to the current date. MM/DD/YY Date (month/day/year). set to 00:00. The date is set The time is The default for start is the current time and date minus 30 days. SMM-1012 C CRAY PROPRIETARY 7-5 -e end Sets the end time and date of the report. The required argument must be in one of the following formats: Format Description hh: mm, MM/ DD/YY Time (hours:minutes) and date (month/day/year) hh:mm Time (hours:minutes). to the current date. MM/DD/YY Date (month/day/year). set to 23:59. The date is set The time is The default for end is the current time and date. errfiles 7.1.2 Specifies the errfiles to be read. errfiles can be one or more files created by errdemon(lM). The default errfile is /usr/adm/errfile. HELP MENUS This subsection contains the menus to use in selecting the fields for the field portion of the arguments associated with the -D, -M, -T, and -v options. Figures 7-1, 7-2, 7-3, and 7-4 show the Disk, Memory, Tape, and SSD Help Menus, respectively. 7-6 CRAY PROPRIETARY SMM-1012 C dte) dtc) iop) cha) hd ) sst) st ) blk) cs ) df ) s1 ) Date Dt-IOP-channel-unit lOP Channel Head Spiralled sector Status Block Control status Disk function Status 01 tme) dt ) ios) et ) sct) cyl) ret) sbk) gs ) sO ) s2 ) Time Device type lOS Error type Sector Cylinder Retry Spiralled block General status Status 00 Status 02 s21) s23) ecs) edf) Status 21 Status 23 End controller status End disk function s22) ies) eds) fes) Status 22 Initial error status End drive status Final error status a1 ) b1 ) aof) b2c) a2c) b2o) a2o) elm) A1 - bit 5 of B1 - bit 7 of A-offset B2 correction A2 correction B2 offset A2 offset Expected LMA a2 ) b2 ) bof) b1c) alc) b1o) alo) aIm) A2 - bit 6 of B2 - bit 8 of B-offset B1 correction A1 correction B1 offset A1 offset Actual LMA cO ) c2 ) ofs) sy1) sy2) c2c) cOc) c2o) cOo) CO - bit 3 of G.S. C2 - bit 5 of G.S. Offset Chan. 1 syndrome Chan. 3 syndrome C2 correction mask CO correction mask C2 offset CO offset c1 ) c3 ) syO) sy2) c3c) clc) c30) c1o) C1 - bit 4 of G.S. C3 - bit 6 of G.S. Chan. 0 syndrome Chan. 2 syndrome C3 correction mask C1 correction mask C3 offset C1 offset 0049 only G.S. G.S. mask mask G.S. G.S. mask mask 0039 only Figure 7-1. SMM-1012 C Disk Help Menu (1 of 2) CRAY PROPRIETARY 7-7 0040 only ibs) fbs) msk) dfa) if1) if3) io1) io3) ic1) ic3) ef1) ef3) ec1) ec3) Initial buffer stat. Final buffer status 0040 correction mask Oefect address Initial fault stat 1 Initial fault stat 3 Initial opere stat 1 Initial opere stat 3 Initial FRU code 1 Initial FRU code 3 Ending fault stat 1 Ending fault stat 3 Ending FRU code 1 Ending FRU code 3 ids) fds) off) if 0) if2) ioO) io2) icO) ic2) efO) ef2) ecO) ec2) syn) Initial drive status Final drive status 0040 offset Initial fault stat 0 Initial fault stat 2 Initial opere stat 0 Initial opere stat 2 Initial FRU code 0 Initial FRU code 2 Ending fault stat 0 Ending fault stat 2 Ending FRU code 0 Ending FRU code 2 Channel syndrome 0029/0019 only cid) req) isr) of d) cvO) cv2) Cylinder from 10 Request Interlock stat. reg. Offset direqtion Correction vector 0 Correction vector 2 csr) fsr) mrg) mgn) cv1) cv3) Cylinder status reg. Fault status reg. Margin Magnitude Correction vector 1 Correction vector 3 Figure 7-1. dte) cnt) st ) mde) syn) bnk) add) loc) cmd) Oisk Help Menu (2 of 2) Oate Count Status Mode Syndrome Bank Failing address Failing module Current Command Figure 7-2. 7-8 tme) ity) sub) cpu) chp) rh ) bit) usr) Time Initial type Subtype CPU Chip-select Rh Failing bit Current user Memory Help Menu CRAY PROPRIETARY SMM-1012 C dte) et ) ich) idp) fch) fds) ffn) dns) vol) cmd) sO ) Date Error type Initial channel Initial device path Final channel Final device stat. Final function Density Volume Command SBOO tme) st ) ios) ids) fdp) ifn) blk) ret) usr) ipt) sl ) Time Status IDS number Initial device stat. Final device path Initial function Block Retry User Input tags SB01 s22) SB22 s23) SB23 s24) s26) s28) s30) s25) s27) s29) s31) IBM 3480 only SB24 SB26 SB28 SB30 Figure 7-3. dte) cha) sad) len) Date Channel SSD-Address Length Figure 7-4. 7.1.3 SB25 SB27 SB29 SB31 Tape Help Menu tme) st ) mad) rwf) Time Status MEM-Address Read/write flag SSD Help Menu PROGRAM EXAMPLES This subsection contains olhpa execution examples. Depending on whether errors are in the current error file, it may be necessary to specify an error file. If you need assistance, contact your CRI representative. To display disk, tape, memory, and SSD error reports, enter the following: olhpa SMM-1012 C CRAY PROPRIETARY 7-9 To display a disk error report, enter the following: olhpa -d To display a disk error report for an error file, enter the following: olhpa -d errfile To display the disk help menu, enter the following: olhpa -0 H To display a disk error report for the date, time, head, and channel fields only, enter the following: olhpa -0 P,OTE,TME,HO,CHA To display a disk error report of only the records for which the channel is equal to 26 and the lOP is equal to 2, enter the following: olhpa -0 S,CHA=26,IOP=2 The following example searches for disk errors for a specific channel and lOP, and displays the associated error information in the specified fields. The disk error report will display the following fields for only the records for which the channel is equal to 26 and the lOP is equal to 2: date, time, device type, general status, and AI, A2, B1, and B2 of the general status. Enter the following: olhpa -OS,CHA=26,IOP=2 -OP,OTE,TME,OT,GS,Al,A2,Bl,B2 To display a bar graph showing yesterday's disk errors in 2-hour intervals, enter the following (using yesterday's date for date): olhpa -d -s date -e date -g 2 7.1.4 SHELL SCRIPT GENERATION ANO EXECUTION Shell scripts can allow you to easily generate and execute olhpa command sequences. The following example shows a shell script that generates a disk error report for each disk drive for which errors are logged. 7-10 CRAY PROPRIETARY SMM-1012 C Example: # # # Shell script to report errors for each disk drive. echo "**************************************************************" echo REP 0 R T o F DIS K ERR 0 R S echo echo Only devices which logged errors will create reports. echo echo "**************************************************************" II II for DEV in 'olhpa -DPdtc $1 lawk '{print $1}' luniq Igrep ,_" do echo "**************************************************************" $DEV echo echo "**************************************************************" echo olhpa -DSdtc=${DEV} $1 done echo "**************************************************************" echo REP 0 R T END o F echo "**************************************************************" II Error report output from preceding shell script: ************************************************************** REP 0 R T o F DIS K ERR 0 R S Only devices which logged errors will create reports. ************************************************************** ************************************************************** 40-1-34A ************************************************************** Cray Hardware Performance Analyzer 10:26 03/02/88 Run time 10:26 02/01/88 Starting time Ending time 10:26 03/02/88 Hardware Error Report For Disks Restrictions: Dt-IOP-channel-unit = 40-1-34A SMM-1012 C CRAY PROPRIETARY 7-11 Error report (continued) : Date 88/02/26 88/02/26 88/02/26 Time 03:10:51 03:37:30 04:46:23 Errtyp Read Read Read DT/IOP/CHA 40-1-34A 40-1-34A 40-1-34A HD Sect 0000 0000 0000 Cyl 00 00 00 0013 0013 0013 Gen-Stat 011426 011426 011426 Status Corre. Corre. Recov. 88/03/01 88/03/01 88/03/01 04:14:00 04:14:13 06:24:40 Read Read Read 40-1-34A 40-1-34A 40-1-34A 00 00 00 0000 0000 0000 0011 0011 0013 011411 011411 011426 Recov. Recov. Corre. Total Disk Errors Recovered Disk Errors Corrected Disk Errors Unrecovered Disk Errors Uncorrected Disk Errors Total Retries 30 12 18 0 0 70 ************************************************************** 40-2-34A ************************************************************** Cray Hardware Performance Analyzer Run time 10:26 03/02/88 Starting time 10:26 02/01/88 Ending time 10:26 03/02/88 Hardware Error Report For Disks Restrictions: Dt-IOP-channel-unit Date 88/02/26 88/02/26 Time 05:48:17 06:01:49 Errtyp Read Read DT/IOP/CHA HD 40-2-34A 00 40-2-34A 00 Total Disk Errors Recovered Disk Errors Corrected Disk Errors Unrecovered Disk Errors Uncorrected Disk Errors Total Retries 7-12 = 40-2-34A Sect 0000 0000 Cyl 0007 0003 Gen-Stat 011433 011442 Status Corre. Corre. 2 a 2 a a 6 CRAY PROPRIETARY SMM-1012 C Error report (continued): Error information for all drives for which errors are logged is displayed. ************************************************************** END o F REP 0 R T ************************************************************** 7.1.5 PROGRAM MESSAGES If an invalid or nonexistent command option is entered, olhpa displays the incorrect entry and the complete program synopsis. If an invalid or nonexistent error file is entered, the following message is displayed: olhpa: Cannot open file In an error report, a field can contain the following symbols: SymbOl Description N/A No information was recorded in the system error log. (x) No information was recorded in the system error log. The field is specific to device type x. SMM-1012 C CRAY PROPRIETARY 7-13 runsequence 7.2 lne f tUDI~lce utll't · l Y lS used with the crontab(l) command to per orm automatic test sequencing (scheduling and testing withou operator intervention). Error messages are returned to "f" d thro h th U N I C O S · . . spec~ ~e users u9. e ma~l. Th~s alerts f~eld engineers and analysts that there ~s an error. They can then examine the error log to determine where the error occurred. The goal is to detect and isolate failures before a system or application failure occurs. To initiate automatic test sequencing, do the following: 1. 2. 3. 4. Set the shell variables in the runsequence shell script. Create the sequence files. Create the input file for the crontab(l) command. Execute the crontab(l) command. After being called in from the crontab(l) input file, runsequence reads a file containing a list of diagnostics and related command optio~s, executes the diagnostics (one at a time), and saves any output in a file. After each diagnostic in the sequence file is executed, runsequence determines the number of lines of output generated, as follows: • If there are more than five lines of output, runsequence assumes that the diagnostic detected an error and sends specified users a message. • If no error is detected but standard error output is generated, runsequence sends specified users a message. • If no error is detected, the output files from the diagnostic are removed. 1.2.1 crontab INPUT FILE The crontab(1) input file contains the following information: • • Times at which the sequences are to be run Calls to runsequence When defining the crontab(l) input file, you must include calls to runsequence. Each call to runsequence must contain an appropriate sequence file name and, optionally, a CPU designator. For additional information on the crontab(l) command, refer to the UNICOS User Commands Reference Manual, eRI publication SR-2011. 7-14 CRAY PROPRIETARY SMM-I012 C runsequence synopsis: runsequence seqfile [cpu] seqfile Indicates the name of the file containing the sequence of diagnostics to be run, the diagnostic command options, and any comments. The comments are the same as shell script comments; they start with a pound sign (#) and continue to the end of the line. cpu Indicates the CPU in which the diagnostics are to be run. f, g, or h. If the cpu option is specified, the diagnostics in the sequence file must be CPU tests. All the log and core files are placed in a subdirectory of the DIAGLOG directory, which is created if it does not already exist. CpU can be a, b, c, d, e, If the cpu option the default value diagnostic in the are placed in the subdirectory. is not specified, the diagnostic uses or you can specify the CPU option for the sequence file. All log and core files DIAGLOG directory instead of a The following example shows a sample crontab(l) input file: # Run in a different cpu every 15 minutes 1 * * * * $HOME/scripts/runsequence hourlyseq a 15 * * * * $HOME/scripts/runsequence hourlyseq b 30 * * * * $HOME/scripts/runsequence hourlyseq c 45 * * * * $HOME/scripts/runsequence hourlyseq d * * * * * * * * 1 15 30 45 # 1 1 1 1 1 1 * * * * * * * * $HOME/scripts/runsequence sbtseq a,b,c,d $HOME/scripts/runsequence sbtseq b,c,d,a $HOME/scripts/runsequence sbtseq c,d,a,b $HOME/scripts/runsequence sbtseq d,a,b,c Run at midnight each day * * 0-6 $HOME/scripts/runsequence dailyseq a * * 0-6 $HOME/scripts/runsequence dailyseq b * * 0-6 $HOME/scripts/runsequence dailyseq c * * 0-6 $HOME/scripts/runsequence dailyseq d * * 0-6 FSPATH=/tmp DT=DD49 $HOME/scripts/runsequence cfdtseq * * 0-6 FINDPATH=$HOME/log $HOME/scripts/findseq 0 0 0 0 0 0 The minute field is set to 1 to offset the diagnostic program execution to one minute after the hour. This allows scheduled system activities to be performed at the start of each hour. SMM-I012 C CRAY PROPRIETARY 7-15 7.2.2 SEQUENCE FILES The sequence files contain a list of the diagnostics to be executed and their related command options. You must place these files in the directory specified by the DIAGSCRIPTS shell variable. Before creating sequence files, refer to appendix B, Test Execution Times. The following example shows the recommended sequence files for the erontab(l) input file. Example: hourlyseq: # Run the following sequence once every 15 minutes in a different CPU. olerit cputime 0:0:30 +getseed olesve cputime 0:0:30 +getseed olibuf cputime 0:0:30 +getseed olefpt cputime 0:0:30 +getseed olem cputime 0:0:30 +getseed it it # # it Read Read Read Read Read seed seed seed seed seed from from from from from olerit.seed if available olesve.seed if available olibuf.seed if available olefpt.seed if available olem.seed if available seed seed seed seed seed from from from from from olerit.seed if available olesve.seed if available olibuf.seed if available olefpt.seed if available olem.seed if available dailyseq: it Run the following sequence once a day. olerit cputime 0:6:0 +getseed olesve cputime 0:6:0 +getseed olibuf cputime 0:6:0 +getseed olefpt cputime 0:6:0 +getseed olem cputime 0:6:0 +getseed # # # # # Read Read Read Read Read sbtseq: # # sbtseq: This sequence tests olsbt in all cpus available # it should be run once every 15 minutes. # olsbt cputime 30 +getseed cfdtseq: # Run the following sequence to test a mass storage device. olcfdt maxp 50 fn $FSPATH/workfil.$$ rsz 512 sz 250 dt $DT find $FSPATH -name 'workfil .• ' -user $LOGNAME -exec rm -f {} \\; 7-16 CRAY PROPRIETARY SMM-1012 C Example (continued): findseq: # # findseq: This sequence finds and removes any small log files # or stderr files that the runsequence created. # TOO OLD=180 #FPATH # Number of days to save log files Path to log files. default FPATH=$HOME/log in cronfile find $FPATH \( \( -name '*.[0-9]*[0-9]' -size -300c \) -0 -name \ 'stderr.*'\) -atime +0 -type f -exec rm -f {} \; 2>/dev/null 1>&2 # #Remove any log file that has not been touched recently # find $FPATH -name '*.[0-9]*[0-9]' -type f -atime +$TOO_OLD \ -exec rm -f {} \; Each site must determine if additional testing is desirable. 7.2.3 runsequence SHELL SCRIPT The runsequence shell script runs under the Bourne shell and executes a series of diagnostics by reading a file containing a list of the diagnostics to be run. The diagnostics should be run with the verbose option disabled (-verbose), because the size of each diagnostic output file is used to determine if the diagnostic has failed. The shell script maintains the diagnostic output and sends messages to a specified list of users when an error is detected. You can set the following variables in the runsequence shell script: DIAGBIN=path Indicates the full path name of the directory where the executable binaries of the diagnostics reside. If the binaries reside in more than one directory, enter colons between each directory. The following entry defines a single directory: DIAGBIN=/ce/bin The following entry defines several directories: DIAGBIN=/ce/bin:$HOME/bin SMM-I012 C CRAY PROPRIETARY 7-17 DIAGLOG=path Indicates the full path name of the directory where the log files are saved when a diagnostic detects an error DIAGSCRIPTS=path Indicates the path name where the sequence files reside. You can specify only one full path name. MAILLIST="user ••• user" Provides a list of users to be notified when a diagnostic detects an error. Enter a space between each user name and enclose the list in double quotes. It is recommended that the list contain more than one user name. NICE=n "Indicates the amount by which the diagnostic's priority in the execution queue is to be lowered. n can be any integer within the range 1 through 19. If a value greater than 19 is entered, it is processed as if it were 19. If a value less than 0 is entered, it has no effect. ROlfLOG=logfile Indicates the name of the log file containing information on the sequence being run and any errors detected. The log file resides in the DIAGLOG directory. SAVECORE=ONIOFF Enables (ON) or disables (OFF) the option that renames and saves each core file generated. If SAVECORE is set to OFF, any new core file overwrites an existing one. The default values for the variables in the runsequence shell script are as follows: DIAGBIH=/ce/bin DIAGLOG=$HOME/log DIAGSCRIPT=$HOME/scripts RUNLOG=$DIAGLOG/runlog NICE=4 SAVECORE=OFF MAILIST="$LOGNAME" 7-18 # Location of the executable diagnostics # Location of the diagnostic log files # Location of the diagnostic sequence lists # Program log # Lower the diagnostic's priority by this amount # Existing core file will be overwritten # List of people to receive error messages CRAY PROPRIETARY SMM-1012 C APPENDIX SECTION A. ON-LINE DIAGNOSTIC PROGRAMS This appendix lists and briefly describes the following types of on-line diagnostic programs: • • • • Confidence tests Maintenance tests Down-device programs Network communications test (olnet) 1/0 Subsystem (lOS) deadstart programs Utilities offmon tests • • • The on-line diagnostic programs listed in this section are supported on the following computer systems: • CEA systems Y-mode (32-bit addressing) • A.1 CRAY X-MP and CRAY-1 computer systems CONFIDENCE TESTS Table A-1 briefly describes each on-line confidence test. Table A-1. Test Confidence Tests Description Language olefdt Mass storage device test CFT77 olefpt Comprehensive floating-point test CAL 2 olem Central memory test CAL 2 olerit Comprehensive random instruction test CAL 2 olcsvc Comprehensive scalar/vector compare test CAL 2 SMM-1012 C CRAY PROPRIETARY A-1 Table A-l. Confidence Tests (continued) Description Test Language olibuf Instruction buffer test CAL 2 olsbt Semaphore, shared Band T register test CAL 2 A.2 MAINTENANCE TESTS Table A-2 briefly describes each on-line CPU maintenance test. NOTE The CPU Maintenance Tests are supported for CX/CEA systems in X-mode only. Table A-2. Test CPU Maintenance Tests Description Language olaht A register indexing test CAL 2 olarb A register data test CAL 2 olarm A register multiply test CAL 2 olbrb B register basic data test CAL 2 A-2 CRAY PROPRIETARY SMM-I012 C Table A-2. Test CPU Maintenance Tests (continued) Description Language Random instruction and operand test CAL 2 olcmptt Vector compress instruction test CAL 2 olcmzttt Random instruction and operand test CAL 2 o19thtt Scatter/gather test CAL 2 olibzttt Instruction buffer test CAL 2 olmit Moving inversions memory test CAL 2 olsfa Simulate floating-point add test CAL 2 olsfm Simulate floating-point multiply test CAL 2 olsfr Simulate floating-point reciprocal CAL 2 olsis Scalar register instruction simulation test CAL 2 olsr3 Random instruction issue register conflicts CAL 2 olsra Scalar register add test CAL 2 olsrb Scalar register basic test CAL 2 olsrl Scalar register logical test CAL 2 olsrs Scalar register shift test CAL 2 olstan Standard answer functional units test CAL 2 olsvc Scalar and vector compare test CAL 2 oltrb T register basic data test CAL 2 t tt ttt CRAY-l computer systems only CEA (X-mode) and CRAY X-MP computer systems only CRAY X-MP EA (X-mode) and CRAY X-MP computer systems only SMM-I012 C CRAY PROPRIETARY A-3 Table A-2. CPU Maintenance Tests (continued) Description Test Language olvpopt Vector population count test CAL 2 olvpptt Vector population count test CAL 2 olvra Vector register add test CAL 2 olvrl Vector register logical test CAL 2 olvrn . Vector register random test CAL 2 olvrr Vector register random length test CAL 2 olvrs Vector register shift test CAL 2 olvrztt Vector register stress test CAL 2 t tt A.3 CRAY-l computer systems only CEA (X-mode) and CRAY X-MP computer systems only DOWN-DEVICE PROGRAMS Table A-3 briefly describes the down-device programs, which reside on DIAGPL. Table A-3. Down-Device Programs 1 I Test Description Language 1============================================================== I 1 donut On-line disk maintenance program CFT77, C oldmont Down CPU monitor C & CAL 2 I 1 & CAL 2 I C On-line magnetic tape test I unitap I ________~----------------------------------~---------------t A-4 Multiple CPU Cray computer systems only CRAY PROPRIETARY SMM-1012 C Tables A-4 and A-S briefly describe the down CPU tests, which reside on XMPPL and execute under oldmon, the down CPU monitor. These tests run on CRAY X-MP computer systems in multiple-CPU environments only (CRAY X-MP/4 and CRAY X-MP/2 computer systems). Table A-4. Test Down CPU Confidence Tests Description Language offcfpt Comprehensive floating point test CAL 2 offern Central memory test CAL 2 offcrit Comprehensive random instruction test CAL 2 offcsvc Comprehensive scalar/vector compare test CAL 2 offibuf Instruction buffer test CAL 2 Table A-S. Test Down CPU Maintenance Tests Description Language aht A register indexing test CAL 2 arb A register data test CAL 2 arm A register multiply test CAL 2 brb B register basic data test CAL 2 cmp Vector compress instruction test CAL 2 CIllX Random instruction and operand test CAL 2 gth Scatter/gather test CAL 2 ibz Instruction buffer test CAL 2 mit Moving inversions memory test CAL 2 SMM-1012 C CRAY PROPRIETARY A-5 Table A-S. Down CPU Maintenance Tests (continued) Test Description Language sfa Simulate floating-point add test CAL 2 sfm Simulate floating-point multiply test CAL 2 sfr Simulate floating-point reciprocal CAL 2 sis Scalar register instruction simulation test CAL 2 sr3 Random instruction issue register conflicts CAL 2 sra Scalar register add test CAL 2 srb Scalar register basic test CAL 2 srI Scalar register logical test CAL 2 srs Scalar register shift test CAL 2 stan Standard answer functional units test CAL 2 svc Scalar and vector compare test CAL 2 trb T register basic data test CAL 2 vpp Vector population count test CAL 2 vra Vector register add test CAL 2 vrl Vector register logical test CAL 2 vrn Vector register random test CAL 2 vrr Vector register random length test CAL 2 vrs Vector register shift test CAL 2 vrz Vector register stress test CAL 2 A-6 CRAY PROPRIETARY SMM-1012 C A.4 ON-LINE NETWORK COMMUNICATIONS PROGRAM Table A-6 briefly describes the Cray-to-front end communications test, olnet. Table A-6. On-line Network Communications Program I Test Description Language I I========~======================================~=========== I CFT77 & ct Cray-to-front end communications test I olnet (exercises all or part of the path between I a Cray mainframe and a front end) I I ________~------------------------------------------~------------ t Motorola Operator Workstation (OWS) and Maintenance Workstation (MWS) only The olnet test is described in the On-line Diagnostic Network Communications Program (OLNET) Maintenance Manual, CRI pUblication SMM-1016. SMM-1012 C CRAY PROPRIETARY A-7 A.S IIO SUBSYSTEM DEADSTART PROGRAMS Table A-7 briefly describes the IIO Subsystem (lOS) deadstart programs, which reside on DIAGPL. The cleario program is executed independently from the other programs listed. The dsdiag program, the lOS deadstart diagnostic control program, loads and executes all of the programs (except cleario) from a diagnostic overlay file, after first executing a series of basic IOP-O tests. Table A-7. Program IIO Subsystem Deadstart Programs Description Language cleario Attempts to clear the lOS if the deadstart procedure fails APML dsdiag Deadstart diagnostic control program APML dshsp High-speed channel test from an IIO processor (lOP) to central memory or to an SSD solid-state storage device APML dsiom Local memory addressing and data test for each lOP APML dsiop Instruction test for each lOP APML dslsp Low-speed channel test from IOP-O to central memory APML and CAL 1 dsmos Buffer memory addressing and data path test for each lOP APML dsmos16k Test of the lower 16 Kbytes of buffer memory from IOP-O only APML A-a CRAY PROPRIETARY SMM-1012 C A.6 UTILITY PROGRAMS Table A-a briefly describes each on-line utility program. Table A-B. Utility Programs Description Utility Language olhpa Hardware performance analyzer C runsequenee Diagnostic sequencer utility Shell script A.7 offmon TESTS Table A-9 briefly describes each offmon test. Table A-9. Confidence Test offmon Tests Description Language offcfpt Comprehensive floating-point test CAL 2 offem Central memory test CAL 2 offcrit Comprehensive random instruction test CAL 2 offesvc Comprehensive scalar/vector compare test CAL 2 offibuf Instruction buffer test CAL 2 SMM-1012 C CRAY PROPRIETARY A-9 B. TEST EXECUTION TIMES This appendix lists the execution times for the following types of on-line diagnostic tests: • • Confidence Maintenance The tests were run at Cray Research, Inc. during normal workday operations, using a default pass count of 512 (0'1000). The times are for test execution in a single CPU of a CRAY X-MP computer system and cannot be extrapolated to determine execution times for multiple CPU runs. NOTE The execution times may vary depending on system load, and should not be used for CPU or benchmark comparisons. In the test execution tables, the following times are listed in the headings: B.1 Time Description Elapsed User System Wall-clock time CPU time System overhead time EXECUTION TIMES FOR CONFIDENCE TESTS Table B-1 lists the execution times for the confidence tests. was run with a pass count of 512 (0'1000). SMM-1012 C CRAY PROPRIETARY Each test B-1 Table B-1. Test Elapsed Timet tt B.2 User Time System Time olem 65.00 s 34.25 s 0.88 s olefpt 23.00 s 7.15 s 0.47 s olerit 15.00 s 7.55 s 0.28 s olesve 12.00 s 4.27 s 0.21 s olibuf 78.00 s 21.00 s 0.11 s 4.66 s 2.29 s 1.43 s olsbttt t Execution Times for Confidence Tests Execution times may be reduced 0 increased by the use of test-specific options. Times are for test execution with four CPUs (cpu a,b,c,d) EXECUTION TIMES FOR MAINTENANCE TESTS Table B-2 lists the execution times for the maintenance tests. Each test was run with a pass count of 512 (0'1000) except olibz and olsfm; these tests were run for less than 512 (0'1000) passes, and their respective execution times were then used to extrapolate elapsed, user, and system times for 512 passes. Table B-2. Test t tt B-2 Execution Times for Maintenance Tests Elapsed Time User Time System Time olaht 10.03 s 2.24 s 0.08 s olarb 0.74 s 0.11 s 0.01 s olarm 21.10 m 15.95 m olbrb 0.69 s 0.24 s 0.01 s 7.10 s 2.92 s 0.04 s 17.35 s CRAY-1 compute systems only CEA (X-mode) and CRAY X-MP computer systems only CRAY PROPRIETARY SMM-1012 C Table B-2. Test Elapsed Time User Time System Time olcmx t 25.35 s 2.49 s 0.1 olqthtt 15.11 s 7.41 s 0.12 s olibzt 6.74 h 1.62 h 1.25 m olmit 1.61 m 42.12 s 1.58 s olsfa 9.39 s 7.95 s 0.17 s olsfm 117.0 h olsfr 8.02 olsis m 14.3 h s 12.64 m 6.33 m 5.77 s 0.46 s 0.02 s 0.01 s olsr3 0.46 s 0.18 s 0.01 s olsra 0.96 s 0.70 s 0.04 s olsrb 1.00 s 0.34 s 0.02 s olsrl 1.96 s 0.05 s 0.01 s olsrs 20.64 s 18.04 s 0.37 s olstan 0.31 s 0.21 s 0.01 s olsvc 0.35 s 0.17 s 0.01 s oltrb 6.07 s 5.13 s 0.12 s 0.57 s 0.02 s olvpopttt t tt ttt Execution Times for Maintenance Tests (continued) 0.73 s olvpptt 0.84 s 0.62 s 0.01 s olvra 0.82 s 0.68 s 0.02 s olvrl 0.87 s 0.59 s 0.01 s CRAY X-MP EA (X-mode) and CRAY X-MP computer systems only CEA (X-mode) and CRAY X-MP computer systems only CRAY-1 computer systems only SMM-I012 C CRAY PROPRIETARY B-3 Table B-2. Test User Time System Time 0.23 s 0.12 s 0.01 s olvrr 0.28 s 0.12 s 0.01 s s 17.34 s 0.36 s olvrzt B-4 Elapsed Time olvrn olvrs t Execution Times for Maintenance Tests (continued) 26.3 2.86 m 2.83 min 1.44 s CEA (X-mode) and CRAY X-MP computer systems only CRAY PROPRIETARY SMM-1012 C C. ON-LINE DIAGNOSTIC PROGRAM LIBRARIES This appendix describes the on-line diagnostic program libraries (PLs) and their contents and associated decks. The on-line diagnostic PLs are as follows: PL Description DIAGPL Contains on-line diagnostic programs that execute on CX/CEA and CRAY-l computer systems XMPPL Contains diagnostic programs that execute on CX/CEA systems CRAY1PL Contains diagnostic programs that execute on a CRAY-l computer system Each deck contains source code that is used to generate a binary. C.l DIAGPL DIAGPL contains on-line diagnostic programs that execute on CX/CEA and CRAY-l computer systems. The contents of DIAGPL are as follows: Program Deck bnmtap cleario donut dsdiag olcm olcfdt olcfpt olcrit olcsvc oldmon olhpa olibuf olnet olsbt runsequence BMXTAP CLEARIO DONUT DSDIAG, DSDIAGD, DSMOS16K, DSIOM, DSIOP, DSMOS, DSHSP, DSLSP OLCM OLCFDT OLCFPT OLCRIT OLCSVC OLDMON OLHPA OLIBUF OLNET OLSBT RUNSEQ SMM-1012 C CRAY PROPRIETARY C-l C.2 XMPPL XMPPL contains diagnostic programs that execute on CX/CEA systems. contents of XMPPL are as follows: C.3 Program Deck olaht olarb olarm olbrb olcmp olcmz olgtb olibz olmit olsfa olsfm olsfr olsis olsr3 olsra olsrb olsrl olsrs olstan olsvc oltrb olvpp olvra olvrl olvrn olvrr olvrs olvrz AHT ARB ARM BRB CMP CMX GTH IBZ MIT SFA SFM SFR SIS SR3 SRA SRB SRL SRS STAN SVC TRB VPP VRA VRL VRN VRR VRS VRX The CRAY1PL CRAY1PL contains diagnostic programs that execute on CRAY-l computer systems. The contents of CRAY1PL are as follows: C-2 Program Deck olaht olarb olarm AHT ARB ARM CRAY PROPRIETARY SMM-1012 C Program Deck olbrb olcmd olmit olsfa olsfm olsfr olsis olsr3 olsra olsrb olsrl olsrs olstan olsvc oltrb olvpop olvra olvrl olvrn olvrr olvrs BRB CMD MIT SFA SFM SFR SIS SR3 SRA SRB SRL SRS STAN SVC TRB VPOP VRA VRL VRN VRR VRS SMM-1012 C CRAY PROPRIETARY C-3 D. SOFTWARE PROBLEM REPORTING This appendix describes the on-line diagnostic software problem reporting procedure. The on-line diagnostics are released as part of the operating system software. To report problems with or request changes to the on-line diagnostic software, send the information electronically to the automated Software Technical Support database, or send a Software Problem Report (SPR) form to the Software Technical Support department. Figure D-1 shows an SPR form. You can order these forms from the CRI Distribution Center. For additional SPR information, refer to the Software Problem Report (SPR) User's Guide, CRI publication SD-0235. SMM-1012 C CRAY PROPRIETARY D-1 PLEASE PRESS HARD YOU ARE MAKING 3 COPIES Software Problem Report Phone ame Mainframe Site Code Date o o o o lOS Version Prerelease YO On-Site Analyst's Signature No Version Prerelease Version YO NO Prerelease YO NO Title of Problem DUMP ED. NO.) LISTING JOB THAT PRODUCED PROBLEM SPR DESCRIPTION CORRECTIVE CODE SUPPLIED: YON 0 TESTED: YON 0 SEND TO: TEST CASE SUPPLIED: RESEARCH, INC. 1345 Northland Drive Mandota Heights, MN 55120 DISTRIBUTION: WHITE - CRI FILE BLUE - SPR COORDINATOR Figure D-1. D-2 PINK - AIC SPR Form CRAY PROPRIETARY SMM-I012 C E. SYSTEM UTILITIES This appendix briefly describes the UNICOS system utilities that have been identified as effective diagnostic tools. These utilities are as follows: Utility Description dda(l) The dda command (dynamic dump analyzer) allows you to examine the contents of a program memory dump. icrash(lM) The icrash command allows you to examine the 1/0 Subsystem (lOS) core image. If you know of other system utilities that should be mentioned in this appendix, please use one of the following options to forward the information to the Technical Publications department: • Call our Technical Publications department at (612) 681-5729 during the hours of 7:30 A.M. to 6:00 P.M. (Central Time). • Send us electronic mail from a UNICOS or UNIX system, using the following UUCP addresses: uunet!cray!publications sun!tundra!hall!publicatioDS • Send us electronic mail from a UNICOS or UNIX system, using the following ARPAnet address: publicatioDs@cray.com • Send a facsimile of your comments to the attention of "Publications" at FAX number: (612) 681-5602 • Use the postage-paid Reader's Comment form at the back of this manual. SMM-1012 C CRAY PROPRIETARY E-1 • Write to us at the following address: Cray Research, Inc. Technical Publications Department 1345 Northland Drive Mendota Heights, Minnesota 55120 We value your comments and will respond to them promptly. E-2 CRAY PROPRIETARY SMM-1012 C F. SITE COMMUNICATIONS This appendix describes on-line diagnostic field support. includes the following: This support • On-line diagnostic error dumps analysis • On-line diagnostic formatted error output analysis • On-line diagnostic installation, usage, and availability information Please use one of the following options to forward inquiries to the On-line Diagnostic department: • Call our On-line Diagnostic department at (612) 681-5642 during the hours of 8:00 A.M. to 5:00 P.M. (Central Time). From 5:00 P.M. to 8:00 A.M., you can leave a recorded message. Include the following information in your message. Your name Telephone number Site identification Operating system/release level On-line diagnostic release Failing on-line diagnostic Description of the problem • Send us electronic mail from a UNICOS or UNIX system, using the following electronic mail address: oldiag@Crayamid • Write to us at the following address: Cray Research, Inc. On-line Diagnostic Department 1345 Northland Drive Mendota Heights, Minnesota 55120 SMM-1012 C CRAY PROPRIETARY F-1 G. INSTALLATION INFORMATION Typically, the on-line diagnostics are installed as part of the system installation procedure documented in the UNICOS System Installation Bulletin (SIB). If you need to re-install the on-line diagnostics subsequent to system installation, a different procedure must be used. This appendix describes how to install the on-line diagnostics after system installation. The following topics are discussed: G.l • On-line diagnostic directories • Generating on-line diagnostic binaries and listings • Saving off-line versions of on-line confidence tests and 1/0 Subsystem (lOS) deadstart programs • Generating olnet • Deleting proprietary source code ON-LINE DIAGNOSTIC DIRECTORIES The on-line diagnostics are located in the following directories: Directory Description lusrlsrc/diag Source code Ice/bin On-line diagnostic binaries Ice/oldmon Off-line diagnostic binaries for oldmon Ice/olnet olnet source code for front-end computer systems Ice/scripts runsequence scripts Ice/log Log directory for runsequence Ice/ios lOS deadstart programs for single lOS systems Ice/iosa Ice/iosb lOS deadstart programs for two lOS systems SMM-1012 C CRAY PROPRIETARY G-l G.2 GENERATING ON-LINE DIAGNOSTIC BINARIES Perform the following steps to generate on-line diagnostic binaries: 1. Load the on-line diagnostic tape. This tape is normally included with the UNICOS release package. . If necessary, you can order another copy from the CRI Distribution Center. 2. Enter the following commands to execute the Makefile: cd lusrlsrc/diag update -p diagpl -q DIAGMAKE -c diag -a m mv diag.m diag.mk Make -f diag.mk install SN=xxxx xxxx is your mainframe's serial number. G.3 GENERATING ON-LINE DIAGNOSTIC LISTINGS To generate the on-line diagnostic listings, enter the following commands: cd lusrlsrc/diag make -f diag.mk listings NOTE The listings include all on-line diagnostic test listings, off-line versions of CPU on-line test listings, and lOS deadstart and cleario test listings. The diagnostic listings are CRAY PROPRIETARY. Print the write them to tape; do not keep the listings on-line. G-2 CRAY PROPRIETARY listings or SMM-I012 C G.4 SAVING OFF-LINE VERSIONS OF ON-LINE CONFIDENCE TESTS This section describes where to save off-line versions of on-line confidence tests for Maintenance Workstation-based (MWS-based) systems running the Cray Maintenance System (CMS) or expander-based systems running DSS. G.4.1 MWS-BASED SYSTEMS RUNNING CMS Enter the following commands to copy the off-line confidence diagnostics to the MWS: rcp rcp rcp rcp rcp /ce/oldmon/offcrit mws:/CPUDIR /ce/oldmon/offcsvc mws:/CPUDIR /ce/oldmon/offcfpt mws:/CPUDIR /ce/oldmon/offibuf mws:/CPUDIR /ce/oldmon/offcm mws:/CPUDIR CPUDIR is the directory on the MWS where the CPU off-line diagnostics reside. G.4.2 1. mws is the hostname for the MWS. EXPANDER-BASED SYSTEMS RUNNING DSS Enter the following commands to write the off-line confidence diagnostics to a scratch tape: extd -0 -r extd -0 -r extd -0 -r extd -0 -r extd -0 -n -n 0 (/ce/oldmon/offcrit -n 1 (/ce/oldmon/offcsvc -n 2 (/ce/oldmon/offcfpt -n 3 (/ce/oldmon/offibuf 4 (/ce/oldmon/offcm NOTE Steps 2 and 3 cannot be performed while the operating system is running. Perform these steps the next time you shut down your system. SMM-I012 C CRAY PROPRIETARY G-3 2. Copy the diagnostics to the off-line expander pack under FNT 4. To copy the diagnostics from the tape that was just written, enter the following commands under ossa: READ READ READ READ READ 3. @ SCRIT 4 @ SCSVC 4 @ SCFPT 4 @ SIBUF 4 @ SCM 4 These off-line diagnostics are dependent on the latest off-line IOPPL release P2.0. This release of the Cray Maintenance Operating System (CMOS) allows diagnostics larger than 6000 words to be loaded and deadstarted. To load and execute these diagnostics, use the CMOS command DS L. G.S SAVING 1/0 SUBSYSTEM (lOS) DEADSTART PROGRAMS This section describes where to save 1/0 Subsystem (lOS) deadstart programs for Operator Workstation (OWS), expander tape, or expander disk UNICOS. G.S.l OWS UNICOS To copy the newly created dsdiaq and cleario binaries to the OWS, enter the following commands: rcp rcp rcp rcp Ice/ios/dsdiag ows:IIOSDIR Ice/ios/dsdiag.ov ows:IIOSDIR Ice/ios/cleario ows:IIOSDIR Ice/ios/cleario.ov ows:IIOSDIR IOSDIR is a site-specific parameter that indicates the location of the lOS kernel and overlays. ows is the hostname for the OWS. The deadstart diagnostics should reside in the same OWS directory as the lOS kernel and overlays. Two lOS systems will store diagnostics in two OWS directories based on the lOS serial number. NOTE Two lOS systems store diagnostics in directories Ice/iosal and Ice/iosb/. G-4 CRAY PROPRIETARY SMM-1012 C The deadstart diagnostic binaries are now saved on the OWS as files called dsdiag, dsdiag.ov, cleario, and cleario.ov. G.S.2 EXPANDER TAPE UNICOS Write the deadstart diagnostics to the same deadstart tape as the UNICOS kernel. To write the newly created deadstart diagnostic binaries to expander tape, enter the following commands: extd extd extd -0 -0 -0 -r -n 7 < Ice/ios/cleario -r -n 8 < Ice/ios/dsdiag -n 9 < Ice/ios/dsdiag.ov NOTE Two lOS systems store diagnostics in directories Ice/iosal and Ice/iosb/. The deadstart binaries are now saved on the expander tape as files called CLEARIO, DSDIAG, and DSDIAG.OV. G.S.3 EXPANDER DISK UNICOS To write the newly created dsdiag and cleario binaries to expander disk pack, enter the following commands: exdf exdf exdf -0 -0 -0 IINSTALLldsdiag < Ice/ios/dsdiag IINSTALLldsdiag.ov < Ice/ios/dsdiag.ov IINSTALLlcleario < Ice/ios/cleario INSTALL is a site-specific parameter that indicates the location of CLEARIO, DSDIAG, and DSDIAG.OV on an expander disk. The deadstart diagnostic binaries should reside in the same directory as the UNICOS kernel and overlays. NOTE Two lOS systems store diagnostics in directories /ce/iosa/ and /ce/iosb/. SMM-1012 C CRAY PROPRIETARY G-5 he deadstart binaries are now saved on the expander disk pack as files called CLEARIO, DSDIAG, and DSDIAG.OV. G.6 GENERATING olnet This section describes how to generate olnet for computer systems with the following front-ends: • • • G.6.1 IBM Sun Workstation Motorola workstation, OWS, or MWS IBM FRONT-END The following olnet build procedure is intended for sites with front-end computer systems running VM. 1. Transfer the following files created during the UNICOS build procedure: UNICOS Name VM Name Description olnet.vm.f file name OLNET file type FORTRAN olnet Fortran source code driver.vm.a file name OLFEIV olnet driver (BAL code) file type ASSEMBLE Perform steps 2 through 6 from the CMS user environment: 2. Compile the olnet Fortran source code: FORTVS OLNET 3. Access the VM/SP macro libraries: LINK MAINT 194 194 RR ACCESS 194 B ACC 194 I GLOBAL MACLIB OSMACRO 4. (a password may be required) DMSSP DMKSP CMSLIB TSOMAC Assemble the VM driver: ASSEMBLE OLFEIV REL B REL I G-6 CRAY PROPRIETARY SMM-1012 C 5. Link the oln~t driver and source code modules to create an executable binary module named OLNET: GLOBAL TXTLIB VLNKMLIB VFORTLIB CMSLIB LOAD OLNET OLFEIV GENMOD OLNET NOTE The following step is required by the olnet licensing agreement. 6. G.6.2 Discard the following files: File Name File Type OLNET OLNET OLFEIV OLFEIV LOAD FORTRAN TEXT ASSEMBLE TEXT MAP SUN WORKSTATION FRONT-END (NSC) The following olnet NSC build procedure is intended for sites with Sun Workstation front-end computer systems. 1. 2. Transfer the following files created during the UNICOS build procedure: UNICOS Name Sun Name Description olnet.sunnsc.f olnet.sunnsc.f olnet Fortran source code drv.sunnsc.c drv.sunnsc.c olnet driver (C code) Compile the olnet Fortran source code and C driver: f77 SMM-1012 C -0 olnet olnet.sunnsc.f drv.sunnsc.c CRAY PROPRIETARY G-7 NOTE The following step is required by the olnet licensing agreement. 3. Remove the following files: rm rm rm "rm G.6.3 olnet.sunnsc.f olnet.sunnsc.o drv.sunnsc.c drv. sunnsc.o SUN WORKSTATION FRONT-END (VME) The following olnet VME build procedure is intended for sites with Sun Workstation front-end computer systems: 1. 2. Transfer the following files created during the UNICOS build procedure: UNICOS Name Sun Name Description olnet.sunvme.f olnet.sunvrne.f olnet Fortran source code drv. sunvme. c drv.sunvrne.c olnet driver (C code) Compile the olnet Fortran source code and C driver. f77 -0 olnet olnet.sunvrne.f drv.sunvrne.c NOTE The following step is required by the olnet licensing agreement. G-8 CRAY PROPRIETARY SMM-1012 C 3. Remove the following files: rm rm rm rm G.6.4 olnet.sunvrne.f olnet.sunvme.o drv.sunvme.c drv.sunvrne.o MOTOROLA WORKSTATION, OWS, OR MWS FRONT-END (VME) The following olnet VME build procedure is intended for sites with Motorola workstation, OWS, or MWS front-end computer systems. 1. 2. Transfer the following file created during the UNICOS build procedure: UNICOS Name Sun Name Description olnet.mot.c olnet.mot.c olnet C source code Compile the olnet C source code and driver. cc -0 olnet olnet.mot.c NOTE The following step is required by the olnet licensing agreement. 3. Discard the following files: rm olnet.mot.c rm olnet.mot.o SMM-1012 C CRAY PROPRIETARY G-9 G.7 DELETING PROPRIETARY SOURCE CODE The CRAY1PL, XMPPL, and DIAGPL libraries contain source code that is CRAY PROPRIETARY. Therefore, the program libraries, source code, binaries, and listings must not be maintained on system storage. Remove the source code files, listings, binaries, and program libraries from system storage by entering the following commands: cd lusrlsrc/diag make -f diag.mk delete rm -f craylpl xmppl diagpl craylpl.mods xmppl.mods diagpl.mods G-10 CRAY PROPRIETARY SMM-1012 C INDEX INDEX cleario execution, 6-2 messages, 6-4 overview, 6-2 Confidence tests examples, 2-6 execution, 2-5 execution times, B-1 list of, A-1 messages, 2-8 off-line monitor (offmon), 2-10 olcfdt, 3-1 olcfpt, 3-11 olcm, 3-25 olcrit, 3-36 olcsvc, 3-61 olibuf, 3-85 olsbt, 3-107 on-line monitor (olemon), 2-1 overview, 2-1 termination, 2-5 Deadstart programs cleario, 6-2 dsdiaq, 6-5 list of, A-8 overview, 6-1 system configuration, 6-1 donut buffer utility menu, 5-13 disk mode maintenance mode, 5-3 overview, 5-2 system mode, 5-3 disk selection, 5-2 error correction code test, 5-41 error utility menu, 5-17 error log menu, 5-19 error table menu, 5-18 examples, 5-44 execution, 5-5 exiting, 5-44 flaw table utility menus, 5-33 formatting menu, 5-20 examine data buffer menu, 5-22 ID analysis menu, 5-23 logical address of the sector ID, 5-21 parameter menu, 5-27 position field of the sector ID, 5-22 SMM-1012 C donut (continued) main menu, 5-9 commands to change the data buffer, 5-12 commands to change the type of write command used, 5-12 commands to display commands list, 5-13 commands to display flaw table menus, 5-11 commands to display sUbmemus, 5-9 commands to display the data buffer, 5-11 commands to select display format, 5-10 commands to set arguments, 5-10 menu displays, 5-4 overview, 5-1 parameter menu, 5-42 surface tests menu, 5-27 examine data buffer menu, 5-33 parameter menu, 5-33 write data, read data and compare, and surface analysis menus, 5-29 warnings and messages, 5-4 Down-device programs, 5-1 donut, 5-1 list of, A-4,5,6 oldmon, 5-50 unitap, 5-89 dsdiaq execution IOP-O tests, 6-7 lOS tests dshsp, 6-14 dsiom, 6-10 dsiop, 6-10 dslsp, 6-15 dsmos, 6-13 dsmos16t, 6-9 overview, 6-9 messages error all tests, 6-17 dshsp, 6-24 dsiom, 6-19 dsiop, 6-20 dslsp, 6-31 dsmos, 6-22 dsmos16t, 6-19 IOP-O tests, 6-18 CRAY PROPRIETARY Index-l dsdiag messages (continued) informative, 6-16 overview, 6-16 overview, 6-5 Error messages (see Program messages) Examples (see Program execution examples) Execution (see Program execution) Installation information, G-1 generating olnet, G-6 generating on-line diagnostic binaries, G-2 generating on-line diagnostic listings, G-2 on-line diagnostic directories, G-1 saving lOS deadstart programs, G-4 saving off-line versions of on-line confidence tests, G-3 lOS deadstart programs (see deadstart programs) Libraries (see Program libraries) Maintenance tests diagnostic memory image, 4-13 examples, 4-7 execution, 4-4 execution times, B-2 list of CPU tests, A-2 messages, 4-12 monitor (olmon), 4-1 overview, 4-1 synopsis, 4-2 termination, 4-7 test-specific requirements olaht, 4-5 olCllUr:, 4-5 olibz, 4-6 Messages (see Program messages) Monitors down CPU (oldmon), 5-50 off-line confidence (offmon), 2-10 on-line confidence (olcmon), 2-1 maintenance (olmon), 4-1 offmon list of tests, A-9 overview, 2-10 olcfdt examples, 3-6 messages, 3-8 overview, 3-1 synopsis, 3-2 olcfpt examples, 3-18 execution comparison of simulation and execution results, 3-16 Index-2 olcfpt execution (continued) error isolation, 3-16 random floating point instruction and data generation, 3-15 random floating point instruction buffer execution, 3-16 random floating point instruction buffer simulation, 3-15 test initialization, 3-15 messages, 3-23 overview, 3-11 synopsis, 3-11 termination, 3-18 olcm examples, 3-30 execution comparison of expected and actual data, 3-30 error report, 3-30 test initialization, 3-26 test section execution, 3-27 messages, 3-34 overview, 3-25 synopsis, 3-25 termination, 3-30 olcmon examples, 2-6 execution, 2-5 messages, 2-8 overview, 2-1 synopsis, 2-1 termination, 2-5 olcrit examples, 3-49 execution comparison of simulation and execution results, 3-47 error isolation, 3-48 random instruction and data generation, 3-46 random instruction buffer execution, 3-47 random instruction buffer simulation, 3-47 test initialization and hardware configuration detection, 3-45 messages, 3-57 overview, 3-36 synopsis, 3-36 termination, 3-49 olcsvc examples, 3-77 execution comparison of execution results, 3-76 error isolation, 3-76 instruction buffer execution, 3-75 overview, 3-66 random instruction and data generation, 3-67 test initialization and hardware configuration detection, 3-66 messages, 3-83 overview, 3-61 CRAY PROPRIETARY SMM-1012 C oIcsvc (continued) synopsis, 3-61 termination, 3-77 oIdmon commands, 5-63 append (a) and dump (d), 5-66 common arguments, 5-65 CPU (c), 5-67 enter (e), 5-68 execute (z), 5-68 fill (f), 5-68 go (g), 5-69 halt (h), 5-69 load (1), 5-70 options (0), 5-70 quit (q), 5-71 redraw (r), 5-71 shell escape (!), 5-72 status (s), 5-72 up (u), 5-72 view (v), 5-72 write (w), 5-73 display modes screen mode display, 5-62 scroll mode display, 5-61 down CPU tests, 5-50 example, 5-74 execution down CPU tests, 5-53 environment variables, 5-58 test loop code, 5-56 messages, 5-87 overview, 5-50 synopsis, 5-51 oIhpa examples, 7-9 help menus, 7-6 messages, 7-13 overview, 7-1 shell script generation and execution, 7-10 synopsis, 7-1 olibuf error isolation to the failing bit, 3-96 CRAY X-MP computer system error isolation, 3-99 CXll system error isolation, 3-97 examples, 3-101 execution comparison of expected and actual data, 3-96 CRAY X-MP computer system test buffer generation, 3-89 CRAY Y-MP computer system test buffer generation, 3-92 error report, 3-96 test buffer execution, 3-96 test initialization, 3-88 messages, 3-105 overview, 3-85 synopsis, 3-85 termination, 3-101 SMM-1012 C olmon diagnostic memory image, 4-13 examples, 4-7 execution, 4-4 messages, 4-12 overview, 4-1 synopsis, 4-2 termination, 4-7 olnet, A-7 oIsbt examples, 3-115 execution comparison of simulation and execution results, 3-114 error isolation, 3-114 random instruction and data generation, 3-110 random instruction buffer execution, 3-113 random instruction buffer simulation, 3-113 test initialization and hardware configuration detection, 3-110 messages, 3-126 overview, 3-107 synopsis, 3-107 termination, 3-115 On-line diagnostics confidence tests olcfdt, 3-1 olcfpt, 3-11 olCJR, 3-25 olcrit, 3-36 olcsvc, 3-61 olibuf, 3-85 olsbt, 3-107 overview, 2-1 deadstart programs cleario, 6-2 dsdiaq, 6-5 overview, 6-1 down-device programs, 5-1 donut, 5-1 oldmon, 5-50 unitap, 5-89 environment, 1-1 list of confidence tests, A-I CPU tests, A-2 deadstart programs, A-8 down-device programs, A-4,5,6 maintenance tests, A-2 utility programs, A-5 maintenance tests diagnostic memory image, 4-13 examples, 4-7 execution, 4-4 execution times, B-2 messages, 4-12 monitor (olmon), 4-1 overview, 4-1 synopsis, 4-2 termination, 4-7 CRAY PROPRIETARY Index-3 On-line diagnostics (continued) monitors offmon, 2-10 olcmon, 2-1 oldman, 5-50 olmon, 4-1 utilities olhpa, 7-1 runsequence, 7-14 system, E-l Program execution confidence tests execution times, B-1 olcfdt, 3-1 olcfpt, 3-14 olcm, 3-26· olcrit, 3-44 olcsvc, 3-66 olibuf, 3-88 olsbt, 3-110 overview, 2-5 deadstart programs cleario, 6-2 dsdiag, 6-5 down-device programs donut, 5-5 oldmon, 5-53 unitap, 5-91 examples confidence tests, 2-6 donut, 5-44 maintenance tests, 4-7 olcfdt, 3-6 olcfpt, 3-18 olcm, 5-30 olcmon, 2-6 olcrit, 3-49 olcsvc, 3-77 oldmon, 5-74 olhpa, 7-9 olibuf, 3-101 olmon, 4-7 olsbt, 3-115 unitap, 5-111 libraries DIAGPL, C-l XMPPL, C-2 CRAYIPL, C-2 maintenance tests execution times, B-2 overview, 4-4 times confidence tests, B-1 maintenance tests, B-2 utilities olhpa, 7-1 runsequence, 7-14 Program messages confidence tests, 2-8 cleario, 6-4 donut, 5-4 Index-4 Program messages (continued) dsdiag, 6-16 maintenance tests, 4-12 olcfdt, 3-8 olcfpt, 3-23 OIClll, 3-34 olcmon, 2-8 olcrit, 3-57 olcsvc, 3-83 oldmon, 5-87 olhpa, 7-13 olibuf, 3-105 olmon, 4-12 olsbt, 3-126 unitap, 5-111 runsequence crontab input file, 7-14 overview, 7-14 sequence files, 7-16 shell script, 7-17 Site communications, F-l Software Problem Report (SPR) description, D-l form, D-2 SPR (see Software Problem Report) Support (see Site communications) Times (see Program execution times) unitap debug tools, 5-102 breakpoint tool, 5-103 channel commands tool, 5-104 compare data tool, 5-107 display data buffer tool, 5-105 packet status tool, 5-110 programming tool, 5-109 system call history tool, 5-108 examples, 5-111 execution, 5-91 learn mode, 5-111 menus canned test menu, 5-96 debug menu, 5-98 global options menu, 5-99 hardware layout menu, 5-100 main menu, 5-92 test menu, 5-94 variable menu, 5-93 messages, 5-111 overview, 5-89 synopsis, 5-90 trace file, 5-111 Utility programs list of, A-9 olhpa, 7-1 runsequence, 7-14 system, E-l CRAY PROPRIETARY SMM-1012 C READER'S COMMENT FORM CRAY Y-MP, CRAY X-MP EA, CRAY X-MP, and CRAY-l Computer Systems UNICOS On-line Diagnostic Maintenance Manual SMM-lOl2 C Your reactions to this manual will help us provide you with better documentation. Please take a moment to check the spaces below, and use the blank space for additional comments. 1) Your experience with computers: _ _ 0-1 year _ _ 1-5 years _ _5+ years 2) Your experience with Cray computer systems: _ _0-1 year _ _ 1-5 years _ _5+ years 3) Your occupation: _ _ computer programmer _ _ non-computer professional _ _ other (please specify): _ _ _ _ _ _ _ _ _ _ __ 4) How you used this manual: _ _ in a class __as a tutorial or introduction _ _ as a reference guide __ for troubleshooting Using a scale from 1 (poor) to 10 (excellent), please rate this manual on the following criteria: 5) Accuracy _ _ 6) Completeness _ _ 7) Organization __ 8) Physical qualities (binding, printing) _ _ 9) Readability _ _ 10) Amount and quality of examples _ _ Please use the space below, and an additional sheet if necessary, for your other comments about this manual. If you have discovered any inaccuracies or omissions, please give us the page number on which the problem occurred. We promise a quick reply to your comments and questions. Name --------------------Title _ _ _ _ _ _ _ _ _ __ Company _______________ Telephone ________________ Today's Date _ _ _ _ _ __ Address --------------------City _____________ State/ Country _ _ _ _ _ __ Zip Code _ _ _ _ _ _ _ __ (") C -i ~ r o Z G') -i I en r Z m FOLD .-----------------------------------------------~ 111111 NO POSTAGE NECESSARY IF MAILED IN THE UNITED STATES BUSINESS REPLY CARD FIRST CLASS PERMIT NO 6184 ST PAUL. MN POSTAGE WILL BE PAID BY ADDRESSEE RESEARCH. INC. Attention: PUBLICATIONS 1345 Northland Drive Mendota Heights, MN 55120 -----------------------------------------------~ FOLD READER'S COMMENT FORM CRAY Y-MP, CRAY X-MP EA, CRAY X-MP, and CRAY-l Computer Systems UNICOS On-line Diagnostic Maintenance Manual SMM-1012 C Your reactions to this manual will help us provide you with better documentation. Please take a moment to check the spaces below, and use the blank space for additional comments. 1) Your experience with computers: _ _ 0-1 year _ _ 1-5 years _ _5+ years 2) Your experience with Cray computer systems: _ _0-1 year _ _ 1-5 years _ _5+ years 3) Your occupation: _ _ computer programmer _ _ non-computer professional __ other (please specify): _ _ _ _ _ _ _ _ _ _ __ 4) How you used this manual: _ _ in a class __as a tutorial or introduction _ _ as a reference guide __ for troubleshooting Using a scale from 1 (poor) to 10 (excellent), please rate this manual on the following criteria: 5) Accuracy _ _ 6) Completeness _ _ 7) Organization __ 8) Physical qualities (binding, printing) _ _ 9) Readability _ _ 10) Amount and quality of examples _ _ Please use the space below, and an additional sheet if necessary, for your other comments about this manual. If you have discovered any inaccuracies or omissions, please give us the page number on which the problem occurred. We promise a quick reply to your comments and questions. Name ----------------------_ Title __________ Company ______________ Telephone __________ Today's Date _____________ Address ___________ City _____________ Statel Country ________ Zip Code ____________ ("') C -I » r o Z G) -I :::r: Ui r Z m FOLD .-----------------------------------------------~ 111111 NO POSTAGE NECESSARY IF MAILED IN THE UNITED STATES BUSINESS REPLY CARD FIRST CLASS PERMIT NO 6184 ST PAUL, MN POSTAGE Will BE PAID BY ADDRESSEE RESEARCH, INC. Attention: PUBLICATIONS 1345 Northland Drive Mendota Heights, M N 55120 ,-----------------------------------------------~ FOLD r r r r, r
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.3 Linearized : No XMP Toolkit : Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00 Create Date : 2009:10:18 21:42:53Z Creator Tool : OmniPage Pro 15 Modify Date : 2009:11:08 18:20:23-06:00 Metadata Date : 2009:11:08 18:20:23-06:00 Producer : Adobe Acrobat 9.0 Paper Capture Plug-in Format : application/pdf Document ID : uuid:8cd1797e-4d36-4ef2-8c92-fe3415c4eed9 Instance ID : uuid:a6b7769d-80e2-458a-9d8b-dd3d41a2b05a Page Count : 392 Creator : OmniPage Pro 15EXIF Metadata provided by EXIF.tools