Step By Programming With Base SAS Software Manual
SAS_programming_manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 788
Download | |
Open PDF In Browser | View PDF |
Step-by-Step Programming with Base SAS Software ® The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2001. Step-by-Step Programming with Base SAS ® Software. Cary, NC: SAS Institute Inc. Step-by-Step Programming with Base SAS® Software Copyright © 2001 by SAS Institute Inc., Cary, NC, USA. ISBN 978-1-58025-791-6 All rights reserved. Produced in the United States of America. For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. For a Web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication. U.S. Government Restricted Rights Notice. Use, duplication, or disclosure of this software and related documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19 Commercial Computer Software-Restricted Rights (June 1987). SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. February 2007 SAS® Publishing provides a complete selection of books and electronic products to help customers use SAS software to its fullest potential. For more information about our e-books, e-learning products, CDs, and hard-copy books, visit the SAS Publishing Web site at support.sas.com/pubs or call 1-800-727-3228. SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. Contents PART 1 Introduction to the SAS System 1 Chapter 1 3 4 What Is the SAS System? Introduction to the SAS System 3 Components of Base SAS Software 4 Output Produced by the SAS System 8 Ways to Run SAS Programs 11 Running Programs in the SAS Windowing Environment Review of SAS Tools 15 Learning More 16 PART 2 Getting Your Data into Shape Chapter 2 17 4 Introduction to DATA Step Processing Introduction to DATA Step Processing 20 The SAS Data Set: Your Key to the SAS System How the DATA Step Works: A Basic Introduction Supplying Information to Create a SAS Data Set Review of SAS Tools 41 Learning More 41 Chapter 3 13 4 Starting with Raw Data: The Basics 19 20 26 33 43 Introduction to Raw Data 44 Examine the Structure of the Raw Data: Factors to Consider Reading Unaligned Data 44 Reading Data That Is Aligned in Columns 47 Reading Data That Requires Special Instructions 50 Reading Unaligned Data with More Flexibility 53 Mixing Styles of Input 55 Review of SAS Tools 58 Learning More 59 Chapter 4 4 Starting with Raw Data: Beyond the Basics 44 61 Introduction to Beyond the Basics with Raw Data 61 Testing a Condition before Creating an Observation 62 Creating Multiple Observations from a Single Record 63 Reading Multiple Records to Create a Single Observation 67 Problem Solving: When an Input Record Unexpectedly Does Not Have Enough Values 74 Review of SAS Tools 77 Learning More 79 iv Chapter 5 4 Starting with SAS Data Sets 81 Introduction to Starting with SAS Data Sets 81 Understanding the Basics 82 Input SAS Data Set for Examples 82 Reading Selected Observations 84 Reading Selected Variables 85 Creating More Than One Data Set in a Single DATA Step 89 Using the DROP= and KEEP= Data Set Options for Efficiency Review of SAS Tools 92 Learning More 93 PART 3 Basic Programming Chapter 6 95 4 Understanding DATA Step Processing Introduction to DATA Step Processing 97 Input SAS Data Set for Examples 97 Adding Information to a SAS Data Set 98 Defining Enough Storage Space for Variables Conditionally Deleting an Observation 104 Review of SAS Tools 105 Learning More 105 Chapter 7 4 Working with Numeric Variables 4 Working with Character Variables 97 103 Introduction to Working with Numeric Variables About Numeric Variables in SAS 108 Input SAS Data Set for Examples 108 Calculating with Numeric Variables 109 Comparing Numeric Variables 113 Storing Numeric Variables Efficiently 115 Review of SAS Tools 116 Learning More 117 Chapter 8 91 107 107 119 Introduction to Working with Character Variables 119 Input SAS Data Set for Examples 120 Identifying Character Variables and Expressing Character Values Setting the Length of Character Variables 122 Handling Missing Values 124 Creating New Character Values 127 Saving Storage Space by Treating Numbers as Characters 134 Review of SAS Tools 135 Learning More 136 Chapter 9 4 Acting on Selected Observations Introduction to Acting on Selected Observations Input SAS Data Set for Examples 140 139 139 121 v Selecting Observations 141 Constructing Conditions 145 Comparing Characters 152 Review of SAS Tools 156 Learning More 157 Chapter 10 4 Creating Subsets of Observations 159 Introduction to Creating Subsets of Observations 159 Input SAS Data Set for Examples 160 Selecting Observations for a New SAS Data Set 161 Conditionally Writing Observations to One or More SAS Data Sets Review of SAS Tools 170 Learning More 170 Chapter 11 4 Working with Grouped or Sorted Observations 173 Introduction to Working with Grouped or Sorted Observations Input SAS Data Set for Examples 174 Working with Grouped Data 175 Working with Sorted Data 181 Review of SAS Tools 185 Learning More 186 Chapter 12 173 4 Using More Than One Observation in a Calculation 187 Introduction to Using More Than One Observation in a Calculation Input File and SAS Data Set for Examples 188 Accumulating a Total for an Entire Data Set 189 Obtaining a Total for Each BY Group 191 Writing to Separate Data Sets 193 Using a Value in a Later Observation 196 Review of SAS Tools 199 Learning More 200 Chapter 13 4 Finding Shortcuts in Programming 201 Introduction to Shortcuts 201 Input File and SAS Data Set 201 Performing More Than One Action in an IF-THEN Statement Performing the Same Action for a Series of Variables 204 Review of SAS Tools 207 Learning More 209 Chapter 14 4 Working with Dates in the SAS System Introduction to Working with Dates 211 Understanding How SAS Handles Dates 212 Input File and SAS Data Set for Examples 213 Entering Dates 214 Displaying Dates 217 Using Dates in Calculations 221 211 164 202 187 vi Using SAS Date Functions 223 Comparing Durations and SAS Date Values Review of SAS Tools 227 Learning More 228 PART 4 Combining SAS Data Sets Chapter 15 225 231 4 Methods of Combining SAS Data Sets 233 Introduction to Combining SAS Data Sets 233 Definition of Concatenating 234 Definition of Interleaving 234 Definition of Merging 235 Definition of Updating 236 Definition of Modifying 237 Comparing Modifying, Merging, and Updating Data Sets Learning More 239 Chapter 16 4 Concatenating SAS Data Sets 238 241 Introduction to Concatenating SAS Data Sets 241 Concatenating Data Sets with the SET Statement 242 Concatenating Data Sets Using the APPEND Procedure 255 Choosing between the SET Statement and the APPEND Procedure Review of SAS Tools 260 Learning More 260 Chapter 17 4 Interleaving SAS Data Sets 263 Introduction to Interleaving SAS Data Sets 263 Understanding BY-Group Processing Concepts 263 Interleaving Data Sets 264 Review of SAS Tools 267 Learning More 267 Chapter 18 4 Merging SAS Data Sets 269 Introduction to Merging SAS Data Sets 270 Understanding the MERGE Statement 270 One-to-One Merging 270 Match-Merging 276 Choosing between One-to-One Merging and Match-Merging Review of SAS Tools 290 Learning More 290 Chapter 19 4 Updating SAS Data Sets 293 Introduction to Updating SAS Data Sets 293 Understanding the UPDATE Statement 294 Understanding How to Select BY Variables 294 Updating a Data Set 295 286 259 vii Updating with Incremental Values 300 Understanding the Differences between Updating and Merging Handling Missing Values 305 Review of SAS Tools 308 Learning More 309 Chapter 20 4 Modifying SAS Data Sets 302 311 Introduction 311 Input SAS Data Set for Examples 312 Modifying a SAS Data Set: The Simplest Case 313 Modifying a Master Data Set with Observations from a Transaction Data Set Understanding How Duplicate BY Variables Affect File Update 317 Handling Missing Values 319 Review of SAS Tools 320 Learning More 321 Chapter 21 4 Conditionally Processing Observations from Multiple SAS Data Sets Introduction to Conditional Processing from Multiple SAS Data Sets Input SAS Data Sets for Examples 324 Determining Which Data Set Contributed the Observation 326 Combining Selected Observations from Multiple Data Sets 328 Performing a Calculation Based on the Last Observation 330 Review of SAS Tools 332 Learning More 332 PART 5 Understanding Your SAS Session Chapter 22 323 323 333 4 Analyzing Your SAS Session with the SAS Log Introduction to Analyzing Your SAS Session with the SAS Log Understanding the SAS Log 336 Locating the SAS Log 337 Understanding the Log Structure 337 Writing to the SAS Log 339 Suppressing Information to the SAS Log 341 Changing the Log’s Appearance 344 Review of SAS Tools 346 Learning More 346 Chapter 23 314 4 Directing SAS Output and the SAS Log 335 335 349 Introduction to Directing SAS Output and the SAS Log 349 Input File and SAS Data Set for Examples 350 Routing the Output and the SAS Log with PROC PRINTTO 351 Storing the Output and the SAS Log in the SAS Windowing Environment 353 Redefining the Default Destination in a Batch or Noninteractive Environment 354 Review of SAS Tools 355 Learning More 356 viii Chapter 24 4 Diagnosing and Avoiding Errors 357 Introduction to Diagnosing and Avoiding Errors 357 Understanding How the SAS Supervisor Checks a Job Understanding How SAS Processes Errors 358 Distinguishing Types of Errors 358 Diagnosing Errors 359 Using a Quality Control Checklist 366 Learning More 366 PART 6 Producing Reports Chapter 25 357 369 4 Producing Detail Reports with the PRINT Procedure Introduction to Producing Detail Reports with the PRINT Procedure Input File and SAS Data Sets for Examples 372 Creating Simple Reports 373 Creating Enhanced Reports 381 Creating Customized Reports 391 Making Your Reports Easy to Change 399 Review of SAS Tools 402 Learning More 405 Chapter 26 371 372 4 Creating Summary Tables with the TABULATE Procedure 407 Introduction to Creating Summary Tables with the TABULATE Procedure Understanding Summary Table Design 408 Understanding the Basics of the TABULATE Procedure 410 Input File and SAS Data Set for Examples 412 Creating Simple Summary Tables 413 Creating More Sophisticated Summary Tables 419 Review of SAS Tools 431 Learning More 433 Chapter 27 4 Creating Detail and Summary Reports with the REPORT Procedure Introduction to Creating Detail and Summary Reports with the REPORT Procedure 436 Understanding How to Construct a Report 436 Input File and SAS Data Set for Examples 438 Creating Simple Reports 439 Creating More Sophisticated Reports 446 Review of SAS Tools 454 Learning More 458 PART 7 Producing Plots and Charts Chapter 28 408 461 4 Plotting the Relationship between Variables Introduction to Plotting the Relationship between Variables Input File and SAS Data Set for Examples 464 463 463 435 ix Plotting One Set of Variables 466 Enhancing the Plot 468 Plotting Multiple Sets of Variables Review of SAS Tools 480 Learning More 481 Chapter 29 473 4 Producing Charts to Summarize Variables Introduction to Producing Charts to Summarize Variables Understanding the Charting Tools 484 Input File and SAS Data Set for Examples 485 Charting Frequencies with the CHART Procedure 487 Customizing Frequency Charts 494 Creating High-Resolution Histograms 503 Review of SAS Tools 514 Learning More 518 PART 8 Designing Your Own Output Chapter 30 483 484 519 4 Writing Lines to the SAS Log or to an Output File Introduction to Writing Lines to the SAS Log or to an Output File Understanding the PUT Statement 522 Writing Output without Creating a Data Set 522 Writing Simple Text 523 Writing a Report 528 Review of SAS Tools 535 Learning More 536 Chapter 31 521 521 4 Understanding and Customizing SAS Output: The Basics 537 Introduction to the Basics of Understanding and Customizing SAS Output Understanding Output 538 Input SAS Data Set for Examples 540 Locating Procedure Output 541 Making Output Informative 542 Controlling Output Appearance 548 Controlling the Appearance of Pages 550 Representing Missing Values 561 Review of SAS Tools 563 Learning More 564 538 4 Understanding and Customizing SAS Output: The Output Delivery System Chapter 32 (ODS) 565 Introduction to Customizing SAS Output by Using the Output Delivery System Input Data Set for Examples 566 Understanding ODS Output Formats and Destinations 567 Selecting an Output Format 568 Creating Formatted Output 569 565 x Selecting the Output That You Want to Format Customizing ODS Output 585 Storing Links to ODS Output 589 Review of SAS Tools 590 Learning More 592 PART 9 577 Storing and Managing Data in SAS Files Chapter 33 4 Understanding SAS Data Libraries Introduction to Understanding SAS Data Libraries What Is a SAS Data Library? 596 Accessing a SAS Data Library 596 Storing Files in a SAS Data Library 598 Referencing SAS Data Sets in a SAS Data Library Review of SAS Tools 601 Learning More 601 Chapter 34 4 Managing SAS Data Libraries 593 595 595 599 603 Introduction 603 Choosing Your Tools 603 Understanding the DATASETS Procedure 604 Looking at a PROC DATASETS Session 605 Review of SAS Tools 606 Learning More 606 Chapter 35 4 Getting Information about Your SAS Data Sets Introduction to Getting Information about Your SAS Data Sets Input Data Library for Examples 608 Requesting a Directory Listing for a SAS Data Library 608 Requesting Contents Information about SAS Data Sets 610 Requesting Contents Information in Different Formats 613 Review of SAS Tools 615 Learning More Chapter 36 607 607 615 4 Modifying SAS Data Set Names and Variable Attributes Introduction to Modifying SAS Data Set Names and Variable Attributes Input Data Library for Examples 618 Renaming SAS Data Sets 618 Modifying Variable Attributes 619 Review of SAS Tools 626 Learning More Chapter 37 627 4 Copying, Moving, and Deleting SAS Data Sets Introduction to Copying, Moving, and Deleting SAS Data Sets Input Data Libraries for Examples 630 Copying SAS Data Sets 630 629 629 617 617 xi Copying Specific SAS Data Sets 634 Moving SAS Data Libraries and SAS Data Sets Deleting SAS Data Sets 637 Deleting All Files in a SAS Data Library 639 Review of SAS Tools 640 Learning More 640 PART 10 635 Understanding Your SAS Environment 641 Chapter 38 643 4 Introducing the SAS Environment Introduction to the SAS Environment 644 Starting a SAS Session 645 Selecting a SAS Processing Mode 645 Review of SAS Tools 652 Learning More 654 Chapter 39 4 Using the SAS Windowing Environment 655 Introduction to Using the SAS Windowing Environment 657 Getting Organized 657 Finding Online Help 660 Using SAS Windowing Environment Command Types 660 Working with SAS Windows 663 Working with Text 667 Working with Files 671 Working with SAS Programs 676 Working with Output 682 Review of SAS Tools 690 Learning More 692 Chapter 40 4 Customizing the SAS Environment 693 Introduction to Customizing the SAS Environment 694 Customizing Your Current Session 695 Customizing Session-to-Session Settings 698 Customizing the SAS Windowing Environment 702 Review of SAS Tools 707 Learning More 708 PART 11 Appendix Appendix 1 709 4 Additional Data Sets 711 Introduction 711 Data Set CITY 712 Raw Data Used for “Understanding Your SAS Session” Section Data Set SAT_SCORES 714 Data Set YEAR_SALES 715 Data Set HIGHLOW 716 713 xii Data Set GRADES 717 Data Sets for “Storing and Managing Data in SAS Files” Section Glossary Index 723 745 718 1 1 P A R T Introduction to the SAS System Chapter 1. . . . . . . . . . What Is the SAS System? 3 2 3 CHAPTER 1 What Is the SAS System? Introduction to the SAS System 3 Components of Base SAS Software 4 Overview of Base SAS Software 4 Data Management Facility 4 Programming Language 5 Elements of the SAS Language 5 Rules for SAS Statements 6 Rules for Most SAS Names 6 Special Rules for Variable Names 6 Data Analysis and Reporting Utilities 6 Output Produced by the SAS System 8 Traditional Output 8 Output from the Output Delivery System (ODS) 9 Ways to Run SAS Programs 11 Selecting an Approach 11 SAS Windowing Environment 11 SAS/ASSIST Software 12 Noninteractive Mode 12 Batch Mode 12 Interactive Line Mode 13 Running Programs in the SAS Windowing Environment 13 Review of SAS Tools 15 Statements 15 Procedures 15 Learning More 16 Introduction to the SAS System SAS is an integrated system of software solutions that enables you to perform the following tasks: 3 data entry, retrieval, and management 3 report writing and graphics design 3 statistical and mathematical analysis 3 business forecasting and decision support 3 operations research and project management 3 applications development How you use SAS depends on what you want to accomplish. Some people use many of the capabilities of the SAS System, and others use only a few. 4 Components of Base SAS Software 4 Chapter 1 At the core of the SAS System is Base SAS software which is the software product that you will learn to use in this documentation. This section presents an overview of Base SAS. It introduces the capabilities of Base SAS, addresses methods of running SAS, and outlines various types of output. Components of Base SAS Software Overview of Base SAS Software Base SAS software contains the following: 3 a data management facility 3 a programming language 3 data analysis and reporting utilities Learning to use Base SAS enables you to work with these features of SAS. It also prepares you to learn other SAS products, because all SAS products follow the same basic rules. Data Management Facility SAS organizes data into a rectangular form or table that is called a SAS data set. The following figure shows a SAS data set. The data describes participants in a 16-week weight program at a health and fitness club. The data for each participant includes an identification number, name, team name, and weight (in U.S. pounds) at the beginning and end of the program. Figure 1.1 Rectangular Form of a SAS Data Set variable IdNumber Name Team StartWeight EndWeight 1 1023 David Shaw red 189 165 2 1049 Amelia Serrano yellow 145 124 3 1219 Alan Nance red 210 192 4 1246 Ravi Sinha yellow 194 177 5 1078 Ashley McKnight red 127 118 observation data value data value In a SAS data set, each row represents information about an individual entity and is called an observation. Each column represents the same type of information and is called a variable. Each separate piece of information is a data value. In a SAS data set, What Is the SAS System? 4 Programming Language 5 an observation contains all the data values for an entity; a variable contains the same type of data value for all entities. To build a SAS data set with Base SAS, you write a program that uses statements in the SAS programming language. A SAS program that begins with a DATA statement and typically creates a SAS data set or a report is called a DATA step. The following SAS program creates a SAS data set named WEIGHT_CLUB from the health club data: data weight_club; u input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight; v Loss=StartWeight-EndWeight; w datalines; x 1023 David Shaw red 189 165 y 1049 Amelia Serrano yellow 145 124 y 1219 Alan Nance red 210 192 y 1246 Ravi Sinha yellow 194 177 y 1078 Ashley McKnight red 127 118 y ; U run; The following list corresponds to the numbered items in the preceding program: u The DATA statement tells SAS to begin building a SAS data set named WEIGHT_CLUB. v The INPUT statement identifies the fields to be read from the input data and names the SAS variables to be created from them (IdNumber, Name, Team, StartWeight, and EndWeight). w The third statement is an assignment statement. It calculates the weight each person lost and assigns the result to a new variable, Loss. x The DATALINES statement indicates that data lines follow. y The data lines follow the DATALINES statement. This approach to processing raw data is useful when you have only a few lines of data. (Later sections show ways to access larger amounts of data that are stored in files.) U The semicolon signals the end of the raw data, and is a step boundary. It tells SAS that the preceding statements are ready for execution. Note: By default, the data set WEIGHT_CLUB is temporary; that is, it exists only for the current job or session. For information about how to create a permanent SAS data set, see Chapter 2, “Introduction to DATA Step Processing,” on page 19. 4 Programming Language Elements of the SAS Language The statements that created the data set WEIGHT_CLUB are part of the SAS programming language. The SAS language contains statements, expressions, functions and CALL routines, options, formats, and informats – elements that many programming languages share. However, the way you use the elements of the SAS language depends on certain programming rules. The most important rules are listed in the next two sections. 6 Data Analysis and Reporting Utilities 4 Chapter 1 Rules for SAS Statements The conventions that are shown in the programs in this documentation, such as indenting of subordinate statements, extra spacing, and blank lines, are for the purpose of clarity and ease of use. They are not required by SAS. There are only a few rules for writing SAS statements: 3 SAS statements end with a semicolon. 3 You can enter SAS statements in lowercase, uppercase, or a mixture of the two. 3 You can begin SAS statements in any column of a line and write several statements on the same line. 3 You can begin a statement on one line and continue it on another line, but you cannot split a word between two lines. 3 Words in SAS statements are separated by blanks or by special characters (such as the equal sign and the minus sign in the calculation of the Loss variable in the WEIGHT_CLUB example). Rules for Most SAS Names SAS names are used for SAS data set names, variable names, and other items. The following rules apply: 3 A SAS name can contain from one to 32 characters. 3 The first character must be a letter or an underscore (_). 3 Subsequent characters must be letters, numbers, or underscores. 3 Blanks cannot appear in SAS names. Special Rules for Variable Names For variable names only, SAS remembers the combination of uppercase and lowercase letters that you use when you create the variable name. Internally, the case of letters does not matter. “CAT,” “cat,” and “Cat” all represent the same variable. But for presentation purposes, SAS remembers the initial case of each letter and uses it to represent the variable name when printing it. Data Analysis and Reporting Utilities The SAS programming language is both powerful and flexible. You can program any number of analyses and reports with it. SAS can also simplify programming for you with its library of built-in programs known as SAS procedures. SAS procedures use data values from SAS data sets to produce preprogrammed reports, requiring minimal effort from you. For example, the following SAS program produces a report that displays the values of the variables in the SAS data set WEIGHT_CLUB. Weight values are presented in U.S. pounds. options linesize=80 pagesize=60 pageno=1 nodate; proc print data=weight_club; title ’Health Club Data’; run; This procedure, known as the PRINT procedure, displays the variables in a simple, organized form. The following output shows the results: What Is the SAS System? 4 Data Analysis and Reporting Utilities 7 Output 1.1 Displaying the Values in a SAS Data Set Health Club Data Obs Id Number 1 2 3 4 5 1023 1049 1219 1246 1078 Name Team David Shaw Amelia Serrano Alan Nance Ravi Sinha Ashley McKnight red yellow red yellow red 1 Start Weight 189 145 210 194 127 End Weight 165 124 192 177 118 Loss 24 21 18 17 9 To produce a table showing mean starting weight, ending weight, and weight loss for each team, use the TABULATE procedure. options linesize=80 pagesize=60 pageno=1 nodate; proc tabulate data=weight_club; class team; var StartWeight EndWeight Loss; table team, mean*(StartWeight EndWeight Loss); title ’Mean Starting Weight, Ending Weight,’; title2 ’and Weight Loss’; run; The following output shows the results: Output 1.2 Table of Mean Values for Each Team Mean Starting Weight, Ending Weight, and Weight Loss 1 ----------------------------------------------------------| | Mean | | |--------------------------------------| | |StartWeight | EndWeight | Loss | |------------------+------------+------------+------------| |Team | | | | |------------------| | | | |red | 175.33| 158.33| 17.00| |------------------+------------+------------+------------| |yellow | 169.50| 150.50| 19.00| ----------------------------------------------------------- A portion of a SAS program that begins with a PROC (procedure) statement and ends with a RUN statement (or is ended by another PROC or DATA statement) is called a PROC step. Both of the PROC steps that create the previous two outputs comprise the following elements: 3 a PROC statement, which includes the word PROC, the name of the procedure you want to use, and the name of the SAS data set that contains the values. (If you omit the DATA= option and data set name, the procedure uses the SAS data set that was most recently created in the program.) 3 additional statements that give SAS more information about what you want to do, for example, the CLASS, VAR, TABLE, and TITLE statements. 8 Output Produced by the SAS System 4 Chapter 1 3 a RUN statement, which indicates that the preceding group of statements is ready to be executed. Output Produced by the SAS System Traditional Output A SAS program can produce some or all of the following kinds of output: a SAS data set contains data values that are stored as a table of observations and variables. It also stores descriptive information about the data set, such as the names and arrangement of variables, the number of observations, and the creation date of the data set. A SAS data set can be temporary or permanent. The examples in this section create the temporary data set WEIGHT_CLUB. the SAS log is a record of the SAS statements that you entered and of messages from SAS about the execution of your program. It can appear as a file on disk, a display on your monitor, or a hardcopy listing. The exact appearance of the SAS log varies according to your operating environment and your site. The output in Output 1.3 shows a typical SAS log for the program in this section. a report or simple listing ranges from a simple listing of data values to a subset of a large data set or a complex summary report that groups and summarizes data and displays statistics. The appearance of procedure output varies according to your site and the options that you specify in the program, but the output in Output 1.1 and Output 1.2 illustrate typical procedure output. You can also use a DATA step to produce a completely customized report (see “Creating Customized Reports” on page 391). other SAS files such as catalogs contain information that cannot be represented as tables of data values. Examples of items that can be stored in SAS catalogs include function key settings, letters that are produced by SAS/FSP software, and displays that are produced by SAS/GRAPH software. external files or entries in other databases can be created and updated by SAS programs. SAS/ACCESS software enables you to create and update files that are stored in databases such as Oracle. What Is the SAS System? 4 Output from the Output Delivery System (ODS) Output 1.3 Traditional Output: A SAS Log NOTE: PROCEDURE PRINTTO used: real time 0.02 seconds cpu time 0.01 seconds 22 23 options pagesize=60 linesize=80 pageno=1 nodate; 24 25 data weight_club; 26 input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight; 27 Loss=StartWeight-EndWeight; 28 datalines; NOTE: The data set WORK.WEIGHT_CLUB has 5 observations and 6 variables. NOTE: DATA statement used: real time 0.14 seconds cpu time 0.07 seconds 34 ; 35 36 37 proc tabulate data=weight_club; 38 class team; 39 var StartWeight EndWeight Loss; 40 table team, mean*(StartWeight EndWeight Loss); 41 title ’Mean Starting Weight, Ending Weight,’; 42 title2 ’and Weight Loss’; 43 run; NOTE: There were 5 observations read from the data set WORK.WEIGHT_CLUB. NOTE: PROCEDURE TABULATE used: real time 0.18 seconds cpu time 0.09 seconds 44 proc printto; run; Output from the Output Delivery System (ODS) The Output Delivery System (ODS) enables you to produce output in a variety of formats, such as 3 3 3 3 3 an HTML file a traditional SAS Listing (monospace) a PostScript file an RTF file (for use with Microsoft Word) an output data set The following figure illustrates the concept of output for SAS Version 8. 9 10 Output from the Output Delivery System (ODS) Figure 1.2 4 Chapter 1 Model of the Production of ODS Output Table Definition (formatting instructions) Data + Output Object RTF Destination RTF Output Output Destination SAS Data Sets } Listing Destination HTML Destination Printer Destination Listing Output HTML Output High-resolution Printer Output ODS Destination } ODS Output The following definitions describe the terms in the preceding figure: data Each procedure that supports ODS and each DATA step produces data, which contains the results (numbers and characters) of the step in a form similar to a SAS data set. table definition The table definition is a set of instructions that describes how to format the data. This description includes but is not limited to 3 the order of the columns 3 text and order of column headings 3 formats for data 3 font sizes and font faces output object ODS combines formatting instructions with the data to produce an output object. The output object, therefore, contains both the results of the procedure or DATA step and information about how to format the results. An output object has a name, a label, and a path. Note: Although many output objects include formatting instructions, not all do. In some cases the output object consists of only the data. 4 ODS destinations An ODS destination specifies a specific type of output. ODS supports a number of destinations, which include the following: RTF What Is the SAS System? 4 SAS Windowing Environment 11 produces output that is formatted for use with Microsoft Word. Output produces a SAS data set. Listing produces traditional SAS output (monospace format). HTML produces output that is formatted in Hyper Text Markup Language (HTML). You can access the output on the web with your web browser. Printer produces output that is formatted for a high-resolution printer. An example of this type of output is a PostScript file. ODS output ODS output consists of formatted output from any of the ODS destinations. For more information about ODS output, see Chapter 23, “Directing SAS Output and the SAS Log,” on page 349 and Chapter 32, “Understanding and Customizing SAS Output: The Output Delivery System (ODS),” on page 565. For complete information about ODS, see SAS Output Delivery System: User’s Guide. Ways to Run SAS Programs Selecting an Approach There are several ways to run SAS programs. They differ in the speed with which they run, the amount of computer resources that are required, and the amount of interaction that you have with the program (that is, the kinds of changes you can make while the program is running). The examples in this documentation produce the same results, regardless of the way you run the programs. However, in a few cases, the way that you run a program determines the appearance of output. The following sections briefly introduce different ways to run SAS programs. SAS Windowing Environment The SAS windowing environment enables you to interact with SAS directly through a series of windows. You can use these windows to perform common tasks, such as locating and organizing files, entering and editing programs, reviewing log information, viewing procedure output, setting options, and more. If needed, you can issue operating system commands from within this environment. Or, you can suspend the current SAS windowing environment session, enter operating system commands, and then resume the SAS windowing environment session at a later time. Using the SAS windowing environment is a quick and convenient way to program in SAS. It is especially useful for learning SAS and developing programs on small test files. Although it uses more computer resources than other techniques, using the SAS windowing environment can save a lot of program development time. For more information about the SAS windowing environment, see Chapter 39, “Using the SAS Windowing Environment,” on page 655. 12 SAS/ASSIST Software 4 Chapter 1 SAS/ASSIST Software One important feature of SAS is the availability of SAS/ASSIST software. SAS/ASSIST provides a point-and-click interface that enables you to select the tasks that you want to perform. SAS then submits the SAS statements to accomplish those tasks. You do not need to know how to program in the SAS language in order to use SAS/ASSIST. SAS/ASSIST works by submitting SAS statements just like the ones shown earlier in this section. In that way, it provides a number of features, but it does not represent the total functionality of SAS software. If you want to perform tasks other than those that are available in SAS/ASSIST, you need to learn to program in SAS as described in this documentation. Noninteractive Mode In noninteractive mode, you prepare a file that contains SAS statements and any system statements that are required by your operating environment, and submit the program. The program runs immediately and occupies your current workstation session. You cannot continue to work in that session while the program is running,* and you usually cannot interact with the program.** The log and procedure output go to prespecified destinations, and you usually do not see them until the program ends. To modify the program or correct errors, you must edit and resubmit the program. Noninteractive execution may be faster than batch execution because the computer system runs the program immediately rather than waiting to schedule your program among other programs. Batch Mode To run a program in batch mode, you prepare a file that contains SAS statements and any system statements that are required by your operating environment, and then you submit the program. You can then work on another task at your workstation. While you are working, the operating environment schedules your job for execution (along with jobs submitted by other people) and runs it. When execution is complete, you can look at the log and the procedure output. The central feature of batch execution is that it is completely separate from other activities at your workstation. You do not see the program while it is running, and you cannot correct errors at the time they occur. The log and procedure output go to prespecified destinations; you can look at them only after the program has finished running. To modify the SAS program, you edit the program with the editor that is supported by your operating environment and submit a new batch job. When sites charge for computer resources, batch processing is a relatively inexpensive way to execute programs. It is particularly useful for large programs or when you need to use your workstation for other tasks while the program is executing. However, for learning SAS or developing and testing new programs, using batch mode might not be efficient. * In a workstation environment, you can switch to another window and continue working. ** Limited ways of interaction are available. You can, for example, use the asterisk (*) option in a %INCLUDE statement in your program. What Is the SAS System? 4 Running Programs in the SAS Windowing Environment 13 Interactive Line Mode In an interactive line-mode session, you enter one line of a SAS program at a time, and SAS executes each DATA or PROC step automatically as soon as it recognizes the end of the step. You usually see procedure output immediately on your display monitor. Depending on your site’s computer system and on your workstation, you may be able to scroll backward and forward to see different parts of your log and procedure output, or you may lose them when they scroll off the top of your screen. There are limited facilities for modifying programs and correcting errors. Interactive line-mode sessions use fewer computer resources than a windowing environment. If you use line mode, you should familiarize yourself with the %INCLUDE, %LIST, and RUN statements in SAS Language Reference: Dictionary. Running Programs in the SAS Windowing Environment You can run most programs in this documentation by using any of the methods that are described in the previous sections. This documentation uses the SAS windowing environment (as it appears on Windows and UNIX operating environments) when it is necessary to show programming within a SAS session. The SAS windowing environment appears differently depending on the operating environment that you use. For more information about the SAS windowing environment, see Chapter 39, “Using the SAS Windowing Environment,” on page 655. The following example gives a brief overview of a SAS session that uses the SAS windowing environment. When you invoke SAS, the following windows appear. Display 1.1 SAS Windowing Environment The specific window placement, display colors, messages, and some other details vary according to your site, your monitor, and your operating environment. The window on the left side of the display is the SAS Explorer window, which you can use to assign and locate SAS libraries, files, and other items. The window at the top right is the Log 14 Running Programs in the SAS Windowing Environment 4 Chapter 1 window; it contains the SAS log for the session. The window at the bottom right is the Program Editor window. This window provides an editor in which you edit your SAS programs. To create the program for the health and fitness club, type the statements in the Program Editor window. You can turn line numbers on or off to facilitate program creation. The following display shows the beginning of the program. Display 1.2 Editing a Program in the Program Editor Window When you fill the Program Editor window, scroll down to continue typing the program. When you finish editing the program, submit it to SAS and view the output. (If SAS does not create output, check the SAS log for error messages.) The following displays show the first and second pages of the Output window. Display 1.3 The First Page of Output in the Output Window What Is the SAS System? 4 Procedures 15 Display 1.4 The Second Page of Output in the Output Window After you finish viewing the output, you can return to the Program Editor window to begin creating a new program. By default, the output from all submissions remains in the Output window, and all statements that you submit remain in memory until the end of your session. You can view the output at any time, and you can recall previously submitted statements for editing and resubmitting. You can also clear a window of its contents. All the commands that you use to move through the SAS windowing environment can be executed as words or as function keys. You can also customize the SAS windowing environment by determining which windows appear, as well as by assigning commands to function keys. For more information about customizing the SAS windowing environment, see Chapter 40, “Customizing the SAS Environment,” on page 693. Review of SAS Tools Statements DATA SAS-data-set; begins a DATA step and tells SAS to begin creating a SAS data set. SAS-data-set names the data set that is being created. %INCLUDE source(s)>; brings SAS programming statements, data lines, or both into a current SAS program. RUN; tells SAS to begin executing the preceding group of SAS statements. For more information, see Statements in SAS Language Reference: Dictionary. Procedures PROC procedure ; begins a PROC step and tells SAS to invoke a particular SAS procedure to process the SAS data set that is specified in the DATA= option. If you omit the DATA= option, then the procedure processes the most recently created SAS data set in the program. 16 Learning More 4 Chapter 1 For more information about using procedures, see the Base SAS Procedures Guide. Learning More Basic SAS usage For an entry-level introduction to basic SAS programming language, see The Little SAS Book: A Primer, Second Edition. DATA step For more information about how to create SAS data sets, see Chapter 2, “Introduction to DATA Step Processing,” on page 19. DATA step processing For more information about DATA step processing, see Chapter 6, “Understanding DATA Step Processing,” on page 97. For information about how to easily use the SAS environment, see Getting Started with the SAS System. 17 2 P A R T Getting Your Data into Shape Chapter 2. . . . . . . . . . Introduction to DATA Step Processing Chapter 3 . . . . . . . . . . Starting with Raw Data: The Basics Chapter 4 . . . . . . . . . . Starting with Raw Data: Beyond the Basics Chapter 5 . . . . . . . . . . Starting with SAS Data Sets 81 19 43 61 18 19 CHAPTER 2 Introduction to DATA Step Processing Introduction to DATA Step Processing 20 Purpose 20 Prerequisites 20 The SAS Data Set: Your Key to the SAS System 20 Understanding the Function of the SAS Data Set 20 Understanding the Structure of the SAS Data Set 22 Temporary versus Permanent SAS Data Sets 24 Creating and Using Temporary SAS Data Sets 24 Creating and Using Permanent SAS Data Sets 24 Conventions That Are Used in This Documentation 25 How the DATA Step Works: A Basic Introduction 26 Overview of the DATA Step 26 During the Compile Phase 28 During the Execution Phase 28 Example of a DATA Step 29 The DATA Step 29 The Statements 29 The Process 30 Supplying Information to Create a SAS Data Set 33 Overview of Creating a SAS Data Set 33 Telling SAS How to Read the Data: Styles of Input 34 Reading Dates with Two-Digit and Four-Digit Year Values 35 Defining Variables in SAS 35 Indicating the Location of Your Data 36 Data Locations 36 Raw Data in the Job Stream 37 Data in an External File 37 Data in a SAS Data Set 37 Data in a DBMS File 38 Using External Files in Your SAS Job 38 Identifying an External File Directly 38 Referencing an External File with a Fileref 39 Review of SAS Tools 41 Statements 41 Learning More 41 20 Introduction to DATA Step Processing 4 Chapter 2 Introduction to DATA Step Processing Purpose The DATA step is one of the basic building blocks of SAS programming. It creates the data sets that are used in a SAS program’s analysis and reporting procedures. Understanding the basic structure, functioning, and components of the DATA step is fundamental to learning how to create your own SAS data sets. In this section, you will learn the following: 3 what a SAS data set is and why it is needed 3 how the DATA step works 3 what information you have to supply to SAS so that it can construct a SAS data set for you Prerequisites You should understand the concepts introduced in Chapter 1, “What Is the SAS System?,” on page 3 before continuing. The SAS Data Set: Your Key to the SAS System Understanding the Function of the SAS Data Set SAS enables you to solve problems by providing methods to analyze or to process your data in some way. You need to first get the data into a form that SAS can recognize and process. After the data is in that form, you can analyze it and generate reports. The following figure shows this process in the simplest case. Introduction to DATA Step Processing 4 Understanding the Function of the SAS Data Set 21 Figure 2.1 From Raw Data to Final Analysis You begin with raw data, that is, a collection of data that has not yet been processed by SAS. You use a set of statements known as a DATA step to get your data into a SAS data set. Then you can further process your data with additional DATA step programming or with SAS procedures. In its simplest form, the DATA step can be represented by the three components that are shown in the following figure. Figure 2.2 From Raw Data to a SAS Data Set SAS processes input in the form of raw data and creates a SAS data set. When you have a SAS data set, you can use it as input to other DATA steps. The following figure shows the SAS statements that you can use to create a new SAS data set. Figure 2.3 Using One SAS Data Set to Create Another input DATA step statements output existing SAS data set DATA statement; SET, MERGE, MODIFY, or UPDATE; more statements; new SAS data set 22 Understanding the Structure of the SAS Data Set 4 Chapter 2 Understanding the Structure of the SAS Data Set Think of a SAS data set as a rectangular structure that identifies and stores data. When your data is in a SAS data set, you can use additional DATA steps for further processing, or perform many types of analyses with SAS procedures. The rectangular structure of a SAS data set consists of rows and columns in which data values are stored. The rows in a SAS data set are called observations, and the columns are called variables. In a raw data file, the rows are called records and the columns are called fields. Variables contain the data values for all of the items in an observation. For example, the following figure shows a collection of raw data about participants in a health and fitness club. Each record contains information about one participant. Figure 2.4 Raw Data from the Health and Fitness Club The following figure shows how easily the health club records can be translated into parts of a SAS data set. Each record becomes an observation. In this case, each observation represents a participant in the program. Each field in the record becomes a variable. The variables represent each participant’s identification number, name, team name, and weight at the beginning and end of a 16-week program. Introduction to DATA Step Processing 4 Understanding the Structure of the SAS Data Set 23 Figure 2.5 How Data Fits into a SAS Data Set variable IdNumber Name Team StartWeight EndWeight 1 1023 David Shaw red 189 165 2 1049 Amelia Serrano yellow 145 124 3 1219 Alan Nance red 210 192 4 1246 Ravi Sinha yellow 194 177 5 1078 Ashley McKnight red 127 118 6 1221 Jim Brown yellow 220 . observation data value missing value data value In a SAS data set, every variable exists for every observation. What if you do not have all the data for each observation? If the raw data is incomplete because a value for the numeric variable EndWeight was not recorded for one observation, then this missing value is represented by a period that serves as a placeholder, as shown in observation 6 in the previous figure. (Missing values for character variables are represented by blanks. Character and numeric variables are discussed later in this section.) By coding a value as missing, you can add an observation to the data set for which the data is incomplete and still retain the rectangular shape necessary for a SAS data set. Along with data values, each SAS data set contains a descriptor portion, as illustrated in the following figure: Figure 2.6 Parts of a SAS Data Set The descriptor portion consists of details that SAS records about a data set, such as the names and attributes of all the variables, the number of observations in the data set, and the date and time that the data set was created and updated. Operating Environment Information: Depending on your operating environment and the engine used to write the SAS data set, SAS may store additional information about a SAS data set in its descriptor portion. For more information, refer to the SAS documentation for your operating environment. 4 24 Temporary versus Permanent SAS Data Sets 4 Chapter 2 Temporary versus Permanent SAS Data Sets Creating and Using Temporary SAS Data Sets When you use a DATA step to create a SAS data set with a one-level name, you normally create a temporary SAS data set, one that exists only for the duration of your current session. SAS places this data set in a SAS data library referred to as WORK. In most operating environments, all files that SAS stores in the WORK library are deleted at the end of a session. The following is an example of a DATA step that creates the temporary data set WEIGHT_CLUB. data weight_club; input IdNumber Name $ 6--20 Team $ 22--27 StartWeight EndWeight; datalines; 1023 David Shaw red 189 165 1049 Amelia Serrano yellow 145 124 1219 Alan Nance red 210 192 1246 Ravi Sinha yellow 194 177 1078 Ashley McKnight red 127 118 1221 Jim Brown yellow 220 . ; run; The preceding program code refers to the temporary data set as WEIGHT_CLUB. SAS. However, it assigns the first-level name WORK to all temporary data sets, and refers to the WEIGHT_CLUB data set with its two-level name, WORK.WEIGHT_CLUB. The following output from the SAS log shows the name of the temporary data set. Output 2.1 SAS Log: The WORK.WEIGHT_CLUB Temporary Data Set 162 data weight_club; 163 input IdNumber Name $ 6-20 Team $ 22-27 StartWeight EndWeight; 164 datalines; NOTE: The data set WORK.WEIGHT_CLUB has 6 observations and 5 variables. Because SAS assigns the first-level name WORK to all SAS data sets that have only a one-level name, you do not need to use WORK. You can refer to these temporary data sets with a one-level name, such as WEIGHT_CLUB. To reference this SAS data set in a later DATA step or in a PROC step, you can use a one-level name: proc print data = weight_club; run; Creating and Using Permanent SAS Data Sets To create a permanent SAS data set, you must indicate a SAS data library other than WORK. (WORK is a reserved libref that SAS automatically assigns to a temporary SAS data library.) Use a LIBNAME statement to assign a libref to a SAS data library on Introduction to DATA Step Processing 4 Temporary versus Permanent SAS Data Sets 25 your operating environment’s file system. The libref functions as a shorthand way of referring to a SAS data library. Here is the form of the LIBNAME statement: LIBNAME libref ’your-data-library’; where libref is a shortcut name to where your SAS files are stored. libref must be a valid SAS name. It must begin with a letter or an underscore, and it can contain uppercase and lowercase letters, numbers, or underscores. A libref has a maximum length of 8 characters. ’your-data-library’ must be the physical name for your SAS data library. The physical name is the name that is recognized by the operating environment. Operating Environment Information: Additional restrictions can apply to librefs and physical file names under some operating environments. For more information, refer to the SAS documentation for your operating environment. 4 The following is an example of the LIBNAME statement that is used with a DATA step: libname saveit ’your-data-library’; u data saveit.weight_club; v ...more SAS statements... ; proc print data = saveit.weight_club; w run; The following list corresponds to the numbered items: u The LIBNAME statement associates the libref SAVEIT with your-data-library, where your-data-library is your operating environment’s name for a SAS data library. v To create a new permanent SAS data set and store it in this SAS data library, you must use the two-level name SAVEIT.WEIGHT_CLUB in the DATA statement. w To reference this SAS data set in a later DATA step or in a PROC step, you must use the two-level name SAVEIT.WEIGHT_CLUB in the PROC step. For more information, see Chapter 33, “Understanding SAS Data Libraries,” on page 595. Conventions That Are Used in This Documentation Data sets that are used in examples are usually shown as temporary data sets specified with a one-level name: data fitness; In rare cases in this documentation, data sets are created as permanent SAS data sets. These data sets are specified with a two-level name, and a LIBNAME statement precedes each DATA step in which a permanent SAS data set is created: libname saveit ’your-data-library’; data saveit.weight_club; 26 How the DATA Step Works: A Basic Introduction 4 Chapter 2 How the DATA Step Works: A Basic Introduction Overview of the DATA Step The DATA step consists of a group of SAS statements that begins with a DATA statement. The DATA statement begins the process of building a SAS data set and names the data set. The statements that make up the DATA step are compiled, and the syntax is checked. If the syntax is correct, then the statements are executed. In its simplest form, the DATA step is a loop with an automatic output and return action. The following figure illustrates the flow of action in a typical DATA step. Introduction to DATA Step Processing 4 Overview of the DATA Step Figure 2.7 Flow of Action in a Typical DATA Step compiles SAS statements (includes syntax checking) Compile Phase creates an input buffer a program data vector descriptor information begins with a DATA statement (counts iterations) Execution Phase sets variable values to missing in the program data vector data-reading statement: is there a record to read? YES reads an input record executes additional executable statements writes an observation to the SAS data set returns to the beginning of the DATA step NO closes data set; goes on to the next DATA or PROC step 27 28 During the Compile Phase 4 Chapter 2 During the Compile Phase When you submit a DATA step for execution, SAS checks the syntax of the SAS statements and compiles them, that is, automatically translates the statements into machine code. SAS further processes the code, and creates the following three items: input buffer is a logical area in memory into which SAS reads each record of data from a raw data file when the program executes. (When SAS reads from a SAS data set, however, the data is written directly to the program data vector.) program data vector is a logical area of memory where SAS builds a data set, one observation at a time. When a program executes, SAS reads data values from the input buffer or creates them by executing SAS language statements. SAS assigns the values to the appropriate variables in the program data vector. From here, SAS writes the values to a SAS data set as a single observation. The program data vector also contains two automatic variables, _N_ and _ERROR_. The _N_ variable counts the number of times the DATA step begins to iterate. The _ERROR_ variable signals the occurrence of an error caused by the data during execution. These automatic variables are not written to the output data set. descriptor information is information about each SAS data set, including data set attributes and variable attributes. SAS creates and maintains the descriptor information. During the Execution Phase All executable statements in the DATA step are executed once for each iteration. If your input file contains raw data, then SAS reads a record into the input buffer. SAS then reads the values in the input buffer and assigns the values to the appropriate variables in the program data vector. SAS also calculates values for variables created by program statements, and writes these values to the program data vector. When the program reaches the end of the DATA step, three actions occur by default that make using the SAS language different from using most other programming languages: 1 SAS writes the current observation from the program data vector to the data set. 2 The program loops back to the top of the DATA step. 3 Variables in the program data vector are reset to missing values. Note: The following exceptions apply: 3 Variables that you specify in a RETAIN statement are not reset to missing values. 3 The automatic variables _N_ and _ERROR_ are not reset to missing. For information about the RETAIN statement, see “Using a Value in a Later Observation” on page 196. 4 If there is another record to read, then the program executes again. SAS builds the second observation, and continues until there are no more records to read. The data set is then closed, and SAS goes on to the next DATA or PROC step. Introduction to DATA Step Processing 4 Example of a DATA Step 29 Example of a DATA Step The DATA Step The following simple DATA step produces a SAS data set from the data collected for a health and fitness club. As discussed earlier, the input data contains each participant’s identification number, name, team name, and weight at the beginning and end of a 16-week weight program: data weight_club; u input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight; v Loss = StartWeight - EndWeight; w datalines; x 1023 David Shaw 1049 Amelia Serrano 1219 Alan Nance 1246 Ravi Sinha 1078 Ashley McKnight 1221 Jim Brown 1095 Susan Stewart 1157 Rosa Gomez 1331 Jason Schock 1067 Kanoko Nagasaka 1251 Richard Rose 1333 Li-Hwa Lee 1192 Charlene Armstrong 1352 Bette Long 1262 Yao Chen 1087 Kim Sikorski 1124 Adrienne Fink 1197 Lynne Overby 1133 John VanMeter 1036 Becky Redding 1057 Margie Vanhoy 1328 Hisashi Ito 1243 Deanna Hicks 1177 Holly Choate 1259 Raoul Sanchez 1017 Jennifer Brooks 1099 Asha Garg 1329 Larry Goss ; x red yellow red yellow red yellow blue green blue green blue green yellow green blue red green red blue green yellow red blue red green blue yellow yellow 189 145 210 194 127 220 135 155 187 135 181 141 152 156 196 148 156 138 180 135 146 155 134 141 189 138 148 188 165 124 192 177 118 . 127 141 172 122 166 129 139 137 180 135 142 125 167 123 132 142 122 130 172 127 132 174 The Statements The following list corresponds to the numbered items in the preceding program: u The DATA statement begins the DATA step and names the data set that is being created. 30 Example of a DATA Step 4 Chapter 2 v The INPUT statement creates five variables, indicates how SAS reads the values from the input buffer, and assigns the values to variables in the program data vector. w The assignment statement creates an additional variable called Loss, calculates the value of Loss during each iteration of the DATA step, and writes the value to the program data vector. x The DATALINES statement marks the beginning of the input data. The single semicolon marks the end of the input data and the DATA step. Note: A DATA step that does not contain a DATALINES statement must end with a RUN statement. 4 The Process When you submit a DATA step for execution, SAS automatically compiles the DATA step and then executes it. At compile time, SAS creates the input buffer, program data vector, and descriptor information for the data set WEIGHT_CLUB. As the following figure shows, the program data vector contains the variables that are named in the INPUT statement, as well as the variable Loss. The values of the _N_ and the _ERROR_ variables are automatically generated for every DATA step. The _N_ automatic variable represents the number of times that the DATA step has iterated. The _ERROR_ automatic variable acts like a binary switch whose value is 0 if no errors exist in the DATA step, or 1 if one or more errors exist. These automatic variables are not written to the output data set. All variable values, except _N_ and _ERROR_, are initially set to missing. Note that missing numeric values are represented by a period, and missing character values are represented by a blank. Figure 2.8 Variable Values Initially Set to Missing Input Buffer ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 Program Data Vector IdNumber . Name Team StartWeight EndWeight . . Loss . The syntax is correct, so the DATA step executes. As the following figure illustrates, the INPUT statement causes SAS to read the first record of raw data into the input buffer. Then, according to the instructions in the INPUT statement, SAS reads the data values in the input buffer and assigns them to variables in the program data vector. Introduction to DATA Step Processing 4 Example of a DATA Step 31 Figure 2.9 Values Assigned to Variables by the INPUT Statement Input Buffer ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 1023 David Shaw red 189 165 Program Data Vector IdNumber Name Team 1023 David Shaw StartWeight EndWeight 189 165 red Loss . When SAS assigns values to all variables that are listed in the INPUT statement, SAS executes the next statement in the program: Loss = StartWeight - EndWeight; This assignment statement calculates the value for the variable Loss and writes that value to the program data vector, as the following figure shows. Figure 2.10 Value Computed and Assigned to the Variable Loss Input Buffer ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 1023 David Shaw red 189 165 Program Data Vector IdNumber Name 1023 David Shaw Team red StartWeight EndWeight 189 165 Loss 24 SAS has now reached the end of the DATA step, and the program automatically does the following: 3 writes the first observation to the data set 3 loops back to the top of the DATA step to begin the next iteration 3 increments the _N_ automatic variable by 1 3 resets the _ERROR_ automatic variable to 0 3 except for _N_ and _ERROR_, sets variable values in the program data vector to missing values, as the following figure shows 32 4 Example of a DATA Step Figure 2.11 Chapter 2 Values Set to Missing Input Buffer ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 1023 David Shaw red 189 165 Program Data Vector IdNumber Name Team StartWeight EndWeight . . . Loss . Execution continues. The INPUT statement looks for another record to read. If there are no more records, then SAS closes the data set and the system goes on to the next DATA or PROC step. In this example, however, more records exist and the INPUT statement reads the second record into the input buffer, as the following figure shows. Figure 2.12 Second Record Is Read into the Input Buffer Input Buffer ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 1049 Amelia Serrano yellow 145 124 Program Data Vector IdNumber Name Team StartWeight EndWeight . . . Loss . The following figure shows that SAS assigned values to the variables in the program data vector and calculated the value for the variable Loss, building the second observation just as it did the first one. Figure 2.13 Results of Second Iteration of the DATA Step Input Buffer ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 1049 Amelia Serrano yellow 145 124 Program Data Vector IdNumber 1049 Name Amelia Serrano Team StartWeight EndWeight yellow 145 124 Loss 21 This entire process continues until SAS detects the end of the file. The DATA step iterates as many times as there are records to read. Then SAS closes the data set WEIGHT_CLUB, and SAS looks for the beginning of the next DATA or PROC step. Introduction to DATA Step Processing 4 Overview of Creating a SAS Data Set 33 Now that SAS has transformed the collected data from raw data into a SAS data set, it can be processed by a SAS procedure. The following output, produced with the PRINT procedure, shows the data set that has just been created. proc print data=weight_club; title ’Fitness Center Weight Club’; run; Output 2.2 PROC PRINT Output of the WEIGHT_CLUB Data Set Fitness Center Weight Club Obs Id Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 1023 1049 1219 1246 1078 1221 1095 1157 1331 1067 1251 1333 1192 1352 1262 1087 1124 1197 1133 1036 1057 1328 1243 1177 1259 1017 1099 1329 Name Team David Shaw Amelia Serrano Alan Nance Ravi Sinha Ashley McKnight Jim Brown Susan Stewart Rosa Gomez Jason Schock Kanoko Nagasaka Richard Rose Li-Hwa Lee Charlene Armstrong Bette Long Yao Chen Kim Sikorski Adrienne Fink Lynne Overby John VanMeter Becky Redding Margie Vanhoy Hisashi Ito Deanna Hicks Holly Choate Raoul Sanchez Jennifer Brooks Asha Garg Larry Goss red yellow red yellow red yellow blue green blue green blue green yellow green blue red green red blue green yellow red blue red green blue yellow yellow 1 Start Weight 189 145 210 194 127 220 135 155 187 135 181 141 152 156 196 148 156 138 180 135 146 155 134 141 189 138 148 188 End Weight 165 124 192 177 118 . 127 141 172 122 166 129 139 137 180 135 142 125 167 123 132 142 122 130 172 127 132 174 Loss 24 21 18 17 9 . 8 14 15 13 15 12 13 19 16 13 14 13 13 12 14 13 12 11 17 11 16 14 Supplying Information to Create a SAS Data Set Overview of Creating a SAS Data Set You supply SAS with specific information for reading raw data so that you can create a SAS data set from the raw data. You can use the data set for further processing, data analysis, or report writing. To process raw data in a DATA step, you must 3 use an INPUT statement to tell SAS how to read the data 3 define the variables and indicate whether they are character or numeric 3 specify the location of the raw data 34 Telling SAS How to Read the Data: Styles of Input 4 Chapter 2 Telling SAS How to Read the Data: Styles of Input SAS provides many tools for reading raw data into a SAS data set. These tools include three basic input styles as well as various format modifiers and pointer controls. List input is used when each field in the raw data is separated by at least one space and does not contain embedded spaces. The INPUT statement simply contains a list of the variable names. List input, however, places numerous restrictions on your data. These restrictions are discussed in detail in Chapter 3, “Starting with Raw Data: The Basics,” on page 43. The following example shows list input. Note that there is at least one blank space between each data value. data scores; input Name $ Test_1 Test_2 Test_3; datalines; Bill 187 97 103 Carlos 156 76 74 Monique 99 102 129 ; Column input enables you to read the same data if it is located in fixed columns: data scores; input Name $ 1-7 Test_1 9-11 Test_2 13-15 Test_3 17-19; datalines; Bill 187 97 103 Carlos 156 76 74 Monique 99 102 129 ; Formatted input enables you to supply special instructions in the INPUT statement for reading data. For example, to read numeric data that contains special symbols, you need to supply SAS with special instructions so that it can read the data correctly. These instructions, called informats, are discussed in more detail in Chapter 3, “Starting with Raw Data: The Basics,” on page 43. In the INPUT statement, you can specify an informat to be used to read a data value, as in the example that follows: data total_sales; input Date mmddyy10. +2 Amount comma5.; datalines; 09/05/2000 1,382 10/19/2000 1,235 11/30/2000 2,391 ; In this example, the MMDDYY10. informat for the variable Date tells SAS to interpret the raw data as a month, day, and year, ignoring the slashes. The COMMA5. informat for the variable Amount tells SAS to interpret the raw data as a number, ignoring the comma. The +2 is a pointer control that tells SAS where to look for the next item. For more information about pointer controls, see Chapter 3, “Starting with Raw Data: The Basics,” on page 43. SAS also enables you to mix these styles of input as required by the way values are arranged in the data records. Chapter 3, “Starting with Raw Data: The Basics,” on page 43 discusses in detail input styles (including their rules and restrictions), as well as additional data-reading tools. Introduction to DATA Step Processing 4 Defining Variables in SAS 35 Reading Dates with Two-Digit and Four-Digit Year Values In the previous example, the year values in the dates in the raw data had four digits: 09/05/2000 10/19/2000 11/30/2000 However, SAS is also capable of reading two-digit year values (for example, 09/05/99). In this example, use the MMDDYY8. informat for the variable Date. How does SAS know to which century a two-digit year belongs? SAS uses the value of the YEARCUTOFF= SAS system option. In Version 7 and later of SAS, the default value of the YEARCUTOFF= option is 1920. This means that two-digit years from 00 to 19 are assumed to be in the twenty-first century, that is, 2000 to 2019. Two-digit years from 20 to 99 are assumed to be in the twentieth century, that is, 1920 to 1999. Note: site. 4 The YEARCUTOFF= option and the default setting may be different at your To avoid confusion, you should use four-digit year values in your raw data wherever possible. For more information, see the Dates, Times, and Intervals section of SAS Language Reference: Concepts. Defining Variables in SAS So far you have seen that the INPUT statement instructs SAS on how to read raw data lines. At the same time that the INPUT statement provides instructions for reading data, it defines the variables for the data set that come from the raw data. By assuming default values for variable attributes, the INPUT statement does much of the work for you. Later in this documentation, you will learn other statements that enable you to define variables and assign attributes to variables, but this section and Chapter 3, “Starting with Raw Data: The Basics,” on page 43 concentrate on the use of the INPUT statement. SAS variables can have these attributes: 3 3 3 3 3 3 3 3 name type length informat format label position in observation index type See the SAS Variables section of SAS Language Reference: Concepts for more information about variable attributes. In an INPUT statement, you must supply each variable name. Unless you also supply an informat, the type is assumed to be numeric, and its length is assumed to be eight bytes. The following INPUT statement creates four numeric variables, each with a length of eight bytes, without requiring you to specify either type or length. The table summarizes this information. input IdNumber Test_1 Test_2 Test_3; 36 Indicating the Location of Your Data 4 Chapter 2 Variable name Type Length IdNumber numeric 8 Test_1 numeric 8 Test_2 numeric 8 Test_3 numeric 8 The values of numeric variables can contain only numbers. To store values that contain alphabetic or special characters, you must create a character variable. By following a variable name in an INPUT statement with a dollar sign ($), you create a character variable. The default length of a character variable is also eight bytes. The following statement creates a data set that contains one character variable and four numeric variables, all with a default length of eight bytes. The table summarizes this information. input IdNumber Name $ Test_1 Test_2 Test_3; Variable name Type Length IdNumber numeric 8 Name character 8 Test_1 numeric 8 Test_2 numeric 8 Test_3 numeric 8 In addition to specifying the types of variables in the INPUT statement, you can also specify the lengths of character variables. Character variables can be up to 32,767 bytes in length. To specify the length of a character variable in an INPUT statement, you need to supply an informat or use column numbers. For example, following a variable name in the INPUT statement with the informat $20., or with column specifications such as 1-20, creates a character variable that is 20 bytes long. Note that the length of numeric variables is not affected by informats or column specifications in an INPUT statement. See SAS Language Reference: Concepts for more information about numeric variables and lengths. Two other variable attributes, format and label, affect how variable values and names are represented when they are printed or displayed. These attributes are assigned with different statements that you will learn about later. Indicating the Location of Your Data Data Locations To create a SAS data set, you can read data from one of four locations: 3 raw data in the data (job) stream, that is, following a DATALINES statement 3 raw data in a file that you specify with an INFILE statement Introduction to DATA Step Processing 4 Indicating the Location of Your Data 37 3 data from an existing SAS data set 3 data in a database management system (DBMS) file Raw Data in the Job Stream You can place data directly in the job stream with the programming statements that make up the DATA step. The DATALINES statement tells SAS that raw data follows. The single semicolon that follows the last line of data marks the end of the data. The DATALINES statement and data lines must occur last in the DATA step statements: data weight_club; input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight; Loss = StartWeight - EndWeight; datalines; 1023 David Shaw red 189 165 1049 Amelia Serrano yellow 145 124 1219 Alan Nance red 210 192 1246 Ravi Sinha yellow 194 177 1078 Ashley McKnight red 127 118 ; Data in an External File If your raw data is already stored in a file, then you do not have to bring that file into the data stream. Use an INFILE statement to specify the file containing the raw data. (See “Using External Files in Your SAS Job” on page 38 for details about INFILE, FILE, and FILENAME statements.) The statements in the code that follows demonstrate the same example, this time showing that the raw data is stored in an external file: data weight_club; infile ’your-input-file’; input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30; Loss=StartWeight-EndWeight; run; Data in a SAS Data Set You can also use data that is already stored in a SAS data set as input to a new data set. To read data from an existing SAS data set, you must specify the existing data set’s name in one of these statements: 3 3 3 3 SET statement MERGE statement MODIFY statement UPDATE statement For example, the statements that follow create a new SAS data set named RED that adds the variable LossPercent: data red; set weight_club; LossPercent = Loss / StartWeight * 100; run; 38 Using External Files in Your SAS Job 4 Chapter 2 The SET statement indicates that the input data is already in the structure of a SAS data set and gives the name of the SAS data set to be read. In this example, the SET statement tells SAS to read the WEIGHT_CLUB data set in the WORK library. Data in a DBMS File If you have data that is stored in another vendor’s database management system (DBMS) files, then you can use SAS/ACCESS software to bring this data into a SAS data set. SAS/ACCESS software enables you to assign a libref to a library containing the DBMS file. In this example, a libref is declared, and points to a library containing Oracle data. SAS reads data from an Oracle file into a SAS data set: libname dblib oracle user=scott password=tiger path=’hrdept_002’; data employees; set dblib.employees; run; See SAS/ACCESS for Relational Databases: Reference for more information about using SAS/ACCESS software to access DBMS files. Using External Files in Your SAS Job Your SAS programs often need to read raw data from a file, or write data or reports to a file that is not a SAS data set. To use a file that is not a SAS data set in a SAS program, you need to tell SAS where to find it. You can do the following: 3 Identify the file directly in the INFILE, FILE, or other SAS statement that uses the file. 3 Set up a fileref for the file by using the FILENAME statement, and then use the fileref in the INFILE, FILE, or other SAS statement. 3 Use operating environment commands to set up a fileref, and then use the fileref in the INFILE, FILE, or other SAS statement. The first two methods are described here. The third method depends on the operating environment that you use. Operating Environment Information: For more information, refer to the SAS documentation for your operating environment. 4 Identifying an External File Directly The simplest method for referring to an external file is to use the name of the file in the INFILE, FILE, or other SAS statement that needs to refer to the file. For example, if your raw data is stored in a file in your operating environment, and you want to read the data using a SAS DATA step, you can tell SAS where to find the raw data by putting the name of the file in the INFILE statement: data temp; infile ’your-input-file’; input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30; run; The INFILE statement for this example may appear as follows for various operating environments: Introduction to DATA Step Processing 4 Referencing an External File with a Fileref 39 Table 2.1 Example INFILE Statements for Various Operating Environments Operating environment INFILE statement example z/OS infile ’fitness.weight.rawdata(club1)’; CMS infile ’club1 weight a’; OpenVMS infile ’[fitness.weight.rawdata]club1.dat’; UNIX infile ’/usr/local/fitness/club1.dat’; Windows infile ’c:\fitness\club1.dat’; Operating Environment Information: For more information, refer to the SAS documentation for your operating environment. 4 Referencing an External File with a Fileref An alternate method for referencing an external file is to use the FILENAME statement to set up a fileref for a file. The fileref functions as a shorthand way of referring to an external file. You then use the fileref in later SAS statements that reference the file, such as the FILE or INFILE statement. The advantage of this method is that if the program contains many references to the same external file and the external filename changes, then the program needs to be modified in only one place, rather than in every place where the file is referenced. Here is the form of the FILENAME statement: FILENAME fileref ’your-input-or-output-file’; The fileref must be a valid SAS name, that is, it must 3 begin with a letter or an underscore 3 contain only letters, numbers, or underscores 3 have no more than 8 characters. Operating Environment Information: Additional restrictions may apply under some operating environments. For more information, refer to the SAS documentation for your operating environment. 4 For example, you can reference the raw data that is stored in a file in your operating environment by first using the FILENAME statement to specify the name of the file and its fileref, and then using the INFILE statement with the same fileref to reference the file. filename fitclub ’your-input-file’; data temp; infile fitclub; input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30; run; In this example, the INFILE statement stays the same for all operating environments. The FILENAME statement, however, can appear differently in different operating environments, as the following table shows: 40 Referencing an External File with a Fileref 4 Chapter 2 Table 2.2 Example FILENAME Statements for Various Operating Environments Operating environment FILENAME statement example z/OS filename fitclub ’fitness.weight.rawdata(club1)’; CMS filename fitclub ’club1 weight a’; OpenVMS filename fitclub ’[fitness.weight.rawdata]club1.dat’; UNIX filename fitclub ’/usr/local/fitness/club1.dat’; Windows filename fitclub ’c:\fitness\club1.dat’; If you need to use several files or members from the same directory, partitioned data set (PDS), or MACLIB, then you can use the FILENAME statement to create a fileref that identifies the name of the directory, PDS, or MACLIB. Then you can use the fileref in the INFILE statement and enclose the name of the file, PDS member, or MACLIB member in parentheses immediately after the fileref, as in this example: filename fitclub ’directory-or-PDS-or-MACLIB’; data temp; infile fitclub(club1); input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30; run; data temp2; infile fitclub(club2); input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30; run; In this case, the INFILE statements stay the same for all operating environments. The FILENAME statement, however, can appear differently for different operating environments, as the following table shows: Table 2.3 Referencing Directories, PDSs, and MACLIBs in Various Operating Environments Operating environment FILENAME statement example z/OS filename fitclub ’fitness.weight.rawdata’; CMS filename fitclub ’use1 maclib’;1 OpenVMS filename fitclub ’[fitness.weight.rawdata]’; UNIX filename fitclub ’/usr/local/fitness’; Windows filename fitclub ’c:\fitness’; 1 Under CMS, the external file must be a CMS MACLIB, a CMS TXTLIB, or a z/OS PDS. Introduction to DATA Step Processing 4 Learning More 41 Review of SAS Tools Statements DATA SAS-data-set; tells SAS to begin creating a SAS data set. If you omit the libref, then SAS creates a temporary SAS data set. (SAS attaches the libref WORK for its internal processing.) If you give a previously defined libref as the first level of the name, then SAS stores the data set permanently in the library referenced by the libref. A SAS program or a portion of a program that begins with a DATA statement and ends with a RUN statement, another DATA statement, or a PROC statement is called a DATA step. FILENAME fileref ’your-input-or-output-file’; associates a fileref with an external file. Enclose the name of the external file in quotation marks. INFILE fileref|’your-input-file’; identifies an external file to be read by an INPUT statement. Specify a fileref that has been assigned with a FILENAME statement or with an appropriate operating environment command, or specify the actual name of the external file. INPUT variable <$>; reads raw data using list input. At least one blank must occur between any two data values. The $ denotes a character variable. INPUT variable<$>column-range; reads raw data that is aligned in columns. The $ denotes a character variable. INPUT variable informat; reads raw data using formatted input. An informat supplies special instructions for reading the data. LIBNAME libref ’your-SAS-data-library’; associates a libref with a SAS data library. Enclose the name of the library in quotation marks. SAS locates a permanent SAS data set by matching the libref in a two-level SAS data set name with the library associated with that libref in a LIBNAME statement. The rules for creating a SAS data library depend on your operating environment. Learning More ATTRIBUTE statement For information about how the ATTRIBUTE statement enables you to assign attributes to variables, see SAS Language Reference: Dictionary. DBMS access This documentation explains how to use SAS for reading files of raw data and SAS data sets and writing to SAS data sets. However, SAS documentation for SAS/ACCESS provides complete information about using SAS to read and write information stored in several types of database management system (DBMS) files. Informats 42 Learning More 4 Chapter 2 For a discussion about informats that you use with dates, see Chapter 14, “Working with Dates in the SAS System,” on page 211. Length of variables For more information about how a variable’s length affects the values you can store in the variable, see Chapter 7, “Working with Numeric Variables,” on page 107 and Chapter 8, “Working with Character Variables,” on page 119. LINESIZE= option For information about how to use the LINESIZE= option in an INPUT statement to limit how much of each data line the INPUT statement reads, see SAS Language Reference: Dictionary. MERGE, MODIFY, or UPDATE statements In addition to the SET statement, you can read a SAS data set with the MERGE, MODIFY, or UPDATE statements. For more information, see Chapter 18, “Merging SAS Data Sets,” on page 269 and Chapter 19, “Updating SAS Data Sets,” on page 293. SET statement For information about the SET statement, see Chapter 5, “Starting with SAS Data Sets,” on page 81. USER= SAS system option You can specify the USER= SAS system option to use one-level names to point to permanent SAS files. (If you specify USER=WORK, then SAS assumes that files referenced with one-level names refer to temporary work files.) See the SAS System Options section in SAS Language Reference: Dictionary for details. 43 CHAPTER 3 Starting with Raw Data: The Basics Introduction to Raw Data 44 Purpose 44 Prerequisites 44 Examine the Structure of the Raw Data: Factors to Consider 44 Reading Unaligned Data 44 Understanding List Input 44 Program: Basic List Input 45 Program: When the Data Is Delimited by Characters, Not Blanks 46 List Input: Points to Remember 46 Reading Data That Is Aligned in Columns 47 Understanding Column Input 47 Program: Reading Data Aligned in Columns 47 Understanding Some Advantages of Column Input over Simple List Input 48 Reading Embedded Blanks and Creating Longer Variables 48 Program: Skipping Fields When Reading Data Records 49 Column Input: Points to Remember 50 Reading Data That Requires Special Instructions 50 Understanding Formatted Input 50 Program: Reading Data That Requires Special Instructions 50 Understanding How to Control the Position of the Pointer 52 Formatted Input: Points to Remember 53 Reading Unaligned Data with More Flexibility 53 Understanding How to Make List Input More Flexible 53 Creating Longer Variables and Reading Numeric Data That Contains Special Characters Reading Character Data That Contains Embedded Blanks 54 Mixing Styles of Input 55 An Example of Mixed Input 55 Understanding the Effect of Input Style on Pointer Location 56 Why You Can Get into Trouble by Mixing Input Styles 56 Pointer Location with Column and Formatted Input 56 Pointer Location with List Input 57 Review of SAS Tools 58 Statements 58 Column-Pointer Controls 59 Learning More 59 53 44 Introduction to Raw Data 4 Chapter 3 Introduction to Raw Data Purpose To create a SAS data set from raw data, you must examine the data records first to determine how the data values that you want to read are arranged. Then you can look at the styles of reading input that are available in the INPUT statement. SAS provides three basic input styles: 3 list 3 column 3 formatted You can use these styles individually, in combination with each other, or in conjunction with various line-hold specifiers, line-pointer controls, and column-pointer controls. This section demonstrates various ways of using the INPUT statement to turn your raw data into SAS data sets. You can enter the data directly in a DATA step or use an existing file of raw data. If your data is machine readable, then you need to learn how to use those tools that enable SAS to read them. If your data is not yet entered, then you can choose the input style that enables you to enter the data most easily. Prerequisites You should understand the concepts presented in Chapter 1, “What Is the SAS System?,” on page 3 and Chapter 2, “Introduction to DATA Step Processing,” on page 19 before continuing. Examine the Structure of the Raw Data: Factors to Consider Before you can select the appropriate style of input, examine the structure of the raw data that you want to read. Consider some of the following factors: 3 how the data is arranged in the input records (For example, are data fields aligned in columns or unaligned? Are they separated by blanks or by other characters?) 3 whether character values contain embedded blanks 3 whether numeric values contain non-numeric characters such as commas 3 whether the data contains time or date values 3 whether each input record contains data for more than one observation 3 whether data for a single observation is spread over multiple input records Reading Unaligned Data Understanding List Input The simplest form of the INPUT statement uses list input. List input is used to read data values that are separated by a delimiter character (by default, a blank space). With list input, SAS reads a data value until it encounters a blank space. SAS assumes the Starting with Raw Data: The Basics 4 Program: Basic List Input 45 value has ended and assigns the data to the appropriate variable in the program data vector. SAS continues to scan the record until it reaches a nonblank character again. SAS reads a data value until it encounters a blank space or the end of the input record. Program: Basic List Input This program uses the health and fitness club data from Chapter 2, “Introduction to DATA Step Processing,” on page 19 to illustrate a DATA step that uses list input in an INPUT statement. data club1; input IdNumber Name $ Team $ StartWeight EndWeight;w datalines;u 1023 David red 189 165 v 1049 Amelia yellow 145 124 1219 Alan red 210 192 1246 Ravi yellow 194 177 1078 Ashley red 127 118 1221 Jim yellow 220 . v ; u proc print data=club1; title ’Weight of Club Members’; run; The following list corresponds to the numbered items in the preceding program: u The DATALINES statement marks the beginning of the data lines. The semicolon that follows the data lines marks the end of the data lines and the end of the DATA step. v Each data value in the raw data record is separated from the next by at least one blank space. The last record contains a missing value, represented by a period, for the value of EndWeight. w The variable names in the INPUT statement are specified in exactly the same order as the fields in the raw data records. The output that follows shows the resulting data set. The PROC PRINT statement that follows the DATA step produces this listing. Output 3.1 Data Set Created with List Input Weight of Club Members Obs Id Number 1 2 3 4 5 6 1023 1049 1219 1246 1078 1221 Name Team David Amelia Alan Ravi Ashley Jim red yellow red yellow red yellow 1 Start Weight 189 145 210 194 127 220 End Weight 165 124 192 177 118 . 46 Program: When the Data Is Delimited by Characters, Not Blanks 4 Chapter 3 Program: When the Data Is Delimited by Characters, Not Blanks This program also uses the health and fitness club data but notice that here the data is delimited by a comma instead of a blank space, the default delimiter. options pagesize=60 linesize=80 pageno=1 nodate; data club1; infile datalinesv dlm=’,’w; input IdNumber Name $ Team $ StartWeight EndWeight; datalines; 1023,David,red,189,165u 1049,Amelia,yellow,145,124 1219,Alan,red,210,192 1246,Ravi,yellow,194,177 1078,Ashley,red,127,118 1221,Jim,yellow,220,. ; proc print data=club1; title ’Weight of Club Members’; run; The following list corresponds to the numbered items in the preceding output: u These data values are separated by commas instead of blanks. v List input, by default, scans the input records, looking for blank spaces to delimit each data value. The DLM= option enables list input to recognize a character, here a comma, as the delimiter. w This example required the DLM= option, which is available only in the INFILE statement. Usually this statement is used only when the input data resides in an external file. The DATALINES specification, however, enables you to take advantage of INFILE statement options, when you are reading data records from the job stream. Output 3.2 Reading Data Delimited by Commas Weight of Club Members Obs Id Number 1 2 3 4 5 6 1023 1049 1219 1246 1078 1221 Name Team David Amelia Alan Ravi Ashley Jim red yellow red yellow red yellow 1 Start Weight 189 145 210 194 127 220 End Weight 165 124 192 177 118 . List Input: Points to Remember The points to remember when you use list input are: 3 Use list input when each field is separated by at least one blank space or delimiter. 3 Specify each field in the order that they appear in the records of raw data. Starting with Raw Data: The Basics 4 Program: Reading Data Aligned in Columns 47 3 Represent missing values by a placeholder such as a period. (Under the default behavior, a blank field causes the variable names and values to become mismatched.) 3 Character values cannot contain embedded blanks. 3 The default length of character variables is eight bytes. SAS truncates a longer value when it writes the value to the program data vector. (To read a character variable that contains more than eight characters with list input, use a LENGTH statement. See “Defining Enough Storage Space for Variables” on page 103.) 3 Data must be in standard character or numeric format (that is, it can be read without an informat). Note: List input requires the fewest specifications in the INPUT statement. However, the restrictions that are placed on the data may require that you learn to use other styles of input to read your data. For example, column input, which is discussed in the next section, is less restrictive. This section has introduced only simple list input. See “Understanding How to Make List Input More Flexible” on page 53 to learn about modified list input. 4 Reading Data That Is Aligned in Columns Understanding Column Input With column input, data values occupy the same fields within each data record. When you use column input in the INPUT statement, list the variable names and specify column positions that identify the location of the corresponding data fields. You can use column input when your raw data is in fixed columns and does not require the use of informats to be read. Program: Reading Data Aligned in Columns The following program also uses the health and fitness club data, but now two more data values are missing. The data is aligned in columns and SAS reads the data with column input: data club1; input IdNumber 1-4 Name $ 6-11 Team $ 13-18 StartWeight 20-22 EndWeight 24-26; datalines; 1023 David red 189 165 1049 Amelia yellow 145 1219 Alan red 210 192 1246 Ravi yellow 177 1078 Ashley red 127 118 1221 Jim yellow 220 ; proc print data=club1; title ’Weight Club Members’; run; 48 Understanding Some Advantages of Column Input over Simple List Input 4 Chapter 3 The specification that follows each variable name indicates the beginning and ending columns in which the variable value will be found. Note that with column input you are not required to indicate missing values with a placeholder such as a period. The following output shows the resulting data set. Missing numeric values occur three times in the data set, and are indicated by periods. Output 3.3 Data Set Created with Column Input Weight Club Members Obs Id Number 1 2 3 4 5 6 1023 1049 1219 1246 1078 1221 Name Team David Amelia Alan Ravi Ashley Jim red yellow red yellow red yellow 1 Start Weight 189 145 210 . 127 220 End Weight 165 . 192 177 118 . Understanding Some Advantages of Column Input over Simple List Input Here are several advantages of using column input: 3 With column input, character variables can contain embedded blanks. 3 Column input also enables the creation of variables that are longer than eight bytes. In the preceding example, the variable Name in the data set CLUB1 contains only the members’ first names. By using column input, you can read the first and last names as a single value. These differences between input styles are possible for two reasons: 3 Column input uses the columns that you specify to determine the length of character variables. 3 Column input, unlike list input, reads data until it reaches the last specified column, not until it reaches a blank space. 3 Column input enables you to skip some data fields when reading records of raw data. It also enables you to read the data fields in any order and reread some fields or parts of fields. Reading Embedded Blanks and Creating Longer Variables This DATA step uses column input to create a new data set named CLUB2. The program still uses the health and fitness club weight data. However, the data has been modified to include members’ first and last names. Now the second data field in each record or raw data contains an embedded blank and is 18 bytes long. data club2; input IdNumber 1-4 Name $ 6-23 Team $ 25-30 StartWeight 32-34 EndWeight 36-38; datalines; red 189 165 1023 David Shaw Starting with Raw Data: The Basics 1049 1219 1246 1078 1221 ; Amelia Serrano Alan Nance Ravi Sinha Ashley McKnight Jim Brown yellow red yellow red yellow 145 210 194 127 220 4 Program: Skipping Fields When Reading Data Records 49 124 192 177 118 proc print data=club2; title ’Weight Club Members’; run; The following output shows the resulting data set. Output 3.4 Data Set Created with Column Input (Embedded Blanks) Weight Club Members Obs Id Number 1 2 3 4 5 6 1023 1049 1219 1246 1078 1221 Name Team David Shaw Amelia Serrano Alan Nance Ravi Sinha Ashley McKnight Jim Brown red yellow red yellow red yellow 1 Start Weight End Weight 189 145 210 194 127 220 165 124 192 177 118 . Program: Skipping Fields When Reading Data Records Column input also enables you to skip over fields or to read the fields in any order. This example uses column input to read the same health and fitness club data, but it reads the value for the variable Team first and omits the variable IdNumber altogether. You can read or reread part of a value when using column input. For example, because the team names begin with different letters, this program saves storage space by reading only the first character in the field that contains the team name. Note the INPUT statement: data club2; input Team $ 25 Name datalines; 1023 David Shaw 1049 Amelia Serrano 1219 Alan Nance 1246 Ravi Sinha 1078 Ashley McKnight 1221 Jim Brown ; $ 6-23 StartWeight 32-34 EndWeight 36-38; red yellow red yellow red yellow 189 145 210 194 127 220 165 124 192 177 118 proc print data=club2; title ’Weight Club Members’; run; The following output shows the resulting data set. The variable that contains the identification number is no longer in the data set. Instead, Team is the first variable in the new data set, and it contains only one character to represent the team value. 50 Column Input: Points to Remember Output 3.5 4 Chapter 3 Data Set Created with Column Input (Skipping Fields) Weight Club Members Obs Team 1 2 3 4 5 6 r y r y r y Name David Shaw Amelia Serrano Alan Nance Ravi Sinha Ashley McKnight Jim Brown 1 Start Weight 189 145 210 194 127 220 End Weight 165 124 192 177 118 . Column Input: Points to Remember Remember the following rules when you use column input: 3 Character variables can be up to 32,767 bytes (32KB) in length and are not limited to the default length of eight bytes. 3 Character variables can contain embedded blanks. 3 You can read fields in any order. 3 A placeholder is not required to indicate a missing data value. A blank field is read as missing and does not cause other values to be read incorrectly. 3 You can skip over part of the data in the data record. 3 You can reread fields or parts of fields. 3 You can read standard character and numeric data only. Informats are ignored. Reading Data That Requires Special Instructions Understanding Formatted Input Sometimes the INPUT statement requires special instructions to read the data correctly. For example, SAS can read numeric data that is in special formats such as binary, packed decimal, or date/time. SAS can also read numeric values that contain special characters such as commas and currency symbols. In these situations, use formatted input. Formatted input combines the features of column input with the ability to read nonstandard numeric or character values. The following data shows formatted input: 3 1,262 3 $55.64 3 02JAN2003 Program: Reading Data That Requires Special Instructions The data in this program includes numeric values that contain a comma, which is an invalid character for a numeric variable: data january_sales; input Item $ 1-16 Amount comma5.; Starting with Raw Data: The Basics datalines; trucks vans sedans ; 4 Program: Reading Data That Requires Special Instructions 51 1,382 1,235 2,391 proc print data=january_sales; title ’January Sales in Thousands’; run; The INPUT statement cannot read the values for the variable Amount as valid numeric values without the additional instructions provided by an informat. The informat COMMA5. enables the INPUT statement to read and store this data as a valid numeric value. The following figure shows that the informat COMMA5. instructs the program to read five characters of data (the comma counts as part of the length of the data), to remove the comma from the data, and to write the resulting numeric value to the program data vector. Note that the name of an informat always ends in a period (.). Figure 3.1 Reading a Value with an Informat COMMA5. informat The following figure shows that the data values are read into the input buffer exactly as they occur in the raw data records, but they are written to the program data vector (and then to the data set as an observation) as valid numeric values without any special characters. Figure 3.2 Input Value Compared to Variable Value Input Buffer ----+----1----+----2----+----3 trucks 1,382 Program Data Vector Item trucks Amount 1382 The following output shows the resulting data set. The values for Amount contain only numbers. Note that the commas are removed. 52 Understanding How to Control the Position of the Pointer Output 3.6 4 Chapter 3 Data Set Created with Column and Formatted Input January Sales in Thousands Obs Item Amount 1 2 3 trucks vans sedans 1382 1235 2391 1 In a report, you might want to include the comma in numeric values to improve readability. Just as the informat gives instructions on how to read a value and to remove the comma, a format gives instructions to add characters to variable values in the output. See “Writing Output without Creating a Data Set” on page 522 for an example. Understanding How to Control the Position of the Pointer As the INPUT statement reads data values, it uses an input pointer to keep track of the position of the data in the input buffer. Column-pointer controls provide additional control over pointer movement and are especially useful with formatted input. Column-pointer controls tell how far to advance the pointer before SAS reads the next value. In this example, SAS reads data lines with a combination of column and formatted input: data january_sales; input Item $ 1-16 Amount comma5.; datalines; trucks 1,382 vans 1,235 sedans 2,391 ; In the next example, SAS reads data lines by using formatted input with a column-pointer control: data january_sales; input Item $10. @17 Amount comma5.; datalines; trucks 1,382 vans 1,235 sedans 2,391 ; After SAS reads the first value for the variable Item, the pointer is left in the next position, column 11. The absolute column-pointer control, @17, then directs the pointer to move to column 17 in the input buffer. Now, it is in the correct position to read a value for the variable Amount. In the following program, the relative column-pointer control, +6, instructs the pointer to move six columns to the right before SAS reads the next data value. data january_sales; input Item $10. +6 Amount comma5.; datalines; trucks 1,382 Starting with Raw Data: The Basics vans sedans ; 4 Creating Longer Variables and Reading Numeric Data That Contains Special Characters 53 1,235 2,391 The data in these two programs is aligned in columns. As with column input, you instruct the pointer to move from field to field. With column input you use column specifications; with formatted input you use the length that is specified in the informat together with pointer controls. Formatted Input: Points to Remember Remember the following rules when you use formatted input: 3 SAS reads formatted input data until it has read the number of columns that the informat indicates. This method of reading the data is different from list input, which reads until a blank space (or other defined delimiter character) is reached. 3 You can position the pointer to read the next value by using pointer controls. 3 You can read data stored in nonstandard form such as packed decimal, or data that contains commas. 3 You have the flexibility of using informats with all the features of column input, as described in “Column Input: Points to Remember” on page 50. Reading Unaligned Data with More Flexibility Understanding How to Make List Input More Flexible While list input is the simplest to code, remember that it places restrictions on your data. By using format modifiers, you can take advantage of the simplicity of list input without the inconvenience of the usual restrictions. For example, you can use modified list input to do the following: 3 Create character variables that are longer than the default length of eight bytes. 3 Read numeric data with special characters like commas, dashes, and currency symbols. 3 Read character data that contains embedded blanks. 3 Read data values that can be stored as SAS date variables. Creating Longer Variables and Reading Numeric Data That Contains Special Characters By simply modifying list input with the colon format modifier (:) you can read 3 character data that contains more than eight characters 3 numeric data that contains special characters. To use the colon format modifier with list input, place the colon between the variable name and the informat. As in simple list input, at least one blank (or other defined delimiter character) must separate each value from the next, and character values cannot contain embedded blanks (or other defined delimiter characters). Consider this DATA step: data january_sales; input Item : $12. Amount : comma5.; 54 Reading Character Data That Contains Embedded Blanks 4 Chapter 3 datalines; Trucks 1,382 Vans 1,235 Sedans 2,391 SportUtility 987 ; proc print data=january_sales; title ’January Sales in Thousands’; run; The variable Item has a length of 12, and the variable Amount requires an informat (in this case, COMMA5.) that removes commas from numbers so that they are read as valid numeric values. The data values are not aligned in columns as was required in the last example, which used formatted input to read the data. The following output shows the resulting data set. Output 3.7 Data Set Created with Modified List Input (: comma5.) January Sales in Thousands Obs 1 2 3 4 Item Trucks Vans Sedans SportUtility 1 Amount 1382 1235 2391 987 Reading Character Data That Contains Embedded Blanks Because list input uses a blank space to determine where one value ends and the next one begins, values normally cannot contain blanks. However, with the ampersand format modifier (&) you can use list input to read data that contains single embedded blanks. The only restriction is that at least two blanks must divide each value from the next data value in the record. To use the ampersand format modifier with list input, place the ampersand between the variable name and the informat. The following DATA step uses the ampersand format modifier with list input to create the data set CLUB2. Note that the data is not in fixed columns; therefore, column input is not appropriate. data club2; input IdNumber Name & $18. Team $ StartWeight EndWeight; datalines; 1023 David Shaw red 189 165 1049 Amelia Serrano yellow 145 124 1219 Alan Nance red 210 192 1246 Ravi Sinha yellow 194 177 1078 Ashley McKnight red 127 118 1221 Jim Brown yellow 220 . ; proc print data=club2; title ’Weight Club Members’; run; Starting with Raw Data: The Basics 4 An Example of Mixed Input 55 The character variable Name, with a length of 18, contains members’ first and last names separated by one blank space. The data lines must have two blank spaces between the values for the variable Name and the variable Team for the INPUT statement to correctly read the data. The following output shows the resulting data set. Output 3.8 Data Set Created with Modified List Input (& $18.) Weight Club Members Obs Id Number 1 2 3 4 5 6 1023 1049 1219 1246 1078 1221 Name Team David Shaw Amelia Serrano Alan Nance Ravi Sinha Ashley McKnight Jim Brown red yellow red yellow red yellow 1 Start Weight End Weight 189 145 210 194 127 220 165 124 192 177 118 . Mixing Styles of Input An Example of Mixed Input When you begin an INPUT statement in a particular style (list, column, or formatted), you are not restricted to using that style alone. You can mix input styles in a single INPUT statement as long as you mix them in a way that appropriately describes the raw data records. For example, this DATA step uses all three input styles: data club1; input IdNumber u Name $18. v Team $ 25-30 w StartWeight EndWeight; u datalines; 1023 David Shaw red 189 1049 Amelia Serrano yellow 145 1219 Alan Nance red 210 1246 Ravi Sinha yellow 194 1078 Ashley McKnight red 127 1221 Jim Brown yellow 220 ; 165 124 192 177 118 . proc print data=club1; title ’Weight Club Members’; run; The following list corresponds to the numbered items in the preceding program: u The variables IdNumber, StartWeight, and EndWeight are read with list input. v The variable Name is read with formatted input. w The variable Team is read with column input. The following output demonstrates that the data is read correctly. 56 Understanding the Effect of Input Style on Pointer Location Output 3.9 4 Chapter 3 Data Set Created with Mixed Styles of Input Weight Club Members Obs Id Number 1 2 3 4 5 6 1023 1049 1219 1246 1078 1221 Name Team David Shaw Amelia Serrano Alan Nance Ravi Sinha Ashley McKnight Jim Brown red yellow red yellow red yellow 1 Start Weight 189 145 210 194 127 220 End Weight 165 124 192 177 118 . Understanding the Effect of Input Style on Pointer Location Why You Can Get into Trouble by Mixing Input Styles CAUTION: When you mix styles of input in a single INPUT statement, you can get unexpected results if you do not understand where the input pointer is positioned after SAS reads a value in the input buffer. As the INPUT statement reads data values from the record in the input buffer, it uses a pointer to keep track of its position. Read the following sections so that you understand how the pointer movement differs between input styles before mixing multiple input styles in a single INPUT statement 4 Pointer Location with Column and Formatted Input With column and formatted input, you supply the instructions that determine the exact pointer location. With column input, SAS reads the columns that you specify in the INPUT statement. With formatted input, SAS reads the exact length that you specify with the informat. In both cases, the pointer moves as far as you instruct it and stops. The pointer is left in the column that immediately follows the last column that is read. Here are two examples of input followed by an explanation of the pointer location. The first DATA step shows column input: data scores; input Team $ 1-6 Score 12-13; datalines; red 59 blue 95 yellow 63 green 76 ; The second DATA step uses the same data to show formatted input: data scores; input Team $6. +5 Score 2.; datalines; red 59 blue 95 yellow 63 green 76 Starting with Raw Data: The Basics 4 Understanding the Effect of Input Style on Pointer Location 57 ; The following figure shows that the pointer is located in column 7 after the first value is read with either of the two previous INPUT statements. Figure 3.3 Pointer Position: Column and Formatted Input ----+----1----+----2 red 59 Unlike list input, column and formatted input rely totally on your instructions to move the pointer and read the value for the second variable, Score. Column input uses column specifications to move the pointer to each data field. Formatted input uses informats and pointer controls to control the position of the pointer. This INPUT statement uses column input with the column specifications 12-13 to move the pointer to column 12 and read the value for the variable Score: input Team $ 1-6 Score 12-13; This INPUT statement uses formatted input with the +5 column-pointer control to move the pointer to column 12. Then the value for the variable Score is read with the 2. numeric informat. input Team $6. +5 Score 2.; Without the use of a pointer control, which moves the pointer to the column where the value begins, this INPUT statement would attempt to read the value for Score in columns 7 and 8, which are blank. Pointer Location with List Input List input, on the other hand, uses a scanning method to determine the pointer location. With list input, the pointer reads until a blank is reached and then stops in the next column. To read the next variable value, the pointer moves automatically to the first nonblank column, discarding any leading blanks it encounters. Here is the same data that is read with list input: data scores; input Team $ Score; datalines; red 59 blue 95 yellow 63 green 76 ; The following figure shows that the pointer is located in column 5 after the value red is read. Because Score, the next variable, is read with list input, the pointer scans for the next nonblank space before it begins to read a value for Score. Unlike column and formatted input, you do not have to explicitly move the pointer to the beginning of the next field in list input. 58 Review of SAS Tools 4 Figure 3.4 Chapter 3 Pointer Position: List Input ----+----1----+----2 red 59 Review of SAS Tools Statements DATALINES; indicates that data lines immediately follow the DATALINES statement. A semicolon in the line that immediately follows the last data line indicates the end of the data and causes the DATA step to compile and execute. INFILE DATALINES DLM=’character’; identifies the source of the input records as data lines in the job stream rather than as an external file. When your program contains the input data, the data lines directly follow the DATALINES statement. Because you can specify DATALINES in the INFILE statement, you can take advantage of many data-reading options that are available only through the INFILE statement. The DLM= option specifies the character that is used to separate data values in the input records. By default, a blank space denotes the end of a data value. This option is useful when you want to use list input to read data records in which a character other than a blank separates data values. INPUT variable <&> <$>; reads the input data record using list input. The & (ampersand format modifier) enables character values to contain embedded blanks. When you use the ampersand format modifier, two blanks are required to signal the end of a data value. The $ indicates a character variable. INPUT variable start-column <– end-column>; reads the input data record using column input. You can omit end-column if the data is only 1 byte long. This style of input enables you to skip columns of data that you want to omit. INPUT variable : informat; INPUT variable & informat; read the input data record using modified list input. The : (colon format modifier) instructs SAS to use the informat that follows to read the data value. The & (ampersand format modifier) instructs SAS to use the informat that follows to read the data value. When you use the ampersand format modifier, two blanks are required to signal the end of a data value. INPUT variable informat; reads raw data using formatted input. The informat supplies special instructions to read the data. You can also use a pointer-control to direct SAS to start reading at a particular column. The syntax given above for the three styles of input shows only one variable. Subsequent variables in the INPUT statement may or may not be described in the Starting with Raw Data: The Basics 4 Learning More 59 same input style as the first one. You may use any of the three styles of input (list, column, and formatted) in a single INPUT statement. Column-Pointer Controls @n moves the pointer to the nth column in the input buffer. +n moves the pointer forward n columns in the input buffer. / moves the pointer to the next line in the input buffer. #n moves the pointer to the nth line in the input buffer. Learning More Advanced features For some more advanced data-reading features, see Chapter 4, “Starting with Raw Data: Beyond the Basics,” on page 61. Character-delimited data For more information about reading data that is delimited by a character other than a blank space, see the DELIMITER= option in the INFILE statement in SAS Language Reference: Dictionary . Pointer controls For a complete discussion and listing of column-pointer controls, line-pointer controls, and line-hold specifiers, see SAS Language Reference: Dictionary. Types of input For more information about the INPUT statement, see SAS Language Reference: Dictionary. 60 61 CHAPTER 4 Starting with Raw Data: Beyond the Basics Introduction to Beyond the Basics with Raw Data 61 Purpose 61 Prerequisites 62 Testing a Condition before Creating an Observation 62 Creating Multiple Observations from a Single Record 63 Using the Double Trailing @ Line-Hold Specifier 63 Understanding How the Double Trailing @ Affects DATA Step Execution 64 Reading Multiple Records to Create a Single Observation 67 How the Data Records Are Structured 67 Method 1: Using Multiple Input Statements 67 Method 2: Using the / Line-Pointer Control 69 Reading Variables from Multiple Records in Any Order 70 Understanding How the #n Line-Pointer Control Affects DATA Step Execution 71 Problem Solving: When an Input Record Unexpectedly Does Not Have Enough Values 74 Understanding the Default Behavior 74 Methods of Control: Your Options 75 Four Options: FLOWOVER, STOPOVER, MISSOVER, and TRUNCOVER 75 Understanding the MISSOVER Option 76 Understanding the TRUNCOVER Option 77 Review of SAS Tools 77 Column-Pointer Controls 77 Line-Hold Specifiers 78 Statements 78 Learning More 79 Introduction to Beyond the Basics with Raw Data Purpose To create a SAS data set from raw data, you often need more than the most basic features. In this section, you will learn advanced features for reading raw data that include the following: 3 how to understand and then control what happens when a value is unexpectedly missing in an input record 3 how to read a record more than once so that you may test a condition before taking action on the current record 3 how to create multiple observations from a single input record 3 how to read multiple observations to create a single record 62 Prerequisites 4 Chapter 4 Prerequisites You should understand the concepts presented in Chapter 1, “What Is the SAS System?,” on page 3 and Chapter 2, “Introduction to DATA Step Processing,” on page 19 before continuing. Testing a Condition before Creating an Observation Sometimes you need to read a record, and hold that record in the input buffer while you test for a specified condition before a decision can be made about further processing. As an example, the ability to hold a record so that you can read from it again, if necessary, is useful when you need to test for a condition before SAS creates an observation from a data record. To do this, you can use the trailing at-sign (@). For example, to create a SAS data set that is a subset of a larger group of records, you might need to test for a condition to decide if a particular record will be used to create an observation. The trailing at-sign placed before the semicolon at the end of an INPUT statement instructs SAS to hold the current data line in the input buffer. This makes the data line available for a subsequent INPUT statement. Otherwise, the next INPUT statement causes SAS to read a new record into the input buffer. You can set up the process to read each record twice by following these steps: 1 Use an INPUT statement to read a portion of the record. 2 Use a trailing @ at the end of the INPUT statement to hold the record in the input buffer for the execution of the next INPUT statement. 3 Use an IF statement on the portion that is read in to test for a condition. 4 If the condition is met, use another INPUT statement to read the remainder of the record to create an observation. 5 If the condition is not met, the record is released and control passes back to the top of the DATA step. To read from a record twice, you must prevent SAS from automatically placing a new record into the input buffer when the next INPUT statement executes. Use of a trailing @ in the first INPUT statement serves this purpose. The trailing @ is one of two line-hold specifiers that enable you to hold a record in the input buffer for further processing. For example, the health and fitness club data contains information about all members. This DATA step creates a SAS data set that contains only members of the red team: data red_team; input Team $ 13-18 @; u if Team=’red’; v input IdNumber 1-4 StartWeight 20-22 EndWeight 24-26; datalines; 1023 David red 189 165 1049 Amelia yellow 145 124 1219 Alan red 210 192 1246 Ravi yellow 194 177 1078 Ashley red 127 118 1221 Jim yellow 220 . ; x proc print data=red_team; w 4 Starting with Raw Data: Beyond the Basics Using the Double Trailing @ Line-Hold Specifier 63 title ’Red Team’; run; In this DATA step, these actions occur: u The INPUT statement reads a record into the input buffer, reads a data value from columns 13 through 18, and assigns that value to the variable Team in the program data vector. The single trailing @ holds the record in the input buffer. v The IF statement enables the current iteration of the DATA step to continue only when the value for Team is red. When the value is not red, the current iteration stops and SAS returns to the top of the DATA step, resets values in the program data vector to missing, and releases the held record from the input buffer. w The INPUT statement executes only when the value of Team is red. It reads the remaining data values from the record held in the input buffer and assigns values to the variables IdNumber, StartWeight, and EndWeight. x The record is released from the input buffer when the program returns to the top of the DATA step. The following output shows the resulting data set: Output 4.1 Subset Data Set Created with Trailing @ Red Team Obs 1 2 3 Team red red red 1 Id Number Start Weight 1023 1219 1078 189 210 127 End Weight 165 192 118 Creating Multiple Observations from a Single Record Using the Double Trailing @ Line-Hold Specifier Sometimes you may need to create multiple observations from a single record of raw data. One way to tell SAS how to read such a record is to use the other line-hold specifier, the double trailing at-sign (@@ or “double trailing @”). The double trailing @ not only prevents SAS from reading a new record into the input buffer when a new INPUT statement is encountered, but it also prevents the record from being released when the program returns to the top of the DATA step. (Remember that the trailing @ does not hold a record in the input buffer across iterations of the DATA step.) For example, this DATA step uses the double trailing @ in the INPUT statement: data body_fat; input Gender $ PercentFat @@; datalines; m 13.3 f 22 m 22 f 23.2 m 16 m 12 ; proc print data=body_fat; 64 Understanding How the Double Trailing @ Affects DATA Step Execution 4 Chapter 4 title ’Results of Body Fat Testing’; run; The following output shows the resulting data set: Output 4.2 Data Set Created with Double Trailing @ Results of Body Fat Testing Obs Gender 1 2 3 4 5 6 m f m f m m 1 Percent Fat 13.3 22.0 22.0 23.2 16.0 12.0 Understanding How the Double Trailing @ Affects DATA Step Execution To understand how the data records in the previous example were read, look at the data lines that were used in the previous DATA step: m 13.3 f 22 m 22 f 23.2 m 16 m 12 Each record contains the raw data for two observations instead of one. Consider this example in terms of the flow of the DATA step, as explained in Chapter 2, “Introduction to DATA Step Processing,” on page 19. When SAS reaches the end of the DATA step, it returns to the top of the program and begins the next iteration, executing until there are no more records to read. Each time it returns to the top of the DATA step and executes the INPUT statement, it automatically reads a new record into the input buffer. The second set of data values in each record, therefore, would never be read: m 13.3 f 22 m 22 f 23.2 m 16 m 12 To allow the second set of data values in each record to be read, the double trailing @ tells SAS to hold the record in the input buffer. Each record is held in the input buffer until the end of the record is reached. The program does not automatically place the next record into the input buffer each time the INPUT statement is executed, and the current record is not automatically released when it returns to the top of the DATA step. As a result, the pointer location is maintained on the current record which enables the program to read each value in that record. Each time the DATA step completes an iteration, an observation is written to the data set. The next five figures demonstrate what happens in the input buffer when a double trailing @ appears in the INPUT statement, as in this example: input Gender $ PercentFat @@; The first figure shows that all values in the program data vector are set to missing. The INPUT statement reads the first record into the input buffer. The program begins Starting with Raw Data: Beyond the Basics 4 Understanding How the Double Trailing @ Affects DATA Step Execution 65 to read values from the current pointer location, which is the beginning of the input buffer. Figure 4.1 First Iteration: First Record Is Read Input Buffer ----+----1----+----2 m 13.3 f 22 Program Data Vector Gender PercentFat . The following figure shows that the value m is written to the program data vector. When the pointer reaches the blank space that follows 13.3, the complete value for the variable PercentFat has been read. The pointer stops in the next column, and the value 13.3 is written to the program data vector. Figure 4.2 First Observation Is Created Input Buffer ----+----1----+----2 m 13.3 f 22 Program Data Vector Gender PercentFat m 13.3 There are no other variables in the INPUT statement and no more statements in the DATA step, so three actions take place: 1 The first observation is written to the data set. 2 The DATA step begins its next iteration. 3 The values in the program data vector are set to missing. The following figure shows the current position of the pointer. SAS is ready to read the next piece of data in the same record. 66 Understanding How the Double Trailing @ Affects DATA Step Execution Figure 4.3 4 Chapter 4 Second Iteration: First Record Remains in the Input Buffer Input Buffer ----+----1----+----2 m 13.3 f 22 Program Data Vector Gender PercentFat . The following figure shows that the INPUT statement reads the next two values from the input buffer and writes them to the program data vector. Figure 4.4 Second Observation Is Created Input Buffer ----+----1----+----2 m 13.3 f 22 Program Data Vector Gender PercentFat f 22 When the DATA step completes the second iteration, the values in the program data vector are written to the data set as the second observation. Then the DATA step begins its third iteration. Values in the program data vector are set to missing, and the INPUT statement executes. The pointer, which is now at column 13 (two columns to the right of the last data value that was read), continues reading. Because this is list input, the pointer scans for the next nonblank character to begin reading the next value. When the pointer reaches the end of the input buffer and fails to find a nonblank character, SAS reads a new record into the input buffer. The final figure shows that values for the third observation are read from the beginning of the second record. Starting with Raw Data: Beyond the Basics 4 Method 1: Using Multiple Input Statements 67 Figure 4.5 Third Iteration: Second Record Is Read into the Input Buffer Input Buffer ----+----1----+----2 m 22 f 23.2 Program Data Vector Gender PercentFat . The process continues until SAS reads all the records. The resulting SAS data set contains six observations instead of three. Note: Although this program successfully reads all of the data in the input records, SAS writes a message to the log noting that the program had to go to a new line. 4 Reading Multiple Records to Create a Single Observation How the Data Records Are Structured An earlier example (see “Reading Character Data That Contains Embedded Blanks” on page 54) shows data for several observations that are contained in a single record of raw data: 1023 David Shaw red 189 165 This INPUT statement reads all the data values arranged across a single record: input IdNumber 1-4 Name $ 6-23 Team $ StartWeight EndWeight; Now, consider the opposite situation: when information for a single observation is not contained in a single record of raw data but is scattered across several records. For example, the health and fitness club data could be constructed in such a way that the information about a single member is spread across several records instead of in a single record: 1023 David Shaw red 189 165 Method 1: Using Multiple Input Statements Multiple INPUT statements, one for each record, can read each record into a single observation, as in this example: input IdNumber 1-4 Name $ 6-23; input Team $ 1-6; input StartWeight 1-3 EndWeight 5-7; To understand how to use multiple INPUT statements, consider what happens as a DATA step executes. Remember that one record is read into the INPUT buffer 68 Method 1: Using Multiple Input Statements 4 Chapter 4 automatically as each INPUT statement is encountered during each iteration. SAS reads the data values from the input buffer and writes them to the program data vector as variable values. At the end of the DATA step, all the variable values in the program data vector are written automatically as a single observation. This example uses multiple INPUT statements in a DATA step to read only selected data fields and create a data set containing only the variables IdNumber, StartWeight, and EndWeight. data club2; input IdNumber 1-4; u input; v input StartWeight 1-3 EndWeight 5-7; datalines; 1023 David Shaw red 189 165 1049 Amelia Serrano yellow 145 124 1219 Alan Nance red 210 192 1246 Ravi Sinha yellow 194 177 1078 Ashley McKnight red 127 118 1221 Jim Brown yellow 220 . ; w proc print data=club2; title ’Weight Club Members’; run; The following list corresponds to the numbered items in the preceding program: u The first INPUT statement reads only one data field in the first record and assigns a value to the variable IdNumber. v The second INPUT statement, without arguments, is a null INPUT statement that reads the second record into the input buffer. However, it does not assign a value to a variable. w The third INPUT statement reads the third record into the input buffer and assigns values to the variables StartWeight and EndWeight. The following output shows the resulting data set: Starting with Raw Data: Beyond the Basics 4 Method 2: Using the / Line-Pointer Control Output 4.3 Data Set Created with Multiple INPUT Statements Weight Club Members Obs Id Number Start Weight 1 2 3 4 5 6 1023 1049 1219 1246 1078 1221 189 145 210 194 127 220 1 End Weight 165 124 192 177 118 . Method 2: Using the / Line-Pointer Control Writing a separate INPUT statement for each record is not the only way to create a single observation. You can write a single INPUT statement and use the slash (/) line-pointer control. The slash line-pointer control forces a new record into the input buffer and positions the pointer at the beginning of that record. This example uses only one INPUT statement to read multiple records: data club2; input IdNumber 1-4 / / StartWeight 1-3 EndWeight 5-7; datalines; 1023 David Shaw red 189 165 1049 Amelia Serrano yellow 145 124 1219 Alan Nance red 210 192 1246 Ravi Sinha yellow 194 177 1078 Ashley McKnight red 127 118 1221 Jim Brown yellow 220 . ; proc print data=club2; title ’Weight Club Members’; run; The / line-pointer control appears exactly where a new INPUT statement begins in the previous example (see “Method 1: Using Multiple Input Statements” on page 67). The sequence of events in the input buffer and the program data vector as this DATA step executes is identical to the previous example in method 1. The / is the signal to read a new record into the input buffer, which happens automatically when the DATA step encounters a new INPUT statement. The preceding example shows two slashes 69 70 Reading Variables from Multiple Records in Any Order 4 Chapter 4 (/ /), indicating that SAS skips a record. SAS reads the first record, skips the second record, and reads the third record. The following output shows the resulting data set: Output 4.4 Data Set Created with the / Line-Pointer Control Weight Club Members Obs Id Number Start Weight 1 2 3 4 5 6 1023 1049 1219 1246 1078 1221 189 145 210 194 127 220 1 End Weight 165 124 192 177 118 . Reading Variables from Multiple Records in Any Order You can also read multiple records to create a single observation by pointing to a specific record in a set of input records with the #n line-pointer control. As you saw in the last section, the advantage of using the / line-pointer control over multiple INPUT statements is that it requires fewer statements. However, using the #n line-pointer control enables you to read the variables in any order, no matter which record contains the data values. It is also useful if you want to skip data lines. This example uses one INPUT statement to read multiple data lines in a different order: data club2; input #2 Team $ 1-6 #1 Name $ 6-23 IdNumber 1-4 #3 StartWeight 1-3 EndWeight 5-7; datalines; 1023 David Shaw red 189 165 1049 Amelia Serrano yellow 145 124 1219 Alan Nance red 210 192 1246 Ravi Sinha yellow 194 177 1078 Ashley McKnight red 127 118 1221 Jim Brown yellow 220 . ; proc print data=club2; Starting with Raw Data: Beyond the Basics 4 Understanding How the #n Line-Pointer Control Affects DATA Step Execution 71 title ’Weight Club Members’; run; The following output shows the resulting data set: Output 4.5 Data Set Created with the #n Line-Pointer Control Weight Club Members Obs 1 2 3 4 5 6 Team Name red yellow red yellow red yellow David Shaw Amelia Serrano Alan Nance Ravi Sinha Ashley McKnight Jim Brown 1 Id Number Start Weight 1023 1049 1219 1246 1078 1221 189 145 210 194 127 220 End Weight 165 124 192 177 118 . The order of the observations is the same as in the raw records ( shown in the section “Reading Variables from Multiple Records in Any Order” on page 70). However, the order of the variables in the data set differs from the order of the variables in the raw input data records. This occurs because the order of the variables in the INPUT statements corresponds with their order in the resulting data sets. Understanding How the #n Line-Pointer Control Affects DATA Step Execution To understand the importance of the #n line-pointer control, remember the sequence of events in the DATA steps that demonstrate the / line-pointer control and multiple INPUT statements. Each record is read into the input buffer sequentially. The data is read, and then a / or a new INPUT statement causes the program to read the next record into the input buffer. It is impossible for the program to read a value from the first record after a value from the second record is read because the data in the first record is no longer available in the input buffer. To solve this problem, use the #n line-pointer control. The #n line-pointer control signals the program to create a multiple-line input buffer so that all the data for a single observation is available while the observation is being built in the program data vector. The #n line-pointer control also identifies the record in which data for each variable appears. To use the #n line-pointer control, the raw data must have the same number of records for each observation; for example, it cannot have three records for one observation and two for the next. When the program compiles and builds the input buffer, it looks at the INPUT statement and creates an input buffer with as many lines as are necessary to contain the number of records it needs to read for a single observation. In this example, the highest number of records specified is three, so the input buffer is built to contain three records at one time. The following figures demonstrate the flow of the DATA step in this example. This figure shows that the values are set to missing in the program data vector and that the INPUT statement reads the first three records into the input buffer. 72 Understanding How the #n Line-Pointer Control Affects DATA Step Execution Figure 4.6 4 Chapter 4 Three Records Are Read into the Input Buffer as a Single Observation Input Buffer ----+----1----+----2----+----3----+----4----+----5----+----6 1023 David Shaw ----+----1----+----2----+----3----+----4----+----5----+----6 red ----+----1----+----2----+----3----+----4----+----5----+----6 189 165 Program Data Vector Team Name IdNumber StartWeight . . EndWeight . The INPUT statement for this example is as follows: input #2 Team $ 1-6 #1 Name $ 6-23 IdNumber 1-4 #3 StartWeight 1-3 EndWeight 5-7; The first variable is preceded by #2 to indicate that the value in the second record is assigned to the variable Team. The following figure shows that the pointer advances to the second line in the input buffer, reads the value, and writes it to the program data vector. Figure 4.7 Reading from the Second Record First Input Buffer ----+----1----+----2----+----3----+----4----+----5----+----6 1023 David Shaw ----+----1----+----2----+----3----+----4----+----5----+----6 red ----+----1----+----2----+----3----+----4----+----5----+----6 189 165 Program Data Vector Team red Name IdNumber StartWeight . . EndWeight . The following figure shows that the pointer then moves to the sixth column in the first record, reads a value, and assigns it to the variable Name in the program data vector. It then moves to the first column to read the ID number, and assigns it to the variable IdNumber. Starting with Raw Data: Beyond the Basics 4 Understanding How the #n Line-Pointer Control Affects DATA Step Execution 73 Figure 4.8 Reading from the First Record Input Buffer ----+----1----+----2----+----3----+----4----+----5----+----6 1023 David Shaw ----+----1----+----2----+----3----+----4----+----5----+----6 red ----+----1----+----2----+----3----+----4----+----5----+----6 189 165 Program Data Vector Team red Name IdNumber StartWeight David Shaw 1023 . EndWeight . The following figure shows that the process continues with the pointer moving to the third record in the first observation. Values are read and assigned to StartWeight and EndWeight, the last variable that is listed. Figure 4.9 Reading from the Third Record Input Buffer ----+----1----+----2----+----3----+----4----+----5----+----6 1023 David Shaw ----+----1----+----2----+----3----+----4----+----5----+----6 red ----+----1----+----2----+----3----+----4----+----5----+----6 189 165 Program Data Vector Team red Name David Shaw IdNumber StartWeight 1023 189 EndWeight 165 When the bottom of the DATA step is reached, variable values in the program data vector are written as an observation to the data set. The DATA step returns to the top, and values in the program data vector are set to missing. The INPUT statement executes again. The final figure shows that the next three records are read into the input buffer, ready to create the second observation. 74 4 Problem Solving: When an Input Record Unexpectedly Does Not Have Enough Values Figure 4.10 Chapter 4 Reading the Next Three Records into the Input Buffer Input Buffer ----+----1----+----2----+----3----+----4----+----5----+----6 1049 Amelia Serrano ----+----1----+----2----+----3----+----4----+----5----+----6 yellow ----+----1----+----2----+----3----+----4----+----5----+----6 145 124 Program Data Vector Team Name IdNumber StartWeight . . EndWeight . Problem Solving: When an Input Record Unexpectedly Does Not Have Enough Values Understanding the Default Behavior When a DATA step reads raw data from an external file, problems can occur when SAS encounters the end of an input line before reading in data for all variables specified in the input statement. This problem can occur when reading variable-length records and/or records containing missing values. The following is an example of an external file that contains variable-length records: ----+-----1-----+-----2 22 333 4444 55555 This DATA step uses the numeric informat 5. to read a single field in each record of raw data and to assign values to the variable TestNumber: data numbers; infile ’your-external-file’; input TestNumber 5.; run; proc print data=numbers; title ’Test DATA Step’; run; The DATA step reads the first value (22). Because the value is shorter than the 5 characters expected by the informat, the DATA step attempts to finish filling the value with the next record (333). This value is entered into the PDV and becomes the value of Starting with Raw Data: Beyond the Basics 4 Methods of Control: Your Options 75 the TestNumber variable for the first observation. The DATA step then goes to the next record, but encounters the same problem because the value (4444) is shorter than the value that is expected by the informat. Again, the DATA step goes to the next record, reads the value (55555), and assigns that value to the TestNumber variable for the second observation. The following output shows the results. After this program runs, the SAS log contains a note to indicate the places where SAS went to the next record to search for data values. Output 4.6 Reading Raw Data Past the End of a Line: Default Behavior Test DATA Step Obs 1 Test Number 1 2 333 55555 Methods of Control: Your Options Four Options: FLOWOVER, STOPOVER, MISSOVER, and TRUNCOVER To control how SAS behaves after it attempts to read past the end of a data line, you can use the following options in the INFILE statement: infile ’your-external-file’ flowover; is the default behavior. The DATA step simply reads the next record into the input buffer, attempting to find values to assign to the rest of the variable names in the INPUT statement. infile ’your-external-file’ stopover; causes the DATA step to stop processing if an INPUT statement reaches the end of the current record without finding values for all variables in the statement. Use this option if you expect all of the data in the external file to conform to a given standard and if you want the DATA step to stop when it encounters a data record that does not conform to the standard. infile ’your-external-file’ missover; prevents the DATA step from going to the next line if it does not find values in the current record for all of the variables in the INPUT statement. Instead, the DATA step assigns a missing value for all variables that do not have values. infile ’your-external-file’ truncover; causes the DATA step to assign the raw data value to the variable even if the value is shorter than expected by the INPUT statement. If, when the DATA step encounters the end of an input record, there are variables without values, the variables are assigned missing values for that observation. You can also use these options even when your data lines are in the program itself, that is, when they follow the DATALINES statement. Simply use datalines instead of a reference to an external file to indicate that the data records are in the DATA step itself: 3 infile datalines flowover; 3 infile datalines stopover; 76 Methods of Control: Your Options 4 Chapter 4 3 infile datalines missover; 3 infile datalines truncover; Note: The examples in this section show the use of the MISSOVER and TRUNCOVER options with formatted input. You can also use these options with list input and column input. 4 Understanding the MISSOVER Option The MISSOVER option prevents the DATA step from going to the next line if it does not find values in the current record for all of the variables in the INPUT statement. Instead, the DATA step assigns a missing value for all variables that do not have complete values according to any specified informats. The input file contains the following raw data: ----+-----1-----+-----2 22 333 4444 55555 The following example uses the MISSOVER option: data numbers; infile ’your-external-file’ missover; input TestNumber 5.; run; proc print data=numbers; title ’Test DATA Step’; run; Output 4.7 Output from the MISSOVER Option Test DATA Step Obs 1 2 3 4 1 Test Number . . . 55555 Because the fourth record is the only one whose value matches the informat, it is the only record whose value is assigned to the TestNumber variable. The other observations receive missing values. This result is probably not the desired outcome for this example, but the MISSOVER option can sometimes be valuable. For an example, see “Updating a Data Set” on page 295. Note: If there is a blank line at the end of the last record, the DATA step attempts to load another record into the input buffer. Because there are no more records, the MISSOVER option instructs the DATA step to assign missing values to all variables, and an extra observation is added to the data set. To prevent this situation from Starting with Raw Data: Beyond the Basics 4 Column-Pointer Controls 77 occurring, make sure that your input data does not have a blank line at the end of the last record. 4 Understanding the TRUNCOVER Option The TRUNCOVER option causes the DATA step to assign the raw data value to the variable even if the value is shorter than the length that is expected by the INPUT statement. If, when the DATA step encounters the end of an input record, there are variables without values, the variables are assigned missing values for that observation. The following example demonstrates the use of the TRUNCOVER statement: data numbers; infile ’your-external-file’ truncover; input TestNumber 5.; run; proc print data=numbers; title ’Test DATA Step’; run; Output 4.8 Output from the TRUNCOVER Option Test DATA Step Obs 1 2 3 4 1 Test Number 22 333 4444 55555 This result shows that all of the values were assigned to the TestNumber variable, despite the fact that three of them did not match the informat. For another example using the TRUNCOVER option, see “Input SAS Data Set for Examples” on page 140. Review of SAS Tools Column-Pointer Controls @n moves the pointer to the n column in the input buffer. +n moves the pointer forward n columns in the input buffer. / moves the pointer to the next line in the input buffer. #n moves the pointer to the nth line in the input buffer. 78 Line-Hold Specifiers 4 Chapter 4 Line-Hold Specifiers @ (trailing @) prevents SAS from automatically reading a new data record into the input buffer when a new INPUT statement is executed within the same iteration of the DATA step. When used, the trailing @ must be the last item in the INPUT statement. @@ (double trailing @) prevents SAS from automatically reading a new data record into the input buffer when the next INPUT statement is executed, even if the DATA step returns to the top for another iteration. When used, the double trailing @ must be the last item in the INPUT statement. Statements DATALINES; indicates that data lines immediately follow. A semicolon in the line that immediately follows the last data line indicates the end of the data and causes the DATA step to compile and execute. INFILE fileref< FLOWOVER | STOPOVER | MISSOVER | TRUNCOVER>; INFILE ’external-file’ ; identifies an external file to be read by an INPUT statement. Specify a fileref that has been assigned with a FILENAME statement or with an appropriate operating environment command. Or you can specify the actual name of the external file. These options give you control over how SAS behaves if the end of a data record is encountered before all of the variables are assigned values. You can use these options with list, modified list, formatted, and column input. FLOWOVER is the default behavior. It causes the DATA step to look in the next record if the end of the current record is encountered before all of the variables are assigned values MISSOVER causes the DATA step to assign missing values to any variables that do not have values when the end of a data record is encountered. The DATA step continues processing. STOPOVER causes the DATA step to stop execution immediately and write a note to the SAS log. TRUNCOVER causes the DATA step to assign values to variables, even if the values are shorter than expected by the INPUT statement, and to assign missing values to any variables that do not have values when the end of a record is encountered. INPUT variable <&> <$>; reads the input data record using list input. The & (ampersand format modifier) allows character values to contain embedded blanks. When you use the ampersand format modifier, two blanks are required to signal the end of a data value. The $ indicates a character variable. Starting with Raw Data: Beyond the Basics 4 Learning More 79 INPUT variable start-column ; reads the input data record using column input. You can omit end-column if the data is only 1 byte long. This style of input enables you to skip columns of data that you want to omit. INPUT variable : informat; INPUT variable & informat; reads the input data record using modified list input. The : (colon format modifier) instructs SAS to use the informat that follows to read the data value. The & (ampersand format modifier) instructs SAS to use the informat that follows to read the data value. When you use the ampersand format modifier, two blanks are required to signal the end of a data value. INPUT variable informat; reads raw data using formatted input. The informat supplies special instructions to read the data. You can also use a pointer-control to direct SAS to start reading at a particular column. The syntax given above for the three styles of input shows only one variable. Subsequent variables in the INPUT statement may or may not be described in the same input style as the first one. You may use any of the three styles of input (list, column, and formatted) in a single INPUT statement. Learning More Handling missing data values For complete details about the FLOWOVER, STOPOVER, MISSOVER, and TRUNCOVER options in the INFILE statement, see SAS Language Reference: Dictionary. Reading multiple input records Testing a condition 3 For more information about performing conditional processing with the IF statement, see Chapter 9, “Acting on Selected Observations,” on page 139 and Chapter 10, “Creating Subsets of Observations,” on page 159. 3 For a complete discussion and listing of line-pointer controls and line-hold specifiers, see SAS Language Reference: Dictionary. 80 81 CHAPTER 5 Starting with SAS Data Sets Introduction to Starting with SAS Data Sets 81 Purpose 81 Prerequisites 81 Understanding the Basics 82 Input SAS Data Set for Examples 82 Reading Selected Observations 84 Reading Selected Variables 85 Overview of Reading Selected Variables 85 Keeping Selected Variables 86 Dropping Selected Variables 87 Choosing between Data Set Options and Statements 88 Choosing between the DROP= and KEEP= Data Set Option 88 Creating More Than One Data Set in a Single DATA Step 89 Using the DROP= and KEEP= Data Set Options for Efficiency 91 Review of SAS Tools 92 Data Set Options 92 Procedures 93 Statements 93 Learning More 93 Introduction to Starting with SAS Data Sets Purpose In this section, you will learn how to do the following: 3 display information about a SAS data set 3 create a new SAS data set from an existing SAS data set rather than creating it from raw data records Reading a SAS data set in a DATA step is simpler than reading raw data because the work of describing the data to SAS has already been done. Prerequisites You should understand the concepts presented in Chapter 1, “What Is the SAS System?,” on page 3 and Chapter 2, “Introduction to DATA Step Processing,” on page 19 before continuing with this section. 82 Understanding the Basics 4 Chapter 5 Understanding the Basics When you use a SAS data set as input into a DATA step, the description of the data set is available to SAS. In your DATA step, use a SET, MERGE, MODIFY, or UPDATE statement to read the SAS data set. Use SAS programming statements to process the data and create an output SAS data set. In a DATA step, you can create a new data set that is a subset of the original data set. For example, if you have a large data set of personnel data, you might want to look at a subset of observations that meet certain conditions, such as observations for employees hired after a certain date. Alternatively, you might want to see all observations but only a few variables, such as the number of years of education or years of service to the company. When you use existing SAS data sets, as well as with subsets created from SAS data sets, you can make more efficient use of computer resources than if you use raw data or if you are working with large data sets. Reading fewer variables means that SAS creates a smaller program data vector, and reading fewer observations means that fewer iterations of the DATA step occur. Reading data directly from a SAS data set is more efficient than reading the raw data again, because the work of describing and converting the data has already been done. One way of looking at a SAS data set is to produce a listing of the data in a SAS data set by using the PRINT procedure. Another way to look at a SAS data set is to display information that describes its structure rather than its data values. To display information about the structure of a data set, use the DATASETS procedure with the CONTENTS statement. If you need to work with a SAS data set that is unfamiliar to you, the CONTENTS statement in the DATASETS procedure displays valuable information such as the name, type, and length of all the variables in the data set. An example that shows the CONTENTS statement in the DATASETS procedure is shown in “Input SAS Data Set for Examples” on page 82. Input SAS Data Set for Examples The examples in this section use a SAS data set named CITY, which contains information about expenditures for a small city. It reports total city expenditures for the years 1980 through 2000 and divides the expenses into two major categories: services and administration. (To see the program that creates the CITY data set, see “DATA Step to Create the Data Set CITY” on page 712.) The following example uses the DATASETS procedure with the NOLIST option to display the CITY data set. The NOLIST option prevents the DATASETS procedure from listing other data sets that are also located in the WORK library: proc datasets library=work nolist; contents data=city; run; Starting with SAS Data Sets 4 Input SAS Data Set for Examples 83 Output 5.1 The Structure of CITY as Shown by PROC DATASETS The SAS System 1 The DATASETS Procedure Data Set Name: Member Type: Engine: Created: Last Modified: Protection: Data Set Type: Label: WORK.CITY DATA V8 9:54 Wednesday, October 6, 1999 9:54 Wednesday, October 6, 1999 Observations: Variables: Indexes: Observation Length: Deleted Observations: Compressed: Sorted: -----Engine/Host Dependent Information----Data Set Page Size: Number of Data Set Pages: First Data Page: Max Obs per Page: Obs in First Data Page: Number of Data Set Repairs: File Name: Release Created: Host Created: Inode Number: Access Permission: Owner Name: File Size (bytes): 21 u 10 u 0 80 0 NO NO v 8192 1 1 101 21 0 /usr/tmp/code_editor_saswork/SAS_ work63ED00006E98/city.sas7bdat 8.0001M0 HP-UX 62403 rw-r--r-abcdef 16384 -----Alphabetic List of Variables and Attributes----w # Variable Type Len Pos x Label ---------------------------------------------------------------------------5 AdminLabor Num 8 32 Administration: Labor 6 AdminSupplies Num 8 40 Administration: Supplies 9 AdminTotal Num 8 64 Administration: Total 7 AdminUtilities Num 8 48 Administration: Utilities 3 ServicesFire Num 8 16 Services: Fire 2 ServicesPolice Num 8 8 Services: Police 8 ServicesTotal Num 8 56 Services: Total 4 ServicesWater_Sewer Num 8 24 Services: Water & Sewer 10 Total Num 8 72 Total Outlays 1 Year Num 8 0 The following list corresponds to the numbered items in the previous SAS output: u The Observations and the Variables fields identify the number of observations and the number of variables. v The Engine/Host Dependent Information section lists detailed information about the data set. This information is generated by the engine, which is the mechanism for reading from and writing to files. Operating Environment Information: The output in this section may differ, depending on your operating environment. For more information, refer to the SAS documentation for your operating environment. 4 w The Alphabetic List of Variables and Attributes lists the name, type, length, and position of each variable. x The Label lists the format, informat, and label for each variable, if they exist. 84 Reading Selected Observations 4 Chapter 5 Reading Selected Observations If you are interested in only part of a large data set, you can use data set options to create a subset of your data. Data set options specify which observations you want the new data set to include. In Chapter 10, “Creating Subsets of Observations,” on page 159 you learn how to use the subsetting IF statement to create a subset of a large SAS data set. In this section, you learn how to use the FIRSTOBS= and OBS= data set options to create subsets of a larger data set. For example, you might not want to read the observations at the beginning of the data set. You can use the FIRSTOBS= data set option to define which observation should be the first one that is processed. For the data set CITY, this example creates a data set that excludes observations that contain data prior to 1991 by specifying FIRSTOBS=12. As a result, SAS does not read the first 11 observations, which contain data prior to 1991. (To see the program that creates the CITY data set, see “DATA Step to Create the Data Set CITY” on page 712.) The following program creates the data set CITY2, which contains the same number of variables but fewer observations than CITY. data city2; set city(firstobs=12); run; proc print; title ’City Expenditures’; title2 ’1991 - 2000’; run; The following output shows the results: Starting with SAS Data Sets 4 Overview of Reading Selected Variables 85 Output 5.2 Subsetting a Data Set by Observations City Expenditures 1991 - 2000 O b s 1 2 3 4 5 6 7 8 9 10 Y e a r 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 S e r v i c e s P o l i c e 2195 2204 2175 2556 2026 2526 2027 2037 2852 2787 S e r v i c e s F i r e S e r v i c e s W a t e r _ S e w e r A d m i n L a b o r 1002 964 1144 1341 1380 1454 1486 1667 1834 1701 643 692 735 813 868 946 1043 1152 1318 1317 256 256 241 238 226 317 226 244 270 307 1 A d m i n S u p p l i e s A d m i n U t i l i t i e s 24 28 19 25 24 13 . 20 23 26 55 70 83 97 97 89 82 88 74 66 S e r v i c e s T o t a l 3840 3860 4054 4710 4274 4926 4556 4856 6004 5805 A d m i n T o t a l 335 354 343 360 347 419 . 352 367 399 T o t a l 4175 4214 4397 5070 4621 5345 . 5208 6371 6204 You can also specify the last observation you want to include in a new data set with the OBS= data set option. For example, the next program creates a SAS data set containing only the observations for 1989 (the 10th observation) through 1994 (the 15th observation). data city3; set city (firstobs=10 obs=15); run; Reading Selected Variables Overview of Reading Selected Variables You can create a subset of a larger data set not only by excluding observations but also by specifying which variables you want the new data set to contain. In a DATA step you can use the SET statement and the KEEP= or DROP= data set options (or the DROP and KEEP statements) to create a subset from a larger data set by specifying which variables you want the new data set to include. 86 Keeping Selected Variables 4 Chapter 5 Keeping Selected Variables This example uses the KEEP= data set option in the SET statement to read only the variables that represent the services-related expenditures of the data set CITY. data services; set city (keep=Year ServicesTotal ServicesPolice ServicesFire ServicesWater_Sewer); run; proc print data=services; title ’City Services-Related Expenditures’; run; The following output shows the resulting data set. Note that the data set SERVICES contains only those variables that are specified in the KEEP= option. Output 5.3 Selecting Variables with the KEEP= Option City Services-Related Expenditures Obs Year Services Police 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2819 2477 2028 2754 2195 1877 1727 1532 1448 1500 1934 2195 2204 2175 2556 2026 2526 2027 2037 2852 2787 1 Services Fire Services Water_ Sewer Services Total 1120 1160 1061 893 963 926 1111 1220 1156 1076 969 1002 964 1144 1341 1380 1454 1486 1667 1834 1701 422 500 510 540 541 535 535 519 577 606 646 643 692 735 813 868 946 1043 1152 1318 1317 4361 4137 3599 4187 3699 3338 3373 3271 3181 3182 3549 3840 3860 4054 4710 4274 4926 4556 4856 6004 5805 The following example uses the KEEP statement instead of the KEEP= data set option to read all of the variables from the CITY data set. The KEEP statement creates a new data set (SERVICES) that contains only the variables listed in the KEEP statement. The following program gives results that are identical to those in the previous example: data services; set city; keep Year ServicesTotal ServicesPolice ServicesFire ServicesWater_Sewer; run; Starting with SAS Data Sets 4 Dropping Selected Variables The following example has the same effect as using the KEEP= data set option in the DATA statement. All of the variables are read into the program data vector, but only the specified variables are written to the SERVICES data set: data services (keep=Year ServicesTotal ServicesPolice ServicesFire ServicesWater_Sewer); set city; run; Dropping Selected Variables Use the DROP= option to create a subset of a larger data set when you want to specify which variables are being excluded rather than which ones are being included. The following DATA step reads all of the variables from the data set CITY except for those that are specified with the DROP= option, and then creates a data set named SERVICES2: data services2; set city (drop=Total AdminTotal AdminLabor AdminSupplies AdminUtilities); run; proc print data=services2; title ’City Services-Related Expenditures’; run; The following output shows the resulting data set: Output 5.4 Excluding Variables with the DROP= Option City Services-Related Expenditures Obs Year Services Police 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2819 2477 2028 2754 2195 1877 1727 1532 1448 1500 1934 2195 2204 2175 2556 2026 2526 2027 2037 2852 2787 1 Services Fire Services Water_ Sewer Services Total 1120 1160 1061 893 963 926 1111 1220 1156 1076 969 1002 964 1144 1341 1380 1454 1486 1667 1834 1701 422 500 510 540 541 535 535 519 577 606 646 643 692 735 813 868 946 1043 1152 1318 1317 4361 4137 3599 4187 3699 3338 3373 3271 3181 3182 3549 3840 3860 4054 4710 4274 4926 4556 4856 6004 5805 87 88 Choosing between Data Set Options and Statements 4 Chapter 5 The following example uses the DROP statement instead of the DROP= data set option to read all of the variables from the CITY data set and to exclude the variables that are listed in the DROP statement from being written to the new data set. The results are identical to those in the previous example: data services2; set city; drop Total AdminTotal AdminLabor AdminSupplies AdminUtilities; run; proc print data=services2; run; Choosing between Data Set Options and Statements When you create only one data set in the DATA step, the data set options to drop and keep variables have the same effect on the output data set as the statements to drop and keep variables. When you want to control which variables are read into the program data vector, using the data set options in the statement (such as a SET statement) that reads the SAS data set is generally more efficient than using the statements. Later topics in this section show you how to use the data set options in some cases where the statements will not work. Choosing between the DROP= and KEEP= Data Set Option In a simple case, you might decide to use the DROP= or KEEP= option, depending on which method enables you to specify fewer variables. If you work with large jobs that read data sets, and you expect that variables might be added between the times your batch jobs run, you may want to use the KEEP= option to specify which variables are included in the subset data set. The following figure shows two data sets named SMALL. They have different contents because the new variable F was added to data set BIG before the DATA step ran on Tuesday. The DATA step uses the DROP= option to keep variables D and E from being written to the output data set. The result is that the data sets contain different contents: the second SMALL data set has an extra variable, F. If the DATA step used the KEEP= option to specify A, B, and C, then both of the SMALL data sets would have the same variables (A, B, and C). The addition of variable F to the original data set BIG would have no effect on the creation of the SMALL data set. Starting with SAS Data Sets 4 Creating More Than One Data Set in a Single DATA Step 89 Figure 5.1 Using the DROP= Option A B C A B C D E F F D E data small; set big(drop=d e); run; A B data small; set big(drop=d e); run; C A B C F Creating More Than One Data Set in a Single DATA Step You can use a single DATA step to create more than one data set at a time. You can create data sets with different contents by using the KEEP= or DROP= data set options. For example, the following DATA step creates two SAS data sets: SERVICES contains variables that show services-related expenditures, and ADMIN contains variables that represent the administration-related expenditures. Use the KEEP= option after each data set name in the DATA statement to determine which variables are written to each SAS data set being created. data services(keep=ServicesTotal ServicesPolice ServicesFire ServicesWater_Sewer) admin(keep=AdminTotal AdminLabor AdminSupplies AdminUtilities); set city; run; proc print data=services; title ’City Expenditures: Services’; run; 90 Creating More Than One Data Set in a Single DATA Step 4 Chapter 5 proc print data=admin; title ’City Expenditures: Administration’; run; The following output shows both data sets. Note that each data set contains only the variables that are specified with the KEEP= option after its name in the DATA statement. Output 5.5 Creating Two Data Sets in One DATA Step City Expenditures: Services Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 Services Police Services Fire Services Water_ Sewer Services Total 2819 2477 2028 2754 2195 1877 1727 1532 1448 1500 1934 2195 2204 2175 2556 2026 2526 2027 2037 2852 2787 1120 1160 1061 893 963 926 1111 1220 1156 1076 969 1002 964 1144 1341 1380 1454 1486 1667 1834 1701 422 500 510 540 541 535 535 519 577 606 646 643 692 735 813 868 946 1043 1152 1318 1317 4361 4137 3599 4187 3699 3338 3373 3271 3181 3182 3549 3840 3860 4054 4710 4274 4926 4556 4856 6004 5805 Starting with SAS Data Sets 4 Using the DROP= and KEEP= Data Set Options for Efficiency City Expenditures: Administration 91 2 Obs Admin Labor Admin Supplies Admin Utilities Admin Total 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 391 172 269 227 214 198 213 195 225 235 266 256 256 241 238 226 317 226 244 270 307 63 47 29 21 21 16 27 11 12 19 11 24 28 19 25 24 13 . 20 23 26 98 70 79 67 59 80 70 69 58 62 63 55 70 83 97 97 89 82 88 74 66 552 289 377 315 294 294 310 275 295 316 340 335 354 343 360 347 419 . 352 367 399 Note: In this case, using the KEEP= data set option is necessary, because when you use the KEEP statement, all data sets that are created in the DATA step contain the same variables. 4 Using the DROP= and KEEP= Data Set Options for Efficiency The DROP= and KEEP= data set options are valid in both the DATA statement and the SET statement. However, you can write a more efficient DATA step if you understand the consequences of using these options in the DATA statement rather than the SET statement. In the DATA statement, these options affect which variables SAS writes from the program data vector to the resulting SAS data set. In the SET statement, these options determine which variables SAS reads from the input SAS data set. Therefore, they determine how the program data vector is built. When you specify the DROP= or KEEP= option in the SET statement, SAS does not read the excluded variables into the program data vector. If you work with a large data set (perhaps one containing thousands or millions of observations), you can construct a more efficient DATA step by not reading unneeded variables from the input data set. Note also that if you use a variable from the input data set to perform a calculation, the variable must be read into the program data vector. If you do not want that variable to appear in the new data set, however, use the DROP= option in the DATA statement to exclude it. The following DATA step creates the same two data sets as the DATA step in the previous example, but it does not read the variable Total into the program data vector. Compare the SET statement here to the one in “Creating More Than One Data Set in a Single DATA Step” on page 89. data services (keep=ServicesTotal ServicesPolice ServicesFire ServicesWater_Sewer) 92 Review of SAS Tools 4 Chapter 5 admin (keep=AdminTotal AdminLabor AdminSupplies AdminUtilities); set city(drop=Total); run; proc print data=services; title ’City Expenditures: Services’; run; proc print data=admin; title ’City Expenditures: Administration’; run; In contrast with previous examples, the data set options in this example appear in both the DATA and SET statements. In the SET statement, the DROP= option determines which variables are omitted from the program data vector. In the DATA statement, the KEEP= option controls which variables are written from the program data vector to each data set being created. Note: Using a DROP or KEEP statement is comparable to using a DROP= or KEEP= option in the DATA statement. All variables are included in the program data vector; they are excluded when the observation is written from the program data vector to the new data set. When you create more than one data set in a single DATA step, using the data set options enables you to drop or keep different variables in each of the new data sets. A DROP or KEEP statement, on the other hand, affects all of the data sets that are created. 4 Review of SAS Tools Data Set Options DROP=variable(s) specifies the variables to be excluded. Used in the SET statement, DROP= specifies the variables that are not to be read from the existing SAS data set into the program data vector. Used in the DATA statement, DROP= specifies the variables to be excluded from the data set that is being created. FIRSTOBS=n specifies the first observation to be read from the SAS data set that you specify in the SET statement. KEEP=variable(s) specifies the variables to be included. Used in the SET statement, KEEP= specifies the variables to be read from the existing SAS data set into the program data vector. Used in the DATA statement, KEEP= specifies which variables in the program data vector are to be written to the data set being created. OBS=n specifies the last observation to be read from the SAS data set that you specify in the SET statement. Starting with SAS Data Sets 4 Learning More 93 Procedures PROC DATASETS ; CONTENTS ; describes the structure of a SAS data set, including the name, type, and length of all variables in the data set. Statements DATA SAS-data-set<(data-set-options)>; begins a DATA step and names the SAS data set or data sets that are being created. You can specify the DROP= or KEEP= data set options in parentheses after each data set name to control which variables are written to the output data set from the program data vector. DROP variable(s); specifies the variables to be excluded from the data set that is being created. See also the DROP= data set option. KEEP variable(s) specifies the variables to be written to the data set that is being created. See also the KEEP= data set option. SET SAS-data-set(data-set-options); reads observations from a SAS data set rather than records of raw data. You can specify the DROP= or KEEP= data set options in parentheses after a data set name to control which variables are read into the program data vector from the input data set. Learning More Creating SAS data sets For a general discussion about creating SAS data sets from other SAS data sets by merging, concatenating, interleaving, and updating, see Chapter 15, “Methods of Combining SAS Data Sets,” on page 233. Data set options See the “Data Set Options” section of SAS Language Reference: Dictionary, and the SAS documentation for your operating environment. DROP and KEEP statements See the “Statements” section of SAS Language Reference: Dictionary. Engines see SAS Language Reference: Concepts. Subsetting IF statement You can use the subsetting IF statement and conditional (IF-THEN) logic when creating a new SAS data set from an existing one. For more information, see Chapter 9, “Acting on Selected Observations,” on page 139 and Chapter 10, “Creating Subsets of Observations,” on page 159. 94 95 3 P A R T Basic Programming Chapter 6. . . . . . . . . . Understanding DATA Step Processing Chapter 7 . . . . . . . . . . Working with Numeric Variables Chapter 8 . . . . . . . . . . Working with Character Variables Chapter 9 . . . . . . . . . . Acting on Selected Observations Chapter 10. . . . . . . . .Creating Subsets of Observations Chapter 11. . . . . . . . .Working with Grouped or Sorted Observations Chapter 12. . . . . . . . .Using More Than One Observation in a Calculation Chapter 13. . . . . . . . .Finding Shortcuts in Programming Chapter 14. . . . . . . . .Working with Dates in the SAS System 97 107 119 139 159 201 211 173 187 96 97 CHAPTER 6 Understanding DATA Step Processing Introduction to DATA Step Processing 97 Purpose 97 Prerequisites 97 Input SAS Data Set for Examples 97 Adding Information to a SAS Data Set 98 Understanding the Assignment Statement 98 Making Uniform Changes to Data by Creating a Variable 99 Adding Information to Some Observations but Not Others 100 Making Uniform Changes to Data Without Creating Variables 101 Using Variables Efficiently 101 Defining Enough Storage Space for Variables 103 Conditionally Deleting an Observation 104 Review of SAS Tools 105 Statements 105 Learning More 105 Introduction to DATA Step Processing Purpose To add, modify, and delete information in a SAS data set, you use a DATA step. In this section, you will learn how the DATA step works, the general form of the statements, and some programming techniques. Prerequisites You should understand the concepts presented in Chapter 2, “Introduction to DATA Step Processing,” on page 19 and Chapter 3, “Starting with Raw Data: The Basics,” on page 43 before proceeding with this section. Input SAS Data Set for Examples Tradewinds Travel Inc. has an external file that they use to manipulate and store data about their tours. The external file contains the following information: u v w x y France 8 793 575 Major 98 Adding Information to a SAS Data Set 4 Chapter 6 Spain 10 805 510 Hispania India 10 . 489 Royal Peru 7 722 590 Mundial The numbered fields represent u v w x y the the the the the name of the country toured number of nights on the tour airfare in US dollars cost of the land package in US dollars name of the company that offers the tour Notice that the cost of the airfare for the tour to India has a missing value, which is indicated by a period. The following DATA step creates a permanent SAS data set named MYLIB.INTERNATIONALTOURS: options pagesize=60 linesize=80 pageno=1 nodate; libname mylib ’permanent-data-library’; data mylib.internationaltours; infile ’input-file’; input Country $ Nights AirCost LandCost Vendor $; proc print data = mylib.internationaltours; title ’Data Set MYLIB.INTERNATIONALTOURS’; run; The PROC PRINT statement that follows the DATA step produces this display of the MYLIB.INTERNATIONALTOURS data set: Output 6.1 Creating a Permanent SAS Data Set Data Set MYLIB.INTERNATIONALTOURS Obs 1 2 3 4 Country France Spain India Peru Nights 8 10 10 7 Air Cost Land Cost 793 805 . 722 575 510 489 590 1 Vendor Major Hispania Royal Mundial Adding Information to a SAS Data Set Understanding the Assignment Statement One of the most common reasons for using program statements in the DATA step is to produce new information from the original information or to change the information read by the INPUT or SET/MERGE/MODIFY/UPDATE statement. How do you add information to observations with a DATA step? Understanding DATA Step Processing 4 Making Uniform Changes to Data by Creating a Variable 99 The basic method of adding information to a SAS data set is to create a new variable in a DATA step with an assignment statement. An assignment statement has the form: variable=expression; The variable receives the new information; the expression creates the new information. You specify the calculation necessary to produce the information and write the calculation as the expression. When the expression contains character data, you must enclose the data in quotation marks. SAS evaluates the expression and stores the new information in the variable that you name. It is important to remember that if you need to add the information to only one or two observations out of many, SAS creates that variable for all observations. The SAS data set that is being created must have information in every observation and every variable. Making Uniform Changes to Data by Creating a Variable Sometimes you want to make a particular change to every observation. For example, at Tradewinds Travel the airfare must be increased for every tour by $10 because of a new tax. One way to do this is to write an assignment statement that creates a new variable that calculates the new airfare: NewAirCost = AirCost+10; This statement directs SAS to read the value of AirCost, add 10 to it, and assign the result to the new variable, NewAirCost. When this assignment statement is included in a DATA step, the DATA step looks like this: options pagesize=60 linesize=80 pageno=1 nodate; data newair; set mylib.internationaltours; NewAirCost = AirCost + 10; proc print data=newair; var Country AirCost NewAirCost; title ’Increasing the Air Fare by $10 for All Tours’; run; Note: In this example, the VAR statement in the PROC PRINT step determines which variables are displayed in the output. 4 The following output shows the resulting SAS data set, NEWAIR: Output 6.2 Adding Information to All Observations by Using a New Variable Increasing the Air Fare by $10 for All Tours Obs 1 2 3 4 Country France Spain India Peru Air Cost New u Air Cost 793 805 . 722 803 815 . v 732 1 100 Adding Information to Some Observations but Not Others 4 Chapter 6 Notice in this data set that u because SAS carries out each statement in the DATA step for every observation, NewAirCost is calculated during each iteration of the DATA step. v the observation for India contains a missing value for AirCost; SAS therefore assigns a missing value to NewAirCost for that observation The SAS data set has information in every observation and every variable. Adding Information to Some Observations but Not Others Often you need to add information to some observations but not to others. For example, some tour operators award bonus points to travel agencies for scheduling particular tours. Two companies, Hispania and Mundial, are offering bonus points this year. IF-THEN/ELSE statements can cause assignment statements to be carried out only when a condition is met. In the following DATA step, the IF statements check the value of the variable Vendor. If the value is either Hispania or Mundial, information about the bonus points is added to those observations. options pagesize=60 linesize=80 pageno=1 nodate; data bonus; set mylib.internationaltours; if Vendor = ’Hispania’ then BonusPoints = ’For 10+ people’; else if Vendor = ’Mundial’ then BonusPoints = ’Yes’; run; proc print data=bonus; var Country Vendor BonusPoints; title1 ’Adding Information to Observations for’; title2 ’Vendors Who Award Bonus Points’; run; The following output displays the results: Output 6.3 Specifying Values for Specific Observations by Using a New Variable Adding Information to Observations for Vendors Who Award Bonus Points Obs 1 2 3 4 Country Vendor BonusPoints France Spain India Peru Major Hispania Royal Mundial u For 10+ people v u Yes 1 The new variable BonusPoints has the following information: u In the two observations that are not assigned a value for BonusPoints, SAS assigns a missing value, represented by a blank in this case, to indicate the absence of a character value. v The first value that SAS encounters for BonusPoints contains 14 characters; therefore, SAS sets aside 14 bytes of storage in each observation for BonusPoints, regardless of the length of the value for that observation. Understanding DATA Step Processing 4 Using Variables Efficiently 101 Making Uniform Changes to Data Without Creating Variables Sometimes you want to change the value of existing variables without adding new variables. For example, in one DATA step a new variable, NewAirCost, was created to contain the value of the airfare plus the new $10 tax: NewAirCost = AirCost + 10; You can also decide to change the value of an existing variable rather than create a new variable. Following the example, AirCost is changed as follows: AirCost = AirCost + 10; SAS processes this statement just as it does other assignment statements. It evaluates the expression on the right side of the equal sign and assigns the result to the variable on the left side of the equal sign. The fact that the same variable appears on the right and left sides of the equal sign does not matter. SAS evaluates the expression on the right side of the equal sign before looking at the variable on the left side. The following program contains the new assignment statement: options pagesize=60 linesize=80 pageno=1 nodate; data newair2; set mylib.internationaltours; AirCost = AirCost + 10; proc print data=newair2; var Country AirCost; title ’Adding Tax to the Air Cost Without Adding a New Variable’; run; The following output displays the results: Output 6.4 Changing the Information in a Variable Adding Tax to the Air Cost Without Adding a New Variable Obs 1 2 3 4 Country France Spain India Peru 1 Air Cost 803 815 . 732 When you change the kind of information that a variable contains, you change the meaning of that variable. In this case, you are changing the meaning of AirCost from airfare without tax to airfare with tax. If you remember the current meaning and if you know that you do not need the original information, then changing a variable’s values is useful. However, for many programmers, having separate variables is easier than recalling one variable whose definition changes. Using Variables Efficiently Variables that contain information that applies to only one or two observations use more storage space than necessary. When possible, create fewer variables that apply to 102 Using Variables Efficiently 4 Chapter 6 more observations in the data set, and allow the different values in different observations to supply the information. For example, the Major company offers discounts, not bonus points, for groups of 30 or more people. An inefficient program would create separate variables for bonus points and discounts, as follows: /* inefficient use of variables */ options pagesize=60 linesize=80 pageno=1 nodate; data tourinfo; set mylib.internationaltours; if Vendor = ’Hispania’ then BonusPoints = ’For 10+ people’; else if Vendor = ’Mundial’ then BonusPoints = ’Yes’; else if Vendor = ’Major’ then Discount = ’For 30+ people’; run; proc print data=tourinfo; var Country Vendor BonusPoints Discount; title ’Information About Vendors’; run; The following output displays the results: Output 6.5 Inefficient: Using Variables That Scatter Information Across Multiple Variables Information About Vendors Obs 1 2 3 4 Country Vendor France Spain India Peru Major Hispania Royal Mundial BonusPoints 1 Discount For 30+ people For 10+ people Yes As you can see, storage space is used inefficiently. Both BonusPoints and Discount have a significant number of missing values. With a little planning, you can make the SAS data set much more efficient. In the following DATA step, the variable Remarks contains information about bonus points, discounts, and any other special features of any tour. /* efficient use of variables */ options pagesize=60 linesize=80 pageno=1 nodate; data newinfo; set mylib.internationaltours; if Vendor = ’Hispania’ then Remarks = ’Bonus for 10+ people’; else if Vendor = ’Mundial’ then Remarks = ’Bonus points’; else if Vendor = ’Major’ then Remarks = ’Discount: 30+ people’; run; proc print data=newinfo; var Country Vendor Remarks; title ’Information About Vendors’; run; Understanding DATA Step Processing 4 Defining Enough Storage Space for Variables 103 The following output displays a more efficient use of variables: Output 6.6 Efficient: Using Variables to Contain Maximum Information Information About Vendors Obs 1 2 3 4 Country Vendor Remarks France Spain India Peru Major Hispania Royal Mundial Discount: 30+ people Bonus for 10+ people 1 Bonus points Remarks has fewer missing values and contains all the information that is used by BonusPoints and Discount in the inefficient example. Using variables efficiently can save storage space and optimize your SAS data set. Defining Enough Storage Space for Variables The first time that a value is assigned to a variable, SAS enables as many bytes of storage space for the variable as there are characters in the first value assigned to it. At times, you may need to specify the amount of storage space that a variable requires. For example, as shown in the preceding example, the variable Remarks contains miscellaneous information about tours: if Vendor = ’Hispania’ then Remarks = ’Bonus for 10+ people’; In this assignment statement, SAS enables 20 bytes of storage space for Remarks as there are 20 characters in the first value assigned to it. The longest value may not be the first one assigned, so you specify a more appropriate length for the variable before the first value is assigned to it: length Remarks $ 30; This statement, called a LENGTH statement, applies to the entire data set. It defines the number of bytes of storage that is used for the variable Remarks in every observation. SAS uses the LENGTH statement during compilation, not when it is processing statements on individual observations. The following DATA step shows the use of the LENGTH statement: options pagesize=60 linesize=80 pageno=1 nodate; data newlength; set mylib.internationaltours; length Remarks $ 30; if Vendor = ’Hispania’ then Remarks = ’Bonus for 10+ people’; else if Vendor = ’Mundial’ then Remarks = ’Bonus points’; else if Vendor = ’Major’ then Remarks = ’Discount for 30+ people’; run; proc print data=newlength; var Country Vendor Remarks; title ’Information About Vendors’; run; 104 Conditionally Deleting an Observation 4 Chapter 6 The following output displays the NEWLENGTH data set: Output 6.7 Using a LENGTH Statement Information About Vendors Obs 1 2 3 4 1 Country Vendor Remarks France Spain India Peru Major Hispania Royal Mundial Discount for 30+ people Bonus for 10+ people Bonus points Because the LENGTH statement affects variable storage, not the spacing of columns in printed output, the Remarks variable appears the same in Output 6.6 and Output 6.7. To show the effect of the LENGTH statement on variable storage using the DATASETS procedures, see Chapter 35, “Getting Information about Your SAS Data Sets,” on page 607. Conditionally Deleting an Observation If you do not want the program data vector to write to a data set based on a condition, use the DELETE statement in the DATA step. For example, if the tour to Peru has been discontinued, it is no longer necessary to include the observation for Peru in the data set that is being created. The following example uses the DELETE statement to prevent SAS from writing that observation to the output data set: options pagesize=60 linesize=80 pageno=1 nodate; data subset; set mylib.internationaltours; if Country = ’Peru’ then delete; run; proc print data=subset; title ’Omitting a Discontinued Tour’; run; The following output displays the results: Output 6.8 Deleting an Observation Omitting a Discontinued Tour Obs 1 2 3 Country France Spain India Nights 8 10 10 Air Cost Land Cost 793 805 . 575 510 489 1 Vendor Major Hispania Royal The observation for Peru has been deleted from the data set. Understanding DATA Step Processing 4 Learning More 105 Review of SAS Tools Statements DELETE; prevents SAS from writing a particular observation to the output data set. It usually appears as part of an IF-THEN/ELSE statement. If condition THEN action ELSE action; tests whether the condition is true. When the condition is true, the THEN statement specifies the action to take. When the condition is false, the ELSE statement provides an alternative action. The action can be one or more statements, including assignment statements. LENGTH variable <$> length; assigns the number of bytes of storage (length) for a variable. Include a dollar sign ($) if the variable is character. The LENGTH statement must appear before the first use of the variable. variable=expression; is an assignment statement. It causes SAS to evaluate the expression on the right side of the equal sign and assign the result to the variable on the left. You must select the name of the variable and create the proper expression for calculating its value. The same variable name can appear on the left and right sides of the equal sign because SAS evaluates the right side before assigning the result to the variable on the left side. Learning More Character variables For information about expressions involving alphabetic and special characters as well as numbers, see Chapter 8, “Working with Character Variables,” on page 119. DATA step For general DATA step information, see Chapter 2, “Introduction to DATA Step Processing,” on page 19. Complete information about the DATA step can be found in the “DATA Step Concepts” section of SAS Language Reference: Concepts. IF-THEN/ELSE statements The IF-THEN/ELSE statements are discussed in Chapter 9, “Acting on Selected Observations,” on page 139. LENGTH statement Additional information about the LENGTH statement can be found in Chapter 7, “Working with Numeric Variables,” on page 107 and Chapter 8, “Working with Character Variables,” on page 119. To show the effect of the LENGTH statement on variable storage using the DATASETS procedures, see Chapter 35, “Getting Information about Your SAS Data Sets,” on page 607. Missing values For more information about missing values, see the in Chapter 7, “Working with Numeric Variables,” on page 107 and Chapter 8, “Working with Character Variables,” on page 119. 106 Learning More 4 Chapter 6 Numeric variables Information about working with numeric variables and expressions can be found in Chapter 7, “Working with Numeric Variables,” on page 107. SAS statements For complete reference information about the IF-THEN/ELSE, LENGTH, DELETE, assignment, and comment statements, see SAS Language Reference: Dictionary. 107 CHAPTER 7 Working with Numeric Variables Introduction to Working with Numeric Variables 107 Purpose 107 Prerequisites 107 About Numeric Variables in SAS 108 Input SAS Data Set for Examples 108 Calculating with Numeric Variables 109 Using Arithmetic Operators in Assignment Statements 109 Understanding Numeric Expressions and Assignment Statements Understanding How SAS Handles Missing Values 111 Why SAS Assigns Missing Values 111 Rules for Missing Values 111 Propagating Missing Values 112 Calculating Numbers Using SAS Functions 112 Rounding Values 112 Calculating a Cost When There Are Missing Values 112 Combining Functions 113 Comparing Numeric Variables 113 Storing Numeric Variables Efficiently 115 Review of SAS Tools 116 Functions 116 Statements 117 Learning More 117 111 Introduction to Working with Numeric Variables Purpose In this section, you will learn the following: 3 how to perform arithmetic calculations in SAS using arithmetic operators and the SAS functions ROUND and SUM 3 how to compare numeric variables using logical operators 3 how to store numeric variables efficiently when disk space is limited Prerequisites Before proceeding with this section, you should understand the concepts presented in the following topics: 108 About Numeric Variables in SAS 4 Chapter 7 3 Part 1, “Introduction to the SAS System” 3 Part 2, “Getting Your Data into Shape” 3 Chapter 6, “Understanding DATA Step Processing,” on page 97 About Numeric Variables in SAS A numeric variable is a variable whose values are numbers. Note: SAS uses double-precision floating point representation for calculations and, by default, for storing numeric variables in SAS data sets. 4 SAS accepts numbers in many forms, such as scientific notation, and hexadecimal. For more information, see the discussion on the types of numbers that SAS can read from data lines in SAS Language Reference: Concepts. For simplicity, this documentation concentrates on numbers in standard representation, as shown here: 1254 336.05 -243 You can use SAS to perform all kinds of mathematical operations. To perform a calculation in a DATA step, you can write an assignment statement in which the expression contains arithmetic operators, SAS functions, or a combination of the two. To compare numeric variables, you can write an IF-THEN/ELSE statement using logical operators. For more information on numeric functions, see the discussion in the “Functions and CALL Routines” section in SAS Language Reference: Dictionary. Input SAS Data Set for Examples Tradewinds Travel Inc. has an external file that contains information about their most popular tours: u Japan Greece New Zealand Ireland Venezuela Italy Russia Switzerland Australia Brazil v w x y 8 982 1020 Express 12 . 748 Express 16 1368 1539 Southsea 7 787 628 Express 9 426 505 Mundial 8 852 598 Express 14 1106 1024 A-B-C 9 816 834 Tour2000 12 1299 1169 Southsea 8 682 610 Almeida The numbered fields represent u the name of the country toured v the number of nights on the tour w the airfare in US dollars x the cost of the land package in US dollars y the name of the company that offers the tour The following program creates a permanent SAS data set named MYLIB.POPULARTOURS: Working with Numeric Variables 4 Using Arithmetic Operators in Assignment Statements 109 options pagesize=60 linesize=80 pageno=1 nodate; libname mylib ’permanent-data-library’; data mylib.populartours; infile ’input-file’; input Country $ 1-11 Nights AirCost LandCost Vendor $; run; proc print data=mylib.populartours; title ’Data Set MYLIB.POPULARTOURS’; run; The following output shows the data set: Output 7.1 Data Set MYLIB.POPULARTOURS Data Set MYLIB.POPULARTOURS Obs 1 2 3 4 5 6 7 8 9 10 Country 1 Nights Air Cost Land Cost Vendor 8 12 16 7 9 8 14 9 12 8 982 . 1368 787 426 852 1106 816 1299 682 1020 748 1539 628 505 598 1024 834 1169 610 Express Express Southsea Express Mundial Express A-B-C Tour2000 Southsea Almeida Japan Greece New Zealand Ireland Venezuela Italy Russia Switzerland Australia Brazil In MYLIB.POPULARTOURS, the variables Nights, AirCost, and LandCost contain numbers and are stored as numeric variables. For comparison, variables Country and Vendor contain alphabetic and special characters as well as numbers; they are stored as character variables. Calculating with Numeric Variables Using Arithmetic Operators in Assignment Statements One way to perform calculations on numeric variables is to write an assignment statement using arithmetic operators. Arithmetic operators indicate addition, subtraction, multiplication, division, and exponentiation (raising to a power). For more information on arithmetic expressions, see the discussion in SAS Language Reference: Concepts. The following table shows operators that you can use in arithmetic expressions. 110 Using Arithmetic Operators in Assignment Statements 4 Chapter 7 Table 7.1 Operators in Arithmetic Expressions Operation Symbol Example addition + x = y + z; subtraction – x = y - z; multiplication * x=y*z division / x=y/z exponentiation ** x = y ** z The following examples show some typical calculations using the Tradewinds Travel sample data. Table 7.2 Examples of Using Arithmetic Operators Action SAS Statement Add the airfare and land cost to produce the total cost. TotalCost = AirCost + Landcost; Calculate the peak season airfares by increasing the basic fare by 10% and adding an $8 departure tax. PeakAir = (AirCost * 1.10) + 8; Show the cost per night of each land package. NightCost = LandCost / Nights; In each case, the variable on the left side of the equal sign receives the calculated value from the numeric expression on the right side of the equal sign. Including these statements in the following DATA step produces data set NEWTOUR: options pagesize=60 linesize=80 pageno=1 nodate; data newtour; set mylib.populartours; TotalCost = AirCost + LandCost; PeakAir = (AirCost * 1.10) + 8; NightCost = LandCost / Nights; run; proc print data=newtour; var Country Nights AirCost LandCost TotalCost PeakAir NightCost; title ’Costs for Tours’; run; The VAR statement in the PROC PRINT step causes only the variables listed in the statement to be displayed in the output. Working with Numeric Variables 4 Understanding How SAS Handles Missing Values 111 Output 7.2 Creating New Variables by Using Arithmetic Expressions Costs for Tours Obs 1 2 3 4 5 6 7 8 9 10 Country Japan Greece New Zealand Ireland Venezuela Italy Russia Switzerland Australia Brazil 1 Nights Air Cost Land Cost Total Cost 8 12 16 7 9 8 14 9 12 8 982 . 1368 787 426 852 1106 816 1299 682 1020 748 1539 628 505 598 1024 834 1169 610 2002 . 2907 1415 931 1450 2130 1650 2468 1292 Peak Air Night Cost 1088.2 . 1512.8 873.7 476.6 945.2 1224.6 905.6 1436.9 758.2 127.500 62.333 96.188 89.714 56.111 74.750 73.143 92.667 97.417 76.250 Understanding Numeric Expressions and Assignment Statements Numeric expressions in SAS share some features with mathematical expressions: 3 When an expression contains more than one operator, the operations have the same order of precedence as in a mathematical expression: exponentiation is done first, then multiplication and division, and finally addition and subtraction. 3 When operators of equal precedence appear, the operations are performed from left to right (except exponentiation, which is performed right to left). 3 Parentheses are used to group parts of an expression; as in mathematical expressions, operations in parentheses are performed first. Note: The equal sign in an assignment statement does not perform the same function as the equal sign in a mathematical equation. The sequence variable= in an assignment statement defines the statement, and the variable must appear on the left side of the equal sign. You cannot switch the positions of the result variable and the expression as you can in a mathematical equation. 4 Understanding How SAS Handles Missing Values Why SAS Assigns Missing Values What if an observation lacks a value for a particular numeric variable? For example, in the data set MYLIB.POPULARTOURS, as shown in Output 7.2, the observation for Greece has no value for the variable AirCost. To maintain the rectangular structure of a SAS data set, SAS assigns a missing value to the variable in that observation. A missing value indicates that no information is present for the variable in that observation. Rules for Missing Values The following rules describe missing values in several situations: 3 In data lines, a missing numeric value is represented by a period, for example, Greece 8 12 . 748 Express 112 Calculating Numbers Using SAS Functions 4 Chapter 7 By default, SAS interprets a single period in a numeric field as a missing value. (If the INPUT statement reads the value from particular columns, as in column input, a field that contains only blanks also produces a missing value.) 3 In an expression, a missing numeric value is represented by a period, for example, if AirCost= . then Status = ’Need air cost’; 3 In a comparison and in sorting, a missing numeric value is a lower value than any other numeric value. 3 In procedure output, SAS by default represents a missing numeric value with a period. 3 Some procedures eliminate missing values from their analyses; others do not. Documentation for individual procedures describes how each procedure handles missing values. Propagating Missing Values When you use a missing value in an arithmetic expression, SAS sets the result of the expression to missing. If you use that result in another expression, the next result is also missing. In SAS, this method of treating missing values is called propagation of missing values. For example, Output 7.2 shows that in the data set NEWTOUR, the values for TOTALCOST and PEAKAIR are also missing in the observation for Greece. Note: SAS enables you to distinguish between various kinds of numeric missing values. See “Missing Values” section of SAS Language Reference: Concepts. The SAS language contains 27 special missing values based on the letters A–Z and the underscore (_). 4 Calculating Numbers Using SAS Functions Rounding Values In the example data that lists costs of the different tours (Output 7.1), some of the tours have odd prices: $748 instead of $750, $1299 instead of $1300, and so on. Rounded numbers, created by rounding the tour prices to the nearest $10, would be easier to work with. Programming a rounding calculation with only the arithmetic operators is a lengthy process. However, SAS contains around 280 built-in numeric expressions called functions. You can use them in expressions just as you do the arithmetic operators. For example, the following assignment statement rounds the value of AirCost to the nearest $50: RoundAir = round(AirCost,50); The following statement calculates the total cost of each tour, rounded to the nearest $100: TotalCostR = round(AirCost + LandCost,100); Calculating a Cost When There Are Missing Values As another example, the travel agent can calculate a total cost for the tours based on all nonmissing costs. Therefore, when the airfare is missing (as it is for Greece) the total cost represents the land cost, not a missing value. (Of course, you must decide whether skipping missing values in a particular calculation is a good idea.) The SUM Working with Numeric Variables 4 Comparing Numeric Variables 113 function calculates the sum of its arguments, ignoring missing values. This example illustrates the SUM function: SumCost = sum(AirCost,LandCost); Combining Functions It is possible for you to combine functions. The ROUND function rounds the quantity given in the first argument to the nearest unit given in the second argument. The SUM function adds any number of arguments, ignoring missing values. The calculation in the following assignment statement rounds the sum of all nonmissing airfares and land costs to the nearest $100 and assigns the value to RoundSum: RoundSum = round(sum(AirCost,LandCost),100); Using the ROUND and SUM functions in the following DATA step creates the data set MORETOUR: options pagesize=60 linesize=80 pageno=1 nodate; data moretour; set mylib.populartours; RoundAir = round(AirCost,50); TotalCostR = round(AirCost + LandCost,100); CostSum = sum(AirCost,LandCost); RoundSum = round(sum(AirCost,LandCost),100); run; proc print data=moretour; var Country AirCost LandCost RoundAir TotalCostR CostSum RoundSum; title ’Rounding and Summing Values’; run; The following output displays the results: Output 7.3 Creating New Variables with ROUND and SUM Functions Rounding and Summing Values Obs 1 2 3 4 5 6 7 8 9 10 1 Country Air Cost Land Cost Round Air Total CostR Cost Sum Round Sum Japan Greece New Zealand Ireland Venezuela Italy Russia Switzerland Australia Brazil 982 . 1368 787 426 852 1106 816 1299 682 1020 748 1539 628 505 598 1024 834 1169 610 1000 . 1350 800 450 850 1100 800 1300 700 2000 . 2900 1400 900 1500 2100 1700 2500 1300 2002 748 2907 1415 931 1450 2130 1650 2468 1292 2000 700 2900 1400 900 1500 2100 1700 2500 1300 Comparing Numeric Variables Often in a program you need to know if variables are equal to each other, or if they are greater than or less than each other. To compare two numeric variables, you can 114 Comparing Numeric Variables 4 Chapter 7 write an IF-THEN/ELSE statement using logical operators. The following table lists some of the logical operators you can use for variable comparisons. Table 7.3 Logical Operators Symbol Mnemonic Equivalent Logical Operation = eq equal ne not equal to ( the =, ^=, or ~= symbol, depending on your keyboard) > gt greater than >= ge greater than or equal to < lt less than <= le less than or equal to =, ^=, ~= In this example, the total cost of each tour in the POPULARTOURS data set is compared to 2000 using the greater-than logical operator (gt). If the total cost of the tour is greater than 2000, the tour is excluded from the data set. The resulting data set TOURSUNDER2K contains tours that are $2000 or less. options pagesize=60 linesize=80 pageno=1 nodate; data toursunder2K; set mylib.populartours; TotalCost = AirCost + LandCost; if TotalCost gt 2000 then delete; run; proc print data=toursunder2K; var Country Nights AirCost Landcost TotalCost Vendor; title ’Tours $2000 or Less’; run; The following output shows the tours that are less than $2000 in total cost: Output 7.4 Comparing Numeric Variables Tours $2000 or Less Obs 1 2 3 4 5 6 Country Greece Ireland Venezuela Italy Switzerland Brazil 1 Nights Air Cost Land Cost Total Cost Vendor 12 7 9 8 9 8 . 787 426 852 816 682 748 628 505 598 834 610 . 1415 931 1450 1650 1292 Express Express Mundial Express Tour2000 Almeida The TotalCost value for Greece is a missing value because any calculation that includes a missing value results in a missing value. In a comparison, missing numeric values are lower than any other numeric value. If you need to compare a variable to more than one value, you can include multiple comparisons in a condition. To eliminate tours with missing values, a second comparison is added: Working with Numeric Variables 4 Storing Numeric Variables Efficiently 115 options pagesize=60 linesize=80 pageno=1 nodate; data toursunder2K2; set mylib.populartours; TotalCost = AirCost + LandCost; if TotalCost gt 2000 or Totalcost = . then delete; run; proc print data=toursunder2K2; var Country Nights TotalCost Vendor; title ’Tours $2000 or Less’; run; The following output displays the results: Output 7.5 Multiple Comparisons in a Condition Tours $2000 or Less Obs 1 2 3 4 5 Country Ireland Venezuela Italy Switzerland Brazil Nights 7 9 8 9 8 Total Cost 1415 931 1450 1650 1292 1 Vendor Express Mundial Express Tour2000 Almeida Notice that Greece is no longer included in the tours for under $2000. Storing Numeric Variables Efficiently The data sets shown in this section are very small, but data sets are often very large. If you have a large data set, you may need to think about the storage space that your data set occupies. There are ways to save space when you store numeric variables in SAS data sets. Note: The SAS documentation for your operating environment provides information about storing numeric variables whose values are limited to 1 or 0 in the minimum number of bytes used by SAS (either 2 or 3 bytes, depending on your operating environment). 4 By default, SAS uses 8 bytes of storage in a data set for each numeric variable. Therefore, storing the variables for each observation in the earlier data set MORETOUR requires 75 bytes: 56 bytes for numeric variables (8 bytes per variable * 7 numeric variables) 11 bytes for Country 8 bytes for Vendor __________________________ 75 bytes for all variables When numeric variables contain only integers (whole numbers), you can often shorten them in the data set being created. For example, a length of 4 bytes accurately stores all integers up to at least 2,000,000. 116 Review of SAS Tools 4 Chapter 7 Note: Under some operating environments, the maximum number of bytes is much greater. For more information, refer to the documentation provided by the vendor for your operating environment. 4 To change the number of bytes used for each variable, use a LENGTH statement. A LENGTH statement contains the names of the variables followed by the number of bytes to be used for their storage. For numeric variables, the LENGTH statement affects only the data set being created; it does not affect the program data vector. The following program changes the storage space for all numeric variables that are in the data set SHORTER: options pagesize=60 linesize=80 pageno=1 nodate; data shorter; set mylib.populartours; length Nights AirCost LandCost RoundAir TotalCostR Costsum RoundSum 4; RoundAir = round(AirCost,50); TotalCostR = round(AirCost + LandCost,100); CostSum = sum(AirCost,LandCost); RoundSum = round(sum(AirCost,LandCost),100); run; By calculating the storage space that is needed for the variables in each observation of SHORTER, you can see how the LENGTH statement changes the amount of storage space used: 28 bytes for numeric variables (4 bytes per variable in the LENGTH statement X 7 numeric variables) 11 bytes for Country 8 bytes for Vendor __________________________ 47 bytes for all variables Because of the 7 variables in SHORTER are shortened by the LENGTH statement, the storage space for the variables in each observation is reduced by almost half. CAUTION: Be careful in shortening the length of numeric variables if your variable values are not integers. Fractional numbers lose precision permanently if they are truncated. In general, use the LENGTH statement to truncate values only when disk space is limited. Use the default length of 8 bytes to store variables containing fractions. 4 Review of SAS Tools Functions ROUND (expression, round-off-unit) rounds the quantity in expression to the figure given in round-off-unit. The expression can be a numeric variable name, a numeric constant, or an arithmetic expression. Separate round-off-unit from expression with a comma. SUM (expression-1<, . . . expression-n>) produces the sum of all expressions that you specify in the parentheses. The SUM function ignores missing values as it calculates the sum of the expressions. Each expression can be a numeric variable, a numeric constant, another arithmetic expression, or another numeric function. Working with Numeric Variables 4 Learning More 117 Statements LENGTH variable-list number-of-bytes; indicates that the variables in the variable-list are to be stored in the data set according to the number-of-bytes that you specify. Numeric variables are not affected while they are in the program data vector. The default length for a numeric variable is 8 bytes. In general, the minimum you should use is 4 bytes for variables that contain integers and 8 bytes for variables that contain fractions. You can assign lengths to both numeric and character variables (discussed in the next section) in a single LENGTH statement. variable=expression; is an assignment statement. It causes SAS to calculate the value of the expression on the right side of the equal sign and assign the result to the variable on the left. When variable is numeric, the expression can be an arithmetic calculation, a numeric constant, or a numeric function. Learning More Abbreviating lists of variables Ways to abbreviate lists of variables in function arguments are documented in SAS Language Reference: Concepts. Many functions, including the SUM function, accept abbreviated lists of variables as arguments. DEFAULT= option Information about using the DEFAULT= option in the LENGTH statement to assign a default storage length to all newly created numeric variables can be found in SAS Language Reference: Dictionary. Logical expressions Additional information about the use of logical expressions can be found in SAS Language Reference: Concepts. Numeric precision For a discussion about numeric precision, see SAS Language Reference: Concepts. Because the computer’s hardware determines the way that a computer stores numbers, the precision with which SAS can store numbers depends on the hardware of the computer system on which it is installed. Specific limits for hardware are discussed in the SAS documentation for each operating environment. Saving space For information about how you can save space by treating some numeric values as character values see Chapter 8, “Working with Character Variables,” on page 119. 118 119 CHAPTER 8 Working with Character Variables Introduction to Working with Character Variables 119 Purpose 119 Prerequisites 120 Character Variables in SAS 120 Input SAS Data Set for Examples 120 Identifying Character Variables and Expressing Character Values 121 Setting the Length of Character Variables 122 Handling Missing Values 124 Reading Missing Values 124 Checking for Missing Character Values 125 Setting a Character Variable Value to Missing 126 Creating New Character Values 127 Extracting a Portion of a Character Value 127 Understanding the SCAN Function 127 Aligning New Values 128 Saving Storage Space When Using the SCAN Function 129 Combining Character Values: Using Concatenation 129 Understanding Concatenation of Variable Values 129 Performing a Simple Concatenation 130 Removing Interior Blanks 130 Adding Additional Characters 132 Troubleshooting: When New Variables Appear Truncated 132 Saving Storage Space by Treating Numbers as Characters 134 Review of SAS Tools 135 Functions 135 Statements 136 Learning More 136 Introduction to Working with Character Variables Purpose In this section, you will learn how to do the following: 3 3 3 3 identify character variables set the length of character variables align character values within character variables handle missing values of character variables 120 Prerequisites 4 Chapter 8 3 work with character variables, character constants, and character expressions in SAS program statements 3 instruct SAS to read fields that contain numbers as character variables in order to save space Prerequisites Before proceeding with this section, you should understand the concepts presented in the following topics: 3 Part 1, “Introduction to SAS” 3 Part 2, “Getting Your Data into Shape” 3 Chapter 6, “Understanding DATA Step Processing,” on page 97 Character Variables in SAS A character variable is a variable whose value contains letters, numbers, and special characters, and whose length can be from 1 to 32,767 characters long. Character variables can be used in declarative statements, comparison statements, or assignment statements where they can be manipulated to create new character variables. Input SAS Data Set for Examples Tradewinds Travel has an external file with data on flight schedules for tours. The following DATA step reads the information and stores it in a data set named AIR.DEPARTURES: options pagesize=60 linesize=80 pageno=1 nodate; libname mylib ’permanent-data-library’; data mylib.departures; input Country $ 1-9 CitiesInTour 11-12 USGate $ 14-26 ArrivalDepartureGates $ 28-48; datalines; u v w x Japan 5 San Francisco Tokyo, Osaka Italy 8 New York Rome, Naples Australia 12 Honolulu Sydney, Brisbane Venezuela 4 Miami Caracas, Maracaibo Brazil 4 Rio de Janeiro, Belem ; proc print data=mylib.departures; title ’Data Set AIR.DEPARTURES’; run; The numbered fields represent u the name of the country toured v the number of cities in the tour w the city from which the tour leaves the United States (the gateway city) x the cities of arrival and departure in the destination country The PROC PRINT statement that follows the DATA step produces this display of the AIR.DEPARTURES data set: Working with Character Variables 4 Identifying Character Variables and Expressing Character Values 121 Output 8.1 Data Set AIR.DEPARTURES Data Set AIR.DEPARTURES Obs 1 2 3 4 5 Country Japan Italy Australia Venezuela Brazil Cities InTour 5 8 12 4 4 1 USGate ArrivalDepartureGates San Francisco New York Honolulu Miami Tokyo, Osaka Rome, Naples Sydney, Brisbane Caracas, Maracaibo Rio de Janeiro, Belem In AIR.DEPARTURES, the variables Country, USGate, and ArrivalDepartureGates contain information other than numbers, so they must be stored as character variables. The variable CitiesInTour contains only numbers; therefore, it can be created and stored as either a character or numeric variable. Identifying Character Variables and Expressing Character Values To store character values in a SAS data set, you need to create a character value. One way to create a character variable is to define it in an input statement. Simply place a dollar sign after the variable name in the INPUT statement, as shown in the DATA step that created AIR.DEPARTURES: input Country $ 1-9 CitiesInTour 11-12 USGate $ 14-26 ArrivalDepartureGates $ 28-48; You can also create a character variable and assign a value to it in an assignment statement. Simply enclose the value in quotation marks: Schedule = ’3-4 tours per season’; Either single quotation marks (apostrophes) or double quotation marks are acceptable. If the value itself contains a single quote, then surround the value with double quotation marks, as in Remarks = "See last year’s schedule"; Note: Matching quotation marks properly is important. Missing or extraneous quotation marks cause SAS to misread both the erroneous statement and the statements following it. 4 When you specify a character value in an expression, you must also enclose the value in quotation marks. For example, the following statement compares the value of USGate to San Francisco and, when a match occurs, assigns the airport code SFO to the variable Airport: if USGate = ’San Francisco’ then Airport = ’SFO’; In character values, SAS distinguishes uppercase letters from lowercase letters. For example, in the data set AIR.DEPARTURES, the value of USGate in the observation for Australia is Honolulu. The following IF condition is true; therefore, SAS assigns to Airport the value HNL: else if USGate = ’Honolulu’ then Airport = ’HNL’; 122 Setting the Length of Character Variables 4 Chapter 8 However, the following condition is false: if USGate = ’HONOLULU’ then Airport = ’HNL’; SAS does not select that observation because the characters in Honolulu and HONOLULU are not equivalent. The following program places these shaded statements in a DATA step: options pagesize=60 linesize=80 pageno=1 nodate; data charvars; set mylib.departures; Schedule = ’3-4 tours per season’; Remarks = "See last year’s schedule"; if USGate = ’San Francisco’ then Airport = ’SFO’; else if USGate = ’Honolulu’ then Airport = ’HNL’; run; proc print data=charvars noobsu; var Country Schedule Remarks USGate Airport; title ’Tours By City of Departure’; run; uThe NOOBS option in the PROC PRINT statement suppresses the display of observation numbers in the output. The following output displays the character variables in the data set CHARVARS: Output 8.2 Examples of Character Variables Tours By City of Departure Country Japan Italy Australia Venezuela Brazil Schedule 3-4 3-4 3-4 3-4 3-4 tours tours tours tours tours per per per per per Remarks season season season season season See See See See See last last last last last year’s year’s year’s year’s year’s 1 USGate schedule schedule schedule schedule schedule San Francisco New York Honolulu Miami Airport SFO HNL Setting the Length of Character Variables This example illustrates why you may want to specify a length for a character variable, rather than let the first assigned value determine the length. Because New York City has two airports, both the abbreviations for John F. Kennedy International Airport and La Guardia Airport can be assigned to the Airport variable as in the DATA step. Note: When you create character variables, SAS determines the length of the variable from its first occurrence in the DATA step. Therefore, you must allow for the longest possible value in the first statement that mentions the variable. If you do not assign the longest value the first time the variable is assigned, then data can be truncated. 4 Working with Character Variables 4 Setting the Length of Character Variables 123 /* first attempt */ options pagesize=60 linesize=80 pageno=1 nodate; data aircode; set mylib.departures; if USGate = ’San Francisco’ then Airport = ’SFO’; else if USGate = ’Honolulu’ then Airport = ’HNL’; else if USGate = ’New York’ then Airport = ’JFK or LGA’; run; proc print data=aircode; var Country USGate Airport; title ’Country by US Point of Departure’; run; The following output displays the results: Output 8.3 Truncation of Character Values Country by US Point of Departure Obs 1 2 3 4 5 Country USGate Japan Italy Australia Venezuela Brazil San Francisco New York Honolulu Miami 1 Airport SFO JFK HNL Only the characters JFK appear in the observation for New York. SAS first encounters Airport in the statement that assigns the value SFO. Therefore, SAS creates Airport with a length of three bytes and uses only the first three characters in the New York observation. To allow space to write JFK or LGA, use a LENGTH statement as the first reference to Airport. The LENGTH statement is a declarative statement and has the form LENGTH variable-list $ number-of-bytes; where variable-list is the variable or variables to which you are assigning the length number-of-bytes. The dollar sign ($) indicates that the variable is a character variable. The LENGTH statement determines the length of a character variable in both the program data vector and the data set that are being created. (In contrast, a LENGTH statement determines the length of a numeric variable only in the data set that is being created.) The maximum length of any character value in SAS is 32,767 bytes. This LENGTH statement assigns a length of 10 to the character variable Airport: length Airport $ 10; Note: If you use a LENGTH statement to assign a length to a character variable, then it must be the first reference to the character variables in the DATA step. Therefore, the best position in the DATA step for a LENGTH statement is immediately after the DATA statement. 4 The following DATA step includes the LENGTH statement for Airport. Remember that you can use the DATASETS procedure to display the length of variables in a SAS data set. 124 Handling Missing Values 4 Chapter 8 /* correct method */ options pagesize=60 linesize=80 pageno=1 nodate; data aircode2; length Airport $ 10; set mylib.departures; if USGate = ’San Francisco’ then Airport = ’SFO’; else if USGate = ’Honolulu’ then Airport = ’HNL’; else if USGate = ’New York’ then Airport = ’JFK or LGA’; else if USGate = ’Miami’ then Airport = ’MIA’; run; proc print data=aircode2; var Country USGate Airport; title ’Country by US Point of Departure’; run; The following output displays the results: Output 8.4 Using a LENGTH Statement to Capture Complete Variable Information Country by US Point of Departure Obs 1 2 3 4 5 Country USGate Airport Japan Italy Australia Venezuela Brazil San Francisco New York Honolulu Miami SFO JFK or LGA HNL MIA 1 Handling Missing Values Reading Missing Values SAS uses a blank to represent a missing value of a character variable. For example, the data line for Brazil lacks the departure city from the United States: Japan 5 San Francisco Tokyo, Osaka Italy 8 New York Rome, Naples Australia 12 Honolulu Sydney, Brisbane Venezuela 4 Miami Caracas, Maracaibo Brazil 4 Rio de Janeiro, Belem As Output 8.1 shows, when the INPUT statement reads the data line for Brazil and determines that the value for USGate in columns 14-26 is missing, SAS assigns a missing value to USGate for that observation. The missing value is represented by a blank when printing. One special case occurs when you read character data values with list input. In that case, you must use a period to represent a missing value in data lines. (Blanks in list input separate values; therefore, SAS interprets blanks as a signal to keep searching for the value, not as a missing value.) In the following DATA step, the TourGuide information for Venezuela is missing and is represented with a period: Working with Character Variables 4 Checking for Missing Character Values 125 options pagesize=60 linesize=80 pageno=1 nodate; data missingval; length Country $ 10 TourGuide $ 10; input Country TourGuide; datalines; Japan Yamada Italy Militello Australia Edney Venezuela . Brazil Cardoso ; proc print data=missingval; title ’Missing Values for Character List Input Data’; run; The following output displays the results: Output 8.5 Using a Period in List Input for Missing Character Data Missing Values for Character List Data Obs 1 2 3 4 5 Country TourGuide Japan Italy Australia Venezuela Brazil Yamada Militello Edney 1 Cardoso SAS recognized the period as a missing value in the fourth data line; therefore, it recorded a missing value for the character variable TourGuide in the resulting data set. Checking for Missing Character Values When you want to check for missing character values, compare the character variable to a blank surrounded by quotation marks: if USGate = ’ ’ then GateInformation = ’Missing’; The following DATA step includes this statement to check USGate for missing information. The results are recorded in GateInformation: options pagesize=60 linesize=80 pageno=1 nodate; data checkgate; length GateInformation $ 15; set mylib.departures; if USGate = ’ ’ then GateInformation = ’Missing’; else GateInformation = ’Available’; run; proc print data=checkgate; 126 Setting a Character Variable Value to Missing 4 Chapter 8 var Country CitiesIntour USGate ArrivalDepartureGates GateInformation; title ’Checking For Missing Gate Information’; run; The following output displays the results: Output 8.6 Checking for Missing Character Values Checking For Missing Gate Information Obs 1 2 3 4 5 Country Japan Italy Australia Venezuela Brazil Cities InTour 5 8 12 4 4 1 USGate ArrivalDepartureGates Gate Information San Francisco New York Honolulu Miami Tokyo, Osaka Rome, Naples Sydney, Brisbane Caracas, Maracaibo Rio de Janeiro, Belem Available Available Available Available Missing Setting a Character Variable Value to Missing You can assign missing character values in assignment statements by setting the character variable to a blank surrounded by quotation marks. For example, the following statement sets the day of departure based on the number of days in the tour. If the number of cities in the tour is a week or less, then the day of departure is a Sunday. Otherwise, the day of departure is not known and is set to a missing value. if Cities <=7 then DayOfDeparture = ’Sunday’; else DayOfDeparture = ’ ’; The following DATA step includes these statements: options pagesize=60 linesize=80 pageno=1 nodate; data departuredays; set mylib.departures; length DayOfDeparture $ 8; if CitiesInTour <=7 then DayOfDeparture = ’Sunday’; else DayOfDeparture = ’ ’; run; proc print data=departuredays; var Country CitiesInTour DayOfDeparture; title ’Departure Day is Sunday or Missing’; run; The following output displays the results: Working with Character Variables 4 Extracting a Portion of a Character Value 127 Output 8.7 Assigning Missing Character Values Departure Day is Sunday or Missing Obs 1 2 3 4 5 Country Japan Italy Australia Venezuela Brazil Cities InTour 1 DayOf Departure 5 8 12 4 4 Sunday Sunday Sunday Creating New Character Values Extracting a Portion of a Character Value Understanding the SCAN Function Some character values may contain multiple pieces of information that need to be isolated and assigned to separate character variables. For example, the value of ArrivalDepartureGates contains two cities: the city of arrival and the city of departure. How can the individual values be isolated so that separate variables can be created for the two cities? The SCAN function returns a character string when it is given the source string, the position of the desired character string, and a character delimiter: SCAN (source,n<,list-of-delimiters>) The source is the value that you want to examine. It can be any kind of character expression, including character variables, character constants, and so on. The n is the position of the term to be selected from the source. The list-of-delimiters can list one, multiple, or no delimiters. If you specify more than one delimiter, then SAS uses any of them; if you omit the delimiter, then SAS divides words according to a default list of delimiters (including the blank and some special characters). For example, to select the first term in the value of ArrivalDepartureGates and assign it to a new variable named ArrivalGate, write ArrivalGate = scan(ArrivalDepartureGates,1,’,’); The SCAN function examines the value of ArrivalDepartureGates and selects the first string as identified by a comma. Although default values can be used for the delimiter, it is a good idea to specify the delimiter to be used. If the default delimiter is used in the SCAN function when the observation for Brazil is processed, then SAS recognizes a blank space as the delimiter and selects Rio rather than Rio de Janeiro as the first term. Specifying the delimiter enables you to control where the division of the term occurs. To select the second term from ArrivalDepartureGates and assign it to a new variable term named DEPARTUREGATE, specify the following: DepartureGate = scan(ArrivalDepartureGates,2,’,’); Note: The default length of a target variable where the expression contains the SCAN function is 200 bytes. 4 128 Extracting a Portion of a Character Value 4 Chapter 8 Aligning New Values Remember that SAS maintains the existing alignment of a character value used in an expression; it does not perform any automatic realignment. This example creates the values for a new variable DepartureGate from the values of ArrivalDepartureGates. The value of ArrivalDepartureGates contains a comma and a blank between the two city names as shown in the following output: Output 8.8 Dividing Values into Separate Words Using the SCAN Function Data Set AIR.DEPARTURES Obs 1 2 3 4 5 Country Japan Italy Australia Venezuela Brazil Cities InTour 5 8 12 4 4 1 USGate ArrivalDepartureGates San Francisco New York Honolulu Miami Tokyo, Osaka Rome, Naples Sydney, Brisbane Caracas, Maracaibo Rio de Janeiro, Belem When the SCAN function divides the names at the comma, the second term begins with a blank; therefore, all the values that are assigned to DepartureGate begin with a blank. To left-align the values, use the LEFT function: LEFT (source) The LEFT function produces a value that has all leading blanks in the source moved to the right side of the value; therefore, the result is left aligned. The source can be any kind of character expression, including a character variable, a character constant enclosed in quotation marks, or another character function. This example uses the LEFT function in the second assignment statement: DepartureGate = scan(ArrivalDepartureGates,2,’,’); DepartureGate = left(DepartureGate); You can also nest the two functions: DepartureGate = left(scan(ArrivalDepartureGates,2,’,’)); When you nest functions, SAS performs the action in the innermost function first. It uses the result of that function as the argument of the next function, and so on. The following DATA step creates separate variables for the arrival gates and the departure gates: options pagesize=60 linesize=80 pageno=1 nodate; data gates; set mylib.departures; ArrivalGate = scan(ArrivalDepartureGates,1,’,’); DepartureGate = left(scan(ArrivalDepartureGates,2,’,’)); run; proc print data=gates; var Country ArrivalDepartureGates ArrivalGate DepartureGate; title ’Arrival and Departure Gates’; run; Working with Character Variables 4 Combining Character Values: Using Concatenation 129 The following output displays the results: Output 8.9 Dividing Values into Separate Words with the SCAN Function Arrival and Departure Gates Obs 1 2 3 4 5 1 Country ArrivalDepartureGates ArrivalGate Departure Gate Japan Italy Australia Venezuela Brazil Tokyo, Osaka Rome, Naples Sydney, Brisbane Caracas, Maracaibo Rio de Janeiro, Belem Tokyo Rome Sydney Caracas Rio de Janeiro Osaka Naples Brisbane Maracaibo Belem Saving Storage Space When Using the SCAN Function The SCAN function causes SAS to assign a length of 200 bytes to the target variable in an assignment statement. Most of the other character functions cause the target to have the same length as the original value. In the data set GATELENGTH, the variable ArrivalGate has a length of 200 because the SCAN function creates it. The variable DepartureGate also has a length of 200 because the argument of the LEFT function contains the SCAN function. Setting the lengths of ArrivalGate and DepartureGate to the needed values rather than to the default length saves a lot of storage space. Because SAS sets the length of a character variable the first time SAS encounters it, the LENGTH statement must appear before the assignment statements that create values for the variables: data gatelength; length ArrivalGate $ 14 DepartureGate $ 9; set mylib.departures; ArrivalGate = scan(ArrivalDepartureGate,1,’,’); DepartureGate = left(scan(ArrivalDepartureGate,2,’,’)); run; Combining Character Values: Using Concatenation Understanding Concatenation of Variable Values SAS enables you to combine character values into longer ones using an operation known as concatenation. Concatenation combines character values by placing them one after the other and assigning them to a variable. In SAS programming, the concatenation operator is a pair of vertical bars (||). If your keyboard does not have a solid vertical bar, use two broken vertical bars (¦¦) or two exclamation points (!!). The length of the new variable is the sum of the lengths of the pieces or number of characters that is specified in a LENGTH statement for the new variable. Concatenation is illustrated in the following figure: 130 Combining Character Values: Using Concatenation 4 Chapter 8 Display 8.1 Concatenation of Two Values Performing a Simple Concatenation The following statement combines all the cities named as gateways into a single variable named AllGates: AllGates = USGate || ArrivalDepartureGates; SAS attaches the beginning of each value of ArrivalDepartureGates to the end of each value of USGate and assigns the results to AllGates. The following DATA step includes this statement: /* first try */ options pagesize=60 linesize=80 pageno=1 nodate; data all; set mylib.departures; AllGates = USGate || ArrivalDepartureGates; run; proc print data=all; var Country USGate ArrivalDepartureGates AllGates; title ’All Tour Gates’; run; The following output displays the results: Output 8.10 Simple Concatenation: Interior Blanks Not Removed All Tour Gates Obs 1 2 3 4 5 1 Country USGate ArrivalDepartureGates Japan Italy Australia Venezuela Brazil San Francisco New York Honolulu Miami Tokyo, Osaka Rome, Naples Sydney, Brisbane Caracas, Maracaibo Rio de Janeiro, Belem Obs AllGates 1 2 3 4 5 San FranciscoTokyo, Osaka New York Rome, Naples Honolulu u Sydney, Brisbane Miami Caracas, Maracaibo v Rio de Janeiro, Belem Removing Interior Blanks Why, in the previous output, does u the middle of AllGates contain blanks? Working with Character Variables 4 Combining Character Values: Using Concatenation 131 v the beginning of AllGates in the Brazil observation contain blanks? When a character value is shorter than the length of the variable to which it belongs, SAS pads the value with trailing blanks. The length of USGate is 13 bytes, but only San Francisco uses all of them. Therefore, the other values contain blanks at the end, and the value for Brazil is entirely blank. SAS concatenates USGate and ArrivalDepartureGates without change; therefore, the middle of AllGates contains blanks for most observations. Most of the values of ArrivalDepartureGates also contain trailing blanks. If you concatenate another variable such as Country to ArrivalDepartureGates, you will see the trailing blanks in ArrivalDepartureGates.To eliminate trailing blanks, use the TRIM function: TRIM (source) The TRIM function produces a value without the trailing blanks in the source. Note: Other rules about trailing blanks in SAS still apply. If the trimmed result is shorter than the length of the variable to which the result is assigned, SAS pads the result with new blanks as it makes the assignment. 4 To eliminate the trailing blanks in USGate from AllGates, add the TRIM function to the expression: AllGate2 = trim(USGate) || ArrivalDepartureGates; The following program adds this statement to the DATA step: /* removing interior blanks */ options pagesize=60 linesize=80 pageno=1 nodate; data all2; set mylib.departures; AllGate2 = trim(USGate) || ArrivalDepartureGates; run; proc print data=all2; var Country USGate ArrivalDepartureGates AllGate2; title ’All Tour Gates’; run; The following output displays the results: Output 8.11 Removing Blanks with the TRIM Function All Tour Gates Obs 1 2 3 4 5 Country USGate ArrivalDepartureGates Japan Italy Australia Venezuela Brazil San Francisco New York Honolulu Miami Tokyo, Osaka Rome, Naples Sydney, Brisbane Caracas, Maracaibo Rio de Janeiro, Belem 1 AllGate2 San FranciscoTokyo, Osaka New YorkRome, Naples HonoluluSydney, Brisbane MiamiCaracas, Maracaibo Rio de Janeiro, Belem u Notice at u that the AllGate2 value for Brazil has a blank space before Rio de Janeiro, Belem. When the TRIM function encounters a missing value in the argument, one blank space is returned. In this observation, USGate has a missing value; therefore, one blank space is concatenated with Rio de Janeiro, Belem. 132 Combining Character Values: Using Concatenation 4 Chapter 8 Adding Additional Characters Data set ALL2 shows that removing the trailing blanks from USGate causes all the values of ArrivalDepartureGates to appear immediately after the corresponding values of USGate. To make the result easier to read, you can concatenate a comma and blank between the trimmed value of USGate and the value of ArrivalDepartureGates. Also, to align the AllGate3 value for Brazil with all other values of AllGate3, use an IF-THEN statement to equate the value of AllGate3 with the value of ArrivalDepartureGates in that observation. AllGate3 = trim(USGate)||’, ’||ArrivalDepartureGates; if Country = ’Brazil’ then AllGate3 = ArrivalDepartureGates; This DATA step includes these statements: /* final version */ options pagesize=60 linesize=80 pageno=1 nodate; data all3; set mylib.departures; AllGate3 = trim(USGate)||’, ’||ArrivalDepartureGates; if Country = ’Brazil’ then AllGate3 = ArrivalDepartureGates; run; proc print data=all3; var Country USGate ArrivalDepartureGates AllGate3; title ’All Tour Gates’; run; The following output displays the results: Output 8.12 Concatenating Additional Characters for Readability All Tour Gates Obs Country 1 2 3 4 5 Japan Italy Australia Venezuela Brazil USGate ArrivalDepartureGates San Francisco New York Honolulu Miami Tokyo, Osaka Rome, Naples Sydney, Brisbane Caracas, Maracaibo Rio de Janeiro, Belem 1 AllGate3 San Francisco, Tokyo, Osaka New York, Rome, Naples Honolulu, Sydney, Brisbane Miami, Caracas, Maracaibo Rio de Janeiro, Belem Troubleshooting: When New Variables Appear Truncated When you concatenate variables, you might see the apparent loss of part of a concatenated value. Earlier in this section, ArrivalDepartureGates was divided into two new variables, ArrivalGate and DepartureGate, each with a default length of 200 bytes. (Remember that when a variable is created by an expression that uses the SCAN function, the variable length is 200 bytes.) For reference, this example re-creates the DATA step: options pagesize=60 linesize=80 pageno=1 nodate; data gates; set mylib.departures; ArrivalGate = scan(ArrivalDepartureGates,1,’,’); DepartureGate = left(scan(ArrivalDepartureGates,2,’,’)); run; Working with Character Variables 4 Combining Character Values: Using Concatenation 133 If the variables ArrivalGate and DepartureGate are concatenated, as they are in the next DATA step, then the length of the resulting concatenation is 402 bytes: 200 bytes for each variable and 1 byte each for the comma and the blank space. This example uses the VLENGTH function to show the length of ADGates. /* accidentally omitting the TRIM function */ options pagesize=60 linesize=80 pageno=1 nodate; data gates2; set gates; ADGates = ArrivalGate||’, ’||DepartureGate;; ADLength = vlength(ADGates); run; proc print data=gates2; var Country ArrivalDepartureGates ADGates ADLength; title ’All Tour Gates’; run; The following output displays the results: Output 8.13 Losing Part of a Concatenated Value All Tour Gates Obs Country 1 2 3 4 5 Japan Italy Australia Venezuela Brazil ArrivalDepartureGates Tokyo, Osaka Rome, Naples Sydney, Brisbane Caracas, Maracaibo Rio de Janeiro, Belem Obs 1 2 3 4 5 1 ADGates Tokyo Rome Sydney Caracas Rio de Janeiro Obs ADLength 1 2 3 4 5 402 402 402 402 402 The concatenated value from DepartureGate appears to be truncated in the output. It has been concatenated after the trailing blanks of ArrivalGate, and it does not appear because the output does not display 402 bytes. There is a two-step solution to the problem: 1 The TRIM function can trim the trailing blanks from ArrivalGate, as shown in the preceding section. The significant characters from all three pieces that are assigned to ADGates can then fit in the output. 2 The length of ADGates remains 402 bytes. The LENGTH statement can assign to the variable a length that is shorter but large enough to contain the significant pieces. 134 Saving Storage Space by Treating Numbers as Characters 4 Chapter 8 The following DATA step uses the TRIM function and the LENGTH statement to remove interior blanks from the concatenation: options pagesize=60 linesize=80 pageno=1 nodate; data gates3; length ADGates $ 30; set gates; ADGates = trim(ArrivalGate)||’, ’||DepartureGate; run; proc print data=gates3; var country ArrivalDepartureGates ADGates; title ’All Tour Gates’; run; The following output displays the results: Output 8.14 Showing All of a Newly Concatenated Value All Tour Gates Obs 1 2 3 4 5 1 Country ArrivalDepartureGates ADGates Japan Italy Australia Venezuela Brazil Tokyo, Osaka Rome, Naples Sydney, Brisbane Caracas, Maracaibo Rio de Janeiro, Belem Tokyo, Osaka Rome, Naples Sydney, Brisbane Caracas, Maracaibo Rio de Janeiro, Belem Saving Storage Space by Treating Numbers as Characters Remember that SAS uses eight bytes of storage for every numeric value in the DATA step; by default, SAS also uses eight bytes of storage for each numeric value in an output data set. However, a character value can contain a minimum of one character; in that case, SAS uses one byte for the character variable, both in the program data vector and in the output data set. In addition, SAS treats the digits 0 through 9 in a character value like any other character. When you are not going to perform calculations on a variable, you can save storage space by treating a value that contains digits as a character value. For example, some tours offer various prices, depending on the quality of the hotel room. The brochures rank the rooms as two stars, three stars, and so on. In this case the values 2, 3, and 4 are really the names of categories, and arithmetic operations are not expected to be performed on them. Therefore, the values can be read into a character variable. The following DATA step reads HotelRank as a character variable and assigns it a length of one byte: data hotels; input Country $ 1-9 HotelRank $ 11 LandCost; datalines; Italy 2 498 Italy 4 698 Australia 2 915 Australia 3 1169 Working with Character Variables 4 Functions 135 Australia 4 1399 ; proc print data=hotels; title ’Hotel Rankings’; run; In the previous example, the INPUT statement assigns HotelRank a length of one byte because the INPUT statement reads one column to find the value (shown by the use of column input). If you are using list input, place a LENGTH statement before the INPUT statement to set the length to one byte. If you read a number as a character value and then discover that you need to use it in a numeric expression, then you can do so without making changes in your program. SAS automatically produces a numeric value from the character value for use in the expression; it also issues a note in the log that the conversion occurred. (Of course, the conversion causes the DATA step to use slightly more computer resources.) The original variable remains unchanged. The following output displays the results: Output 8.15 Saving Storage Space by Creating a Character Variable Hotel Rankings Obs 1 2 3 4 5 Note: Country Italy Italy Australia Australia Australia 1 Hotel Rank Land Cost 2 4 2 3 4 498 698 915 1169 1399 Note that the width of the column is not the default width of eight. 4 Review of SAS Tools Functions LEFT (source) left-aligns the source by moving any leading blanks to the end of the value. The source can be any kind of character expression, including a character variable, a character constant enclosed in quotation marks, or another character function. Because any blanks removed from the left are added to the right, the length of the result matches the length of the source. SCAN (source,n<,list-of-delimiters>) selects the nth term from the source. The source can be any kind of character expression, including a character variable, a character constant enclosed in quotation marks, or another character function. To choose the character that divides the terms, use a delimiter; if you omit the delimiter, then SAS divides the terms using a default list of delimiters (the blank and some special characters). 136 Statements 4 Chapter 8 TRIM (source) trims trailing blanks from the source. The source can be any kind of character expression, including a character variable, a character constant enclosed in quotation marks, or another character function. The TRIM function does not affect the way a variable is stored. If you use the TRIM function to remove trailing blanks and assign the trimmed value to a variable that is longer than that value, then SAS pads the value with new trailing blanks to make the value match the length of the new variable. Statements LENGTH variable-list $ number-of-bytes; assigns a length that you specify in number-of-bytes to the character variable or variables in variable-list. You can assign any number of lengths in a single LENGTH statement, and you can assign lengths to both character and numeric variables in the same statement. Place a dollar sign ($) before the length of any character variable. Learning More Character values This section illustrates the flexibility that SAS provides for manipulating character values. In addition to the functions that are described in this section, the following character functions are also frequently used: COMPBL removes multiple blanks from a character string. COMPRESS removes specified character(s) from the source. INDEX searches the source data for a pattern of characters. LOWCASE converts all letters in an argument to lowercase. RIGHT right-aligns the source. SUBSTR extracts a group of characters. TRANSLATE replaces specific characters in a character expression. UPCASE returns the source data in uppercase. The INDEX and UPCASE functions are discussed in Chapter 9, “Acting on Selected Observations,” on page 139. Complete descriptions of all character functions appear in SAS Language Reference: Dictionary. Character variables Working with Character Variables 4 Learning More 137 Detailed information about character variables is found SAS Language Reference: Concepts. Additional information about aligning character variables is explained in the TEMPLATE procedure in SAS Output Delivery System: User’s Guide, and in the REPORT procedure in Base SAS Procedures Guide. Comparing uppercase and lowercase characters How to compare uppercase and lowercase characters is shown in Chapter 9, “Acting on Selected Observations,” on page 139. Concatenation operator Information about the concatenation operator can be found in SAS Language Reference: Concepts. DATASETS procedure Using the DATASETS procedure to display the length of variables in a SAS data set is explained in Chapter 35, “Getting Information about Your SAS Data Sets,” on page 607. IF-THEN statements A detailed explanation of the IF-THEN statements can be found in Chapter 9, “Acting on Selected Observations,” on page 139. Informats and formats Complete information about the SAS System’s numerous informats and formats for reading and writing character variables is found in SAS Language Reference: Dictionary. Missing values Detailed information about missing values is found in SAS Language Reference: Concepts. VLENGTH function The VLENGTH function is explained in detail in SAS Language Reference: Dictionary. 138 139 CHAPTER 9 Acting on Selected Observations Introduction to Acting on Selected Observations 139 Purpose 139 Prerequisites 140 Input SAS Data Set for Examples 140 Selecting Observations 141 Understanding the Selection Process 141 Selecting Observations Based on a Simple Condition 142 Providing an Alternative Action 143 Creating a Series of Mutually Exclusive Conditions 144 Constructing Conditions 145 Understanding Construct Conditions 145 Selecting an Observation Based on Simple Conditions 146 Using More Than One Comparison in a Condition 147 Specifying Multiple Comparisons 147 Making Comparisons When All of the Conditions Must Be True 147 When Only One Condition Must Be True 148 Using Negative Operators with AND or OR 149 Using Complex Comparisons That Require AND and OR 150 Abbreviating Numeric Comparisons 151 Comparing Characters 152 Types of Character Comparisons 152 Comparing Uppercase and Lowercase Characters 152 Selecting All Values That Begin with the Same Group of Characters 153 Selecting a Range of Character Values 154 Finding a Value Anywhere within Another Character Value 155 Review of SAS Tools 156 Statements 156 Functions 156 Learning More 157 Introduction to Acting on Selected Observations Purpose One of the most useful features of SAS is its ability to perform an action on only the observations that you have selected. In this section, you will learn the following: 3 how the selection process works 3 how to write statements that select observations based on a condition 140 Prerequisites 4 Chapter 9 3 some special points about selecting numeric and character variables Prerequisites You should understand the concepts presented in all previous sections before proceeding with this section. Input SAS Data Set for Examples Tradewinds Travel offers tours to art museums and galleries in various cities. The company has decided that in order to make its process more efficient, additional information is needed. For example, if the tour covers too many museums and galleries within a time period, then the number of museums visited must be decreased or the number of days for the tour needs to change. If the guide who is assigned to the tour is not available, then another guide must be assigned. Most of the process involves selecting observations that meet or that do not meet various criteria and then taking the required action. The Tradewinds Travel tour data is stored in an external file that contains the following information: u Rome Paris London New York Madrid Amsterdam v w x y 3 750 7 4 M, 3 8 1680 6 5 M, 1 6 1230 5 3 M, 2 6 . 8 5 M, 1 3 370 5 3 M, 2 4 580 6 3 M, 3 G other G G, 2 other other G U D’Amico Lucas Wilson Lucas Torres V Torres Lucas Lucas D’Amico D’Amico Vandever The numbered fields represent u the name of the city v the number of nights in the city w the cost of the land package (not airfare) in US dollars x the number of events the trip offers (such as visits to museums and galleries) y a brief description of the events (where M indicates a museum; G, a gallery; and other, another kind of event) U the name of the tour guide V the name of the backup tour guide The following DATA step creates MYLIB.ARTTOURS: options pagesize=60 linesize=80 pageno=1 nodate; libname mylib ’permanent-data-library’; data mylib.arttours; infile ’input-file’ truncover; input City $ 1-9 Nights 11 LandCost 13-16 NumberOfEvents 18 EventDescription $ 20-36 TourGuide $ 38-45 BackUpGuide $ 47-54; run; proc print data=mylib.arttours; title ’Data Set MYLIB.ARTTOURS’; run; Acting on Selected Observations 4 Understanding the Selection Process 141 Note: When the TRUNCOVER option is specified in the INFILE statement, and when the record is shorter than what the INPUT statement expects, SAS will read a variable length record. 4 The PROC PRINT statement that follows the DATA step produces this display of the MYLIB.ARTTOURS data set: Output 9.1 Data Set MYLIB.ARTTOURS Data Set MYLIB.ARTTOURS Obs 1 2 3 4 5 6 City Rome Paris London New York Madrid Amsterdam Nights 3 8 6 6 3 4 Land Cost 750 1680 1230 . 370 580 Number u v OfEvents EventDescription 7 6 5 8 5 6 4 5 3 5 3 3 M, M, M, M, M, M, 3 1 2 1 2 3 G other G G, 2 other other G 1 w Tour Guide D’Amico Lucas Wilson Lucas Torres x BackUp Guide Torres Lucas Lucas D’Amico D’Amico Vandever The following list corresponds to the numbered items in the preceding output: u the variable NumberOfEvents contains the number of attractions visited during the tour v EventDescription lists the number of museums (M), art galleries (G), and other attractions (other) visited w TourGuide lists the name of the tour guide assigned to the tour x BackUpGuide lists the alternate tour guide in case the original tour guide is unavailable Selecting Observations Understanding the Selection Process The most common way that SAS selects observations for action in a DATA step is through the IF-THEN statement: IF condition THEN action; The condition is one or more comparisons, for example, 3 City = ’Rome’ 3 NumberOfEvents > Nights 3 TourGuide = ’Lucas’ and Nights > 7 (The symbol > stands for greater than. You will see how to use symbols as comparison operators in “Understanding Construct Conditions” on page 145.) For a given observation, a comparison is either true or false. In the first example, the value of City is either Rome or it is not. In the second example, the value of NumberOfEvents in the current observation is either greater than the value of Nights in the same observation or it is not. If the condition contains more than one 142 Selecting Observations Based on a Simple Condition 4 Chapter 9 comparison, as in the third example, then SAS evaluates all of them according to its rules (discussed later) and declares the entire condition to be true or false. When the condition is true, SAS takes the action in the THEN clause. The action must be expressed as a SAS statement that can be executed in an individual iteration of the DATA step. Such statements are called executable statements. The most common executable statements are assignment statements, such as 3 LandCost = LandCost + 30; 3 Calendar = ’Check schedule’; 3 TourGuide = ’Torres’; This section concentrates on assignment statements in the THEN clause, but examples in other sections show other types of statements that are used with the THEN clause. Statements that provide information about a data set are not executable. Such statements are called declarative statements. For example, the LENGTH statement affects a variable as a whole, not how the variable is treated in a particular observation. Therefore, you cannot use a LENGTH statement in a THEN clause. When the condition is false, SAS ignores the THEN clause and proceeds to the next statement in the DATA step. Selecting Observations Based on a Simple Condition The following DATA step uses the previous example conditions and actions in IF-THEN statements: options pagesize=60 linesize=80 pageno=1 nodate; data revise; set mylib.arttours; if City = ’Rome’ then LandCost = LandCost + 30; if NumberOfEvents > Nights then Calendar = ’Check schedule’; if TourGuide = ’Lucas’ and Nights > 7 then TourGuide = ’Torres’; run; proc print data=revise; var City Nights LandCost NumberOfEvents TourGuide Calendar; title ’Tour Information’; run; The following output displays the results: Output 9.2 Selecting Observations with IF-THEN Statements Tour Information Obs 1 2 3 4 5 6 City Rome Paris London New York Madrid Amsterdam Nights 3 8 6 6 3 4 Land Cost 780 u 1680 1230 . 370 580 Number OfEvents 7 6 5 8 5 6 1 Tour Guide Calendar v D’Amico Check schedule Torres w Wilson Lucas Check schedule Torres Check schedule Check schedule Acting on Selected Observations 4 Providing an Alternative Action 143 You can see in the output that u the land cost was increased by $30 in the observation for Rome v four observations have a greater number of events than they do number of days in the tour w the tour guide for Paris is replaced by Torres because the original tour guide is Lucas and the number of nights in the tour is greater than 7 Providing an Alternative Action Remember that SAS creates a variable in all observations, even if you do not assign the variable a value in all observations. In the previous output, the value of Calendar is blank in two observations. A second IF-THEN statement can assign a different value, as in these examples: if NumberOfEvents > Nights then Calendar = ’Check schedule’; if NumberOfEvents <= Nights then Calendar = ’No problems’; (The symbol <= means less than or equal to.) In this case, SAS compares the values of Events and Nights twice, once in each IF condition. A more efficient way to provide an alternative action is to use an ELSE statement: ELSE action; An ELSE statement names an alternative action to be taken when the IF condition is false. It must immediately follow the corresponding IF-THEN statement, as shown here: if NumberOfEvents > Nights then Calendar = ’Check schedule’; else Calendar = ’No problems’; The REVISE2 DATA step adds the preceding ELSE statement to the previous DATA step: options pagesize=60 linesize=80 pageno=1 nodate; data revise2; set mylib.arttours; if City = ’Rome’ then LandCost = LandCost + 30; if NumberOfEvents > Nights then Calendar = ’Check schedule’; else Calendar = ’No problems’; if TourGuide = ’Lucas’ and Nights > 7 then TourGuide = ’Torres’; run; proc print data=revise2; var City Nights LandCost NumberOfEvents TourGuide Calendar; title ’Tour Information’; run; The following output displays the results: 144 Creating a Series of Mutually Exclusive Conditions Output 9.3 4 Chapter 9 Providing an Alternative Action with the ELSE Statement Tour Information Obs 1 2 3 4 5 6 City Rome Paris London New York Madrid Amsterdam Nights 3 8 6 6 3 4 Land Cost 780 1680 1230 . 370 580 Number OfEvents 7 6 5 8 5 6 1 Tour Guide D’Amico Torres Wilson Lucas Torres Calendar Check schedule No problems No problems Check schedule Check schedule Check schedule Creating a Series of Mutually Exclusive Conditions Using an ELSE statement after an IF-THEN statement provides one alternative action when the IF condition is false. However, many cases involve a series of mutually exclusive conditions, each of which requires a separate action. In this example, tour prices can be classified as high, medium, or low. A series of IF-THEN and ELSE statements classifies the tour prices appropriately: if LandCost >= 1500 then Price = ’High ’; else if LandCost >= 700 then Price = ’Medium’; else Price = ’Low’; (The symbol >= is greater than or equal to.) To see how SAS executes this series of statements, consider two observations: Amsterdam, whose value of LandCost is 580, and Paris, whose value is 1680. When the value of LandCost is 580: 1 SAS tests whether 580 is equal to or greater than 1500, determines that the comparison is false, ignores the THEN clause, and proceeds to the ELSE statement. 2 The action in the ELSE statement is to evaluate another condition. SAS tests whether 580 is equal to or greater than 700, determines that the comparison is false, ignores the THEN clause, and proceeds to the accompanying ELSE statement. 3 SAS executes the action in the ELSE statement and assigns Price the value Low. When the value of LandCost is 1680: 1 SAS tests whether 1680 is greater than or equal to 1500, determines that the comparison is true, and executes the action in the THEN clause. The value of Price becomes High. 2 SAS ignores the ELSE statement. Because the entire remaining series is part of the first ELSE statement, SAS skips all remaining actions in the series. A simple way to think of these actions is to remember that when an observation satisfies one condition in a series of mutually exclusive IF-THEN/ELSE statements, SAS processes that THEN action and skips the rest of the statements. (Therefore, you can increase the efficiency of a program by ordering the IF-THEN/ELSE statements so that the most common conditions appear first.) The following DATA step includes the preceding series of statements: options pagesize=60 linesize=80 pageno=1 nodate; data prices; Acting on Selected Observations 4 Understanding Construct Conditions 145 set mylib.arttours; if LandCost >= 1500 then Price = ’High ’; else if LandCost >= 700 then Price = ’Medium’; else Price = ’Low’; run; proc print data=prices; var City LandCost Price; title ’Tour Prices’; run; The following output displays the results: Output 9.4 Assigning Mutually Exclusive Values with IF-THEN/ELSE Statements Tour Prices Obs 1 2 3 4 5 6 1 City Land Cost Price Rome Paris London New York Madrid Amsterdam 750 1680 1230 . 370 580 Medium High Medium Low Low Low Note the value of Price in the fourth observation. The Price value is Low because the LandCost value for the New York trip is a missing value. Remember that a missing value is the lowest possible numeric value. Constructing Conditions Understanding Construct Conditions When you use an IF-THEN statement, you ask SAS to make a comparison. SAS must determine whether a value is equal to another value, greater than another value, and so on. SAS has six main comparison operators: Table 9.1 Comparison Operators Symbol = Mnemonic Operator Meaning EQ equal to NE not equal to (the , ^, or ~ symbol, depending on your keyboard) > GT greater than < LT less than =, ^= , ~= 146 Selecting an Observation Based on Simple Conditions Symbol 4 Chapter 9 Mnemonic Operator Meaning >= GE greater than or equal to <= LE less than or equal to The symbols in the table are based on mathematical symbols; the letter abbreviations, known as mnemonic operators, have the same effect. Use the form that you prefer, but remember that you can use the mnemonic operators only in comparisons. For example, the equal sign in an assignment statement must be represented by the symbol =, not the mnemonic operator. Both of the following statements compare the number of nights in the tour to six: 3 if Nights >= 6 then Stay = ’Week+’; 3 if Nights ge 6 then Stay = ’Week+’; The terms on each side of the comparison operator can be variables, expressions, or constants. The side a particular term appears on does not matter, as long as you use the correct operator. All of the following comparisons are constructed correctly for use in SAS statements: 3 Guide = ’ ’ 3 LandCost ne . 3 LandCost lt 600 3 600 ge LandCost 3 NumberOfEvents / Nights > 2 3 2 <= NumberOfEvents / Nights Selecting an Observation Based on Simple Conditions The following DATA step illustrates some of these conditions: options pagesize=60 linesize=80 pageno=1 nodate; data changes; set mylib.arttours; if Nights >= 6 then Stay = ’Week+’; else Stay = ’Days’; if LandCost ne . then Remarks = ’OK ’; else Remarks = ’Redo’; ’; if LandCost lt 600 then Budget = ’Low else Budget = ’Medium’; if NumberOfEvents / Nights > 2 then Pace = ’Too fast’; else Pace = ’OK’; run; proc print data=changes; var City Nights LandCost NumberOfEvents Stay Remarks Budget Pace; title ’Tour Information’; run; The following output displays the results: Acting on Selected Observations 4 Using More Than One Comparison in a Condition 147 Output 9.5 Assigning Values to Variables According to Specific Conditions Tour Information Obs 1 2 3 4 5 6 City Rome Paris London New York Madrid Amsterdam Nights 3 8 6 6 3 4 Land Cost Number OfEvents 750 1680 1230 . 370 580 7 6 5 8 5 6 Stay Days Week+ Week+ Week+ Days Days 1 Remarks Budget Pace OK OK OK Redo OK OK Medium Medium Medium Low Low Low Too fast OK OK OK OK OK Using More Than One Comparison in a Condition Specifying Multiple Comparisons You can specify more than one comparison in a condition with these operators: 3 & or AND 3 | or OR A condition can contain any number of ANDs, ORs, or both. Making Comparisons When All of the Conditions Must Be True When comparisons are connected by AND, all of the comparisons must be true for the condition to be true. Consider this example: if City = ’Paris’ and TourGuide = ’Lucas’ then Remarks = ’Bilingual’; The comparison is true for observations in which the value of City is Paris and the value of TourGuide is Lucas. A common comparison is to determine whether a value is between two quantities, greater than one quantity and less than another quantity. For example, to select observations in which the value of LandCost is greater than or equal to 1000, and less than or equal to 1500, you can write a comparison with AND: if LandCost >= 1000 and LandCost <= 1500 then Price = ’1000-1500’; A simpler way to write this comparison is if 1000 <= LandCost <= 1500 then Price = ’1000-1500’; This comparison has the same meaning as the previous one. You can use any of the operators <, <=, >, >=, or their mnemonic equivalents in this way. The following DATA step includes these multiple comparison statements: options pagesize=60 linesize=80 pageno=1 nodate; data showand; set mylib.arttours; if City = ’Paris’ and TourGuide = ’Lucas’ then Remarks = ’Bilingual’; if 1000 <= LandCost <= 1500 then Price = ’1000-1500’; run; proc print data=showand; var City LandCost TourGuide Remarks Price; 148 Using More Than One Comparison in a Condition 4 Chapter 9 title ’Tour Information’; run; The following output displays the results: Output 9.6 Using AND When Making Multiple Comparisons Tour Information Obs 1 2 3 4 5 6 City Land Cost Tour Guide Rome Paris London New York Madrid Amsterdam 750 1680 1230 . 370 580 D’Amico Lucas Wilson Lucas Torres 1 Remarks Price Bilingual 1000-1500 When Only One Condition Must Be True When comparisons are connected by OR, only one of the comparisons needs to be true for the condition to be true. Consider the following example: if LandCost gt 1500 or LandCost / Nights gt 200 then Level = ’Deluxe’; Any observation in which the land cost is over $1500, the cost per night is over $200, or both, satisfies the condition. The following DATA step shows this condition: options pagesize=60 linesize=80 pageno=1 nodate; data showor; set mylib.arttours; if LandCost gt 1500 or LandCost / Nights gt 200 then Level = ’Deluxe’; run; proc print data=showor; var City LandCost Nights Level; title ’Tour Information’; run; The following output displays the results: Output 9.7 Using OR When Making Multiple Comparisons Tour Information Obs 1 2 3 4 5 6 City Land Cost Rome Paris London New York Madrid Amsterdam 750 1680 1230 . 370 580 1 Nights 3 8 6 6 3 4 Level Deluxe Deluxe Deluxe Acting on Selected Observations 4 Using More Than One Comparison in a Condition 149 Using Negative Operators with AND or OR Be careful when you combine negative operators with OR. Often, the operator that you really need is AND. For example, the variable TourGuide contains some problems with the data. In the observation for Paris, the tour guide and the backup tour guide are both Lucas; in the observation for Amsterdam, the name of the tour guide is missing. You want to label the observations that have no problems with TourGuide as OK. Should you write the IF condition with OR or with AND? The following DATA step shows both conditions: options pagesize=60 linesize=80 pageno=1 nodate; data test; set mylib.arttours; if TourGuide ne BackUpGuide or TourGuide ne ’ ’ then GuideCheckUsingOR = ’OK’; else GuideCheckUsingOR = ’No’; if TourGuide ne BackUpGuide and TourGuide ne ’ ’ then GuideCheckUsingAND = ’OK’; else GuideCheckUsingAND = ’No’; run; proc print data = test; var City TourGuide BackUpGuide GuideCheckUsingOR GuideCheckUsingAND; title ’Negative Operators with OR and AND’; run; The following output displays the results: Output 9.8 Using Negative Operators When Making Comparisons Negative Operators with OR and AND Obs 1 2 3 4 5 6 City Rome Paris London New York Madrid Amsterdam Tour Guide D’Amico Lucas Wilson Lucas Torres BackUp Guide Torres Lucas Lucas D’Amico D’Amico Vandever Guide Check UsingOR OK OK u OK OK OK OK v 1 Guide Check UsingAND OK No OK OK OK No In the IF-THEN/ELSE statements that create GuideCheckUsingOR, only one comparison needs to be true to make the condition true. Note that for the Paris and Amsterdam observations in the data set MYLIB.ARTTOURS, u in the observation for Paris, TourGuide does not have a missing value and the comparison TourGuide NE ’ ’ is true. v for Amsterdam, the comparison TourGuide NE BackUpGuide is true. Because one OR comparison is true in each observation, GuideCheckUsingOR is labeled OK for all observations. The IF-THEN/ELSE statements that create GuideCheckUsingAND achieve better results. That is, the AND operator selects the observations in which the value of TourGuide is not the same as BackUpGuide and is not missing. 150 Using More Than One Comparison in a Condition 4 Chapter 9 Using Complex Comparisons That Require AND and OR A condition can contain both ANDs and ORs. When it does, SAS evaluates the ANDs before the ORs. The following example specifies a list of cities and a list of guides: /* first attempt */ if City = ’Paris’ or City = ’Rome’ and TourGuide = ’Lucas’ or TourGuide = "D’Amico" then Topic = ’Art history’; SAS first joins the items that are connected by AND: City = ’Rome’ and TourGuide = ’Lucas’ Then SAS makes the following OR comparisons: City = ’Paris’ or City = ’Rome’ and TourGuide = ’Lucas’ or TourGuide = "D’Amico" To group the City comparisons and the TourGuide comparisons, use parentheses: /* correct method */ if (City = ’Paris’ or City = ’Rome’) and (TourGuide = ’Lucas’ or TourGuide = "D’Amico") then Topic = ’Art history’; SAS evaluates the comparisons within parentheses first and uses the results as the terms of the larger comparison. You can use parentheses in any condition to control the grouping of comparisons or to make the condition easier to read. The following DATA step illustrates these conditions: options pagesize=60 linesize=80 pageno=1 nodate; data combine; set mylib.arttours; if (City = ’Paris’ or City = ’Rome’) and (TourGuide = ’Lucas’ or TourGuide = "D’Amico") then Topic = ’Art history’; run; proc print data=combine; var City TourGuide Topic; title ’Tour Information’; run; The following output displays the results: Output 9.9 Using Parentheses to Combine Comparisons with AND and OR Tour Information Obs 1 2 3 4 5 6 City Rome Paris London New York Madrid Amsterdam Tour Guide D’Amico Lucas Wilson Lucas Torres 1 Topic Art history Art history Acting on Selected Observations 4 Using More Than One Comparison in a Condition 151 Abbreviating Numeric Comparisons Two points about numeric comparisons are especially helpful to know: 3 An abbreviated form of comparison is possible. 3 Abbreviated comparisons with OR require you to use caution. In computing terms, a value of TRUE is 1 and a value of FALSE is 0. In SAS, the following is true 3 Any numeric value other than 0 or missing is true. 3 A value of 0 or missing is false. Therefore, a numeric variable or expression can stand alone in a condition. If its value is a number other than 0 or if the value is missing, then the condition is true; if its value is 0 or missing, then the condition is false. The following example assigns a value to the variable Remarks only if the value of LandCost is present for a given observation: if LandCost then Remarks = ’Ready to budget’; This statement is equivalent to if LandCost ne . and LandCost ne 0 then Remarks = ’Ready to budget’; Be careful when you abbreviate comparisons with OR; it is easy to produce unexpected results. For example, this IF-THEN statement selects tours that last six or eight nights: /* first try */ if Nights = 6 or 8 then Stay = ’Medium’; SAS treats the condition as the following comparisons: Nights=6 or 8 The second comparison does not use the values of Nights; it is simply the number 8 standing alone. Because the number 8 is neither 0 nor a missing value, it always has the value TRUE. Because only one comparison in a series of OR comparisons needs to be true to make the condition true, this condition is true for all observations. The following comparisons correctly select observations that have six or eight nights: /* correct way */ if Nights = 6 or Nights = 8 then Stay = ’Medium’; The following DATA step includes these IF-THEN statements: options pagesize=60 linesize=80 pageno=1 nodate; data morecomp; set mylib.arttours; if LandCost then Remarks = ’Ready to budget’; else Remarks = ’Need land cost’; if Nights = 6 or Nights = 8 then Stay = ’Medium’; else Stay = ’Short’; run; proc print data=morecomp; var City Nights LandCost Remarks Stay; title ’Tour Information’; run; The following output displays the results: 152 Comparing Characters 4 Output 9.10 Chapter 9 Abbreviating Numeric Comparisons Tour Information Obs 1 2 3 4 5 6 City Nights Rome Paris London New York Madrid Amsterdam 3 8 6 6 3 4 1 Land Cost Remarks Stay 750 1680 1230 . 370 580 Ready to budget Ready to budget Ready to budget Need land cost Ready to budget Ready to budget Short Medium Medium Medium Short Short Comparing Characters Types of Character Comparisons Some special situations occur when you make character comparisons. You might need to do the following: 3 Compare uppercase and lowercase characters. 3 Select all values beginning with a particular group of characters. 3 Select all values beginning with a particular range of characters. 3 Find a particular value anywhere within another character value. Comparing Uppercase and Lowercase Characters SAS distinguishes between uppercase and lowercase letters in comparisons. For example, the values Madrid and MADRID are not equivalent. To compare values that may occur in different cases, use the UPCASE function to produce an uppercase value; then make the comparison between two uppercase values, as shown here: options pagesize=60 linesize=80 pageno=1 nodate; data newguide; set mylib.arttours; if upcase(City) = ’MADRID’ then TourGuide = ’Balarezo’; run; proc print data=newguide; var City TourGuide; title ’Tour Guides’; run; Within the comparison, SAS produces an uppercase version of the value of City and compares it to the uppercase constant MADRID. The value of City in the observation remains in its original case. The following output displays the results: Acting on Selected Observations Output 9.11 4 Selecting All Values That Begin with the Same Group of Characters 153 Data Set Produced by an Uppercase Comparison Tour Guides Obs 1 2 3 4 5 6 City Rome Paris London New York Madrid Amsterdam 1 Tour Guide D’Amico Lucas Wilson Lucas Balarezo Now Balarezo is assigned as the tour guide for Madrid because the UPCASE function compares the uppercase value of Madrid with the value MADRID. The UPCASE function enables SAS to read the two values as equal. Selecting All Values That Begin with the Same Group of Characters Sometimes you need to select a group of character values, such as all tour guides whose names begin with the letter D. By default, SAS compares values of different lengths by adding blanks to the end of the shorter value and testing the result against the longer value. In this example, /* first attempt */ if Tourguide = ’D’ then Chosen = ’Yes’; else Chosen = ’No’; SAS interprets the comparison as TourGuide = ’D ’ where D is followed by seven blanks (because TourGuide, a character variable created by column input, has a length of eight bytes). Because the value of TourGuide never consists of the single letter D, the comparison is never true. To compare a long value to a shorter standard, put a colon (:) after the operator, as in this example: /* correct method */ if TourGuide =: ’D’ then Chosen = ’Yes’; else Chosen = ’No’; The colon causes SAS to compare the same number of characters in the shorter value and the longer value. In this case, the shorter string contains one character; therefore, SAS tests only the first character from the longer value. All names beginning with a D make the comparison true. (If you are not sure that all the values of TourGuide begin with a capital letter, then use the UPCASE function.) The following DATA step selects names beginning with D: options pagesize=60 linesize=80 pageno=1 nodate; data dguide; set mylib.arttours; if TourGuide =: ’D’ then Chosen = ’Yes’; else Chosen = ’No’; run; proc print data=dguide; 154 Selecting a Range of Character Values 4 Chapter 9 var City TourGuide Chosen; title ’Guides Whose Names Begin with D’; run; The following output displays the results: Output 9.12 Selecting All Values That Begin with a Particular String Guides Whose Names Begin with D Obs 1 2 3 4 5 6 Tour Guide City Rome Paris London New York Madrid Amsterdam D’Amico Lucas Wilson Lucas Torres 1 Chosen Yes No No No No No Selecting a Range of Character Values You may want to select values beginning with a range of characters, such as all names beginning with A through L or M through Z. To select a range of character values, you need to understand the following points: 3 In computer processing, letters have magnitude. A is the smallest letter in the alphabet and Z is the largest. Therefore, the comparison AC.* 3 A blank is smaller than any letter. The following statements divide the names of the guides into two groups beginning with A-L and M-Z by combining the comparison operator with the colon: if TourGuide <=: ’L’ then TourGuideGroup = ’A-L’; else TourGuideGroup = ’M-Z’; The following DATA step creates the groups: options pagesize=60 linesize=80 pageno=1 nodate; data guidegrp; set mylib.arttours; if TourGuide <=: ’L’ then TourGuideGroup = ’A-L’; else TourGuideGroup = ’M-Z’; run; proc print data=guidegrp; var City TourGuide TourGuideGroup; title ’Tour Guide Groups’; run; The following output displays the results: * The magnitude of letters in the alphabet is true for all operating environments under which SAS runs. Other points, such as whether uppercase or lowercase letters are larger and how to treat numbers in character values, depend on your operating system. For more information about how character values are sorted under various operating environments, see Chapter 11, “Working with Grouped or Sorted Observations,” on page 173. Acting on Selected Observations Output 9.13 4 Finding a Value Anywhere within Another Character Value 155 Selecting All Values Beginning with a Range of Characters Tour Guide Groups Obs 1 2 3 4 5 6 City Rome Paris London New York Madrid Amsterdam Tour Guide D’Amico Lucas Wilson Lucas Torres 1 Tour Guide Group A-L A-L M-Z A-L M-Z A-L All names beginning with A through L, as well as the missing value, go into group A-L. The missing value goes into that group because a blank is smaller than any letter. Finding a Value Anywhere within Another Character Value A data set is needed that lists tours that visit other attractions in addition to museums and galleries. In the data set MYLIB.ARTTOURS, the variable EventDescription refers to those events as other. However, the position of the word other varies in different observations. How can it be determined that other exists anywhere in the value of EventDescription for a given observation? The INDEX function determines whether a specified character string (the excerpt) is present within a particular character value (the source): INDEX (source,excerpt) Both source and excerpt can be any kind of character expression, including character strings enclosed in quotation marks, character variables, and other character functions. If excerpt does occur within source, then the function returns the position of the first character of excerpt, which is a positive number. If it does not, then the function returns a 0. By testing for a value greater than 0, you can determine whether a particular character string is present in another character value. The following statements select observations containing the string other: if index(EventDescription,’other’) > 0 then OtherEvents = ’Yes’; else OtherEvents = ’No’; You can also write the condition as if index(EventDescription,’other’) then OtherEvents = ’Yes’; else OtherEvents = ’No’; The second example uses the fact that any value other than 0 or missing makes the condition true. This statement is included in the following DATA step: options pagesize=60 linesize=80 pageno=1 nodate; data otherevent; set mylib.arttours; if index(EventDescription,’other’) then OtherEvents = ’Yes’; else OtherEvents = ’No’; run; proc print data=otherevent; var City EventDescription OtherEvents; 156 Review of SAS Tools 4 Chapter 9 title ’Tour Events’; run; The following output displays the results: Output 9.14 Finding a Character String within Another Value Tour Events Obs 1 2 3 4 5 6 City EventDescription Rome Paris London New York Madrid Amsterdam 4 5 3 5 3 3 M, M, M, M, M, M, 3 1 2 1 2 3 G other G G, 2 other other G 1 Other Events No Yes No Yes Yes No In the observations for Paris and Madrid, the INDEX function returns the value 8 because the string other is found in the eighth field of the variable (5 M, 1 other for Paris and 3 M, 2 other for Madrid). For New York, it returns the value 13 because the string other is found in the thirteenth field of the variable (5 M, 1 G, 2 other). In the remaining observations, the function does not find the string other and returns a 0. Review of SAS Tools Statements IF condition THEN action; tests whether the condition is true; if so, the action in the THEN clause is carried out. If the condition is false and an ELSE statement is present, then the ELSE action is carried out. If the condition is false and no ELSE statement is present, then the next statement in the DATA step is processed. The condition is one or more numeric or character comparisons. The action must be an executable statement; that is, one that can be processed in an individual iteration of the DATA step. (Statements that affect the entire DATA step, such as LENGTH, are not executable.) In SAS processing, any numeric value other than 0 or missing is true; 0 and missing are false. Therefore, a numeric value can stand alone in a comparison. If its value is 0 or missing, then the comparison is false; otherwise, the comparison is true. Functions INDEX(source,excerpt) searches the source for the string given in excerpt. Both the source and excerpt can be any kind of character expression, such as character variables, character strings enclosed in quotation marks, other character functions, and so on. When excerpt is present in source, the function returns the position of the first character of excerpt (a positive number). When excerpt is not present, the function returns a 0. Acting on Selected Observations 4 Learning More 157 UPCASE(argument) produces an uppercase value of argument, which can be any kind of character expression, such as character variables, character strings enclosed in quotation marks, other character functions, and so on. Learning More Base SAS functions Base SAS functions are documented in SAS Language Reference: Dictionary. Comparison and logical operators Complete information about comparison and logical operators is provided in SAS Language Reference: Concepts. Executable statements You can issue only executable statements in IF-THEN/ELSE statements. For a complete list of executable and nonexecutable statements, see SAS Language Reference: Dictionary. IF-THEN and ELSE statement and clauses The IF-THEN and ELSE statement and clauses are documented in SAS Language Reference: Dictionary. IN operator Information about the IN operator can be found in SAS Language Reference: Concepts. You can use the IN operator to shorten a comparison when you are comparing a value to a series of numeric or character constants (not variables or expressions). SELECT statement The SELECT statement, which selects observations based on a condition, is documented in SAS Language Reference: Dictionary. Its action is equivalent to a series of IF-THEN/ELSE statements. If you have a long series of conditions and actions, then the DATA step may be easier to read if you write them in a SELECT group. TRUNCOVER option The TRUNCOVER option in the INFILE statement is described in Chapter 3, “Starting with Raw Data: The Basics,” on page 43 . 158 159 CHAPTER 10 Creating Subsets of Observations Introduction to Creating Subsets of Observations 159 Purpose 159 Prerequisites 159 Input SAS Data Set for Examples 160 Selecting Observations for a New SAS Data Set 161 Deleting Observations Based on a Condition 161 Accepting Observations Based on a Condition 162 Comparing the DELETE and Subsetting IF Statements 163 Conditionally Writing Observations to One or More SAS Data Sets 164 Understanding the OUTPUT Statement 164 Example for Conditionally Writing Observations to Multiple Data Sets 165 A Common Mistake When Writing to Multiple Data Sets 166 Understanding Why the Placement of the OUTPUT Statement Is Important 166 Writing an Observation Multiple Times to One or More Data Sets 168 Review of SAS Tools 170 Statements 170 Learning More 170 Introduction to Creating Subsets of Observations Purpose In this section, you will learn to select specific observations from existing SAS data sets in order to create tne following: 3 a new SAS data set that includes only some of the observations from the input data source 3 several new SAS data sets by writing observations from an input data source, using a single DATA step Prerequisites Before proceeding with this section, you should understand the concepts presented in the following topics: 3 Part 1, “Introduction to the SAS System” 3 Part 2, “Getting Your Data into Shape” 3 Chapter 6, “Understanding DATA Step Processing,” on page 97 160 Input SAS Data Set for Examples 4 Chapter 10 Input SAS Data Set for Examples Tradewinds Travel has a schedule for tours to various art museums and galleries. It would be convenient to keep different SAS data sets that contain different information about the tours. The tour data is stored in an external file that contains the following information: u Rome Paris London New York Madrid Amsterdam v w x 3 750 Medium 8 1680 High 6 1230 High 6 . 3 370 Low 4 580 Low y D’Amico Lucas Wilson Lucas Torres The numbered fields represent u the name of the destination city v the number of nights on the tour w the cost of the land package in US dollars x a rating of the budget y the name of the tour guide The following program creates a permanent SAS data set named MYLIB.ARTS: options pagesize=60 linesize=80 pageno=1 nodate; libname mylib ’permanent-data-library’; data mylib.arts; infile ’input-file’ truncover; input City $ 1-9 Nights 11 LandCost 13-16 Budget $ 18-23 TourGuide $ 25-32; ; proc print data=mylib.arts; title ’Data Set MYLIB.ARTS’; run; The PROC PRINT statement that follows the DATA step produces this display of the MYLIB.ARTS data set: Output 10.1 Data Set MYLIB.ARTS Data Set MYLIB.ARTS Obs 1 2 3 4 5 6 City Rome Paris London New York Madrid Amsterdam Nights 3 8 6 6 3 4 Land Cost 750 1680 1230 . 370 580 1 Budget Medium High High Low Low Tour Guide D’Amico Lucas Wilson Lucas Torres Creating Subsets of Observations 4 Deleting Observations Based on a Condition 161 Selecting Observations for a New SAS Data Set Deleting Observations Based on a Condition There are two ways to select specific observations in a SAS data set when creating a new SAS data set: 1 Delete the observations that do not meet a condition, keeping only the ones that you want. 2 Accept only the observations that meet a condition. To delete an observation, first identify it with an IF condition; then use a DELETE statement in the THEN clause: IF condition THEN DELETE Processing the DELETE statement for an observation causes SAS to return immediately to the beginning of the DATA step for a new observation without writing the current observation to the output DATA set. The DELETE statement does not include the observation in the output data set, but it does not delete the observation from the input data set. For example, the following statement deletes observations that contain a missing value for LandCost: if LandCost = . then delete; The following DATA step includes this statement: options pagesize=60 linesize=80 pageno=1 nodate; data remove; set mylib.arts; if LandCost = . then delete; ; proc print data=remove; title ’Tours With Complete Land Costs’; run; The following output displays the results: Output 10.2 Deleting Observations That Have a Particular Value Tours With Complete Land Costs Obs 1 2 3 4 5 City Rome Paris London Madrid Amsterdam Nights 3 8 6 3 4 Land Cost Budget 750 1680 1230 370 580 Medium High High Low Low 1 Tour Guide D’Amico Lucas Wilson Torres New York, the observation that is missing a value for LandCost, is not included in the resulting data set, REMOVE. You can also delete observations as you enter data from an external file. The following DATA step produces the same SAS data set as the REMOVE data set: 162 Accepting Observations Based on a Condition 4 Chapter 10 options pagesize=60 linesize=80 pageno=1 nodate; data remove2; infile ’input-file’ truncover; input City $ 1-9 Nights 11 LandCost 13-16 Budget $ 18-23 TourGuide $ 25-32; if LandCost = . then delete; ; proc print data=remove2; title ’Tours With Complete Land Costs’; run; The following output displays the results: Output 10.3 Deleting Observations While Reading from an External File Tours With Complete Land Costs Obs 1 2 3 4 5 City Rome Paris London Madrid Amsterdam Nights 3 8 6 3 4 Land Cost Budget 750 1680 1230 370 580 Medium High High Low Low 1 Tour Guide D’Amico Lucas Wilson Torres Accepting Observations Based on a Condition One data set that is needed by the travel agency contains observations for tours that last only six nights. One way to make the selection is to delete observations in which the value of Nights is not equal to 6: if Nights ne 6 then delete; A more straightforward way is to select only observations meeting the criterion. The subsetting IF statement selects the observations that you specify. It contains only a condition: IF condition; The implicit action in a subsetting IF statement is always the same: if the condition is true, then continue processing the observation; if it is false, then stop processing the observation and return to the top of the DATA step for a new observation. The statement is called subsetting because the result is a subset of the original observations. For example, if you want to select only observations in which the value of Nights is equal to 6, then you specify the following statement: if Nights = 6; The following DATA step includes the subsetting IF: options pagesize=60 linesize=80 pageno=1 nodate; data subset6; set mylib.arts; if nights=6; ; Creating Subsets of Observations 4 Comparing the DELETE and Subsetting IF Statements 163 proc print data=subset6; title ’Six-Night Tours’; run; The following output displays the results: Output 10.4 Selecting Observations with a Subsetting IF Statement Six-Night Tours Obs City 1 2 London New York Nights 6 6 Land Cost 1230 . 1 Budget High Tour Guide Wilson Lucas Two observations met the criteria for a six-night tour. Comparing the DELETE and Subsetting IF Statements The main reasons for choosing between a DELETE statement and a subsetting IF statement are that 3 it is usually easier to choose the statement that requires the fewest comparisons to identify the condition. 3 it is usually easier to think in positive terms than negative ones (this favors the subsetting IF). One additional situation favors the subsetting IF: it is the safer method if your data has missing or misspelled values. Consider the following situation. Tradewinds Travel needs a SAS data set of low- to medium-priced tours. Knowing that the values of Budget are Low, Medium, and High, a first thought would be to delete observations with a value of High. The following program creates a SAS data set by deleting observations that have a Budget value of HIGH: /* first attempt */ options pagesize=60 linesize=80 pageno=1 nodate; data lowmed; set mylib.arts; if upcase(Budget) = ’HIGH’ then delete; ; proc print data=lowmed; title ’Medium and Low Priced Tours’; run; The following output displays the results: 164 Conditionally Writing Observations to One or More SAS Data Sets Output 10.5 4 Chapter 10 Producing a Subset by Deletion Medium and Low Priced Tours Obs 1 2 3 4 City Nights Rome New York Madrid Amsterdam Land Cost 3 6 3 4 750 . 370 580 Budget Medium Low Low 1 Tour Guide D’Amico Lucas Torres The data set LOWMED contains both the tours that you want and the tour to New York. The inclusion of the tour to New York is erroneous because the value of Budget for the New York observation is missing. Using a subsetting IF statement ensures that the data set contains exactly the observations you want. This DATA step creates the subset with a subsetting IF statement: /* a safer method */ options pagesize=60 linesize=80 pageno=1 nodate; data lowmed2; set mylib.arts; if upcase(Budget) = ’MEDIUM’ or upcase(Budget) = ’LOW’; ; proc print data=lowmed2; title ’Medium and Low Priced Tours’; run; The following output displays the results: Output 10.6 Producing an Exact Subset with Subsetting IF Medium and Low Priced Tours Obs 1 2 3 City Rome Madrid Amsterdam Nights 3 3 4 Land Cost Budget 750 370 580 Medium Low Low 1 Tour Guide D’Amico Torres The result is a SAS data set with no missing values for Budget. Conditionally Writing Observations to One or More SAS Data Sets Understanding the OUTPUT Statement SAS enables you to create multiple SAS data sets in a single DATA step using an OUTPUT statement: OUTPUT ; 4 Creating Subsets of Observations Example for Conditionally Writing Observations to Multiple Data Sets 165 When you use an OUTPUT statement without specifying a data set name, SAS writes the current observation to all data sets named in the DATA statement. If you want to write observations to a selected data set, then you specify that data set name directly in the OUTPUT statement. Any data set name appearing in the OUTPUT statement must also appear in the DATA statement. Example for Conditionally Writing Observations to Multiple Data Sets One of the SAS data sets contains tours that are guided by the tour guide Lucas and the other contains tours led by other guides. Writing to multiple data sets is accomplished by doing one of the following: 1 naming both data sets in the DATA statement. 2 selecting the observations using an IF condition 3 using an OUTPUT statement in the THEN and ELSE clauses to output the observations to the appropriate data sets The following DATA step shows these steps: options pagesize=60 linesize=80 pageno=1 nodate; data lucastour othertours; set mylib.arts; if TourGuide = ’Lucas’ then output lucastour; else output othertours; ; proc print data=lucastour; title "Data Set with TourGuide = ’Lucas’"; ; proc print data=othertours; title "Data Set with Other Guides"; run; The following output displays the results: Output 10.7 Creating Two Data Sets wth One DATA Step Data Set with TourGuide = ’Lucas’ Obs 1 2 City Paris New York Nights Land Cost 8 6 1680 . Budget High 1 Tour Guide Lucas Lucas Data Set with Other Guides Obs 1 2 3 4 City Rome London Madrid Amsterdam Nights 3 6 3 4 Land Cost Budget 750 1230 370 580 Medium High Low Low 2 Tour Guide D’Amico Wilson Torres 166 A Common Mistake When Writing to Multiple Data Sets 4 Chapter 10 A Common Mistake When Writing to Multiple Data Sets If you use an OUTPUT statement, then you suppress the automatic output of observations at the end of the DATA step. Therefore, if you plan to use any OUTPUT statements in a DATA step, then you must program all output for that step with OUTPUT statements. For example, in the previous DATA step you sent output to both LUCASTOUR and OTHERTOURS. For comparison, the following program shows what would happen if you omit the ELSE statement in the DATA step: options pagesize=60 linesize=80 pageno=1 nodate; data lucastour2 othertour2; set mylib.arts; if TourGuide = ’Lucas’ then output lucastour2; run; proc print data=lucastour2; title "Data Set with Guide = ’Lucas’"; run; proc print data=othertour2; title "Data Set with Other Guides"; run; The following output displays the results: Output 10.8 Failing to Direct Output to a Second Data Set Data Set with Guide = ’Lucas’ Obs 1 2 City Paris New York Nights Land Cost 8 6 1680 . Budget High 1 Tour Guide Lucas Lucas No observations are written to OTHERTOUR2 because output was not directed to it. Understanding Why the Placement of the OUTPUT Statement Is Important By default SAS writes an observation to the output data set at the end of each iteration. When you use an OUTPUT statement, you override the automatic output feature. Where you place the OUTPUT statement, therefore, is very important. For example, if a variable value is calculated after the OUTPUT statement executes, then that value is not available when the observation is written to the output data set. For example, in the following DATA step, an assignment statement is placed after the IF-THEN/ELSE group: /* first attempt to combine assignment and OUTPUT statements */ options pagesize=60 linesize=80 pageno=1 nodate; data lucasdays otherdays; set mylib.arts; if TourGuide = ’Lucas’ then output lucasdays; else output otherdays; 4 Creating Subsets of Observations Understanding Why the Placement of the OUTPUT Statement Is Important 167 Days = Nights+1; run; proc print data=lucasdays; title "Number of Days in Lucas’s Tours"; run; proc print data=otherdays; title "Number of Days in Other Guides’ Tours"; run; Output 10.9 Unintended Results: Outputting Observations before Assigning Values Number of Days in Lucas’s Tours Obs 1 2 City Paris New York Nights Land Cost 8 6 1680 . Budget High 1 Tour Guide Lucas Lucas Days . . Number of Days in Other Guides’ Tours Obs 1 2 3 4 City Rome London Madrid Amsterdam Nights 3 6 3 4 Land Cost Budget 750 1230 370 580 Medium High Low Low Tour Guide D’Amico Wilson Torres 2 Days . . . . The value of DAYS is missing in all observations because the OUTPUT statement writes the observation to the SAS data sets before the assignment statement is processed. If you want the value of DAY to appear in the data sets, then use the assignment statement before you use the OUTPUT statement. The following program shows the correct position: /* correct position of assignment statement */ options pagesize=60 linesize=80 pageno=1 nodate; data lucasdays2 otherdays2; set mylib.arts; Days = Nights + 1; if TourGuide = ’Lucas’ then output lucasdays2; else output otherdays2; run; proc print data=lucasdays2; title "Number of Days in Lucas’s Tours"; run; proc print data=otherdays2; 168 Writing an Observation Multiple Times to One or More Data Sets 4 Chapter 10 title "Number of Days in Other Guides’ Tours"; run; Output 10.10 Intended Results: Assigning Values after Outputting Observations Number of Days in Lucas’s Tours Obs 1 2 City Nights Land Cost 8 6 1680 . Paris New York Budget High 1 Tour Guide Lucas Lucas Days 9 7 Number of Days in Other Guides’ Tours Obs 1 2 3 4 City Nights Rome London Madrid Amsterdam 3 6 3 4 Land Cost Budget 750 1230 370 580 Medium High Low Low Tour Guide D’Amico Wilson Torres 2 Days 4 7 4 5 Writing an Observation Multiple Times to One or More Data Sets After SAS processes an OUTPUT statement, the observation remains in the program data vector and you can continue programming with it. You can even output it again to the same SAS data set or to a different one. The following example creates two pairs of data sets, one pair based on the name of the tour guide and one pair based on the number of nights. options pagesize=60 linesize=80 pageno=1 nodate; data lucastour othertour weektour daytour; set mylib.arts; if TourGuide = ’Lucas’ then output lucastour; else output othertour; if nights >= 6 then output weektour; else output daytour; run; proc print data=lucastour; title "Lucas’s Tours"; run; proc print data=othertour; title "Other Guides’ Tours"; run; proc print data=weektour; Creating Subsets of Observations 4 Writing an Observation Multiple Times to One or More Data Sets 169 title ’Tours Lasting a Week or More’; run; proc print data=daytour; title ’Tours Lasting Less Than a Week’; run; The following output displays the results: Output 10.11 Assigning Observations to More Than One Data Set Lucas’s Tours Obs 1 2 City Paris New York Nights Land Cost 8 6 1680 . 1 Budget High Tour Guide Lucas Lucas Other Guides’ Tours Obs 1 2 3 4 City Nights Rome London Madrid Amsterdam 3 6 3 4 2 Land Cost Budget 750 1230 370 580 Medium High Low Low Tour Guide D’Amico Wilson Torres Tours Lasting a Week or More Obs 1 2 3 City Nights Paris London New York 8 6 6 Land Cost 1680 1230 . Budget High High 3 Tour Guide Lucas Wilson Lucas Tours Lasting Less Than a Week Obs 1 2 3 City Rome Madrid Amsterdam Nights 3 3 4 Land Cost Budget 750 370 580 Medium Low Low 4 Tour Guide D’Amico Torres The first IF-THEN/ELSE group outputs all observations to either data set LUCASTOUR or OTHERTOUR. The second IF-THEN/ELSE group outputs the same observations to a different pair of data sets, WEEKTOUR and DAYTOUR. This repetition is possible because each observation remains in the program data vector after the first OUTPUT statement is processed and can be output again. 170 Review of SAS Tools 4 Chapter 10 Review of SAS Tools Statements DATA SAS-data-set-1< . . . SAS-data-set-n>; names the SAS data set(s) to be created in the DATA step. DELETE; deletes the current observation. The DELETE statement is usually used as part of an IF-THEN/ELSE group. IF condition; tests whether the condition is true. If it is true, then SAS continues processing the current observation; if it is not true, then SAS stops processing the observation, does not add it to the SAS data set, and returns to the top of the DATA step. The conditions used are the same as in the IF-THEN/ELSE statements. This type of IF statement is called a subsetting IF statement because it produces a subset of the original observations. OUTPUT ; immediately writes the current observation to the SAS data set. The observation remains in the program data vector, and you can continue programming with it, including outputting it again if you desire. When an OUTPUT statement appears in a DATA step, SAS does not automatically output observations to the SAS data set; you must specify the destination for all output in the DATA step with OUTPUT statements. Any SAS data set that you specify in an OUTPUT statement must also appear in the DATA statement. Learning More Comparison and logical operators See Chapter 9, “Acting on Selected Observations,” on page 139 and SAS Language Reference: Concepts. DROP= and KEEP= data set options Using the DROP= and KEEP= data set options to output a subset of variables to a SAS data set are discussed in Chapter 5, “Starting with SAS Data Sets,” on page 81. FIRSTOBS= and OBS= data set options Using these data set options to select observations from the beginning, middle, or end of a SAS data set are discussed in Chapter 5, “Starting with SAS Data Sets,” on page 81. They are documented completely in SAS Language Reference: Dictionary. IF-THEN/ELSE, DELETE, and OUTPUT statements The IF-THEN/ELSE, DELETE, and OUTPUT statements are completely documented in SAS Language Reference: Dictionary. WHERE statement See Chapter 25, “Producing Detail Reports with the PRINT Procedure,” on page 371. The WHERE statement selects observations based on a condition. Its action is similar to that of a subsetting IF statement. The WHERE statement is Creating Subsets of Observations 4 Learning More 171 extremely useful in PROC steps, and it can also be useful in some DATA steps. The WHERE statement selects observations before they enter the program data vector (in contrast to the subsetting IF statement, which selects observations already in the program data vector). Note: In some cases, the same condition in a WHERE statement in the DATA step and in a subsetting IF statement produces different subsets. The difference is described in the discussion of the WHERE statement in SAS Language Reference: Dictionary. Be sure you understand the difference before you use the WHERE statement in the DATA step. With that caution in mind, a WHERE statement can increase the efficiency of the DATA step considerably. 4 172 173 CHAPTER 11 Working with Grouped or Sorted Observations Introduction to Working with Grouped or Sorted Observations Purpose 173 Prerequisites 173 Input SAS Data Set for Examples 174 Working with Grouped Data 175 Understanding the Basics of Grouping Data 175 Grouping Observations with the SORT Procedure 175 Grouping by More Than One Variable 177 Arranging Groups in Descending Order 177 Finding the First or Last Observation in a Group 178 Working with Sorted Data 181 Understanding Sorted Data 181 Sorting Data 181 Deleting Duplicate Observations 182 Understanding Collating Sequences 184 ASCII Collating Sequence 184 EBCDIC Collating Sequence 185 Review of SAS Tools 185 Procedures 185 Statements 185 Learning More 186 173 Introduction to Working with Grouped or Sorted Observations Purpose Sometimes you need to create reports where observations are grouped according to the values of a particular variable, or where observations are sorted alphabetically. In this section you will learn the following: 3 how to group observations by variables and how to work with grouped observations 3 how to sort the observations and how to work with sorted observations Prerequisites Before proceeding with this section, you should understand the concepts presented in the following parts: 3 Part 1, “Introduction to the SAS System” 174 Input SAS Data Set for Examples 4 Chapter 11 3 Part 2, “Getting Your Data into Shape” 3 Chapter 6, “Understanding DATA Step Processing,” on page 97. Input SAS Data Set for Examples Tradewinds Travel has an external file that contains data about tours that emphasize either architecture or scenery. After the data is created in a SAS data set and the observations for those tours are grouped together, SAS can produce reports on each group separately. In addition, if the observations need to be alphabetized by country, SAS can sort them. The external file looks like this: u Spain Japan Switzerland France Ireland New Zealand Italy Greece v architecture architecture scenery architecture scenery scenery architecture scenery w x y 10 510 World 8 720 Express 9 734 World 8 575 World 7 558 Express 16 1489 Southsea 8 468 Express 12 698 Express The numbered fields represent u v w x the the the the name of the destination country tour’s area of emphasis number of nights on the tour cost of the land package in US dollars y the name of the tour vendor The following DATA step creates the permanent SAS data set MYLIB.ARCH_OR_SCEN: options pagesize=60 linesize=80 pageno=1 nodate; libname mylib ’permanent-data-library’; data mylib.arch_or_scen; infile ’input-file’ truncover; input Country $ 1-11 TourType $ 13-24 Nights LandCost Vendor $; run; proc print data=mylib.arch_or_scen; title ’Data Set MYLIB.ARCH_OR_SCEN’; run; The PROC PRINT statement that follows the DATA step produces this display of the MYLIB.ARCH_OR_SCEN data set: Working with Grouped or Sorted Observations Output 11.1 4 Grouping Observations with the SORT Procedure 175 Data Set MYLIB.ARCH_OR_SCEN Data Set MYLIB.ARCH_OR_SCEN Obs 1 2 3 4 5 6 7 8 Country TourType Spain Japan Switzerland France Ireland New Zealand Italy Greece architecture architecture scenery architecture scenery scenery architecture scenery 1 Nights Land Cost Vendor 10 8 9 8 7 16 8 12 510 720 734 575 558 1489 468 698 World Express World World Express Southsea Express Express Working with Grouped Data Understanding the Basics of Grouping Data The basic method for grouping data is to use a BY statement: BY list-of-variables; The BY statement can be used in a DATA step with a SET, MERGE, MODIFY, or UPDATE statement, or it can be used in SAS procedures. To work with grouped data using the SET, MERGE, MODIFY, or UPDATE statements, the data must meet these conditions: 3 The observations must be in a SAS data set, not an external file. 3 The variables that define the groups must appear in the BY statement. 3 All observations in the input data set must be in ascending or descending numeric or character order, or grouped in some way, such as by calendar month or a formatted value, according to the variables that will be specified in the BY statement. Note: If you use the MODIFY statement, the input data does not need to be in any order. However, ordering the data can improve performance. 4 If the third condition is not met, the data are in a SAS data set but are not arranged in the groups you want, you can order the data using the SORT procedure (discussed in the next section). Once the SAS data set is arranged in some order, you can use the BY statement to group values of one or more common variables. Grouping Observations with the SORT Procedure All observations in the input data set must be in a particular order. To meet this condition, the observations in MYLIB.ARCH_OR_SCEN can be ordered by the values of TourType, architecture and scenery: proc sort data=mylib.arch_or_scen out=tourorder; by TourType; run; 176 Grouping Observations with the SORT Procedure 4 Chapter 11 The SORT procedure sorts the data set MYLIB.ARCH_OR_SCEN alphabetically according to the values of TourType. The sorted observations go into a new data set specified by the OUT= option. In this example, TOURORDER is the sorted data set. If the OUT= option is omitted, the sorted version of the data set replaces the data set MYLIB.ARCH_OR_SCEN. The SORT procedure does not produce output other than the sorted data set. A message in the SAS log says that the SORT procedure was executed: Output 11.2 Message That the SORT Procedure Has Executed Successfully 2 proc sort data=mylib.arch_or_scen out=tourorder; 3 by TourType; 4 run; NOTE: There were 8 observations read from the data set MYLIB.ARCH_OR_SCEN. NOTE: The data set WORK.TOURORDER has 8 observations and 5 variables. NOTE: PROCEDURE SORT used: To see the sorted data set, add a PROC PRINT step to the program: options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=tourorder; by TourType; run; proc print data=tourorder; var TourType Country Nights LandCost Vendor; title ’Tours Sorted by Architecture or Scenic Tours’; run; The following output displays the results: Output 11.3 Displaying the Sorted Output Tours Sorted by Architecture or Scenic Tours Obs 1 2 3 4 5 6 7 8 TourType Country architecture architecture architecture architecture scenery scenery scenery scenery Spain Japan France Italy Switzerland Ireland New Zealand Greece 1 Nights Land Cost Vendor 10 8 8 8 9 7 16 12 510 720 575 468 734 558 1489 698 World Express World Express World Express Southsea Express By default, SAS arranges groups in ascending order of the BY values, smallest to largest. Sorting a data set does not change the order of the variables within it. However, most examples in this section use a VAR statement in the PRINT procedure to display the BY variable in the first column. (The PRINT procedure and other procedures used in this documentation can also produce a separate report for each BY group.) Working with Grouped or Sorted Observations 4 Arranging Groups in Descending Order 177 Grouping by More Than One Variable You can group observations by as many variables as you want. This example groups observations by TourType, Vendor, and LandCost: options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=tourorder2; by TourType Vendor Landcost; run; proc print data=tourorder2; var TourType Vendor Landcost Country Nights; title ’Tours Grouped by Type of Tour, Vendor, and Price’; run; The following output displays the results: Output 11.4 Grouping by Several Variables Tours Grouped by Type of Tour, Vendor, and Price Obs 1 2 3 4 5 6 7 8 TourType Vendor Land Cost Country architecture architecture architecture architecture scenery scenery scenery scenery Express Express World World Express Express Southsea World 468 720 510 575 558 698 1489 734 Italy Japan Spain France Ireland Greece New Zealand Switzerland 1 Nights 8 8 10 8 7 12 16 9 As this example shows, SAS groups the observations by the first variable that is named within those groups, by the second variable named; and so on. The groups defined by all variables contain only one observation each. In this example, no two variables have the same values for all observations. In other words, this example does not have any duplicate entries. Arranging Groups in Descending Order In the data sets that are grouped by TourType, the group for architecture comes before the group for scenery because architecture begins with an “a”; “a” is smaller than “s” in computer processing. (The order of characters, known as their collating sequence, is discussed later in this section.) To produce a descending order for a particular variable, place the DESCENDING option before the name of the variable in the BY statement of the SORT procedure. In the next example, the observations are grouped in descending order by TourType, but in ascending order by Vendor and LandCost: options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=tourorder3; by descending TourType Vendor LandCost; run; 178 Finding the First or Last Observation in a Group 4 Chapter 11 proc print data=tourorder3; var TourType Vendor LandCost Country Nights; title ’Descending Order of TourType’; run; The following output displays the results: Output 11.5 Combining Descending and Ascending Sorted Observations Descending Order of TourType Obs 1 2 3 4 5 6 7 8 TourType Vendor Land Cost Country scenery scenery scenery scenery architecture architecture architecture architecture Express Express Southsea World Express Express World World 558 698 1489 734 468 720 510 575 Ireland Greece New Zealand Switzerland Italy Japan Spain France 1 Nights 7 12 16 9 8 8 10 8 Finding the First or Last Observation in a Group If you do not want to display the entire data set, how can you create a data set containing only the least expensive tour that features architecture, and the least expensive tour that features scenery? First, sort the data set by TourType and LandCost: options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=tourorder4; by TourType LandCost; run; proc print data=tourorder4; var TourType LandCost Country Nights Vendor; title ’Tours Arranged by TourType and LandCost’; run; The following output displays the results: Working with Grouped or Sorted Observations Output 11.6 4 Finding the First or Last Observation in a Group 179 Sorting to Find the Least Expensive Tours Tours Arranged by TourType and LandCost Obs 1 2 3 4 5 6 7 8 TourType Land Cost Country architecture architecture architecture architecture scenery scenery scenery scenery 468 510 575 720 558 698 734 1489 Italy Spain France Japan Ireland Greece Switzerland New Zealand Nights 8 10 8 8 7 12 9 16 1 Vendor Express World World Express Express Express World Southsea You sorted LandCost in ascending order, so the first observation in each value of TourType has the lowest value of LandCost. If you can locate the first observation in each BY group in a DATA step, you can use a subsetting IF statement to select that observation. But how can you locate the first observation with each value of TourType? When you use a BY statement in a DATA step, SAS automatically creates two additional variables for each variable in the BY statement. One is named FIRST.variable, where variable is the name of the BY variable, and the other is named LAST.variable. Their values are either 1 or 0. They exist in the program data vector and are available for DATA step programming, but SAS does not add them to the SAS data set being created. For example, the DATA step begins with these statements: data lowcost; set tourorder4; by TourType; ...more SAS statements... run; The BY statement causes SAS to create one variable called FIRST.TOURTYPE and another variable called LAST.TOURTYPE. When SAS processes the first observation with the value architecture, the value of FIRST.TOURTYPE is 1; in other observations with the value architecture, it is 0. Similarly, when SAS processes the last observation with the value architecture, the value of LAST.TOURTYPE is 1; in other architecture observations, it is 0. The same result occurs in the scenery group with the observations. SAS does not write FIRST. and LAST. variables to the output data set, so you can not display their values with the PRINT procedure. Therefore, the simplest method of displaying the values of FIRST. and LAST. variables is to assign their values to other variables. This example assigns the value of FIRST.TOURTYPE to a variable named FirstTour and the value of LAST.TOURTYPE to a variable named LastTour: options pagesize=60 linesize=80 pageno=1 nodate; data temp; set tourorder4; by TourType; FirstTour = first.TourType; LastTour = last.TourType; run; proc print data=temp; 180 Finding the First or Last Observation in a Group 4 Chapter 11 var Country Tourtype FirstTour LastTour; title ’Specifying FIRST.TOURTYPE and LAST.TOURTYPE’; run; The following output displays the results: Output 11.7 Demonstrating FIRST. and LAST. Values Specifying FIRST.TOURTYPE and LAST.TOURTYPE Obs 1 2 3 4 5 6 7 8 Country TourType Italy Spain France Japan Ireland Greece Switzerland New Zealand architecture architecture architecture architecture scenery scenery scenery scenery First Tour 1 0 0 0 1 0 0 0 1 Last Tour 0 0 0 1 0 0 0 1 In this data set, Italy is the first observation with the value architecture; for that observation, the value of FIRST.TOURTYPE is 1. Italy is not the last observation with the value architecture, so its value of LAST.TOURTYPE is 0. The observations for Spain and France are neither the first nor the last with the value architecture; both FIRST.TOURTYPE and LAST.TOURTYPE are 0 for them. Japan is the last with the value architecture; the value of LAST.TOURTYPE is 1. The same rules apply to observations in the scenery group. Now you’re ready to use FIRST.TOURTYPE in a subsetting IF statement. When the data are sorted by TourType and LandCost, selecting the first observation in each type of tour gives you the lowest price of any tour in that category: options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=tourorder4; by TourType LandCost; run; data lowcost; set tourorder4; by TourType; if first.TourType; run; proc print data=lowcost; title ’Least Expensive Tour for Each Type of Tour’; run; The following output displays the results: Working with Grouped or Sorted Observations Output 11.8 4 Sorting Data 181 Selecting One Observation from Each BY Group Least Expensive Tour for Each Type of Tour Obs Country TourType 1 2 Italy Ireland architecture scenery Nights Land Cost 8 7 468 558 1 Vendor Express Express Working with Sorted Data Understanding Sorted Data By default, groups appear in ascending order of the BY values. In some cases you want to emphasize the order in which the observations are sorted, not the fact that they can be grouped. For example, you may want to alphabetize the tours by country. To sort your data in a particular order, use the SORT procedure just as you do for grouped data. When the sorted order is more important than the grouping, you usually want only one observation with a given BY value in the resulting data set. Therefore, you may need to remove duplicate observations. Operating Environment Information: The SORT procedure accesses either a sorting utility that is supplied as part of SAS, or a sorting utility supplied by the host operating system. All examples in this documentation use the SAS sorting utility. Some operating system utilities do not accept particular options, including the NODUPRECS option described later in this section. The default sorting utility is set by your site. For more information about the utilities available to you, see the documentation for your operating system. 4 Sorting Data The following example sorts data set MYLIB.ARCH_OR_SCEN by COUNTRY: options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=bycountry; by Country; run; proc print data=bycountry; title ’Tours in Alphabetical Order by Country’; run; The following output displays the results: 182 Deleting Duplicate Observations Output 11.9 4 Chapter 11 Sorting Data Tours in Alphabetical Order by Country Obs 1 2 3 4 5 6 7 8 Country TourType France Greece Ireland Italy Japan New Zealand Spain Switzerland architecture scenery scenery architecture architecture scenery architecture scenery 1 Nights Land Cost Vendor 8 12 7 8 8 16 10 9 575 698 558 468 720 1489 510 734 World Express Express Express Express Southsea World World Deleting Duplicate Observations You can eliminate duplicate observations in a SAS data set by using the NODUPRECS option with the SORT procedure. The following programs show you how to create a SAS data set and then remove duplicate observations. The external file shown below contains a duplicate observation for Switzerland: Spain Japan Switzerland France Switzerland Ireland New Zealand Italy Greece architecture architecture scenery architecture scenery scenery scenery architecture scenery 10 510 World 8 720 Express 9 734 World 8 575 World 9 734 World 7 558 Express 16 1489 Southsea 8 468 Express 12 698 Express The following DATA step creates a permanent SAS data set named MYLIB.ARCH_OR_SCEN2. options pagesize=60 linesize=80 pageno=1 nodate; libname mylib ’SAS-data-library’; data mylib.arch_or_scen2; infile ’input-file’; input Country $ 1--11 TourType $ 13--24 Nights LandCost Vendor $; run; proc print data=mylib.arch_or_scen2; title ’Data Set MYLIB.ARCH_OR_SCEN2’; run; The following output shows that this data set contains a duplicate observation for Switzerland: Working with Grouped or Sorted Observations Output 11.10 4 Deleting Duplicate Observations Data Set MYLIB.ARCH_OR_SCEN2 Data Set MYLIB.ARCH_OR_SCEN2 Obs 1 2 3 4 5 6 7 8 9 Country TourType Spain Japan Switzerland France Switzerland Ireland New Zealand Italy Greece architecture architecture scenery architecture scenery scenery scenery architecture scenery 1 Nights Land Cost Vendor 10 8 9 8 9 7 16 8 12 510 720 734 575 734 558 1489 468 698 World Express World World World Express Southsea Express Express The following program uses the NODUPRECS option in the SORT procedure to delete duplicate observations. The program creates a new data set called FIXED. options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=fixed noduprecs; by Country; run; proc print data=fixed; title ’Data Set FIXED: MYLIB.ARCH_OR_SCEN2 With Duplicates Removed’; run; The following output displays messages that appear in the SAS log: Output 11.11 Partial SAS Log Indicating Duplicate Observations Deleted 311 options pagesize=60 linesize=80 pageno=1 nodate; 312 proc sort data=mylib.arch_or_scen out=fixed noduprecs; 313 by Country; 314 run; NOTE: 1 duplicate observations were deleted. NOTE: There were 9 observations read from the data set MYLIB.ARCH_OR_SCEN. NOTE: The data set WORK.FIXED has 8 observations and 5 variables. 315 316 proc print data=fixed; 317 title ’Data Set FIXED: MYLIB.ARCH_OR_SCEN2 With Duplicates Removed’; 318 run; NOTE: There were 8 observations read from the data set WORK.FIXED. The following output shows the results of the NODUPRECS option: 183 184 Understanding Collating Sequences Output 11.12 4 Chapter 11 Data Set FIXED with No Duplicate Observations Data Set FIXED: MYLIB.ARCH_OR_SCEN2 With Duplicates Removed Obs 1 2 3 4 5 6 7 8 Country TourType France Greece Ireland Italy Japan New Zealand Spain Switzerland architecture scenery scenery architecture architecture scenery architecture scenery Nights Land Cost Vendor 8 12 7 8 8 16 10 9 575 698 558 468 720 1489 510 734 World Express Express Express Express Southsea World World 1 Understanding Collating Sequences Both numeric and character variables can be sorted into ascending or descending order. For numeric variables, ascending or descending order is easy to understand, but what about the order of characters? Character values include uppercase and lowercase letters, special characters, and the digits 0 through 9 when they are treated as characters rather than as numbers. How does SAS sort these characters? The order in which characters sort is called a collating sequence. By default, SAS sorts characters in one of two sequences: EBCDIC or ASCII, depending on the operating environment under which SAS is running. For reference, both sequences are displayed here. As long as you work under a single operating system, you seldom need to think about the details of collating sequences. However, when you transfer files from an operating system using EBCDIC to an operating system using ASCII or vice versa, character values that are sorted on one operating system are not necessarily in the correct order for the other operating system. The simplest solution to the problem is to re-sort character data (not numeric data) on the destination operating system. For detailed information about collating sequences, see the documentation for your operating environment. ASCII Collating Sequence The following operating systems use the ASCII collating sequence: Macintosh MS-DOS OpenVMS OS/2 PC DOS UNIX and its derivatives Windows From the smallest to the largest displayable character, the English-language ASCII sequence is blank!"#$%&’()*+,− ./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ [ \ ]ˆ_ abcdefghijklmnopqrstuvwxyz{}~ Working with Grouped or Sorted Observations 4 Statements 185 The main features of the ASCII sequence are that digits are smaller than uppercase letters and uppercase letters are smaller than lowercase ones. The blank is the smallest displayable character, followed by the other types of characters: blank < digits < uppercase letters < lowercase letters EBCDIC Collating Sequence The following operating systems use the EBCDIC collating sequence: CMS z/OS From the smallest to largest displayable character, the English-language EBCDIC sequence is blank.<(+|&!$*); − /,%_>?:#@’=" abcdefghijklmnopqr~stuvwxyz {ABCDEFGHI}JKLMNOPQR\ STUVWXYZ 0123456789 The main features of the EBCDIC sequence are that lowercase letters are smaller than uppercase letters and uppercase letters are smaller than digits. The blank is the smallest displayable character, followed by the other types of characters: blank < lowercase letters < uppercase letters < digits Review of SAS Tools Procedures PROC SORT ; sorts a SAS data set by the values of variables listed in the BY statement. If you specify the OUT= option, the sorted data are stored in a different SAS data set than the input data. The NODUPRECS option tells PROC SORT to eliminate identical observations. Statements BY variable-1 < . . . variable-n>; in a DATA step causes SAS to create FIRST. and LAST. variables for each variable named in the statement. The value of FIRST.variable-1 is 1 for the first observation with a given BY value and 0 for other observations. Similarly, the value of LAST.variable-1 is 1 for the last observation for a given BY value and 0 for other observations. The BY statement can follow a SET, MERGE, MODIFY, or UPDATE statement in the DATA step; it can not be used with an INPUT statement. By default, SAS assumes that data being read with a BY statement are in ascending order of the BY values. The DESCENDING option indicates that values of the variable that follow are in the opposite order, that is, largest to smallest. 186 Learning More 4 Chapter 11 Learning More Alternative to sorting observations Information about an alternative to sorting observations: creating an index that identifies the observations with particular values of a variable, can be found in the “SAS Data Files” section of SAS Language Reference: Concepts. BY statement and BY-group processing See SAS Language Reference: Dictionary and SAS Language Reference: Concepts. Interleaving, merging, and updating SAS data sets See Chapter 17, “Interleaving SAS Data Sets,” on page 263, Chapter 18, “Merging SAS Data Sets,” on page 269, and Chapter 19, “Updating SAS Data Sets,” on page 293. These operations depend on the BY statement in the DATA step. Interleaving combines data sets in sorted order (Chapter 17, “Interleaving SAS Data Sets,” on page 263); match-merging joins observations identified by the value of a BY variable (Chapter 18, “Merging SAS Data Sets,” on page 269); and updating uses a data set containing transactions to change values in a master file Chapter 19, “Updating SAS Data Sets,” on page 293). NOTSORTED option The NOTSORTED option can be used in both DATA and PROC steps, except for the SORT procedure. Information about the NOTSORTED option can be found in Chapter 30, “Writing Lines to the SAS Log or to an Output File,” on page 521. The NOTSORTED option is useful when data are grouped according to the values of a variable, but the groups are not in ascending or descending order. Using the NOTSORTED option in the BY statement enables SAS to process them. SORT procedure The SORT procedure and the role of the BY statement in it is documented in Base SAS Procedures Guide. It also describes how to specify different sorting utilities. 3 When you work with large data sets, plan your work so that you sort the data set as few times as possible. For example, if you need to sort a data set by STATE at the beginning of a program and by CITY within STATE later, sort the data set by STATE and CITY at the beginning of the program. 3 To eliminate observations whose BY values duplicate BY values in other observations (but not necessarily values of other variables), use the NODUPKEY option in the SORT procedure. 3 SAS can sort data in sequences other than English-language EBCDIC or ASCII. Examples include the Danish-Norwegian and Finnish/Swedish sequences. The SAS documentation for your operating system presents operating system-specific information about the SORT procedure. In general, many points about sorting data depend on the operating system and other local conditions at your site (such as whether various operating system utilities are available). 187 CHAPTER 12 Using More Than One Observation in a Calculation Introduction to Using More Than One Observation in a Calculation Purpose 187 Prerequisites 187 Input File and SAS Data Set for Examples 188 Accumulating a Total for an Entire Data Set 189 Creating a Running Total 189 Printing Only the Total 190 Obtaining a Total for Each BY Group 191 Writing to Separate Data Sets 193 Writing Observations to Separate Data Sets 193 Writing Totals to Separate Data Sets 194 The Program 194 Using a Value in a Later Observation 196 Review of SAS Tools 199 Statements 199 Learning More 200 187 Introduction to Using More Than One Observation in a Calculation Purpose In this section you will learn about calculations that require more than one observation. Examples of those calculations include: 3 accumulating a total across a data set or a BY group 3 saving a value from one observation in order to compare it to a value in a later observation Prerequisites Before proceeding with this section, you should understand the concepts presented in the following parts: 3 Chapter 6, “Understanding DATA Step Processing,” on page 97 3 Chapter 11, “Working with Grouped or Sorted Observations,” on page 173. 188 Input File and SAS Data Set for Examples 4 Chapter 12 Input File and SAS Data Set for Examples Tradewinds Travel needs to know how much business the company did with various tour vendors during the peak season. The data that the company wants to look at is the total number of people that are scheduled on tours with various vendors, and the total value of the tours that are scheduled. The following external file contains data about Tradewinds Travel tours: u v w France 575 Express Spain 510 World Brazil 540 World India 489 Express Japan 720 Express Greece 698 Express New Zealand 1489 Southsea Venezuela 425 World Italy 468 Express USSR 924 World Switzerland 734 World Australia 1079 Southsea Ireland 558 Express x 10 12 6 . 10 20 6 8 9 6 20 10 9 The numbered fields represent u the destination country for the tour v the cost of the land package in US dollars w the name of the vendor x the number of people that were scheduled on that tour The first step is to create a permanent SAS data set. The following program creates the data set MYLIB.TOURREVENUE: options pagesize=60 linesize=80 pageno=1 nodate; libname mylib ’permanent-data-library’; data mylib.tourrevenue; infile ’input-file’ truncover; input Country $ 1-11 LandCost Vendor $ NumberOfBookings; run; proc print data=mylib.tourrevenue; title ’SAS Data Set MYLIB.TOURREVENUE’; run; The PROC PRINT statement that follows the DATA step produces this display of the MYLIB.TOURREVENUE data set: Using More Than One Observation in a Calculation Output 12.1 4 Creating a Running Total 189 Data Set MYLIB.TOURREVENUE SAS Data Set MYLIB.TOURREVENUE Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 Country Land Cost Vendor France Spain Brazil India Japan Greece New Zealand Venezuela Italy USSR Switzerland Australia Ireland 575 510 540 489 720 698 1489 425 468 924 734 1079 558 Express World World Express Express Express Southsea World Express World World Southsea Express 1 Number Of Bookings 10 12 6 . 10 20 6 8 9 6 20 10 9 Each observation in the data set MYLIB.TOURREVENUE contains the cost of a tour and the number of people who scheduled that tour. The tasks of Tradewinds Travel are as follows: 3 to determine how much money was spent with each vendor and with all vendors together 3 to store the totals in a SAS data set that is separate from the individual vendors’ records 3 to find the tour that produced the most revenue, which is determined by the land cost times the number of people who scheduled the tour Accumulating a Total for an Entire Data Set Creating a Running Total The first task in performing calculations on the data set MYLIB.TOURREVENUE is to find out the total number of people who scheduled tours with Tradewinds Travel. Therefore, a variable is needed whose value starts at 0 and increases by the number of schedulings in each observation. The sum statement gives you that capability: variable + expression In a sum statement, the value of the variable on the left side of the plus sign is 0 before the statement is processed for the first time. Processing the statement adds the value of the expression on the right side of the plus sign to the initial value; the sum variable retains the new value until the next processing of the statement. The sum statement ignores a missing value for the expression; the previous total remains unchanged. The following statement creates the total number of schedulings : TotalBookings + NumberOfBookings; The following DATA step includes the sum statement above: options pagesize=60 linesize=80 pageno=1 nodate; data total; 190 Printing Only the Total 4 Chapter 12 set mylib.tourrevenue; TotalBookings + NumberOfBookings; run; proc print data=total; var Country NumberOfBookings TotalBookings; title ’Total Tours Booked’; run; The following output displays the results: Output 12.2 Accumulating a Total for a Data Set Total Tours Booked Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 Country France Spain Brazil India Japan Greece New Zealand Venezuela Italy USSR Switzerland Australia Ireland 1 Number Of Bookings Total Bookings 10 12 6 . 10 20 6 8 9 6 20 10 9 10 22 28 28 38 58 64 72 81 87 107 117 126 The TotalBookings variable in the last observation of the TOTAL data set contains the total number of schedulings for the year. Printing Only the Total If the total is the only information that is needed from the data set, a data set that contains only one observation and one variable (the TotalBookings variable) can be created by writing a DATA step that does all of the following: 3 specifies the END= option in the SET statement to determine if the current observation is the last observation 3 uses a subsetting IF to write only the last observation to the SAS data set 3 specifies the KEEP= option in the DATA step to keep only the variable that totals the schedulings. When the END= option in the SET statement is specified, the variable that is named in the END= option is set to 1 when the DATA step is processing the last observation; the variable that is named in the END= option is set to 0 for other observations: SET SAS-data-set ; SAS does not add the END= variable to the data set that is being created. By testing the value of the END= variable, you can determine which observation is the last observation. Using More Than One Observation in a Calculation 4 Obtaining a Total for Each BY Group 191 The following program selects the last observation with a subsetting IF statement and uses a KEEP= data set option to keep only the variable TotalBookings in the data set: options pagesize=60 linesize=80 pageno=1 nodate; data total2(keep=TotalBookings); set mylib.tourrevenue end=Lastobs; TotalBookings + NumberOfBookings; if Lastobs; run; proc print data=total2; title ’Total Number of Tours Booked’; run; The following output displays the results: Output 12.3 Selecting the Last Observation in a Data Set Total Number of Tours Booked Obs Total Bookings 1 126 1 The condition in the subsetting IF statement is true when Lastobs has a value of 1. When SAS is processing the last observation from MYLIB.TOURREVENUE, it assigns to Lastobs the value 1. Therefore, the subsetting IF statement accepts only the last observation from MYLIB.TOURREVENUE, and SAS writes the last observation to the data set TOTAL2. Obtaining a Total for Each BY Group An additional requirement of Tradewinds Travel is to determine the number of tours that are scheduled with each vendor. In order to accomplish this task, a program must group the data by a variable; that is, the program must organize the data set into groups of observations, with one group for each vendor. In this case, the program must group the data by the Vendor variable. Each group is known generically as a BY group; the variable that is used to determine the groupings is called a BY variable. In order to group the data by the Vendor variable, the program must 3 include a PROC SORT step to group the observations by the Vendor variable 3 use a BY statement in the DATA step 3 use a sum statement to total the schedulings 3 reset the sum variable to 0 at the beginning of each group of observations. The following program sorts the data set by Vendor and sums the total schedulings for each vendor. options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.tourrevenue out=mylib.sorttour; by Vendor; run; 192 Obtaining a Total for Each BY Group 4 Chapter 12 data totalby; set mylib.sorttour; by Vendor; if First.Vendor then VendorBookings = 0; VendorBookings + NumberOfBookings; run; proc print data=totalby; title ’Summary of Bookings by Vendor’; run; In the preceding program, the FIRST.Vendor variable is used in an IF-THEN statement to set the sum variable (VendorBookings) to 0 in the first observation of each BY group. (For more information on the FIRST.variable and LAST.variable temporary variables, see “Finding the First or Last Observation in a Group” on page 178.) The following output displays the results. Output 12.4 Creating Totals for BY Groups Summary of Bookings by Vendor Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 Country Land Cost Vendor France India Japan Greece Italy Ireland New Zealand Australia Spain Brazil Venezuela USSR Switzerland 575 489 720 698 468 558 1489 1079 510 540 425 924 734 Express Express Express Express Express Express Southsea Southsea World World World World World 1 Number Of Bookings Vendor Bookings 10 . 10 20 9 9 6 10 12 6 8 6 20 10 10 20 40 49 58 6 16 12 18 26 32 52 Notice that while this output does in fact include the total number of schedulings for each vendor, it also includes a great deal of extraneous information. Reporting the total schedulings for each vendor requires only the variables Vendor and VendorBookings from the last observation for each vendor. Therefore, the program can 3 use the DROP= or KEEP= data set options to eliminate the variables Country, LandCost, and NumberOfBookings from the output data set 3 use the LAST.Vendor variable in a subsetting IF statement to write only the last observation in each group to the data set TOTALBY. The following program creates data set TOTALBY: options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.tourrevenue out=mylib.sorttour; by Vendor; run; data totalby(drop=country landcost); Using More Than One Observation in a Calculation 4 Writing Observations to Separate Data Sets 193 set mylib.sorttour; by Vendor; if First.Vendor then VendorBookings = 0; VendorBookings + NumberOfBookings; if Last.Vendor; run; proc print data=totalby; title ’Total Bookings by Vendor’; run; The following output displays the results: Output 12.5 Putting Totals for Each BY Group in a New Data Set Total Bookings by Vendor Obs 1 2 3 Vendor Express Southsea World 1 Vendor Bookings 58 16 52 Writing to Separate Data Sets Writing Observations to Separate Data Sets Tradewinds Travel wants overall information about the tours that were conducted this year. One SAS data set is needed to contain detailed information about each tour, including the total money that was spent on that tour. Another SAS data set is needed to contain the total number of schedulings with each vendor and the total money spent with that vendor. Both of these data sets can be created using the techniques that you have learned so far. Begin the program by creating two SAS data sets from the SAS data set MYLIB.SORTTOUR using the following DATA and SET statements: data tourdetails vendordetails; set mylib.sorttour; The data set TOURDETAILS will contain the individual records, and VENDORDETAILS will contain the information about vendors. The observations do not need to be grouped for TOURDETAILS, but they need to be grouped by Vendor for VENDORDETAILS. If the data are not already grouped by Vendor, first use the SORT procedure. Add a BY statement to the DATA step for use with VENDORDETAILS. proc sort data=mylib.tourrevenue out=mylib.sorttour; by Vendor; run; data tourdetails vendordetails; 194 Writing Totals to Separate Data Sets 4 Chapter 12 set mylib.sorttour; by Vendor; run; The only calculation that is needed for the individual tours is the amount of money that was spent on each tour. Therefore, calculate the amount in an assignment statement and write the record to TOURDETAILS. Money = LandCost * NumberOfBookings; output tourdetails; The portion of the DATA step that builds TOURDETAILS is now complete. Writing Totals to Separate Data Sets Because observations remain in the program data vector after an OUTPUT statement executes, you can continue using them in programming statements. The rest of the DATA step creates information for the VENDORDETAILS data set. Use the FIRST.Vendor variable to determine when SAS is processing the first observation in each group. Then set the sum variables VendorBookings and VendorMoney to 0 in that observation. VendorBookings totals the schedulings for each vendor, and VendorMoney totals the costs. Add the following statements to the DATA step: if First.Vendor then do; VendorBookings = 0; VendorMoney = 0; end; VendorBookings + NumberOfBookings; VendorMoney + Money; Note: The program uses a DO group. Using DO groups enables the program to evaluate a condition once and take more than one action as a result. For more information on DO groups, see “Performing More Than One Action in an IF-THEN Statement” on page 202. 4 The last observation in each BY group contains the totals for that vendor; therefore, use the following statement to output the last observation to the data set VENDORDETAILS: if Last.Vendor then output vendordetails; As a final step, use KEEP= and DROP= data set options to remove extraneous variables from the two data sets so that each data set has just the variables that are wanted. data tourdetails(drop=VendorBookings VendorMoney) vendordetails(keep=Vendor VendorBookings VendorMoney); The Program The following is the complete program that creates the VENDORDETAILS and TOURDETAILS data sets: options pagesize=60 linesize=80 pageno=1 nodate; Using More Than One Observation in a Calculation 4 The Program proc sort data=mylib.tourrevenue out=mylib.sorttour; by Vendor; run; data tourdetails(drop=VendorBookings VendorMoney) vendordetails(keep=Vendor VendorBookings VendorMoney); set mylib.sorttour; by Vendor; Money = LandCost * NumberOfBookings; output tourdetails; if First.Vendor then do; VendorBookings = 0; VendorMoney = 0; end; VendorBookings + NumberOfBookings; VendorMoney + Money; if Last.Vendor then output vendordetails; run; proc print data=tourdetails; title ’Detail Records: Dollars Spent on Individual Tours’; run; proc print data=vendordetails; title ’Vendor Totals: Dollars Spent and Bookings by Vendor’; run; The following output displays the results: Output 12.6 Detail Tour Records in One SAS Data Set and Vendor Totals in Another Detail Records: Dollars Spent on Individual Tours Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 Country Land Cost Vendor France India Japan Greece Italy Ireland New Zealand Australia Spain Brazil Venezuela USSR Switzerland 575 489 720 698 468 558 1489 1079 510 540 425 924 734 Express Express Express Express Express Express Southsea Southsea World World World World World 1 Number Of Bookings Money 10 . 10 20 9 9 6 10 12 6 8 6 20 5750 . 7200 13960 4212 5022 8934 10790 6120 3240 3400 5544 14680 195 196 Using a Value in a Later Observation 4 Chapter 12 Vendor Totals: Dollars Spent and Bookings by Vendor Obs 1 2 3 Vendor Vendor Bookings Express Southsea World 58 16 52 2 Vendor Money 36144 19724 32984 Using a Value in a Later Observation A further requirement of Tradewinds Travel is a separate SAS data set that contains the tour that generated the most revenue. (The revenue total equals the price of the tour multiplied by the number of schedulings.) One method of creating the new data set might be to follow these three steps: 1 Calculate the revenue in a DATA step. 2 Sort the data set in descending order by the revenue. 3 Use another DATA step with the OBS= data set option to write that observation. A more efficient method compares the revenue from all observations in a single DATA step. SAS can retain a value from the current observation to use in future observations. When the processing of the DATA step reaches the next observation, the held value represents information from the previous observation. The RETAIN statement causes a variable that is created in the DATA step to retain its value from the current observation into the next observation rather than being set to missing at the beginning of each iteration of the DATA step. It is a declarative statement, not an executable statement. This statement has the following form: RETAIN variable-1 < . . . variable-n>; To compare the Revenue value in one observation to the Revenue value in the next observation, create a retained variable named HoldRevenue and assign the value of the current Revenue variable to it. In the next observation, the HoldRevenue variable contains the Revenue value from the previous observation, and its value can be compared to that of Revenue in the current observation. To see how the RETAIN statement works, look at the next example. The following DATA step outputs observations to data set TEMP before SAS assigns the current revenue to HoldRevenue: options pagesize=60 linesize=80 pageno=1 nodate; data temp; set mylib.tourrevenue; retain HoldRevenue; Revenue = LandCost * NumberOfBookings; output; HoldRevenue = Revenue; run; proc print data=temp; var Country LandCost NumberOfBookings Revenue HoldRevenue; title ’Tour Revenue’; run; The following output displays the results: Using More Than One Observation in a Calculation Output 12.7 4 Using a Value in a Later Observation 197 Retaining a Value By Using the Retain Statement Tour Revenue Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 1 Country Land Cost Number Of Bookings Revenue Hold Revenue France Spain Brazil India Japan Greece New Zealand Venezuela Italy USSR Switzerland Australia Ireland 575 510 540 489 720 698 1489 425 468 924 734 1079 558 10 12 6 . 10 20 6 8 9 6 20 10 9 5750 6120 3240 . 7200 13960 8934 3400 4212 5544 14680 10790 5022 . 5750 6120 3240 . 7200 13960 8934 3400 4212 5544 14680 10790 The value of HoldRevenue is missing at the beginning of the first observation; it is still missing when the OUTPUT statement writes the first observation to TEMP. After the OUTPUT statement, an assignment statement assigns the value of Revenue to HoldRevenue. Because HoldRevenue is retained, that value is present at the beginning of the next iteration of the DATA step. When the OUTPUT statement executes again, the value of HoldRevenue still contains that value. To find the largest value of Revenue, assign the value of Revenue to HoldRevenue only when Revenue is larger than HoldRevenue, as shown in the following program: options pagesize=60 linesize=80 pageno=1 nodate; data mostrevenue; set mylib.tourrevenue; retain HoldRevenue; Revenue = LandCost * NumberOfBookings; if Revenue > HoldRevenue then HoldRevenue = Revenue; run; proc print data=mostrevenue; var Country LandCost NumberOfBookings Revenue HoldRevenue; title ’Tour Revenue’; run; The following output displays the results: 198 Using a Value in a Later Observation Output 12.8 4 Chapter 12 Holding the Largest Value in a Retained Variable Tour Revenue Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 1 Country Land Cost Number Of Bookings Revenue Hold Revenue France Spain Brazil India Japan Greece New Zealand Venezuela Italy USSR Switzerland Australia Ireland 575 510 540 489 720 698 1489 425 468 924 734 1079 558 10 12 6 . 10 20 6 8 9 6 20 10 9 5750 6120 3240 . 7200 13960 8934 3400 4212 5544 14680 10790 5022 5750 6120 6120 6120 7200 13960 13960 13960 13960 13960 14680 14680 14680 The value of HoldRevenue in the last observation represents the largest revenue that is generated by any tour. To determine which observation the value came from, create a variable named HoldCountry to hold the name of the country from the observations with the largest revenue. Include HoldCountry in the RETAIN statement to retain its value until explicitly changed. Then use the END= data set option to select the last observation, and use the KEEP= data set option to keep only HoldRevenue and HoldCountry in MOSTREVENUE. options pagesize=60 linesize=80 pageno=1 nodate; data mostrevenue (keep=HoldCountry HoldRevenue); set mylib.tourrevenue end=LastOne; retain HoldRevenue HoldCountry; Revenue = LandCost * NumberOfBookings; if Revenue > HoldRevenue then do; HoldRevenue = Revenue; HoldCountry = Country; end; if LastOne; run; proc print data=mostrevenue; title ’Country with the Largest Value of Revenue’; run; Note: The program uses a DO group. Using DO groups enables the program to evaluate a condition once and take more than one action as a result. For more information on DO groups, see “Performing More Than One Action in an IF-THEN Statement” on page 202. 4 The following output displays the results: Using More Than One Observation in a Calculation Output 12.9 4 Statements 199 Selecting a New Data Set Using RETAIN and Subsetting IF Statements Country with the Largest Value of Revenue Obs Hold Revenue HoldCountry 1 14680 Switzerland 1 Review of SAS Tools Statements RETAIN variable-1 < . . . variable-n>; retains the value of the variable for use in a subsequent observation. The RETAIN statement prevents the value of the variable from being reinitialized to missing when control returns to the top of the DATA step. The RETAIN statement affects variables that are created in the current DATA step (for example, variables that are created with an INPUT or assignment statement). Variables that are read with a SET, MERGE, or UPDATE statement are retained automatically; naming them in a RETAIN statement has no effect. The RETAIN statement can assign an initial value to a variable. If you need a variable to have the same value in all observations of a DATA step, it is more efficient to put the value in a RETAIN statement rather than in an assignment statement. SAS assigns the value in the RETAIN statement when it is compiling the DATA step, but it carries out the assignment statement during each execution of the DATA step. The plus sign is required in the sum statement; to subtract successive values from a starting value, add negative values to the sum variable. SET SAS-data-set ; reads from the SAS-data-set specified. The variable specified in the END= option has the value 0 until SAS is processing the last observation in the data set. Then the variable has the value 1. SAS does not include the END= variable in the data set that is being created. variable + expression; is called a sum statement; it adds the result of the expression on the right side of the plus sign to the variable on the left side of the plus sign and holds the new value of variable for use in subsequent observations. The expression can be a numeric variable or expression. The value of variable is retained. If the expression is a missing value, the variable maintains its previous value. Before the sum statement is executed for the first time, the default value of the variable is 0. The plus sign is required in the sum statement; to subtract successive values from a starting value, add negative values to the sum variable. 200 Learning More 4 Chapter 12 Learning More Automatic variable _N_ The automatic variable _N_, which provides a way to count the number of times SAS executes a DATA step, is discussed in Chapter 30, “Writing Lines to the SAS Log or to an Output File,” on page 521. Using _N_ is more efficient than using a sum statement. SAS creates _N_ in each DATA step. The first time SAS begins to execute the DATA step, the value of _N_ is 1; the second time, 2; and so on. SAS does not add _N_ to the output data set. DO groups information about DO groups can be found in Chapter 13, “Finding Shortcuts in Programming,” on page 201. END= option Another example of using the END= option in the SET statement is presented in Chapter 21, “Conditionally Processing Observations from Multiple SAS Data Sets,” on page 323. KEEP= and DROP= data set options see Chapter 5, “Starting with SAS Data Sets,” on page 81. LAG family of functions See SAS Language Reference: Dictionary. LAG functions provide another way to retain a value from one observation for use in a subsequent observation. LAG functions can retain a value for up to 100 observations. RETAIN, SUM, and SET statements See SAS Language Reference: Dictionary. SUM and SUMBY statements The SUM and SUMBY statements in the PRINT procedure are discussed in Chapter 25, “Producing Detail Reports with the PRINT Procedure,” on page 371. The SUM and SUMBY statements can be used in the PRINT procedure if the only purpose in getting a total is to display it in a report. SUMMARY and MEANS procedures The SUMMARY and MEANS procedures, which can also be used to compute totals are documented in the Base SAS Procedures Guide. 201 CHAPTER 13 Finding Shortcuts in Programming Introduction to Shortcuts 201 Purpose 201 Prerequisites 201 Input File and SAS Data Set 201 Performing More Than One Action in an IF-THEN Statement Performing the Same Action for a Series of Variables 204 Using a Series of IF-THEN statements 204 Grouping Variables into Arrays 204 Repeating the Action 205 Selecting the Current Variable 206 Review of SAS Tools 207 Statements 207 Learning More 209 202 Introduction to Shortcuts Purpose In this section you will learn two DATA step programming techniques that make the code easier to write and read. They are the following: 3 using a DO group to perform more than one action after evaluating an IF condition 3 using arrays to perform the same action on more than one variable with a single group of statements Prerequisites You should understand the topics presented in Chapter 6, “Understanding DATA Step Processing,” on page 97 and Chapter 9, “Acting on Selected Observations,” on page 139 before proceeding with this section. Input File and SAS Data Set In the following example, Tradewinds Travel is making adjustments to their data about tours to art museums and galleries. The data for the tours is as follows: u Rome v wx y 4 3 . D’Amico U 2 202 Performing More Than One Action in an IF-THEN Statement Paris London New York Madrid Amsterdam 5 3 5 . 3 . 2 1 . 3 1 . 2 5 . Lucas Wilson Lucas Torres 4 Chapter 13 5 3 5 4 . The numbered fields represent u the name of the city v the number of museums to be visited w the number of art galleries in the tour x the number of other attractions to be toured y the last name of the tour guide U the number of years of experience the guide has The following program creates the permanent SAS data set MYLIB.ATTRACTIONS: options pagesize=60 linesize=80 pageno=1 nodate; libname mylib ’permanent-data-library’; data mylib.attractions; infile ’input-file’; input City $ 1-9 Museums 11 Galleries 13 Other 15 TourGuide $ 17-24 YearsExperience 26; run; proc print data=mylib.attractions; title ’Data Set MYLIB.ATTRACTIONS’; run; The PROC PRINT statement that follows the DATA step produces this report of the MYLIB.ATTRACTIONS data set: Output 13.1 Data Set MYLIB.ATTRACTIONS Data Set MYLIB.ATTRACTIONS Obs 1 2 3 4 5 6 City Rome Paris London New York Madrid Amsterdam Museums Galleries Other 4 5 3 5 . 3 3 . 2 1 . 3 . 1 . 2 5 . 1 Tour Guide Years Experience D’Amico Lucas Wilson Lucas Torres 2 5 3 5 4 . Performing More Than One Action in an IF-THEN Statement Several changes are needed in the observations for Madrid and Amsterdam. One way to select those observations is to evaluate an IF condition in a series of IF-THEN statements, as follows: Finding Shortcuts in Programming 4 Performing More Than One Action in an IF-THEN Statement 203 /* multiple actions based on the same condition */ data updatedattractions; set mylib.attractions; if City = ’Madrid’ then Museums = 3; if City = ’Madrid’ then Other = 2; if City = ’Amsterdam’ then TourGuide = ’Vandever’; if City = ’Amsterdam’ then YearsExperience = 4; run; To avoid writing the IF condition twice for each city, use a DO group in the THEN clause, for example: IF condition THEN DO; ...more SAS statements... END; The DO statement causes all statements following it to be treated as a unit until a matching END statement appears. A group of SAS statements that begin with DO and end with END is called a DO group. The following DATA step replaces the multiple IF-THEN statements with DO groups: options pagesize=60 linesize=80 pageno=1 nodate; /* a more efficient method */ data updatedattractions2; set mylib.attractions; if City = ’Madrid’ then do; Museums = 3; Other = 2; end; else if City = ’Amsterdam’ then do; TourGuide = ’Vandever’; YearsExperience = 4; end; run; proc print data=updatedattractions2; title ’Data Set MYLIB.UPDATEDATTRACTIONS’; run; Output 13.2 Using DO Groups to Produce a Data Set Data Set MYLIB.UPDATEDATTRACTIONS Obs 1 2 3 4 5 6 City Rome Paris London New York Madrid Amsterdam Museums Galleries Other 4 5 3 5 3 3 3 . 2 1 . 3 . 1 . 2 2 . Tour Guide D’Amico Lucas Wilson Lucas Torres Vandever 1 Years Experience 2 5 3 5 4 4 204 Performing the Same Action for a Series of Variables 4 Chapter 13 Using DO groups makes the program faster to write and easier to read. It also makes the program more efficient for SAS in two ways: 1 The IF condition is evaluated fewer times. (Although there are more statements in this DATA step than in the preceding one, the DO and END statements require very few computer resources.) 2 The conditions City = ’Madrid’ and City = ’Amsterdam’ are mutually exclusive, as condensing the multiple IF-THEN statements into two statements reveals. You can make the second IF-THEN statement part of an ELSE statement; therefore, the second IF condition is not evaluated when the first IF condition is true. Performing the Same Action for a Series of Variables Using a Series of IF-THEN statements In the data set MYLIB.ATTRACTIONS, the variables Museums, Galleries, and Other contain missing values when the tour does not feature that kind of attraction. To change the missing values to 0, you can write a series of IF-THEN statements with assignment statements, as the following program illustrates: /* same action for different variables */ data changes; set mylib.attractions; if Museums = . then Museums = 0; if Galleries = . then Galleries = 0; if Other = . then Other = 0; run; The pattern of action is the same in the three IF-THEN statements; only the variable name is different. To make the program easier to read, you can write SAS statements that perform the same action several times, changing only the variable that is affected. This technique is called array processing, and consists of the following three steps: 1 grouping variables into arrays 2 repeating the action 3 selecting the current variable to be acted upon Grouping Variables into Arrays In DATA step programming you can put variables into a temporary group called an array. To define an array, use an ARRAY statement. A simple ARRAY statement has the following form: ARRAY array-name{number-of-variables} variable-1 < . . . variable-n>; The array-name is a SAS name that you choose to identify the group of variables. The number-of-variables, enclosed in braces, tells SAS how many variables you are grouping, and variable-1< . . . variable-n> lists their names. Note: If you have worked with arrays in other programming languages, note that arrays in SAS are different from those in many other languages. In SAS, an array is simply a convenient way of temporarily identifying a group of variables by assigning an Finding Shortcuts in Programming 4 Repeating the Action 205 alias to them. It is not a permanent data structure; it exists only for the duration of the DATA step. The array-name identifies the array and distinguishes it from any other arrays in the same DATA step; it is not a variable. 4 The following ARRAY statement lists the three variables Museums, Galleries, and Other: array changelist{3} Museums Galleries Other; This statement tells SAS to do the following: 3 make a group named CHANGELIST for the duration of this DATA step 3 put three variable names in CHANGELIST: Museums, Galleries, and Other In addition, by listing a variable in an ARRAY statement, you assign the variable an extra name with the form array-name {position}, where position is the position of the variable in the list (1, 2, or 3 in this case). The position can be a number, or the name of a variable whose value is the number. This additional name is called an array reference, and the position is called the subscript. The previous ARRAY statement assigns to Museums the array reference CHANGELIST{1}; Galleries, CHANGELIST{2}; and Other, CHANGELIST{3}. From that point in the DATA step, you can refer to the variable by either its original name or by its array reference. For example, the names Museums and CHANGELIST{1} are equivalent. Repeating the Action To tell SAS to perform the same action several times, use an iterative DO loop of the following form: DO index-variable=1 TO number-of-variables-in-array; ...SAS statements... END; An iterative DO loop begins with an iterative DO statement, contains other SAS statements, and ends with an END statement. The loop is processed repeatedly (iterated) according to the directions in the iterative DO statement. The iterative DO statement contains an index-variable whose name you choose and whose value changes in each iteration of the loop. In array processing, you usually want the loop to execute as many times as there are variables in the array; therefore, you specify that the values of index-variable are 1 TO number-of-variables-in-array. By default, SAS increases the value of index-variable by 1 before each new iteration of the loop. When the value becomes greater than number-of-variables-in-array, SAS stops processing the loop. By default, SAS adds the index variable to the data set that is being created. An iterative DO loop that processes three times and has an index variable named Count looks like this: do Count = 1 to 3; SAS statements end; The first time the loop is processed, the value of Count is 1; the second time, the value is 2; and the third time, the value is 3. At the beginning of the fourth execution, the value of Count is 4, exceeding the specified range of 1 TO 3. SAS stops processing the loop. 206 Selecting the Current Variable 4 Chapter 13 Selecting the Current Variable Now that you have grouped the variables and you know how many times the loop will be processed, you must tell SAS which variable in the array to use in each iteration of the loop. Recall that variables in an array can be identified by their array references, and that the subscript of the reference can be a variable name as well as a number. Therefore, you can write programming statements in which the index variable of the DO loop is the subscript of the array reference: array-name {index-variable} When the value of the index variable changes, the subscript of the array reference (and, therefore, the variable that is referenced) also changes. The following statement uses the index variable Count as the subscript of array references: if changelist{Count} = . then changelist{Count} = 0; You can place this statement inside an iterative DO loop. When the value of Count is 1, SAS reads the array reference as CHANGELIST{1} and processes the IF-THEN statement on CHANGELIST{1}, that is, Museums. When Count has the value 2 or 3, SAS processes the statement on CHANGELIST{2}, Galleries, or CHANGELIST{3}, Other. The complete iterative DO loop with array references looks like this: do Count = 1 to 3; if changelist{Count} = . then changelist{Count} = 0; end; These statements tell SAS to do the following: 3 perform the actions in the loop three times 3 replace the array subscript Count with the current value of Count for each iteration of the IF-THEN statement 3 locate the variable with that array reference and process the IF-THEN statement on that variable The following DATA step uses the ARRAY statement and iterative DO loop: options pagesize=60 linesize=80 pageno=1 nodate; data changes; set mylib.attractions; array changelist{3} Museums Galleries Other; do Count = 1 to 3; if changelist{Count} = . then changelist{Count} = 0; end; run; proc print data=changes; title ’Tour Attractions’; run; The following output displays the results: Finding Shortcuts in Programming Output 13.3 4 Statements Using an Array and an Iterative DO Loop to Produce a Data Set Tour Attractions Obs 1 2 3 4 5 6 207 City Rome Paris London New York Madrid Amsterdam Museums Galleries Other 4 5 3 5 0 3 3 0 2 1 0 3 0 1 0 2 5 0 1 Tour Guide Years Experience Count D’Amico Lucas Wilson Lucas Torres 2 5 3 5 4 . 4 4 4 4 4 4 The data set CHANGES shows that the missing values for the variables Museums, Galleries, and Other are now zero. In addition, the data set contains the variable Count with the value 4 (the value that caused processing of the loop to cease in each observation). To exclude Count from the data set, use a DROP= data set option: options pagesize=60 linesize=80 pageno=1 nodate; data changes2 (drop=Count); set mylib.attractions; array changelist{3} Museums Galleries Other; do Count = 1 to 3; if changelist{Count} = . then changelist{count} = 0; end; run; proc print data=changes2; title ’Tour Attractions’; run; The following output displays the results: Output 13.4 Dropping the Index Variable from a Data Set Tour Attractions Obs 1 2 3 4 5 6 City Rome Paris London New York Madrid Amsterdam Review of SAS Tools Statements 1 Museums Galleries Other 4 5 3 5 0 3 3 0 2 1 0 3 0 1 0 2 5 0 Tour Guide D’Amico Lucas Wilson Lucas Torres Years Experience 2 5 3 5 4 . 208 Statements 4 Chapter 13 ARRAY array-name{number-of-variables} variable-1 < . . . variable-n>; creates a named, ordered list of variables that exists for processing of the current DATA step. The array-name must be a valid SAS name. Each variable is the name of a variable to be included in the array. Number-of-variables is the number of variables listed. When you place a variable in an array, the variable can also be accessed by array-name {position}, where position is the position of the variable in the list (from 1 to number-of-variables). This way of accessing the variable is called an array reference, and the position is known as the subscript of the array reference. After you list a variable in an ARRAY statement, programming statements in the same DATA step can use either the original name of the variable or the array reference. This documentation uses curly braces around the subscript. Parentheses ( ) are also acceptable, and square brackets [ ] are acceptable on operating environments that support those characters. Refer to the documentation provided by the vendor for your operating environment to determine the supported characters. DO; ...SAS statements... END; treats the enclosed SAS statements as a unit. A group of statements beginning with DO and ending with END is called a DO group. DO groups usually appear in THEN clauses or ELSE statements. DO index-variable=1 TO number-of-variables-in-array; ... SAS statements... END; is known as an iterative DO loop. In each execution of the DATA step, an iterative DO loop is processed repeatedly (is iterated) based on the value of index-variable. To create an index variable, simply use a SAS variable name in an iterative DO statement. When you use iterative DO loops for array processing, the value of index-variable usually starts at 1 and increases by 1 before each iteration of the loop. When the value becomes greater than the number-of-variables-in-array (usually the number of variables in the array being processed), SAS stops processing the loop and proceeds to the next statement in the DATA step. In array processing, the SAS statements in an iterative DO loop usually contain array references whose subscript is the name of the index variable (as in array-name {index-variable}). In each iteration of the loop, SAS replaces the subscript in the reference with the index variable’s current value. Therefore, successive iterations of the loop cause SAS to process the statements on the first variable in the array, then on the second variable, and so on. Finding Shortcuts in Programming 4 Learning More 209 Learning More Arrays Detailed information about using arrays can be found in SAS Language Reference: Concepts. Arrays can be single or multidimensional. DO groups information about DO groups and iterative DO loops can be found in SAS Language Reference: Dictionary. Iterative DO statements are flexible and powerful; they are useful in many situations other than array processing. The range of the index variable can start and stop with any number, and the increment can be any positive or negative number. The range of the index variable can be given as starting and stopping values; the values of the DIM, LBOUND, and HBOUND functions; a list of values separated by commas; or a combination of these. A range can also contain a WHILE or UNTIL clause. The index variable can also be a character variable (in that case, the range must be given as a list of character values). The DIM, LBOUND, and HBOUND functions are documented in SAS Language Reference: Dictionary. DO WHILE and DO UNTIL statements A DO WHILE statement processes a loop as long as a condition is true; a DO UNTIL statement processes a loop until a condition is true. (A DO UNTIL loop always processes at least once; a DO WHILE loop is not processed at all if the condition is initially false.) For more information, see SAS Language Reference: Dictionary. 210 211 CHAPTER 14 Working with Dates in the SAS System Introduction to Working with Dates 211 Purpose 211 Prerequisites 212 Understanding How SAS Handles Dates 212 How SAS Stores Date Values 212 Determining the Century for Dates with Two-Digit Years Input File and SAS Data Set for Examples 213 Entering Dates 214 Understanding Informats for Date Values 214 Reading a Date Value 214 Using Good Programming Practices to Read Dates 215 Using Dates as Constants 217 Displaying Dates 217 Understanding How SAS Displays Values 217 Formatting a Date Value 218 Assigning Permanent Date Formats to Variables 219 Changing Formats Temporarily 220 Using Dates in Calculations 221 Sorting Dates 221 Creating New Date Variables 222 Using SAS Date Functions 223 Finding the Day of the Week 223 Calculating a Date from Today 224 Comparing Durations and SAS Date Values 225 Review of SAS Tools 227 Statements 227 Formats and Informats for Dates 227 Functions 227 System Options 228 Learning More 228 213 Introduction to Working with Dates Purpose SAS stores dates as single, unique numbers so that they can be used in programs like any other numeric variable. In this section you will learn how to do the following: 3 make SAS read dates in raw data files and store them as SAS date values 212 Prerequisites 4 Chapter 14 3 indicate which calendar form SAS should use to display SAS date values 3 calculate with dates, that is, determine the number of days between dates, find the day of the week on which a date falls, and use today’s date in calculations Prerequisites You should understand the following topics before proceeding with this section: 3 Chapter 6, “Understanding DATA Step Processing,” on page 97 3 Chapter 10, “Creating Subsets of Observations,” on page 159 3 Chapter 11, “Working with Grouped or Sorted Observations,” on page 173 Understanding How SAS Handles Dates How SAS Stores Date Values Dates are written in many different ways. Some dates contain only numbers, while others contain various combinations of numbers, letters, and characters. For example, all the following forms represent the date July 26, 2000: 072600 26JUL00 002607 7/26/00 26JUL2000 July 26, 2000 With so many different forms of dates, there must be some common ground, a way to store dates and use them in calculations, regardless of how dates are entered or displayed. The common ground that SAS uses to represent dates is called a SAS date value. No matter which form you use to write a date, SAS can convert and store that date as the number of days between January 1, 1960, and the date that you enter. The following figure shows some dates written in calendar form and as SAS date values: Figure 14.1 Comparing Calendar Dates to SAS Date Values In SAS, every date is a unique number on a number line. Dates before January 1, 1960, are negative numbers; those after January 1, 1960, are positive. Because SAS date values are numeric variables, you can sort them easily, determine time intervals, and use dates as constants, as arguments in SAS functions, or in calculations. Working with Dates in the SAS System 4 Input File and SAS Data Set for Examples 213 Note: SAS date values are valid for dates based on the Gregorian calendar from A.D. 1582 through A.D. 19,900. Use caution when working with historical dates. Although the Gregorian calendar was used throughout most of Europe from 1582, Great Britain and the American colonies did not adopt the calendar until 1752. 4 Determining the Century for Dates with Two-Digit Years If dates in your external data sources or SAS program statements contain two-digit years, then you can determine which century prefix should be assigned to them by using the YEARCUTOFF= system option. The YEARCUTOFF= system option specifies the first year of the 100-year span that is used to determine the century of a two-digit year. Before you use the YEARCUTOFF= system option, examine the dates in your data: 3 If the dates in your data fall within a 100-year span, then you can use the YEARCUTOFF= system option. 3 If the dates in your data do not fall within a 100-year span, then you must either convert the two-digit years to four-digit years or use a DATA step with conditional logic to assign the proper century prefix. After you have determined that the YEARCUTOFF= system option is appropriate for your range of data, you can determine the setting to use. The best setting for YEARCUTOFF= is the year before the lowest year in your data. For example, if you have data in a range from 1921 to 2001, then set YEARCUTOFF= to 1920, if that is not already your system default. The result of setting YEARCUTOFF= to 1920 is that 3 SAS interprets all two-digit dates in the range of 20 through 99 as 1920 through 1999. 3 SAS interprets all two-digit dates in the range of 00 through 19 as 2000 through 2019. With YEARCUTOFF= set to 1920, a two-digit year of 10 would be interpreted as 2010 and a two-digit year of 22 would be interpreted as 1922. Input File and SAS Data Set for Examples In the travel industry, some of the most important data about a tour includes dates, when the tour leaves and returns, when payments are due, when refunds are allowed, and so on. Tradewinds Travel has data that contains dates of past and upcoming popular tours as well as the number of nights spent on the tour. The raw data is stored in an external file that looks like this: u Japan Greece New Zealand Brazil Venezuela Italy USSR Switzerland Australia Ireland v 13may2000 17oct99 03feb2001 28feb2001 10nov00 25apr2001 03jun1997 14jan2001 24oct98 27aug2000 w 8 12 16 8 9 8 14 9 12 7 The numbered fields represent u the name of the country toured 214 Entering Dates 4 Chapter 14 v the departure date w the number of nights on the tour Entering Dates Understanding Informats for Date Values In order for SAS to read a value as a SAS date value, you must give it a set of directions called an informat. By default, SAS reads numeric variables with a standard numeric informat that does not include letters or special characters. When a field that contains data does not match the standard patterns, you specify the appropriate informat in the INPUT statement. SAS provides many informats. Four informats that are commonly used to read date values are: MMDDYY8. reads dates written as mm/dd/yy. MMDDYY10. reads dates written as mm/dd/yyyy. DATE7. reads dates in the form ddMMMyy. DATE9. reads dates in the form ddMMMyyyy. Note that each informat name ends with a period and contains a width specification that tells SAS how many columns to read. Reading a Date Value To create a SAS data set for the Tradewinds Travel data, the DATE9. informat is used in the INPUT statement to read the variable DepartureDate. input Country $ 1-11 @13 DepartureDate date9. Nights; Using an informat in the INPUT statement is called formatted input. The formatted input in this example contains the following items: 3 a pointer to indicate the column in which the value begins (@13) 3 the name of the variable to be read (DepartureDate) 3 the name of the informat to use (DATE9.) The following DATA step creates MYLIB.TOURDATES using the DATE9. informat to create SAS date values: options yearcutoff=1920 pagesize=60 linesize=80 pageno=1 nodate; libname mylib ’permanent-data-library’; data mylib.tourdates; infile ’input-file’; input Country $ 1-11 @13 DepartureDate date9. Nights; run; proc print data=mylib.tourdates; title ’Tour Departure Dates as SAS Date Values’; run; The following output displays the results: Working with Dates in the SAS System Output 14.1 4 Using Good Programming Practices to Read Dates 215 Creating SAS Date Values from Calendar Dates Tour Departure Dates as SAS Date Values Obs 1 2 3 4 5 6 7 8 9 10 Country Japan Greece New Zealand Brazil Venezuela Italy Russia Switzerland Australia Ireland Departure Date Nights 14743 14534 15009 15034 14924 15090 13668 14989 14176 14849 8 12 16 8 9 8 14 9 12 7 1 Compare the SAS values of the variable DepartureDate with the values of the raw data shown in the previous section. The data set MYLIB.TOURDATES shows that SAS read the departure dates and created SAS date values. Now you need a way to display the dates in a recognizable form. Using Good Programming Practices to Read Dates When reading dates, it is good programming practice to always use the DATE9. or MMDDYY10. informats to be sure that the data is read correctly. If you use the DATE7. or MMDDYY8. informat, then SAS reads only the first two digits of the year. If the data contains four-digit years, then SAS reads the century and not the year. Consider the Tradewinds Travel external file with both two-digit years and four-digit years: Japan Greece New Zealand Brazil Venezuela Italy USSR Switzerland Australia Ireland 13may2000 17oct99 03feb2001 28feb2001 10nov00 25apr2001 03jun1997 14jan2001 24oct98 27aug2000 8 12 16 8 9 8 14 9 12 7 The following DATA step creates a SAS data set MYLIB.TOURDATES7 by using the DATE7. informat: options yearcutoff=1920 pagesize=60 linesize=80 pageno=1 nodate; data mylib.tourdates7; infile ’input-file’; input Country $ 1-11 @13 DepartureDate date7. Nights; run; proc print data=mylib.tourdates7; title ’Tour Departure Dates Using the DATE7. Informat’; 216 Using Good Programming Practices to Read Dates 4 Chapter 14 title2 ’Displayed as Two-Digit Calendar Dates’; format DepartureDate date7.; run; proc print data=mylib.tourdates7; title ’Tour Departure Dates Using the DATE7. Informat’; title2 ’Displayed as Four-Digit Calendar Dates’; format DepartureDate date9.; run; The PRINT procedures format DepartureDate using two-digit year (DATE7.) and four-digit year (DATE9.) calendar dates. The following output displays the results: Output 14.2 Using the Wrong Informat Can Produce Invalid SAS Data Sets Tour Departure Dates Using the DATE7. Informat Displayed as Two-Digit Calendar Dates Obs 1 2 3 4 5 6 7 8 9 10 Country Japan Greece New Zealand Brazil Venezuela Italy Russia Switzerland Australia Ireland Departure Date u 13MAY20 17OCT99 03FEB20 28FEB20 10NOV00 25APR20 03JUN19 14JAN20 24OCT98 27AUG20 Nights v 0 12 1 1 9 1 97 1 12 0 Tour Departure Dates Using the DATE7. Informat Displayed as Four-Digit Calendar Dates Obs 1 2 3 4 5 6 7 8 9 10 Country Japan Greece New Zealand Brazil Venezuela Italy Russia Switzerland Australia Ireland Departure Date w 13MAY1920 17OCT1999 03FEB1920 28FEB1920 10NOV2000 25APR1920 03JUN2019 14JAN1920 24OCT1998 27AUG1920 1 2 Nights 0 12 1 1 9 1 97 1 12 0 Notice that the four-digit years in the input file do not match the years in MYLIB.TOURDATES7 for observations 1, 3, 4, 6, 7, 8, and 10: u SAS stopped reading the date after seven characters; it read the first two digits, the century, and not the complete four-digit year. v To read the data for the next variable, SAS moved the pointer one column and read the next two numeric characters (the years 00, 01, and 97) as the value for the variable Nights. The data for Nights in the input file was ignored. w When the dates were formatted for four-digit calendar dates, SAS used the YEARCUTOFF= 1920 system option to determine the century for the two-digit Working with Dates in the SAS System 4 Understanding How SAS Displays Values 217 year. What was originally 1997 in observation 7 became 2019, and what was originally 2000 and 2001 in observations 1, 3, 4, 6, 8, and 10 became 1920. Using Dates as Constants If the tour of Switzerland leaves on January 21, 2001 instead of January 14, then you can use the following assignment statement to make the update: if Country = ’Switzerland’ then DepartureDate = ’21jan2001’d; The value ’21jan2001’D is a SAS date constant. To write a SAS date constant, enclose a date in quotation marks in the standard SAS form ddMMMyyyy and immediately follow the final quotation mark with the letter D. The D suffix tells SAS to convert the calendar date to a SAS date value. The following DATA step includes the use of the SAS date constant: options pagesize=60 linesize=80 pageno=1 nodate; data correctdates; set mylib.tourdates; if Country = ’Switzerland’ then DepartureDate = ’21jan2001’d; run; proc print data=correctdates; title ’Corrected Departure Date for Switzerland’; format DepartureDate date9.; run; The following output displays the results: Output 14.3 Changing a Date by Using a SAS Date Constant Corrected Departure Date for Switzerland Obs 1 2 3 4 5 6 7 8 9 10 Country Departure Date Nights Japan Greece New Zealand Brazil Venezuela Italy Russia Switzerland Australia Ireland 13MAY2000 17OCT1999 03FEB2001 28FEB2001 10NOV2000 25APR2001 03JUN1997 21JAN2001 24OCT1998 27AUG2000 8 12 16 8 9 8 14 9 12 7 1 Displaying Dates Understanding How SAS Displays Values To understand how to display the departure dates, you need to understand how SAS displays values in general. SAS displays all data values with a set of directions called a format. By default, SAS uses a standard numeric format with no commas, letters, or 218 Formatting a Date Value 4 Chapter 14 other special notation to display the values of numeric variables. Output 14.1 shows that printing SAS date values with the standard numeric format produces numbers that are difficult to recognize. To display these numbers as calendar dates, you need to specify a SAS date format for the variable. SAS date formats are available for the most common ways of writing calendar dates. The DATE9. format represents dates in the form ddMMMyyyy. If you want the month, day, and year to be spelled out, then use the WORDDATE18. format. The WEEKDATE29. format includes the day of the week. There are also formats available for number representations such as the format MMDDYY8., which displays the calendar date in the form mm/dd/yy, or the format MMDDYY10., which displays the calendar date in the form mm/dd/yyyy. Like informat names, each format name ends with a period and contains a width specification that tells SAS how many columns to use when displaying the date value. Formatting a Date Value You tell SAS which format to use by specifying the variable and the format name in a FORMAT statement. The following FORMAT statement assigns the MMDDYY10. format to the variable DepartureDate: format DepartureDate mmddyy10.; In this example, the FORMAT statement contains the following items: 3 the name of the variable (DepartureDate) 3 the name of the format to be used (MMDDYY10.) The following PRINT procedures format the variable DepartureDate in both the two-digit year calendar format and the four-digit year calendar format: options pagesize=60 linesize=80 pageno=1 nodate; proc print data=mylib.tourdates; title ’Departure Dates in Two-Digit Calendar Format’; format DepartureDate mmddyy8.; run; proc print data=mylib.tourdates; title ’Departure Dates in Four-Digit Calendar Format’; format DepartureDate mmddyy10.; run; The following output displays the results: Output 14.4 Displaying a Formatted Date Value Departure Dates in Two-Digit Calendar Format Obs 1 2 3 4 5 6 7 8 9 10 Country Departure Date Japan Greece New Zealand Brazil Venezuela Italy Russia Switzerland Australia Ireland 05/13/00 10/17/99 02/03/01 02/28/01 11/10/00 04/25/01 06/03/97 01/14/01 10/24/98 08/27/00 Nights 8 12 16 8 9 8 14 9 12 7 1 Working with Dates in the SAS System 4 Assigning Permanent Date Formats to Variables Departure Dates in Four-Digit Calendar Format Obs 1 2 3 4 5 6 7 8 9 10 Country Japan Greece New Zealand Brazil Venezuela Italy Russia Switzerland Australia Ireland Departure Date Nights 05/13/2000 10/17/1999 02/03/2001 02/28/2001 11/10/2000 04/25/2001 06/03/1997 01/14/2001 10/24/1998 08/27/2000 8 12 16 8 9 8 14 9 12 7 219 2 Placing a FORMAT statement in a PROC step associates the format with the variable only for that step. To associate a format with a variable permanently, use the FORMAT statement in a DATA step. Assigning Permanent Date Formats to Variables The next example creates a new permanent SAS data set and assigns the DATE9. format in the DATA step. Now all subsequent procedures and DATA steps that use the variable DepartureDate will use the DATE9. format by default. The PROC CONTENTS step displays the characteristics of the data set MYLIB.TOURDATE. options yearcutoff=1920 pagesize=60 linesize=80 pageno=1 nodate; data mylib.fmttourdate; set mylib.tourdates; format DepartureDate date9.; run; proc contents data=mylib.fmttourdate nodetails; run; The following output shows that the DATE9. format is permanently associated with DepartureDate: 220 Changing Formats Temporarily Output 14.5 4 Chapter 14 Assigning a Format in a DATA Step The SAS System 1 The CONTENTS Procedure Data Set Name: Member Type: Engine: Created: Last Modified: Protection: Data Set Type: Label: MYLIB.FMTTOURDATE DATA V8 14:15 Friday, November 19, 1999 14:15 Friday, November 19, 1999 Observations: Variables: Indexes: Observation Length: Deleted Observations: Compressed: Sorted: 10 3 0 32 0 NO NO -----Engine/Host Dependent Information----Data Set Page Size: 8192 Number of Data Set Pages: 1 First Data Page: 1 Max Obs per Page: 254 Obs in First Data Page: 10 Number of Data Set Repairs: 0 filename: /SAS_DATA_LIBRARY/fmttourdate.sas7bdat Release Created: 8.0001M0 Host Created: HP-UX Inode Number: 1498874206 Access Permission: rw-r--r-Owner Name: user01 File Size (bytes): 16384 -----Alphabetic List of Variables and Attributes----# Variable Type Len Pos Format -------------------------------------------------1 Country Char 11 16 2 DepartureDate Num 8 0 DATE9. 3 Nights Num 8 8 Changing Formats Temporarily If you are preparing a report that requires the date in a different format, then you can override the permanent format by using a FORMAT statement in a PROC step. For example, to display the value for DepartureDate in the data set MYLIB.TOURDATES in the form of month-name dd, yyyy, you can issue a FORMAT statement in a PROC PRINT step. The following program specifies the WORDDATE18. format for the variable DepartureDate: options pagesize=60 linesize=80 pageno=1 nodate; proc print data=mylib.tourdates; title ’Tour Departure Dates’; format DepartureDate worddate18.; run; The following output displays the results: Working with Dates in the SAS System Output 14.6 4 Sorting Dates 221 Overriding a Previously Specified Format Tour Departure Dates Obs 1 2 3 4 5 6 7 8 9 10 Country Japan Greece New Zealand Brazil Venezuela Italy Russia Switzerland Australia Ireland 1 DepartureDate May 13, October 17, February 3, February 28, November 10, April 25, June 3, January 14, October 24, August 27, 2000 1999 2001 2001 2000 2001 1997 2001 1998 2000 Nights 8 12 16 8 9 8 14 9 12 7 The format DATE9. is still permanently assigned to DepartureDate. Calendar dates in the remaining examples are in the form ddMMMyyyy unless a FORMAT statement is included in the PROC PRINT step. Using Dates in Calculations Sorting Dates Because SAS date values are numeric variables, you can sort them and use them in calculations. The following example uses the data set MYLIB.TOURDATES to extract other information about the Tradewinds Travel data. To help determine how frequently tours are scheduled, you can print a report with the tours listed in chronological order. The first step is to specify the following BY statement in a PROC SORT step to tell SAS to arrange the observations in ascending order of the date variable DepartureDate: by DepartureDate; By using a VAR statement in the following PROC PRINT step, you can list the departure date as the first column in the report: options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.fmttourdate out=sortdate; by DepartureDate; run; proc print data=sortdate; var DepartureDate Country Nights; title ’Departure Dates Listed in Chronological Order’; run; The following output displays the results: 222 Creating New Date Variables Output 14.7 4 Chapter 14 Sorting by SAS Date Values Departure Dates Listed in Chronological Order Obs Departure Date Country 1 2 3 4 5 6 7 8 9 10 03JUN1997 24OCT1998 17OCT1999 13MAY2000 27AUG2000 10NOV2000 14JAN2001 03FEB2001 28FEB2001 25APR2001 Russia Australia Greece Japan Ireland Venezuela Switzerland New Zealand Brazil Italy 1 Nights 14 12 12 8 7 9 9 16 8 8 The observations in the data set SORTDATE are now arranged in chronological order. Note that there are no FORMAT statements in this example, so the dates are displayed in the DATE9. format you assigned to DepartureDate when you created the data set MYLIB.FMTTOURDATE. Creating New Date Variables Because you know the departure date and the number of nights spent on each tour, you can calculate the return date for each tour. To start, create a new variable by adding the number of nights to the departure date, as follows: Return = DepartureDate + Nights; The result is a SAS date value for the return date that you can display by assigning it the DATE9. format, as follows: options yearcutoff=1920 pagesize=60 linesize=80 pageno=1 nodate; data home; set mylib.tourdates; Return = DepartureDate + Nights; format Return date9.; run; proc print data=home; title ’Dates of Departure and Return’; run; Working with Dates in the SAS System Output 14.8 4 Finding the Day of the Week 223 Adding Days to a Date Value Dates of Departure and Return Obs 1 2 3 4 5 6 7 8 9 10 Country Japan Greece New Zealand Brazil Venezuela Italy Russia Switzerland Australia Ireland Departure Date Nights 14743 14534 15009 15034 14924 15090 13668 14989 14176 14849 8 12 16 8 9 8 14 9 12 7 1 Return 21MAY2000 29OCT1999 19FEB2001 08MAR2001 19NOV2000 03MAY2001 17JUN1997 23JAN2001 05NOV1998 03SEP2000 Note that because the variable DepartureDate in the data set MYLIB.TOURDATES has no permanent format, you see a numeric value instead of a readable calendar date for that variable. Using SAS Date Functions Finding the Day of the Week SAS has various functions that produce calendar dates from SAS date values. SAS date functions enable you to do such things as derive partial date information or use the current date in calculations. If the final payment for a tour is due 30 days before the tour leaves, then the final payment date can be calculated using subtraction; however, Tradewinds Travel is closed on Sundays. If the payment is due on a Sunday, then an additional day must be subtracted to make the payment due on Saturday. The WEEKDAY function, which returns the day of the week as a number from 1 through 7 (Sunday through Saturday) can be used to determine if the return day is a Sunday. The following statements determine the final payment date by 3 subtracting 30 from the departure date 3 checking the value returned by the WEEKDAY function 3 subtracting an additional day if necessary DueDate = DepartureDate - 30; if Weekday(DueDate) = 1 then DueDate = DueDate - 1; Constructing a data set with these statements produces a list of payment due dates. The following program includes these statements and assigns the format WEEKDATE29. to the new variable DueDate: options yearcutoff=1920 pagesize=60 linesize=80 pageno=1 nodate; data pay; set mylib.tourdates; DueDate = DepartureDate - 30; if Weekday(DueDate) = 1 then DueDate = DueDate - 1; 224 Calculating a Date from Today 4 Chapter 14 format DueDate weekdate29.; run; proc print data=pay; var Country DueDate; title ’Date and Day of Week Payment Is Due’; run; Output 14.9 Using the WEEKDAY Function Date and Day of Week Payment Is Due Obs 1 2 3 4 5 6 7 8 9 10 Country Japan Greece New Zealand Brazil Venezuela Italy Russia Switzerland Australia Ireland 1 DueDate Thursday, April 13, Friday, September 17, Thursday, January 4, Monday, January 29, Wednesday, October 11, Monday, March 26, Saturday, May 3, Friday, December 15, Thursday, September 24, Friday, July 28, 2000 1999 2001 2001 2000 2001 1997 2000 1998 2000 Calculating a Date from Today Tradewinds Travel occasionally gets the opportunity to do special advertising promotions. In general, tours that depart more than 90 days from today’s date, but less than 180 days from today’s date, are advertised. The following figure illustrates the time frame for advertising: Figure 14.2 Optimum Interval for Advertising Tours Based on Today’s Date A program is needed that determines which tours leave between 90 and 180 days from the date the program is run, regardless of when you run the program. The TODAY function produces a SAS date value that corresponds to the date when the program is run. The following statements determine which tours depart at least 90 days from today’s date but not more than 180 days from now: Now = today(); if Now + 90 <= DepartureDate <= Now + 180; Working with Dates in the SAS System 4 Comparing Durations and SAS Date Values 225 To print the value that is returned by the TODAY function, this example creates a variable that is equal to the value returned by the TODAY function. This step is not necessary but is used here to clarify the program. You can also use the function as part of the program statement. if today() + 90 <= DepartureDate <= today() + 180; The following program uses the TODAY function to determine which tours to advertise: options yearcutoff=1920 pagesize=60 linesize=80 pageno=1 nodate; data ads; set mylib.tourdates; Now = today(); if Now + 90 <= DepartureDate <= Now + 180; run; proc print data=ads; title ’Tours Departing between 90 and 180 Days from Today’; format DepartureDate Now date9.; run; The following output displays the results: Output 14.10 Using the Current Date as a SAS Date Value Tours Departing between 90 and 180 Days from Today Obs Country Departure Date Nights 1 Japan 13MAY2000 8 1 Now 23NOV1999 Note that the PROC PRINT step contains a FORMAT statement that temporarily assigns the format DATE9. to the variables DepartureDate and Now. Comparing Durations and SAS Date Values You can use SAS date values to find the units of time between dates. Tradewinds Travel was founded on February 8, 1982. On November 23, 1999, you decide to find out how old Tradewinds Travel is, and you write the following program: options yearcutoff=1920 pagesize=60 linesize=80 pageno=1 nodate; /* Calculating a duration in days */ data ttage; Start = ’08feb82’d; RightNow = today(); Age = RightNow - Start; format Start RightNow date9.; run; proc print data=ttage; title ’Age of Tradewinds Travel’; run; 226 Comparing Durations and SAS Date Values Output 14.11 4 Chapter 14 Calculating a Duration in Days Age of Tradewinds Travel Obs 1 1 Start RightNow Age 08FEB1982 23NOV1999 6497 The value of Age is 6497, a number that looks like an unformatted SAS date value. However, Age is actually the difference between February 8, 1982, and November 23, 1999, and represents a duration in days, not a SAS date value. To make the value of Age more understandable, divide the number of days by 365 (more precisely, 365.25) to produce a duration in years. The following DATA step calculates the age of Tradewinds Travel in years: options yearcutoff=1920 pagesize=60 linesize=80 pageno=1 nodate; /* Calculating a duration in years */ data ttage2; Start = ’08feb82’d; RightNow = today(); AgeInDays = RightNow - Start; AgeInYears = AgeInDays / 365.25; format AgeInYears 4.1 Start RightNow date9.; run; proc print data=ttage2; title ’Age in Years of Tradewinds Travel’; run; The following output displays the results: Output 14.12 Calculating a Duration in Years Age in Years of Tradewinds Travel Obs 1 Start RightNow Age In Days 08FEB1982 23NOV1999 6497 1 Age In Years 17.8 To show a portion of a year, the value for AgeInYears is assigned a numeric format of 4.1 in the FORMAT statement of the DATA step. The 4 tells SAS that the number contains up to four characters. The 1 tells SAS that the number includes one digit after the decimal point. Working with Dates in the SAS System 4 Functions 227 Review of SAS Tools Statements date-variable=’ddMMMyy’D; is an assignment statement that tells SAS to convert the date in quotation marks to a SAS date value and assign it to date-variable. The SAS date constant ’ddMMMyy’D specifies a particular date, for example, ’23NOV00’D, and can be used in many SAS statements and expressions, not only assignment statements. FORMAT date-variable date-format; tells SAS to format the values of the date-variable using the date-format. A FORMAT statement within a DATA step permanently associates a format with a date-variable. INPUT date-variable date-informat; tells SAS how to read the values for the date-variable from an external file. The date-informat is an instruction that tells SAS the form of the date in the external file. Formats and Informats for Dates DATE9. the form of the date-variable is ddMMMyyyy, for example 23NOV2000. DATE7. the form of the date-variable is ddMMMyy, for example 23NOV00. MMDDYY10. the form of the date-variable is mm/dd/yyyy, for example, 11/23/2000. MMDDYY8. the form of the date-variable is mm/dd/yy, for example, 11/23/00. WORDDATE18. the form of the date-variable is month-name dd, yyyy, for example, November 23, 2000. WEEKDATE29. the form of the date-variable is day-of-the-week, month-name dd, yyyy, for example, Thursday, November 23, 2000. Functions WEEKDAY (SAS-date-value) is a function that returns the day of the week on which the SAS-date-value falls as a number 1 through 7, with Sunday assigned the value 1. TODAY() is a function that returns a SAS date value corresponding to the date on which the SAS program is initiated. 228 System Options 4 Chapter 14 System Options YEARCUTOFF= specifies the first year of a 100-year span that is used by informats and functions to read two-digit years, and used by formats to display two-digit years. The value that is specified in YEARCUTOFF= can result in a range of years that span two centuries. If YEARCUTOFF=1950, then any two-digit value between 50 and 99 inclusive refers to the first half of the 100-year span, which is in the 1900s. Any two-digit value between 00 and 49 inclusive refers to the second half of the 100-year span, which is in the 2000s. YEARCUTOFF= has no effect on existing SAS dates or dates that are read from input data that include a four-digit year. Learning More ATTRIB statement Information about using the ATTRIB statement to assign or change a permanent format can be found in SAS Language Reference: Dictionary. DATASETS procedure To assign or change a variable to a permanent format see the DATASETS procedure in Chapter 34, “Managing SAS Data Libraries,” on page 603. PUT and INPUT functions The PUT and INPUT functions can be used for correcting two common errors in working with SAS dates: treating date values that contain letters or symbols as character variables or storing dates written as numbers as ordinary numeric variables. Neither method enables you to use dates in calculations. Information about these functions can be found in SAS Language Reference: Dictionary. SAS date values Documentation on informats, formats, and functions for working with SAS date values, SAS time, and SAS datetime values can be found in SAS Language Reference: Concepts. This documentation includes the following date and time information: 3 SAS stores a time as the number of seconds since midnight of the current day. For example, 9:30 am. is 34200. A number of this type is known as a SAS time value. A SAS time value is independent of the date; the count begins at 0 each midnight. 3 When a date and a time are both present, SAS stores the value as the number of seconds since midnight, January 1, 1960. For example, 9:30 am, November 23, 2000, is 1290591000. This type of number is known as a SAS datetime value. 3 SAS date and time informats read fields of different widths. SAS date and time formats can display date variables in different ways according to the widths that you specify in the format name. The number at the end of the format or informat name indicates the number of columns that SAS can use. For example, the DATE9. informat reads up to nine columns (as in 23NOV2000). The WEEKDATE8. format displays eight columns, as in Thursday, and WEEKDATE27. displays 27 columns, as in Thursday, November 23, 2000. Working with Dates in the SAS System 4 Learning More 229 3 SAS provides date, time, and datetime intervals for counting different periods of elapsed time, such as MONTH, which represents an interval from the beginning of one month to the next, not a period of 30 or 31 days. 3 International date, time, and datetime formats. SYSDATE9 To include the current date in a title, you can use the macro variable SYSDATE9, which is explained in Chapter 25, “Producing Detail Reports with the PRINT Procedure,” on page 371. 230 231 4 P A R T Combining SAS Data Sets Chapter 15. . . . . . . . .Methods of Combining SAS Data Sets Chapter 16. . . . . . . . .Concatenating SAS Data Sets Chapter 17. . . . . . . . .Interleaving SAS Data Sets Chapter 18. . . . . . . . .Merging SAS Data Sets Chapter 19. . . . . . . . .Updating SAS Data Sets Chapter 20. . . . . . . . .Modifying SAS Data Sets Chapter 21. . . . . . . . .Conditionally Processing Observations from Multiple SAS Data Sets 323 233 241 263 269 293 311 232 233 CHAPTER 15 Methods of Combining SAS Data Sets Introduction to Combining SAS Data Sets 233 Purpose 233 Prerequisites 233 Definition of Concatenating 234 Definition of Interleaving 234 Definition of Merging 235 Definition of Updating 236 Definition of Modifying 237 Comparing Modifying, Merging, and Updating Data Sets Learning More 239 238 Introduction to Combining SAS Data Sets Purpose SAS provides several different methods for combining SAS data sets. In this section, you will be introduced to five methods of combining data sets: 3 concatenating 3 interleaving 3 merging 3 updating 3 modifying Subsequent sections teach you how to use these methods. Prerequisites Before continuing with this section, you should understand the concepts presented in the following sections: 3 Chapter 2, “Introduction to DATA Step Processing,” on page 19 3 Chapter 5, “Starting with SAS Data Sets,” on page 81 3 Chapter 6, “Understanding DATA Step Processing,” on page 97 234 Definition of Concatenating 4 Chapter 15 Definition of Concatenating Concatenating combines two or more SAS data sets, one after the other, into a single SAS data set. You concatenate data sets using either the SET statement in a DATA step or the APPEND procedure. The following figure shows the results of concatenating two SAS data sets, and the DATA step that produces the results. Figure 15.1 Concatenating Two SAS Data Sets DATA1 DATA2 COMBINED Year Year Year 1996 1996 1996 1997 1997 1997 1998 + 1998 = 1998 1999 1999 1999 2000 2000 2000 1996 data combined; set data1 data2; run; 1997 1998 1999 2000 Definition of Interleaving Interleaving combines individual, sorted SAS data sets into one sorted SAS data set. For each observation, the following figure shows the value of the variable by which the data sets are sorted. (In this example, the data sets are sorted by the variable Year.) You interleave data sets using a SET statement along with a BY statement. Methods of Combining SAS Data Sets Figure 15.2 4 Definition of Merging 235 Interleaving SAS Data Sets DATA2 DATA1 COMBINED Year Year Year 1995 1995 1996 1996 + 1997 1996 = 1997 1996 1998 1998 1997 1999 1999 1997 2000 1998 1998 data combined; set data1 data2; by Year; run; 1999 1999 2000 Definition of Merging Merging combines observations from two or more SAS data sets into a single observation in a new data set. A one-to-one merge, shown in the following figure, combines observations based on their position in the data sets. You use the MERGE statement for one-to-one merging. Figure 15.3 One-to-One Merging DATA1 DATA2 VarX VarY VarX VarY X1 Y1 X1 Y1 X2 Y2 X2 Y2 X3 Y3 X3 + Y3 COMBINED = X4 Y4 X4 Y4 X5 Y5 X5 Y5 data combined; merge data1 data2; run; A match-merge, shown in the following figure, combines observations based on the values of one or more common variables. If you are performing a match-merge, then use the MERGE statement along with a BY statement. (In this example, two data sets are match-merged by the value of the variable Year.) 236 Definition of Updating 4 Figure 15.4 Chapter 15 Match-Merging Two SAS Data Sets DATA1 DATA2 COMBINED Year VarX Year VarY Year VarX VarY 1996 X1 1996 Y1 1996 X1 Y1 1997 X2 1996 Y2 1996 X1 Y2 + 1998 X3 1998 Y3 1997 X2 1999 X4 1999 Y4 1998 X3 Y3 2000 X5 2000 Y5 1999 X4 Y4 2000 X5 Y5 = data combined; merge data1 data2; by Year; run; Definition of Updating Updating a SAS data set replaces the values of variables in one data set (the master data set) with values from another data set (the transaction data set). If the UPDATEMODE= option in the UPDATE statement is set to MISSINGCHECK, then missing values in a transaction data set do not replace existing values in a master data set. If the UPDATEMODE= option is set to NOMISSINGCHECK, then missing values in a transaction data set replace existing values in a master data set. The default setting is MISSINGCHECK. You update a data set by using the UPDATE statement along with a BY statement. Both of the input data sets must be sorted by the variable that you use in the BY statement. The following figure shows the results of updating a SAS data set. Methods of Combining SAS Data Sets Figure 15.5 4 Definition of Modifying Updating a Master Data Set MASTER MASTER Year VarX VarY Year VarX VarY 1990 X1 Y1 1990 X1 Y1 1991 X1 Y1 1991 X1 Y1 1992 X1 Y1 1992 X1 Y1 1993 X1 Y1 1993 X1 Y1 1994 X1 Y1 1994 X1 Y1 TRANSACTION 1995 X1 Y1 Year ear VarX VarY 1995 X1 Y1 1996 X1 Y1 1996 X2 1996 X2 Y1 1997 X1 Y1 1997 X2 Y2 1997 X2 Y2 1998 X1 Y1 1998 X2 1998 X2 Y2 1999 X1 Y1 1998 Y2 1999 X1 Y1 2000 X2 Y2 2000 X2 Y2 + = data master; update master transaction; by Year; run; Definition of Modifying Modifying a SAS data set replaces, deletes, or appends observations in an existing data set. Modifying a SAS data set is similar to updating a SAS data set, but the following differences exist: 3 Modifying cannot create a new data set, while updating can. 3 Unlike updating, modifying does not require that the master data set or the transaction data set be sorted. You change an existing file by using the MODIFY statement along with a BY statement. The following figure shows the results. 237 238 Comparing Modifying, Merging, and Updating Data Sets Figure 15.6 4 Chapter 15 Modifying a Data Set MASTER MASTER Year VarX VarY Year VarX VarY 1991 X1 Y1 1991 X1 Y1 1992 X1 Y1 1992 X1 Y1 1993 X1 Y1 1993 X1 Y1 1994 X1 Y1 1994 X1 Y1 TRANSACTION 1995 X1 Y1 Year ear VarX VarY 1995 X1 Y1 1996 X1 Y1 1999 X2 1996 X1 Y1 1997 X1 Y1 1999 1997 X2 Y1 1998 X2 Y2 1998 X1 Y1 1999 X1 2000 X1 + Y2 = 1997 X2 Y1 2000 X2 Y2 1999 X2 Y2 Y1 1998 X2 Y2 2000 X2 Y2 data master; modify master transaction; by Year; run; Comparing Modifying, Merging, and Updating Data Sets The table that follows summarizes several differences among the MERGE, UPDATE, and MODIFY statements. Criterion MERGE UPDATE MODIFY Data sets must be sorted or indexed Match-merge: Yes Yes No BY values must be unique No Master data set: Yes No Can create or delete variables Yes One-to-one merge: No Transaction data set: No Yes No Methods of Combining SAS Data Sets 4 Learning More Criterion MERGE UPDATE MODIFY Number of data sets combined Any number 2 2 Processing missing values Overwrites nonmissing values from first data set with missing values from second data set Default behavior: missing values in the transaction data set do not replace values in the master data set Depends on the value of the UPDATEMODE= option (see “Comparing Modifying, Merging, and Updating Data Sets” on page 238) 239 Default: MISSINGCHECK Learning More Concatenating data sets For more information about concatenating data sets, see Chapter 16, “Concatenating SAS Data Sets,” on page 241. Interleaving data sets For more information about interleaving data sets, see Chapter 17, “Interleaving SAS Data Sets,” on page 263. Manipulating data sets You can manipulate data sets as you combine them. For example, you can select certain observations from each data set and determine which data set an observation came from. For more information, see Chapter 21, “Conditionally Processing Observations from Multiple SAS Data Sets,” on page 323. MERGE, MODIFY, and UPDATE statements For more information about these statements, see the Statements section of SAS Language Reference: Dictionary, and the Reading, Combining, and Modifying SAS Data Sets section of SAS Language Reference: Concepts. Merging data sets For more information about merging data sets, see Chapter 18, “Merging SAS Data Sets,” on page 269. Modifying data sets For more information about modifying data sets, see Chapter 20, “Modifying SAS Data Sets,” on page 311, and Chapter 21, “Conditionally Processing Observations from Multiple SAS Data Sets,” on page 323. Updating data sets For more information about updating data sets, see Chapter 19, “Updating SAS Data Sets,” on page 293. 240 241 CHAPTER 16 Concatenating SAS Data Sets Introduction to Concatenating SAS Data Sets 241 Purpose 241 Prerequisites 242 Concatenating Data Sets with the SET Statement 242 Understanding the SET Statement 242 Using the SET Statement: The Simplest Case 242 Using the SET Statement When Data Sets Contain Different Variables 244 Using the SET Statement When Variables Have Different Attributes 246 Understanding Attributes 246 Using the SET Statement When Variables Have Different Types 247 Changing the Type of a Variable 248 Using the SET Statement When Variables Have Different Formats, Informats, or Labels 250 Using the SET Statement When Variables Have Different Lengths 253 Concatenating Data Sets Using the APPEND Procedure 255 Understanding the APPEND Procedure 255 Using the APPEND Procedure: The Simplest Case 256 Using the APPEND Procedure When Data Sets Contain Different Variables 257 Using the APPEND Procedure When Variables Have Different Attributes 258 Choosing between the SET Statement and the APPEND Procedure 259 Review of SAS Tools 260 Statements 260 Procedures 260 Learning More 260 Introduction to Concatenating SAS Data Sets Purpose Concatenating combines two or more SAS data sets, one after the other, into a single data set. The number of observations in the new data set is the sum of the number of observations in the original data sets. You can concatenate SAS data sets by using 3 the SET statement in a DATA step 3 the APPEND procedure If the data sets that you concatenate contain the same variables, and each variable has the same attributes in all data sets, then the results of the SET statement and PROC 242 Prerequisites 4 Chapter 16 APPEND are the same. In other cases, the results differ. In this section you will learn both of these methods and their differences so that you can decide which one to use. Prerequisites Before continuing with this section, you should be familiar with the concepts presented in Chapter 5, “Starting with SAS Data Sets,” on page 81 through Chapter 8, “Working with Character Variables,” on page 119. Concatenating Data Sets with the SET Statement Understanding the SET Statement The SET statement reads observations from one or more SAS data sets and uses them to build a new data set. The SET statement for concatenating data sets has the following form: SET SAS-data-set(s); where SAS-data-set is two or more SAS data sets to concatenate. The observations from the first data set that you name in the SET statement appear first in the new data set. The observations from the second data set follow those from the first data set, and so on. The list can contain any number of data sets. Using the SET Statement: The Simplest Case In the simplest situation, the data sets that you concatenate contain the same variables (variables with the same name). In addition, the type, length, informat, format, and label of each variable match across all data sets. In this case, SAS copies all observations from the first data set into the new data set, then copies all observations from the second data set into the new data set, and so on. Each observation is an exact copy of the original. In the following example, a company that uses SAS to maintain personnel records for six separate departments decided to combine all personnel records. Two departments, Sales and Customer Support, store their data in the same form. Each observation in both data sets contains values for these variables: EmployeeID is a character variable that contains the employee’s identification number. Name is a character variable that contains the employee’s name in the form last name, comma, first name. HireDate is a numeric variable that contains the date the employee was hired. This variable has a format of DATE9. Salary is a numeric variable that contains the employee’s annual salary in US dollars. HomePhone is a character variable that contains the employee’s home telephone number. Concatenating SAS Data Sets 4 Using the SET Statement: The Simplest Case The following program creates the SAS data sets SALES and CUSTOMER_SUPPORT: options pagesize=60 linesize=80 pageno=1 nodate; data sales; input EmployeeID $ 1-9 Name $ 11-29 Salary HomePhone $; format HireDate date9.; datalines; 429685482 Martin, Virginia 09aug1990 244967839 Singleton, MaryAnn 24apr1995 996740216 Leighton, Maurice 16dec1993 675443925 Freuler, Carl 15feb1998 845729308 Cage, Merce 19oct1992 ; @30 HireDate date9. 34800 27900 32600 29900 39800 493-0824 929-2623 933-6908 493-3993 286-0519 proc print data=sales; title ’Sales Department Employees’; run; data customer_support; input EmployeeID $ 1-9 Name $ 11-29 Salary HomePhone $; format HireDate date9.; datalines; 324987451 Sayre, Jay 15nov1994 596771321 Tolson, Andrew 18mar1998 477562122 Jensen, Helga 01feb1991 894724859 Kulenic, Marie 24jun1993 988427431 Zweerink, Anna 07jul1995 ; @30 HireDate date9. 44800 41200 47400 41400 43700 933-2998 929-4800 286-2816 493-1472 929-3885 proc print data=customer_support; title ’Customer Support Department Employees’; run; The following output shows the results of both DATA steps: Output 16.1 The SALES and the CUSTOMER_SUPPORT Data Sets Sales Department Employees Obs 1 2 3 4 5 Employee ID Name 429685482 244967839 996740216 675443925 845729308 Martin, Virginia Singleton, MaryAnn Leighton, Maurice Freuler, Carl Cage, Merce 1 HireDate Salary 09AUG1990 24APR1995 16DEC1993 15FEB1998 19OCT1992 34800 27900 32600 29900 39800 Home Phone 493-0824 929-2623 933-6908 493-3993 286-0519 243 244 Using the SET Statement When Data Sets Contain Different Variables 4 Chapter 16 Customer Support Department Employees Obs 1 2 3 4 5 Employee ID 324987451 596771321 477562122 894724859 988427431 2 Name HireDate Salary Sayre, Jay Tolson, Andrew Jensen, Helga Kulenic, Marie Zweerink, Anna 15NOV1994 18MAR1998 01FEB1991 24JUN1993 07JUL1995 44800 41200 47400 41400 43700 Home Phone 933-2998 929-4800 286-2816 493-1472 929-3885 To concatenate the two data sets, list them in the SET statement. Use the PRINT procedure to display the resulting DEPT1_2 data set. options pagesize=60 linesize=80 pageno=1 nodate; data dept1_2; set sales customer_support; run; proc print data=dept1_2; title ’Employees in Sales and Customer Support Departments’; run; The following output shows the new DEPT1_2 data set. The data set contains all observations from SALES followed by all observations from CUSTOMER_SUPPORT: Output 16.2 The Concatenated DEPT1_2 Data Set Employees in Sales and Customer Support Departments Obs 1 2 3 4 5 6 7 8 9 10 Employee ID Name 429685482 244967839 996740216 675443925 845729308 324987451 596771321 477562122 894724859 988427431 Martin, Virginia Singleton, MaryAnn Leighton, Maurice Freuler, Carl Cage, Merce Sayre, Jay Tolson, Andrew Jensen, Helga Kulenic, Marie Zweerink, Anna HireDate Salary 09AUG1990 24APR1995 16DEC1993 15FEB1998 19OCT1992 15NOV1994 18MAR1998 01FEB1991 24JUN1993 07JUL1995 34800 27900 32600 29900 39800 44800 41200 47400 41400 43700 1 Home Phone 493-0824 929-2623 933-6908 493-3993 286-0519 933-2998 929-4800 286-2816 493-1472 929-3885 Using the SET Statement When Data Sets Contain Different Variables The two data sets in the previous example contain the same variables, and each variable is defined the same way in both data sets. However, you might want to concatenate data sets when not all variables are common to the data sets that are named in the SET statement. In this case, each observation in the new data set includes all variables from the SAS data sets that are named in the SET statement. The examples in this section show the SECURITY data set, and the concatenation of this data set to the SALES and the CUSTOMER_SUPPORT data sets. Not all variables are common to the three data sets. The personnel records for the Security department Concatenating SAS Data Sets 4 Using the SET Statement When Data Sets Contain Different Variables 245 do not include the variable HomePhone, and do include the new variable Gender, which does not appear in the SALES or the CUSTOMER_SUPPORT data sets. The following program creates the SECURITY data set: options pagesize=60 linesize=80 pageno=1 nodate; data security; input EmployeeID $ 1-9 Name $ 11-29 Gender $ 30 @32 HireDate date9. Salary; format HireDate date9.; datalines; 744289612 Saparilas, Theresa F 09may1998 33400 824904032 Brosnihan, Dylan M 04jan1992 38200 242779184 Chao, Daeyong M 28sep1995 37500 544382887 Slifkin, Leah F 24jul1994 45000 933476520 Perry, Marguerite F 19apr1992 39900 ; proc print data=security; title ’Security Department Employees’; run; The following output shows the results: Output 16.3 The SECURITY Data Set Security Department Employees Obs 1 2 3 4 5 Employee ID 744289612 824904032 242779184 544382887 933476520 Name Saparilas, Theresa Brosnihan, Dylan Chao, Daeyong Slifkin, Leah Perry, Marguerite Gender F M M F F 1 HireDate Salary 09MAY1998 04JAN1992 28SEP1995 24JUL1994 19APR1992 33400 38200 37500 45000 39900 The following program concatenates the SALES, CUSTOMER_SUPPORT, and SECURITY data sets, and creates the new data set, DEPT1_3: options pagesize=60 linesize=80 pageno=1 nodate; data dept1_3; set sales customer_support security; run; proc print data=dept1_3; title ’Employees in Sales, Customer Support,’; title2 ’and Security Departments’; run; The following output shows the results: 246 Using the SET Statement When Variables Have Different Attributes Output 16.4 4 Chapter 16 The Concatenated DEPT1_3 Data Set Employees in Sales, Customer Support, and Security Departments Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Employee ID Name 429685482 244967839 996740216 675443925 845729308 324987451 596771321 477562122 894724859 988427431 744289612 824904032 242779184 544382887 933476520 Martin, Virginia Singleton, MaryAnn Leighton, Maurice Freuler, Carl Cage, Merce Sayre, Jay Tolson, Andrew Jensen, Helga Kulenic, Marie Zweerink, Anna Saparilas, Theresa Brosnihan, Dylan Chao, Daeyong Slifkin, Leah Perry, Marguerite HireDate Salary 09AUG1990 24APR1995 16DEC1993 15FEB1998 19OCT1992 15NOV1994 18MAR1998 01FEB1991 24JUN1993 07JUL1995 09MAY1998 04JAN1992 28SEP1995 24JUL1994 19APR1992 34800 27900 32600 29900 39800 44800 41200 47400 41400 43700 33400 38200 37500 45000 39900 1 Home Phone Gender 493-0824 929-2623 933-6908 493-3993 286-0519 933-2998 929-4800 286-2816 493-1472 929-3885 F M M F F All observations in the data set DEPT1_3 have values for both the variable Gender and the variable HomePhone. Observations from data sets SALES and CUSTOMER_SUPPORT, the data sets that do not contain the variable Gender, have missing values for Gender (indicated by blanks under the variable name). Observations from SECURITY, the data set that does not contain the variable HomePhone, have missing values for HomePhone (indicated by blanks under the variable name). Using the SET Statement When Variables Have Different Attributes Understanding Attributes Each variable in a SAS data set can have as many as six attributes that are associated with it. These attributes are name identifies a variable. That is, when SAS looks at two or more data sets, it considers variables with the same name to be the same variable. type identifies a variable as character or numeric. length refers to the number of bytes that SAS uses to store each of the variable’s values in a SAS data set. Length is an especially important consideration when you use character variables, because the default length of character variables is eight bytes. If your data values are greater than eight bytes, then you can use a LENGTH statement to specify the number of bytes of storage that you need so that your data is not truncated. informat refers to the instructions that SAS uses when reading data values. These instructions specify the form of an input value. format refers to the instructions that SAS uses when writing data values. These instructions specify the form of an output value. label refers to descriptive text that is associated with a specific variable. Concatenating SAS Data Sets 4 Using the SET Statement When Variables Have Different Attributes 247 If the data sets that you name in the SET statement contain variables with the same names and types, then you can concatenate the data sets without modification. However, if variable types differ, then you must modify one or more data sets before concatenating them. When lengths, formats, informats, or labels differ, you might want to modify one or more data sets before proceeding. Using the SET Statement When Variables Have Different Types If a variable is defined as a character variable in one data set that is named in the SET statement, and as a numeric variable in another, then SAS issues an error message and does not concatenate the data sets. In the following example, the Accounting department in the company treats the employee identification number (EmployeeID) as a numeric variable, whereas all other departments treat it as a character variable. The following program creates the ACCOUNTING data set: options pagesize=60 linesize=80 pageno=1 nodate; data accounting; input EmployeeID 1-9 Name @32 HireDate date9. format HireDate date9.; datalines; 634875680 Gardinski, Barbara 824576630 Robertson, Hannah 744826703 Gresham, Jean 824447605 Kruize, Ronald 988674342 Linzer, Fritz ; $ 11-29 Gender $ 30 Salary; F F F M M 29may1998 14mar1995 28apr1992 23may1994 23jul1992 49800 52700 54000 49200 50400 proc print data=accounting; title ’Accounting Department Employees’; run; The following output shows the results: Output 16.5 The ACCOUNTING Data Set Accounting Department Employees Obs Employee ID Name 1 2 3 4 5 634875680 824576630 744826703 824447605 988674342 Gardinski, Barbara Robertson, Hannah Gresham, Jean Kruize, Ronald Linzer, Fritz Gender F F F M M 1 HireDate Salary 29MAY1998 14MAR1995 28APR1992 23MAY1994 23JUL1992 49800 52700 54000 49200 50400 The following program attempts to concatenate the data sets for all four departments: data dept1_4; set sales customer_support security accounting; run; 248 Using the SET Statement When Variables Have Different Attributes 4 Chapter 16 The program fails because of the difference in variable type among the four departments, and SAS writes the following error message to the log: ERROR: Variable EmployeeID has been defined as both character and numeric. Changing the Type of a Variable One way to correct the error in the previous example is to change the type of the variable EmployeeID in ACCOUNTING from numeric to character. Because performing calculations on employee identification numbers is unlikely, EmployeeID can be a character variable. To change the type of the variable EmployeeID, you can 3 re-create the data set, changing the INPUT statement so that it identifies EmployeeID as a character variable 3 use the PUT function to create a new variable, and data set options to rename and drop variables. The following program uses the PUT function and data set options to change the variable type of EmployeeID from numeric to character: options pagesize=60 linesize=80 pageno=1 nodate; data new_accounting (rename=(TempVar=EmployeeID)drop=EmployeeID); u set accounting; v TempVar=put(EmployeeID, 9.); w run; proc datasets library=work; x contents data=new_accounting; run; The following list corresponds to the numbered items in the preceding program: u The RENAME= data set option renames the variable TempVar to EmployeeID when SAS writes an observation to the output data set. The DROP= data set option is applied before the RENAME= option. The result is a change in the variable type for EmployeeID from numeric to character. Note: Although this example creates a new data set called NEW_ACCOUNTING, you can create a data set that has the same name as the data set that is listed on the SET statement. If you do this, then the type attribute for EmployeeID will be permanently altered in the ACCOUNTING data set. 4 v The SET statement reads observations from the ACCOUNTING data set. w The PUT function converts a numeric value to a character value, and applies a format to the variable EmployeeID. The assignment statement assigns the result of the PUT function to the variable TempVar. x The DATASETS procedure enables you to verify the new attribute type for EmployeeID. The following output shows a partial listing from PROC DATASETS: Concatenating SAS Data Sets Output 16.6 4 Using the SET Statement When Variables Have Different Attributes PROC DATASETS Output for the NEW_ACCOUNTING Data Set -----Alphabetic List of Variables and Attributes----# Variable Type Len Pos Format ----------------------------------------------5 EmployeeID Char 9 36 2 Gender Char 1 35 3 HireDate Num 8 0 DATE9. 1 Name Char 19 16 4 Salary Num 8 8 Now that the types of all variables match, you can easily concatenate all four data sets using the following program: options pagesize=60 linesize=80 pageno=1 nodate; data dept1_4; set sales customer_support security new_accounting; run; proc print data=dept1_4; title ’Employees in Sales, Customer Support, Security,’; title2 ’and Accounting Departments’; run; The following output shows the results: Output 16.7 The Concatenated DEPT1_4 Data Set Employees in Sales, Customer Support, Security, and Accounting Departments Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Employee ID Name 429685482 244967839 996740216 675443925 845729308 324987451 596771321 477562122 894724859 988427431 744289612 824904032 242779184 544382887 933476520 634875680 824576630 744826703 824447605 988674342 Martin, Virginia Singleton, MaryAnn Leighton, Maurice Freuler, Carl Cage, Merce Sayre, Jay Tolson, Andrew Jensen, Helga Kulenic, Marie Zweerink, Anna Saparilas, Theresa Brosnihan, Dylan Chao, Daeyong Slifkin, Leah Perry, Marguerite Gardinski, Barbara Robertson, Hannah Gresham, Jean Kruize, Ronald Linzer, Fritz HireDate Salary 09AUG1990 24APR1995 16DEC1993 15FEB1998 19OCT1992 15NOV1994 18MAR1998 01FEB1991 24JUN1993 07JUL1995 09MAY1998 04JAN1992 28SEP1995 24JUL1994 19APR1992 29MAY1998 14MAR1995 28APR1992 23MAY1994 23JUL1992 34800 27900 32600 29900 39800 44800 41200 47400 41400 43700 33400 38200 37500 45000 39900 49800 52700 54000 49200 50400 Home Phone 1 Gender 493-0824 929-2623 933-6908 493-3993 286-0519 933-2998 929-4800 286-2816 493-1472 929-3885 F M M F F F F F M M 249 250 Using the SET Statement When Variables Have Different Attributes 4 Chapter 16 Using the SET Statement When Variables Have Different Formats, Informats, or Labels When you concatenate data sets with the SET statement, the following rules determine which formats, informats, and labels are associated with variables in the new data set. 3 An explicitly defined format, informat, or label overrides a default, regardless of the position of the data sets in the SET statement. 3 If two or more data sets explicitly define different formats, informats, or labels for the same variable, then the variable in the new data set assumes the attribute from the first data set in the SET statement that explicitly defines that attribute. Returning to the examples, you may have noticed that the DATA steps that created the SALES, CUSTOMER_SUPPORT, SECURITY, and ACCOUNTING data sets use a FORMAT statement to explicitly assign a format of DATE9. to the variable HireDate. Therefore, although HireDate is a numeric variable, it appears in all displays as DDMMMYYYY (for example, 13DEC2000). The SHIPPING data set that is created in the following example, however, uses a format of DATE7. for HireDate. The DATE7. format displays as DDMMMYY (for example, 13DEC00). In addition, the SALES, CUSTOMER_SUPPORT, SECURITY, and ACCOUNTING data sets contain a default format for Salary, whereas the SHIPPING data set contains an explicitly defined format, COMMA6., for the same variable. The COMMA6. format inserts a comma in the appropriate place when SAS displays the numeric variable Salary. The following program creates the data set for the Shipping department: options pagesize=60 linesize=80 pageno=1 nodate; data shipping; input employeeID $ 1-9 Name @32 HireDate date9. @42 Salary; format HireDate date7. Salary comma6.; datalines; 688774609 Carlton, Susan F 922448328 Hoffmann, Gerald M 544909752 DePuis, David M 745609821 Hahn, Kenneth M 634774295 Landau, Jennifer F ; $ 11-29 Gender $ 30 28jan1995 12oct1997 23aug1994 23aug1994 30apr1996 proc print data=shipping; title ’Shipping Department Employees’; run; The following output shows the results: 29200 27600 32900 33300 32900 Concatenating SAS Data Sets Output 16.8 4 Using the SET Statement When Variables Have Different Attributes The SHIPPING Data Set Shipping Department Employees Obs 1 2 3 4 5 251 employee ID 688774609 922448328 544909752 745609821 634774295 Name Gender Carlton, Susan Hoffmann, Gerald DePuis, David Hahn, Kenneth Landau, Jennifer F M M M F 1 Hire Date Salary 28JAN95 12OCT97 23AUG94 23AUG94 30APR96 29,200 27,600 32,900 33,300 32,900 Now consider what happens when you concatenate SHIPPING with the previous four data sets. options pagesize=60 linesize=80 pageno=1 nodate; data dept1_5; set sales customer_support security new_accounting shipping; run; proc print data=dept1_5; title ’Employees in Sales, Customer Support, Security,’; title2 ’Accounting, and Shipping Departments’; run; The following output shows the results: 252 Using the SET Statement When Variables Have Different Attributes Output 16.9 4 Chapter 16 The DEPT1_5 Data Set: Concatenation of Five Data Sets Employees in Sales, Customer Support, Security, Accounting, and Shipping Departments Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Employee ID Name 429685482 244967839 996740216 675443925 845729308 324987451 596771321 477562122 894724859 988427431 744289612 824904032 242779184 544382887 933476520 634875680 824576630 744826703 824447605 988674342 688774609 922448328 544909752 745609821 634774295 Martin, Virginia Singleton, MaryAnn Leighton, Maurice Freuler, Carl Cage, Merce Sayre, Jay Tolson, Andrew Jensen, Helga Kulenic, Marie Zweerink, Anna Saparilas, Theresa Brosnihan, Dylan Chao, Daeyong Slifkin, Leah Perry, Marguerite Gardinski, Barbara Robertson, Hannah Gresham, Jean Kruize, Ronald Linzer, Fritz Carlton, Susan Hoffmann, Gerald DePuis, David Hahn, Kenneth Landau, Jennifer HireDate Salary 09AUG1990 24APR1995 16DEC1993 15FEB1998 19OCT1992 15NOV1994 18MAR1998 01FEB1991 24JUN1993 07JUL1995 09MAY1998 04JAN1992 28SEP1995 24JUL1994 19APR1992 29MAY1998 14MAR1995 28APR1992 23MAY1994 23JUL1992 28JAN1995 12OCT1997 23AUG1994 23AUG1994 30APR1996 34,800 27,900 32,600 29,900 39,800 44,800 41,200 47,400 41,400 43,700 33,400 38,200 37,500 45,000 39,900 49,800 52,700 54,000 49,200 50,400 29,200 27,600 32,900 33,300 32,900 Home Phone 1 Gender 493-0824 929-2623 933-6908 493-3993 286-0519 933-2998 929-4800 286-2816 493-1472 929-3885 F M M F F F F F M M F M M M F In this concatenation, the input data sets contain the variable HireDate, which was explicitly defined using two different formats. The data sets also contain the variable Salary, which has both a default and an explicit format. You can see from the output that SAS creates the new data set according to the rules mentioned earlier: 3 In the case of HireDate, SAS uses the format that is defined in the first data set that is named in the SET statement (DATE9. in SALES). 3 In the case of Salary, SAS uses the explicit format (COMMA6.) that is defined in the SHIPPING data set. In this case, SAS does not use the default format. Notice the difference if you perform a similar concatenation but reverse the order of the data sets in the SET statement. options pagesize=60 linesize=80 pageno=1 nodate; data dept5_1; set shipping new_accounting security customer_support sales; run; proc print data=dept5_1; title ’Employees in Shipping, Accounting, Security,’; title2 ’Customer Support, and Sales Departments’; run; The following output shows the results: Concatenating SAS Data Sets Output 16.10 4 Using the SET Statement When Variables Have Different Attributes The DEPT5_1 Data Set: Changing the Order of Concatenation Employees in Shipping, Accounting, Security, Customer Support, and Sales Departments Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 253 employee ID Name 688774609 922448328 544909752 745609821 634774295 634875680 824576630 744826703 824447605 988674342 744289612 824904032 242779184 544382887 933476520 324987451 596771321 477562122 894724859 988427431 429685482 244967839 996740216 675443925 845729308 Carlton, Susan Hoffmann, Gerald DePuis, David Hahn, Kenneth Landau, Jennifer Gardinski, Barbara Robertson, Hannah Gresham, Jean Kruize, Ronald Linzer, Fritz Saparilas, Theresa Brosnihan, Dylan Chao, Daeyong Slifkin, Leah Perry, Marguerite Sayre, Jay Tolson, Andrew Jensen, Helga Kulenic, Marie Zweerink, Anna Martin, Virginia Singleton, MaryAnn Leighton, Maurice Freuler, Carl Cage, Merce Gender F M M M F F F F M M F M M F F 1 Hire Date Salary Home Phone 28JAN95 12OCT97 23AUG94 23AUG94 30APR96 29MAY98 14MAR95 28APR92 23MAY94 23JUL92 09MAY98 04JAN92 28SEP95 24JUL94 19APR92 15NOV94 18MAR98 01FEB91 24JUN93 07JUL95 09AUG90 24APR95 16DEC93 15FEB98 19OCT92 29,200 27,600 32,900 33,300 32,900 49,800 52,700 54,000 49,200 50,400 33,400 38,200 37,500 45,000 39,900 44,800 41,200 47,400 41,400 43,700 34,800 27,900 32,600 29,900 39,800 933-2998 929-4800 286-2816 493-1472 929-3885 493-0824 929-2623 933-6908 493-3993 286-0519 Compared with the output in Output 16.9, this example shows that not only does the order of the observations change, but in the case of HireDate, the DATE7. format specified in SHIPPING now prevails because that data set now appears first in the SET statement. The COMMA6. format prevails for the variable Salary because SHIPPING is the only data set that explicitly specifies a format for the variable. Using the SET Statement When Variables Have Different Lengths If you use the SET statement to concatenate data sets in which the same variable has different lengths, then the outcome of the concatenation depends on whether the variable is character or numeric. The SET statement determines the length of variables as follows: 3 For a character or numeric variable, an explicitly defined length overrides a default, regardless of the position of the data sets in the SET statement. 3 If two or more data sets explicitly define different lengths for the same numeric variable, then the variable in the new data set has the same length as the variable in the data set that appears first in the SET statement. 3 If the length of a character variable differs among data sets, whether or not the differences are explicit, then the variable in the new data set has the same length as the variable in the data set that appears first in the SET statement. The following program creates the RESEARCH data set for the sixth department, Research. Notice that the INPUT statement for this data set creates the variable Name with a length of 27; in all other data sets, Name has a length of 19. options pagesize=60 linesize=80 pageno=1 nodate; 254 Using the SET Statement When Variables Have Different Attributes 4 Chapter 16 data research; input EmployeeID $ 1-9 Name $ 11-37 @40 HireDate date9. Salary; format HireDate date9.; datalines; 922854076 Schoenberg, Marguerite F 770434994 Addison-Hardy, Jonathon M 242784883 McNaughton, Elizabeth F 377882806 Tharrington, Catherine F 292450691 Frangipani, Christopher M ; Gender $ 38 19nov1994 23feb1992 24jul1993 28sep1994 12aug1990 39800 41400 45000 38600 43900 proc print data=research; title ’Research Department Employees’; run; The following output shows the results: Output 16.11 The RESEARCH Data Set Research Department Employees Obs 1 2 3 4 5 Employee ID 922854076 770434994 242784883 377882806 292450691 1 Name Gender HireDate Salary Schoenberg, Marguerite Addison-Hardy, Jonathon McNaughton, Elizabeth Tharrington, Catherine Frangipani, Christopher F M F F M 19NOV1994 23FEB1992 24JUL1993 28SEP1994 12AUG1990 39800 41400 45000 38600 43900 If you concatenate all six data sets, naming RESEARCH in any position except the first in the SET statement, then SAS defines Name with a length of 19. If you want your program to use the Name variable that has a length of 27, then you have two options. You can 3 change the order of data sets in the SET statement 3 change the length of Name in the new data set. In the first case, list the data set (RESEARCH) that uses the longer length first: data dept6_1; set research shipping new_accounting security customer_support sales; run; In the second case, include a LENGTH statement in the DATA step that creates the new data set. If you change the length of a numeric variable, then the LENGTH statement can appear anywhere in the DATA step. However, if you change the length of a character variable, then the LENGTH statement must precede the SET statement. The following program creates the data set DEPT1_6A. The LENGTH statement gives the character variable Name a length of 27, even though the first data set in the SET statement (SALES) assigns it a length of 19. options pagesize=60 linesize=80 pageno=1 nodate; Concatenating SAS Data Sets 4 Understanding the APPEND Procedure data dept1_6a; length Name $ 27; set sales customer_support security new_accounting shipping research; run; proc print data=dept1_6a; title ’Employees in All Departments’; run; The following output shows that all values of Name are complete. Note that the order of the variables in the new data set changes because Name is the first variable encountered in the DATA step. Output 16.12 The DEPT1_6A Data Set: Effects of Using a LENGTH Statement Employees in All Departments Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Name Employee ID Martin, Virginia Singleton, MaryAnn Leighton, Maurice Freuler, Carl Cage, Merce Sayre, Jay Tolson, Andrew Jensen, Helga Kulenic, Marie Zweerink, Anna Saparilas, Theresa Brosnihan, Dylan Chao, Daeyong Slifkin, Leah Perry, Marguerite Gardinski, Barbara Robertson, Hannah Gresham, Jean Kruize, Ronald Linzer, Fritz Carlton, Susan Hoffmann, Gerald DePuis, David Hahn, Kenneth Landau, Jennifer Schoenberg, Marguerite Addison-Hardy, Jonathon McNaughton, Elizabeth Tharrington, Catherine Frangipani, Christopher 429685482 244967839 996740216 675443925 845729308 324987451 596771321 477562122 894724859 988427431 744289612 824904032 242779184 544382887 933476520 634875680 824576630 744826703 824447605 988674342 688774609 922448328 544909752 745609821 634774295 922854076 770434994 242784883 377882806 292450691 HireDate Salary 09AUG1990 24APR1995 16DEC1993 15FEB1998 19OCT1992 15NOV1994 18MAR1998 01FEB1991 24JUN1993 07JUL1995 09MAY1998 04JAN1992 28SEP1995 24JUL1994 19APR1992 29MAY1998 14MAR1995 28APR1992 23MAY1994 23JUL1992 28JAN1995 12OCT1997 23AUG1994 23AUG1994 30APR1996 19NOV1994 23FEB1992 24JUL1993 28SEP1994 12AUG1990 34,800 27,900 32,600 29,900 39,800 44,800 41,200 47,400 41,400 43,700 33,400 38,200 37,500 45,000 39,900 49,800 52,700 54,000 49,200 50,400 29,200 27,600 32,900 33,300 32,900 39,800 41,400 45,000 38,600 43,900 1 Home Phone Gender 493-0824 929-2623 933-6908 493-3993 286-0519 933-2998 929-4800 286-2816 493-1472 929-3885 F M M F F F F F M M F M M M F F M F F M Concatenating Data Sets Using the APPEND Procedure Understanding the APPEND Procedure The APPEND procedure adds the observations from one SAS data set to the end of another SAS data set. PROC APPEND does not process the observations in the first 255 256 Using the APPEND Procedure: The Simplest Case 4 Chapter 16 data set. It adds the observations in the second data set directly to the end of the original data set. The APPEND procedure has the following form: PROC APPEND BASE=base-SAS-data-set ; where base-SAS-data-set names the SAS data set to which you want to append the observations. If this data set does not exist, then SAS creates it. At the completion of PROC APPEND, the value of base-SAS-data-set becomes the current (most recently created) SAS data set. SAS-data-set-to-append names the SAS data set that contains the observations to add to the end of the base data set. If you omit this option, then PROC APPEND adds the observations in the current SAS data set to the end of the base data set. FORCE forces PROC APPEND to concatenate the files in some situations in which the procedure would normally fail. Using the APPEND Procedure: The Simplest Case The following program appends the data set CUSTOMER_SUPPORT to the data set SALES. Both data sets contain the same variables and each variable has the same attributes in both data sets. options pagesize=60 linesize=80 pageno=1 nodate; proc append base=sales data=customer_support; run; proc print data=sales; title ’Employees in Sales and Customer Support Departments’; run; The following output shows the results: Output 16.13 Output from PROC APPEND Employees in Sales and Customer Support Departments Obs 1 2 3 4 5 6 7 8 9 10 Employee ID Name 429685482 244967839 996740216 675443925 845729308 324987451 596771321 477562122 894724859 988427431 Martin, Virginia Singleton, MaryAnn Leighton, Maurice Freuler, Carl Cage, Merce Sayre, Jay Tolson, Andrew Jensen, Helga Kulenic, Marie Zweerink, Anna HireDate Salary 09AUG1990 24APR1995 16DEC1993 15FEB1998 19OCT1992 15NOV1994 18MAR1998 01FEB1991 24JUN1993 07JUL1995 34800 27900 32600 29900 39800 44800 41200 47400 41400 43700 1 Home Phone 493-0824 929-2623 933-6908 493-3993 286-0519 933-2998 929-4800 286-2816 493-1472 929-3885 Concatenating SAS Data Sets 4 Using the APPEND Procedure When Data Sets Contain Different Variables 257 The resulting data set is identical to the data set that was created by naming SALES and CUSTOMER_SUPPORT in the SET statement (see Output 16.2). It is important to realize that PROC APPEND permanently alters the SALES data set, which is the data set for the BASE= option. SALES now contains observations from both the Sales and the Customer Support departments. Using the APPEND Procedure When Data Sets Contain Different Variables Recall that the SECURITY data set contains the variable Gender, which is not in the SALES data set, and lacks the variable HomePhone, which is present in the SALES data set. What happens if you try to use PROC APPEND to concatenate data sets that contain different variables? If you try to append SECURITY to SALES using the following program, then the concatenation fails: proc append base=sales data=security; run; SAS writes the following messages to the log: Output 16.14 SAS Log: PROC APPEND Error 2 proc append base=sales data=security; 3 run; NOTE: Appending WORK.SECURITY to WORK.SALES. WARNING: Variable Gender was not found on BASE file. WARNING: Variable HomePhone was not found on DATA file. ERROR: No appending done because of anomalies listed above. Use FORCE option to append these files. NOTE: 0 observations added. NOTE: The data set WORK.SALES has 5 observations and 5 variables. NOTE: Statements not processed because of errors noted above. NOTE: The SAS System stopped processing this step because of errors. You must use the FORCE option with PROC APPEND when the DATA= data set contains a variable that is not in the BASE= data set. If you modify the program to include the FORCE option, then it successfully concatenates the files. options pagesize=60 linesize=80 pageno=1 nodate; proc append base=sales data=security force; run; proc print data=sales; title ’Employees in the Sales and the Security Departments’; run; The following output shows the results: 258 Using the APPEND Procedure When Variables Have Different Attributes Output 16.15 4 Chapter 16 The SALES Data Set: Using FORCE with PROC APPEND Employees in the Sales and the Security Departments Obs 1 2 3 4 5 6 7 8 9 10 Employee ID Name 429685482 244967839 996740216 675443925 845729308 744289612 824904032 242779184 544382887 933476520 Martin, Virginia Singleton, MaryAnn Leighton, Maurice Freuler, Carl Cage, Merce Saparilas, Theresa Brosnihan, Dylan Chao, Daeyong Slifkin, Leah Perry, Marguerite HireDate Salary 09AUG1990 24APR1995 16DEC1993 15FEB1998 19OCT1992 09MAY1998 04JAN1992 28SEP1995 24JUL1994 19APR1992 34800 27900 32600 29900 39800 33400 38200 37500 45000 39900 1 Home Phone 493-0824 929-2623 933-6908 493-3993 286-0519 This output illustrates two important points about using PROC APPEND to concatenate data sets with different variables: 3 If the BASE= data set contains a variable that is not in the DATA= data set (for example, HomePhone), then PROC APPEND concatenates the data sets and assigns a missing value to that variable in the observations that are taken from the DATA= data set. 3 If the DATA= data set contains a variable that is not in the BASE= data set (for example, Gender), then the FORCE option in PROC APPEND forces the procedure to concatenate the two data sets. But because that variable is not in the descriptor portion of the BASE= data set, the procedure cannot include it in the concatenated data set. Note: In the current example, each data set contains a variable that is not in the other. It is only the case of a variable in the DATA= data set that is not in the BASE= data set that requires the use of the FORCE option. However, both cases display a warning in the log. 4 Using the APPEND Procedure When Variables Have Different Attributes When you use PROC APPEND with variables that have different attributes, the following applies: 3 If a variable has different attributes in the BASE= data set than it does in the DATA= data set, then the attributes in the BASE= data set prevail. In the cases of differing formats, informats, and labels, the concatenation succeeds. 3 If the length of a variable is longer in the BASE= data set than in the DATA= data set, then the concatenation succeeds. 3 If the length of a variable is longer in the DATA= data set than in the BASE= data set, or if the same variable is a character variable in one data set and a numeric variable in the other, then PROC APPEND fails to concatenate the files unless you specify the FORCE option. Using the FORCE option has these consequences: 3 The length that is specified in the BASE= data set prevails. Therefore, SAS truncates values from the DATA= data set to fit them into the length that is specified in the BASE= data set. Concatenating SAS Data Sets 4 Choosing between the SET Statement and the APPEND Procedure 259 3 The type that is specified in the BASE= data set prevails. The procedure replaces values of the wrong type (all values for the variable in the DATA= data set) with missing values. Choosing between the SET Statement and the APPEND Procedure If two data sets contain the same variables and the variables possess the same attributes, then the file that results from concatenating them with the SET statement is the same as the file that results from concatenating them with the APPEND procedure. The APPEND procedure concatenates much faster than the SET statement, particularly when the BASE= data set is large, because the APPEND procedure does not process the observations from the BASE= data set. However, the two methods of concatenating are sufficiently different when the variables or their attributes differ between data sets. In this case, you must consider the differences in behavior before you decide which method to use. The following table summarizes the major differences between using the SET statement and using the APPEND procedure to concatenate files. Table 16.1 Differences between the SET Statement and the APPEND Procedure Criterion SET statement APPEND procedure Number of data sets that you can concatenate Uses any number of data sets. Uses two data sets. Handling of data sets that contain different variables Uses all variables and assigns missing values where appropriate. Uses all variables in the BASE= data set and assigns missing values to observations from the DATA= data set where appropriate. Requires the FORCE option to concatenate data sets if the DATA= data set contains variables that are not in the BASE= data set. Cannot include variables found only in the DATA= data set when concatenating the data sets. Handling of different formats, informats, or labels Uses explicitly defined formats, informats, and labels rather than defaults. If two or more data sets explicitly define the format, informat, or label, then SAS uses the definition from the data set you name first in the SET statement. Uses formats, informats, and labels from the BASE= data set. 260 Review of SAS Tools 4 Chapter 16 Criterion SET statement APPEND procedure Handling of different variable lengths If the same variable has a different length in two or more data sets, then SAS uses the length from the data set you name first in the SET statement. Requires the FORCE option if the length of a variable is longer in the DATA= data set. Truncates the values of the variable to match the length in the BASE= data set. Handling of different variable types Does not concatenate the data sets. Requires the FORCE option to concatenate data sets. Uses the type attribute from the BASE= data set and assigns missing values to the variable in observations from the DATA= data set. Review of SAS Tools Statements LENGTH variable(s) <$> length; specifies the number of bytes that are used for storing variables. SET SAS-data-set(s); reads one or more SAS data sets and creates a single SAS data set that you specify in the DATA statement. Procedures PROC APPEND BASE=base-SAS-data-set ; appends the DATA= data set to the BASE= data set. base-SAS-data-set names the SAS data set to which you want to append the observations. If this data set does not exist, then SAS creates it. At the completion of PROC APPEND the base data set becomes the current (most recently created) SAS data set. SAS-data-set-to-append names the SAS data set that contains the observations to add to the end of the base data set. If you omit this option, then PROC APPEND adds the observations in the current SAS data set to the end of the base data set. The FORCE option forces PROC APPEND to concatenate the files in situations in which the procedure would otherwise fail. Learning More CONTENTS statement The CONTENTS statement in the DATASETS procedure displays information about a data set, including the names and attributes of all variables. This Concatenating SAS Data Sets 4 Learning More 261 information reveals any problems that you might have when you try to concatenate data sets, and helps you decide whether to use the SET statement or PROC APPEND. For more information about using the CONTENTS statement in the DATASETS procedure, see Chapter 33, “Understanding SAS Data Libraries,” on page 595. END= statement option enables you to determine when SAS is processing the last observation in the DATA step. For more information about using the END= option in the SET statement, see Chapter 21, “Conditionally Processing Observations from Multiple SAS Data Sets,” on page 323. IN= data set option enables you to process observations from each data set differently. For more information about using the IN= option in the SET statement, see Chapter 21, “Conditionally Processing Observations from Multiple SAS Data Sets,” on page 323. Variable attributes For more information about variable attributes, see SAS Language Reference: Dictionary. 262 263 CHAPTER 17 Interleaving SAS Data Sets Introduction to Interleaving SAS Data Sets 263 Purpose 263 Prerequisites 263 Understanding BY-Group Processing Concepts 263 Interleaving Data Sets 264 Preparing to Interleave Data Sets 264 Understanding the Interleaving Process 266 Using the Interleaving Process 266 Review of SAS Tools 267 Statements 267 Learning More 267 Introduction to Interleaving SAS Data Sets Purpose Interleaving combines individual sorted SAS data sets into one sorted data set. You interleave data sets using a SET statement and a BY statement in a DATA step. The number of observations in the new data set is the sum of the number of observations in the original data sets. In this section, you will learn how to use the BY statement, how to sort data sets to prepare for interleaving, and how to use the SET and BY statements together to interleave observations. Prerequisites Before continuing with this section, you should be familiar with the concepts presented in Chapter 3, “Starting with Raw Data: The Basics,” on page 43 and Chapter 5, “Starting with SAS Data Sets,” on page 81. Understanding BY-Group Processing Concepts The BY statement specifies the variable or variables by which you want to interleave the data sets. In order to understand interleaving, you must understand BY variables, BY values, and BY groups. BY variable 264 Interleaving Data Sets 4 Chapter 17 is a variable that is named in a BY statement and by which the data is sorted or needs to be sorted. BY value is the value of a BY variable. BY group is the set of all observations with the same value for a BY variable (when only one BY variable is specified). If you use more than one variable in a BY statement, then a BY group is a group of observations with a unique combination of values for those variables. In discussions of interleaving, BY groups commonly span more than one data set. Interleaving Data Sets Preparing to Interleave Data Sets Before you can interleave data sets, the data must be sorted by the same variable or variables you will use with the BY statement that accompanies your SET statement. For example, the Research and Development division and the Publications division of a company both maintain data sets containing information about each project currently under way. Each data set includes these variables: Project is a unique code that identifies the project. Department is the name of a department involved in the project. Manager is the last name of the manager from Department. StaffCount is the number of people working for Manager on this project. Senior management for the company wants to combine the data sets by Project so that the new data set shows the resources that both divisions are devoting to each project. Both data sets must be sorted by Project before they can be interleaved. The program that follows creates and displays the data set RESEARCH_DEVELOPMENT. See Output 17.1. Note that the input data is already sorted by Project. data research_development; length Department Manager $ 10; input Project $ Department $ Manager $ StaffCount; datalines; MP971 Designing Daugherty 10 MP971 Coding Newton 8 MP971 Testing Miller 7 SL827 Designing Ramirez 8 SL827 Coding Cho 10 SL827 Testing Baker 7 WP057 Designing Hascal 11 WP057 Coding Constant 13 WP057 Testing Slivko 10 ; run; Interleaving SAS Data Sets 4 Preparing to Interleave Data Sets proc print data=research_development; title ’Research and Development Project Staffing’; run; Output 17.1 The RESEARCH_DEVELOPMENT Data Set Research and Development Project Staffing Obs 1 2 3 4 5 6 7 8 9 Department Manager Designing Coding Testing Designing Coding Testing Designing Coding Testing Daugherty Newton Miller Ramirez Cho Baker Hascal Constant Slivko Project MP971 MP971 MP971 SL827 SL827 SL827 WP057 WP057 WP057 1 Staff Count 10 8 7 8 10 7 11 13 10 The following program creates, sorts, and displays the second data set, PUBLICATIONS. Output 17.2 shows the data set sorted by Project. data publications; length Department Manager $ 10; input Manager $ Department $ Project $ StaffCount; datalines; Cook Writing WP057 5 Deakins Writing SL827 7 Franscombe Editing MP971 4 Henry Editing WP057 3 King Production SL827 5 Krysonski Production WP057 3 Lassiter Graphics SL827 3 Miedema Editing SL827 5 Morard Writing MP971 6 Posey Production MP971 4 Spackle Graphics WP057 2 ; run; proc sort data=publications; by Project; run; proc print data=publications; title ’Publications Project Staffing’; run; 265 266 Understanding the Interleaving Process Output 17.2 4 Chapter 17 The PUBLICATIONS Data Set Publications Project Staffing Obs Department Manager 1 2 3 4 5 6 7 8 9 10 11 Editing Writing Production Writing Production Graphics Editing Writing Editing Production Graphics Franscombe Morard Posey Deakins King Lassiter Miedema Cook Henry Krysonski Spackle 1 Project Staff Count MP971 MP971 MP971 SL827 SL827 SL827 SL827 WP057 WP057 WP057 WP057 4 6 4 7 5 3 5 5 3 3 2 Understanding the Interleaving Process When interleaving, SAS creates a new data set as follows: 1 Before executing the SET statement, SAS reads the descriptor portion of each data set that you name in the SET statement. Then SAS creates a program data vector that, by default, contains all the variables from all data sets as well as any variables created by the DATA step. SAS sets the value of each variable to missing. 2 SAS looks at the first BY group in each data set in the SET statement in order to determine which BY group should appear first in the new data set. 3 SAS copies to the new data set all observations in that BY group from each data set that contains observations in the BY group. SAS copies from the data sets in the same order as they appear in the SET statement. 4 SAS looks at the next BY group in each data set to determine which BY group should appear next in the new data set. 5 SAS sets the value of each variable in the program data vector to missing. 6 SAS repeats steps 3 through 5 until it has copied all observations to the new data set. Using the Interleaving Process The following program uses the SET and BY statements to interleave the data sets RESEARCH_DEVELOPMENT and PUBLICATIONS. “Interleaving Data Sets” on page 264 shows the new data set. data rnd_pubs; set research_development publications; by Project; run; proc print data=rnd_pubs; title ’Project Participation by Research and Development’; title2 ’and Publications Departments’; title3 ’Sorted by Project’ run; Interleaving SAS Data Sets Output 17.3 4 Learning More 267 Interleaving the Data Sets Project Participation by Research and Development and Publications Departments Sorted by Project Obs Department Manager 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Designing Coding Testing Editing Writing Production Designing Coding Testing Writing Production Graphics Editing Designing Coding Testing Writing Editing Production Graphics Daugherty Newton Miller Franscombe Morard Posey Ramirez Cho Baker Deakins King Lassiter Miedema Hascal Constant Slivko Cook Henry Krysonski Spackle Project MP971 MP971 MP971 MP971 MP971 MP971 SL827 SL827 SL827 SL827 SL827 SL827 SL827 WP057 WP057 WP057 WP057 WP057 WP057 WP057 1 Staff Count 10 8 7 4 6 4 8 10 7 7 5 3 5 11 13 10 5 3 3 2 The new data set RND_PUBS includes all observations from both data sets. Each BY group in the new data set contains observations from RESEARCH_DEVELOPMENT followed by observations from PUBLICATIONS. Review of SAS Tools Statements SET SAS-data-set-list; BY variable-list; read multiple sorted SAS data sets and create one sorted SAS data set. SAS-data-set-list is a list of the SAS data sets to interleave; variable-list contains the names of one or more variables (BY variables) by which to interleave the data sets. All of the data sets must be sorted by the same variable(s) before you can interleave them. Learning More Indexes You do not need to sort unordered data sets before interleaving them if the data sets have an index on the variable or variables by which you want to interleave. 268 Learning More 4 Chapter 17 For more information about indexes, see SAS Language Reference: Concepts and the Base SAS Procedures Guide. Interleaving data sets For information about interleaving data sets when they contain different variables or when the same variables have different attributes, see Chapter 16, “Concatenating SAS Data Sets,” on page 241. The same rules apply to interleaving data sets as to concatenating them. SORT procedure and the BY statement See Chapter 11, “Working with Grouped or Sorted Observations,” on page 173. 269 CHAPTER 18 Merging SAS Data Sets Introduction to Merging SAS Data Sets 270 Purpose 270 Prerequisites 270 Understanding the MERGE Statement 270 One-to-One Merging 270 Definition of One-to-One Merging 270 Performing a Simple One-to-One Merge 271 Input SAS Data Set for Examples 271 The Program 272 Explanation 273 Performing a One-to-One Merge on Data Sets with the Same Variables 273 Input SAS Data Set for Examples 273 The Program 274 Explanation 274 Match-Merging 276 Merging with a BY Statement 276 Input SAS Data Set for Examples 276 The Program 278 Explanation 278 Match-Merging Data Sets with Multiple Observations in a BY Group 279 Input SAS Data Set for Examples 279 The Program 281 Explanation 282 Match-Merging Data Sets with Dropped Variables 284 Match-Merging Data Sets with the Same Variables 284 Match-Merging Data Sets That Lack a Common Variable 285 Choosing between One-to-One Merging and Match-Merging 286 Comparing Match-Merge Methods 286 Input SAS Data Set for Examples 287 When to Use a One-to-One Merge 288 When to Use a Match-Merge 289 Review of SAS Tools 290 Statements 290 Learning More 290 270 Introduction to Merging SAS Data Sets 4 Chapter 18 Introduction to Merging SAS Data Sets Purpose Merging combines observations from two or more SAS data sets into a single observation in a new SAS data set. The new data set contains all variables from all the original data sets unless you specify otherwise. In this section, you will learn about two types of merging: one-to-one merging and match merging. In one-to-one merging, you do not use a BY statement. Observations are combined based on their positions in the input data sets. In match merging, you use a BY statement to combine observations from the input data sets based on common values of the variable by which you merge the data sets. Prerequisites Before continuing with this section, you should be familiar with the concepts presented in Chapter 3, “Starting with Raw Data: The Basics,” on page 43 and Chapter 5, “Starting with SAS Data Sets,” on page 81. Understanding the MERGE Statement You merge data sets using the MERGE statement in a DATA step. The form of the MERGE statement that is used in this section is the following: MERGE SAS-data-set-list; BY variable-list; SAS-data-setlist is the names of two or more SAS data sets to merge. The list may contain any number of data sets. variable-list is one or more variables by which to merge the data sets. If you use a BY statement, then the data sets must be sorted by the same BY variables before you can merge them. One-to-One Merging Definition of One-to-One Merging When you use the MERGE statement without a BY statement, SAS combines the first observation in all data sets you name in the MERGE statement into the first observation in the new data set, the second observation in all data sets into the second observation in the new data set, and so on. In a one-to-one merge, the number of observations in the new data set is equal to the number of observations in the largest data set you name in the MERGE statement. Merging SAS Data Sets 4 Performing a Simple One-to-One Merge 271 Performing a Simple One-to-One Merge Input SAS Data Set for Examples For example, the instructor of a college acting class wants to schedule a conference with each student. One data set, CLASS, contains these variables: Name is the student’s name. Year is the student’s year: first, second, third, or fourth. Major is the student’s area of specialization. This value is always missing for first-year and second-year students, who have not selected a major subject yet. The following program creates and displays the data set CLASS: data class; input Name $ 1-25 Year $ 26-34 Major $ 36-50; datalines; Abbott, Jennifer first Carter, Tom third Theater Kirby, Elissa fourth Mathematics Tucker, Rachel first Uhl, Roland second Wacenske, Maurice third Theater ; proc print data=class; title ’Acting Class Roster’; run; The following output displays the data set CLASS: Output 18.1 The CLASS Data Set Acting Class Roster Obs 1 2 3 4 5 6 Name Year Abbott, Jennifer Carter, Tom Kirby, Elissa Tucker, Rachel Uhl, Roland Wacenske, Maurice first third fourth first second third 1 Major Theater Mathematics Theater A second data set contains a list of the dates and times the instructor has scheduled conferences and the rooms in which the conferences are to take place. The following program creates and displays the data set TIME_SLOT. Note the use of the date format and informat. data time_slot; input Date date9. @12 Time $ @19 Room $; format date date9.; datalines; 272 Performing a Simple One-to-One Merge 14sep2000 14sep2000 14sep2000 15sep2000 15sep2000 17sep2000 ; 10:00 10:30 11:00 10:00 10:30 11:00 4 Chapter 18 103 103 207 105 105 207 proc print data=time_slot; title ’Dates, Times, and Locations of Conferences’; run; The following output displays the data set TIME_SLOT: Output 18.2 The TIME_SLOT Data Set Dates, Times, and Locations of Conferences Obs Date 1 2 3 4 5 6 14SEP2000 14SEP2000 14SEP2000 15SEP2000 15SEP2000 17SEP2000 Time Room 10:00 10:30 11:00 10:00 10:30 11:00 103 103 207 105 105 207 1 The Program The following program performs a one-to-one merge of these data sets, assigning a time slot for a conference to each student in the class. data schedule; merge class time_slot; run; proc print data=schedule; title ’Student Conference Assignments’; run; The following output displays the conference schedule data set: Output 18.3 One-to-One Merge Student Conference Assignments Obs 1 2 3 4 5 6 Name Year Abbott, Jennifer Carter, Tom Kirby, Elissa Tucker, Rachel Uhl, Roland Wacenske, Maurice first third fourth first second third Major Theater Mathematics Theater 1 Date 14SEP2000 14SEP2000 14SEP2000 15SEP2000 15SEP2000 17SEP2000 Time Room 10:00 10:30 11:00 10:00 10:30 11:00 103 103 207 105 105 207 Merging SAS Data Sets 4 Performing a One-to-One Merge on Data Sets with the Same Variables 273 Explanation Output 18.3 shows that the new data set combines the first observation from CLASS with the first observation from TIME_SLOT, the second observation from CLASS with the second observation from TIME_SLOT, and so on. Performing a One-to-One Merge on Data Sets with the Same Variables Input SAS Data Set for Examples The previous example illustrates the simplest case of a one-to-one merge: the data sets contain the same number of observations, all variables have unique names, and you want to keep all variables from both data sets in the new data set. This example merges data sets that contain variables with the same names. Also, the second data set in this example contains one more observation than the first data set. Each data set contains data on a separate acting class. In addition to the data set CLASS, the instructor also uses the data set CLASS2, which contains the same variables as CLASS but one more observation. The following program creates and displays the data set CLASS2: data class2; input Name $ 1-25 Year $ 26-34 Major $ 36-50; datalines; Hitchcock-Tyler, Erin second Keil, Deborah third Theater Nacewicz, Chester third Theater Norgaard, Rolf second Prism, Lindsay fourth Anthropology Singh, Rajiv second Wittich, Stefan third Physics ; proc print data=class2; title ’Acting Class Roster’; title2 ’(second section)’; run; The following output displays the data set CLASS2: Output 18.4 The CLASS2 Data Set Acting Class Roster (second section) Obs 1 2 3 4 5 6 7 Name Year Hitchcock-Tyler, Erin Keil, Deborah Nacewicz, Chester Norgaard, Rolf Prism, Lindsay Singh, Rajiv Wittich, Stefan second third third second fourth second third 1 Major Theater Theater Anthropology Physics 274 Performing a One-to-One Merge on Data Sets with the Same Variables 4 Chapter 18 The Program Instead of scheduling conferences for one class, the instructor wants to schedule acting exercises for pairs of students, one student from each class. The instructor wants to create a data set in which each observation contains the name of one student from each class and the date, time, and location of the exercise. The variables Year and Major should not be in the new data set. This new data set can be created by merging the data sets CLASS, CLASS2, and TIME_SLOT. Because Year and Major are not wanted in the new data set, the DROP= data set option can be used to drop them. Notice that the data sets CLASS and CLASS2 both contain the variable Name, but the values for Name are different in each data set. To preserve both sets of values, the RENAME= data set option must be used to rename the variable in one of the data sets. The following program uses these data set options to merge the three data sets: data exercise; merge class (drop=Year Major) class2 (drop=Year Major rename=(Name=Name2)) time_slot; run; proc print data=exercise; title ’Acting Class Exercise Schedule’; run; The following output displays the new data set: Output 18.5 Merging Three Data Sets Acting Class Exercise Schedule Obs 1 2 3 4 5 6 7 Name Name2 Abbott, Jennifer Carter, Tom Kirby, Elissa Tucker, Rachel Uhl, Roland Wacenske, Maurice Hitchcock-Tyler, Erin Keil, Deborah Nacewicz, Chester Norgaard, Rolf Prism, Lindsay Singh, Rajiv Wittich, Stefan 1 Date 14SEP2000 14SEP2000 14SEP2000 15SEP2000 15SEP2000 17SEP2000 . Time Room 10:00 10:30 11:00 10:00 10:30 11:00 103 103 207 105 105 207 Explanation The following steps describe how SAS merges the data sets: 1 Before executing the DATA step, SAS reads the descriptor portion of each data set that you name in the MERGE statement. Then SAS creates a program data vector for the new data set that, by default, contains all the variables from all data sets, as well as variables created by the DATA step. In this case, however, the DROP= data set option excludes the variables Year and Major from the program data vector. The RENAME= data set option adds the variable Name2 to the program data vector. Therefore, the program data vector contains the variables Name, Name2, Date, Time, and Room. 2 SAS sets the value of each variable in the program data vector to missing, as the next figure illustrates. Merging SAS Data Sets Figure 18.1 4 Performing a One-to-One Merge on Data Sets with the Same Variables 275 Program Data Vector before Reading from Data Sets . 3 Next, SAS reads and copies the first observation from each data set into the program data vector (reading the data sets in the same order they appear in the MERGE statement), as the next figure illustrates. Figure 18.2 Program Data Vector after Reading from Each Data Set Name Name2 Date Time Room Time Room Time Room 14SEP2000 10:00 103 Abbott, Jennifer Name Abbott, Jennifer Name Abbott, Jennifer . Name2 Date Hitchcock-Tyler, Erin Name2 Hitchcock-Tyler, Erin . Date 4 After processing the first observation from the last data set and executing any other statements in the DATA step, SAS writes the contents of the program data vector to the new data set. If the DATA step attempts to read past the end of a data set, then the values of all variables from that data set in the program data vector are set to missing. This behavior has two important consequences: 3 If a variable exists in more than one data set, then the value from the last data set SAS reads is the value that goes into the new data set, even if that value is missing. If you want to keep all the values for like-named variables from different data sets, then you must rename one or more of the variables with the RENAME= data set option so that each variable has a unique name. 3 After SAS processes all observations in a data set, the program data vector and all subsequent observations in the new data set have missing values for the variables unique to that data set. So, as the next figure shows, the program data vector for the last observation in the new data set contains missing values for all variables except Name2. Figure 18.3 Program Data Vector for the Last Observation Wittich, Stefan . 276 Match-Merging 4 Chapter 18 5 SAS continues to merge observations until it has copied all observations from all data sets. Match-Merging Merging with a BY Statement Merging with a BY statement enables you to match observations according to the values of the BY variables that you specify. Before you can perform a match-merge, all data sets must be sorted by the variables that you want to use for the merge. In order to understand match-merging, you must understand three key concepts: BY variable is a variable named in a BY statement. BY value is the value of a BY variable. BY group is the set of all observations with the same value for the BY variable (if there is only one BY variable). If you use more than one variable in a BY statement, then a BY group is the set of observations with a unique combination of values for those variables. In discussions of match-merging, BY groups commonly span more than one data set. Input SAS Data Set for Examples For example, the director of a small repertory theater company, the Little Theater, maintains company records in two SAS data sets, COMPANY and FINANCE. Data Set Variable Description COMPANY Name player’s name Age player’s age Gender player’s gender Name player’s name IdNumber player’s employee ID number Salary player’s annual salary FINANCE The following program creates, sorts, and displays COMPANY and FINANCE: data company; input Name $ 1-25 Age 27-28 Gender $ 30; datalines; Vincent, Martina 34 F Phillipon, Marie-Odile 28 F Gunter, Thomas 27 M Harbinger, Nicholas 36 M Benito, Gisela 32 F Rudelich, Herbert 39 M Merging SAS Data Sets Sirignano, Emily Morrison, Michael ; 4 Input SAS Data Set for Examples 12 F 32 M proc sort data=company; by Name; run; data finance; input IdNumber $ 1-11 Name $ 13-40 Salary; datalines; 074-53-9892 Vincent, Martina 35000 776-84-5391 Phillipon, Marie-Odile 29750 929-75-0218 Gunter, Thomas 27500 446-93-2122 Harbinger, Nicholas 33900 228-88-9649 Benito, Gisela 28000 029-46-9261 Rudelich, Herbert 35000 442-21-8075 Sirignano, Emily 5000 ; proc sort data=finance; by Name; run; proc print data=company; title ’Little Theater Company Roster’; run; proc print data=finance; title ’Little Theater Employee Information’; run; The following output displays the data sets. Notice that the FINANCE data set does not contain an observation for Michael Morrison. Output 18.6 The COMPANY and FINANCE Data Sets Little Theater Company Roster Obs 1 2 3 4 5 6 7 8 Name Benito, Gisela Gunter, Thomas Harbinger, Nicholas Morrison, Michael Phillipon, Marie-Odile Rudelich, Herbert Sirignano, Emily Vincent, Martina Age 32 27 36 32 28 39 12 34 1 Gender F M M M F M F F 277 278 The Program 4 Chapter 18 Little Theater Employee Information Obs IdNumber 1 2 3 4 5 6 7 228-88-9649 929-75-0218 446-93-2122 776-84-5391 029-46-9261 442-21-8075 074-53-9892 Name 2 Salary Benito, Gisela Gunter, Thomas Harbinger, Nicholas Phillipon, Marie-Odile Rudelich, Herbert Sirignano, Emily Vincent, Martina 28000 27500 33900 29750 35000 5000 35000 The Program To avoid having to maintain two separate data sets, the director wants to merge the records for each player from both data sets into a new data set that contains all the variables. The variable that is common to both data sets is Name. Therefore, Name is the appropriate BY variable. The data sets are already sorted by NAME, so no further sorting is required. The following program merges them by NAME: data employee_info; merge company finance; by name; run; proc print data=employee_info; title ’Little Theater Employee Information’; title2 ’(including personal and financial information)’; run; The following output displays the merged data set: Output 18.7 Match-Merging Little Theater Employee Information (including personal and financial information) Obs 1 2 3 4 5 6 7 8 Name Benito, Gisela Gunter, Thomas Harbinger, Nicholas Morrison, Michael Phillipon, Marie-Odile Rudelich, Herbert Sirignano, Emily Vincent, Martina Age Gender IdNumber 32 27 36 32 28 39 12 34 F M M M F M F F 228-88-9649 929-75-0218 446-93-2122 776-84-5391 029-46-9261 442-21-8075 074-53-9892 1 Salary 28000 27500 33900 . 29750 35000 5000 35000 Explanation The new data set contains one observation for each player in the company. Each observation contains all the variables from both data sets. Notice in particular the fourth observation. The data set FINANCE does not have an observation for Michael Merging SAS Data Sets 4 Match-Merging Data Sets with Multiple Observations in a BY Group Morrison. In this case, the values of the variables that are unique to FINANCE (IdNumber and Salary) are missing. Match-Merging Data Sets with Multiple Observations in a BY Group Input SAS Data Set for Examples The Little Theater has a third data set, REPERTORY, that tracks the casting assignments in each of the season’s plays. REPERTORY contains these variables: Play is the name of one of the plays in the repertory. Role is the name of a character in Play. IdNumber is the employee ID number of the player playing Role. The following program creates and displays REPERTORY: data repertory; input Play $ 1-23 Role $ 25-48 IdNumber $ 50-60; datalines; No Exit Estelle 074-53-9892 No Exit Inez 776-84-5391 No Exit Valet 929-75-0218 No Exit Garcin 446-93-2122 Happy Days Winnie 074-53-9892 Happy Days Willie 446-93-2122 The Glass Menagerie Amanda Wingfield 228-88-9649 The Glass Menagerie Laura Wingfield 776-84-5391 The Glass Menagerie Tom Wingfield 929-75-0218 The Glass Menagerie Jim O’Connor 029-46-9261 The Dear Departed Mrs. Slater 228-88-9649 The Dear Departed Mrs. Jordan 074-53-9892 The Dear Departed Henry Slater 029-46-9261 The Dear Departed Ben Jordan 446-93-2122 The Dear Departed Victoria Slater 442-21-8075 The Dear Departed Abel Merryweather 929-75-0218 ; proc print data=repertory; title ’Little Theater Season Casting Assignments’; run; The following output displays the REPERTORY data set: 279 280 Match-Merging Data Sets with Multiple Observations in a BY Group Output 18.8 4 Chapter 18 The REPERTORY Data Set Little Theater Season Casting Assignments Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Play Role No Exit No Exit No Exit No Exit Happy Days Happy Days The Glass Menagerie The Glass Menagerie The Glass Menagerie The Glass Menagerie The Dear Departed The Dear Departed The Dear Departed The Dear Departed The Dear Departed The Dear Departed Estelle Inez Valet Garcin Winnie Willie Amanda Wingfield Laura Wingfield Tom Wingfield Jim O’Connor Mrs. Slater Mrs. Jordan Henry Slater Ben Jordan Victoria Slater Abel Merryweather 1 IdNumber 074-53-9892 776-84-5391 929-75-0218 446-93-2122 074-53-9892 446-93-2122 228-88-9649 776-84-5391 929-75-0218 029-46-9261 228-88-9649 074-53-9892 029-46-9261 446-93-2122 442-21-8075 929-75-0218 To maintain confidentiality during preliminary casting, this data set identifies players by employee ID number. However, casting decisions are now final, and the manager wants to replace each employee ID number with the player’s name. Of course, it is possible to re-create the data set, entering each player’s name instead of the employee ID number in the raw data. However, it is more efficient to make use of the data set FINANCE, which already contains the name and employee ID number of all players (see Output 18.6). When the data sets are merged, SAS takes care of adding the players’ names to the data set. Of course, before you can merge the data sets, you must sort them by IdNumber. proc sort data=finance; by IdNumber; run; proc sort data=repertory; by IdNumber; run; proc print data=finance; title ’Little Theater Employee Information’; title2 ’(sorted by employee ID number)’; run; proc print data=repertory; title ’Little Theater Season Casting Assignments’; title2 ’(sorted by employee ID number)’; run; The following output displays the FINANCE and REPERTORY data sets, sorted by IdNumber: Merging SAS Data Sets Output 18.9 4 Match-Merging Data Sets with Multiple Observations in a BY Group 281 Sorting the FINANCE and REPERTORY Data Sets by IdNumber Little Theater Employee Information (sorted by employee ID number) Obs IdNumber 1 2 3 4 5 6 7 029-46-9261 074-53-9892 228-88-9649 442-21-8075 446-93-2122 776-84-5391 929-75-0218 Name 1 Salary Rudelich, Herbert Vincent, Martina Benito, Gisela Sirignano, Emily Harbinger, Nicholas Phillipon, Marie-Odile Gunter, Thomas 35000 35000 28000 5000 33900 29750 27500 Little Theater Season Casting Assignments (sorted by employee ID number) Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Play Role The Glass Menagerie The Dear Departed No Exit Happy Days The Dear Departed The Glass Menagerie The Dear Departed The Dear Departed No Exit Happy Days The Dear Departed No Exit The Glass Menagerie No Exit The Glass Menagerie The Dear Departed Jim O’Connor Henry Slater Estelle Winnie Mrs. Jordan Amanda Wingfield Mrs. Slater Victoria Slater Garcin Willie Ben Jordan Inez Laura Wingfield Valet Tom Wingfield Abel Merryweather 2 IdNumber 029-46-9261 029-46-9261 074-53-9892 074-53-9892 074-53-9892 228-88-9649 228-88-9649 442-21-8075 446-93-2122 446-93-2122 446-93-2122 776-84-5391 776-84-5391 929-75-0218 929-75-0218 929-75-0218 These two data sets contain seven BY groups; that is, among the 23 observations are seven different values for the BY variable, IdNumber. The first BY group has a value of 029-46-9261 for IdNumber. FINANCE has one observation in this BY group; REPERTORY has two. The last BY group has a value of 929-75-0218 for IdNumber. FINANCE has one observation in this BY group; REPERTORY has three. The Program The following program merges the data sets FINANCE and REPERTORY and illustrates what happens when a BY group in one data set has more observations in it than the same BY group in the other data set. The resulting data set contains all variables from both data sets. options linesize=120; data repertory_name; merge finance repertory; by IdNumber; run; proc print data=repertory_name; 282 4 Match-Merging Data Sets with Multiple Observations in a BY Group Chapter 18 title ’Little Theater Season Casting Assignments’; title2 ’with employee financial information’; run; Note: The OPTIONS statement extends the line size to 120 so that PROC PRINT can display all variables on one line. Most output in this section is created with line size set to 76 in the OPTIONS statement. An OPTIONS statement appears only in examples using a different line size. When you set the LINESIZE= option, it remains in effect until you reset it or end the SAS session. 4 The following output displays the merged data set: Output 18.10 Match-Merge with Multiple Observations in a BY Group Little Theater Season Casting Assignments 1 with employee financial information Obs IdNumber Name Salary Play Role 1 029-46-9261 Rudelich, Herbert 35000 The Glass Menagerie Jim O’Connor 2 3 4 029-46-9261 074-53-9892 074-53-9892 Rudelich, Herbert Vincent, Martina Vincent, Martina 35000 35000 35000 The Dear Departed No Exit Happy Days Henry Slater Estelle Winnie 5 6 7 074-53-9892 228-88-9649 228-88-9649 Vincent, Martina Benito, Gisela Benito, Gisela 35000 28000 28000 The Dear Departed The Glass Menagerie The Dear Departed Mrs. Jordan Amanda Wingfield Mrs. Slater 8 9 10 442-21-8075 446-93-2122 446-93-2122 Sirignano, Emily Harbinger, Nicholas Harbinger, Nicholas 5000 33900 33900 The Dear Departed No Exit Happy Days Victoria Slater Garcin Willie 11 12 13 446-93-2122 776-84-5391 776-84-5391 Harbinger, Nicholas Phillipon, Marie-Odile Phillipon, Marie-Odile 33900 29750 29750 The Dear Departed No Exit The Glass Menagerie Ben Jordan Inez Laura Wingfield 14 15 929-75-0218 929-75-0218 Gunter, Thomas Gunter, Thomas 27500 27500 No Exit The Glass Menagerie Valet Tom Wingfield 16 929-75-0218 Gunter, Thomas 27500 The Dear Departed Abel Merryweather Explanation Carefully examine the first few observations in the new data set and consider how SAS creates them. 1 Before executing the DATA step, SAS reads the descriptor portion of the two data sets and creates a program data vector that contains all variables from both data sets: 3 IdNumber, Name, and Salary from FINANCE 3 Play and Role from REPERTORY. IdNumber is already in the program data vector because it is in FINANCE. SAS sets the values of all variables to missing, as the following figure illustrates. Figure 18.4 Program Data Vector before Reading from Data Sets . Merging SAS Data Sets 4 Match-Merging Data Sets with Multiple Observations in a BY Group 283 2 SAS looks at the first BY group in each data set to determine which BY group should appear first. In this case, the first BY group, observations with the value 029-46-9261 for IdNumber, is the same in both data sets. 3 SAS reads and copies the first observation from FINANCE into the program data vector, as the next figure illustrates. Figure 18.5 Program Data Vector after Reading FINANCE Data Set IdNumber Name 029-46-9261 Rudelich, Herbert Salary Play Role 35000 4 SAS reads and copies the first observation from REPERTORY into the program data vector, as the next figure illustrates. If a data set does not have any observations in a BY group, then the program data vector contains missing values for the variables that are unique to that data set. Figure 18.6 Program Data Vector after Reading REPERTORY Data Set IdNumber Name 029-46-9261 Rudelich, Herbert Salary Play Role 35000 5 SAS writes the observation to the new data set and retains the values in the program data vector. (If the program data vector contained variables created by the DATA step, then SAS would set them to missing after writing to the new data set.) 6 SAS looks for a second observation in the BY group in each data set. REPERTORY has one; FINANCE does not. The MERGE statement reads the second observation in the BY group from REPERTORY. Because FINANCE has only one observation in the BY group, the statement uses the values of Name (Rudelich , Herbert) and Salary (35000) retained in the program data vector for the second observation in the new data set. The next figure illustrates this behavior. Figure 18.7 Program Data Vector with Second Observation in the BY Group 029-46-9261 Rudelich, Herbert 35000 The Dear Departed Henry Slater 7 SAS writes the observation to the new data set. Neither data set contains any more observations in this BY group. Therefore, as the final figure illustrates, SAS sets all values in the program data vector to missing and begins processing the next BY group. It continues processing observations until it exhausts all observations in both data sets. Figure 18.8 IdNumber Program Data Vector before New BY Groups Name Salary . Play Role 284 Match-Merging Data Sets with Dropped Variables 4 Chapter 18 Match-Merging Data Sets with Dropped Variables Now that casting decisions are final, the director wants to post the casting list, but does not want to include salary or employee ID information. As the next program illustrates, Salary and IdNumber can be eliminated by using the DROP= data set option when creating the new data set. data newrep (drop=IdNumber); merge finance (drop=Salary) repertory; by IdNumber; run; proc print data=newrep; title ’Final Little Theater Season Casting Assignments’; run; Note: The difference in placement of the two DROP= data set options is crucial. Dropping IdNumber in the DATA statement means that the variable is available to the MERGE and BY statements (to which it is essential) but that it does not go into the new data set. Dropping Salary in the MERGE statement means that the MERGE statement does not even read this variable, so Salary is unavailable to the program statements. Because the variable Salary is not needed for processing, it is more efficient to prevent it from being read into the PDV in the first place. 4 The following output displays the merged data set without the IdNumber and Salary variables: Output 18.11 Match-Merging Data Sets with Dropped Variables Final Little Theater Season Casting Assignments Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Name Play Role Rudelich, Herbert Rudelich, Herbert Vincent, Martina Vincent, Martina Vincent, Martina Benito, Gisela Benito, Gisela Sirignano, Emily Harbinger, Nicholas Harbinger, Nicholas Harbinger, Nicholas Phillipon, Marie-Odile Phillipon, Marie-Odile Gunter, Thomas Gunter, Thomas Gunter, Thomas The Glass Menagerie The Dear Departed No Exit Happy Days The Dear Departed The Glass Menagerie The Dear Departed The Dear Departed No Exit Happy Days The Dear Departed No Exit The Glass Menagerie No Exit The Glass Menagerie The Dear Departed Jim O’Connor Henry Slater Estelle Winnie Mrs. Jordan Amanda Wingfield Mrs. Slater Victoria Slater Garcin Willie Ben Jordan Inez Laura Wingfield Valet Tom Wingfield Abel Merryweather 1 Match-Merging Data Sets with the Same Variables You can match-merge data sets that contain the same variables (variables with the same name) by using the RENAME= data set option, just as you would when Merging SAS Data Sets 4 Match-Merging Data Sets That Lack a Common Variable 285 performing a one-to-one merge (see “Performing a One-to-One Merge on Data Sets with the Same Variables” on page 273). If you do not use the RENAME= option and a variable exists in more than one data set, then the value of that variable in the last data set read is the value that goes into the new data set. Match-Merging Data Sets That Lack a Common Variable You can name any number of data sets in the MERGE statement. However, if you are match-merging the data sets, then you must be sure they all have a common variable and are sorted by that variable. If the data sets do not have a common variable, then you might be able to use another data set that has variables common to the original data sets to merge them. For instance, consider the data sets that are used in the match-merge examples. The table that follows shows the names of the data sets and the names of the variables in each data set. Data Set Variables COMPANY Name, Age, Gender FINANCE Name, IdNumber, Salary REPERTORY Play, Role, IdNumber These data sets do not share a common variable. However, COMPANY and FINANCE share the variable Name. Similarly, FINANCE and REPERTORY share the variable IdNumber. Therefore, as the next program shows, you can merge the data sets into one with two separate DATA steps. As usual, you must sort the data sets by the appropriate BY variable. (REPERTORY is already sorted by IdNumber.) options linesize=120; /* Sort FINANCE and COMPANY by Name */ proc sort data=finance; by Name; run; proc sort data=company; by Name; run; /* Merge COMPANY and FINANCE into a */ /* temporary data set. */ data temp; merge company finance; by Name; run; proc sort data=temp; by IdNumber; run; /* Merge the temporary data set with REPERTORY */ data all; 286 Choosing between One-to-One Merging and Match-Merging 4 Chapter 18 merge temp repertory; by IdNumber; run; proc print data=all; title ’Little Theater Complete Casting Information’; run; In order to merge the three data sets, this program 3 sorts FINANCE and COMPANY by Name 3 merges COMPANY and FINANCE into a temporary data set, TEMP 3 sorts TEMP by IdNumber 3 merges TEMP and REPERTORY by IdNumber. The following output displays the resulting data set, ALL: Output 18.12 Match-Merging Data Sets That Lack a Common Variable Little Theater Complete Casting Information Obs Name Age Gender IdNumber Salary 1 Play Role 1 2 3 Morrison, Michael Rudelich, Herbert Rudelich, Herbert 32 39 39 M M M 029-46-9261 029-46-9261 . 35000 35000 The Glass Menagerie The Dear Departed Jim O’Connor Henry Slater 4 5 Vincent, Martina Vincent, Martina 34 34 F F 074-53-9892 074-53-9892 35000 35000 No Exit Happy Days Estelle Winnie 6 7 8 Vincent, Martina Benito, Gisela Benito, Gisela 34 32 32 F F F 074-53-9892 228-88-9649 228-88-9649 35000 28000 28000 The Dear Departed The Glass Menagerie The Dear Departed Mrs. Jordan Amanda Wingfield Mrs. Slater 9 10 11 Sirignano, Emily Harbinger, Nicholas Harbinger, Nicholas 12 36 36 F M M 442-21-8075 446-93-2122 446-93-2122 5000 33900 33900 The Dear Departed No Exit Happy Days Victoria Slater Garcin Willie 12 13 14 Harbinger, Nicholas Phillipon, Marie-Odile Phillipon, Marie-Odile 36 28 28 M F F 446-93-2122 776-84-5391 776-84-5391 33900 29750 29750 The Dear Departed No Exit The Glass Menagerie Ben Jordan Inez Laura Wingfield 15 16 17 Gunter, Thomas Gunter, Thomas Gunter, Thomas 27 27 27 M M M 929-75-0218 929-75-0218 929-75-0218 27500 27500 27500 No Exit The Glass Menagerie The Dear Departed Valet Tom Wingfield Abel Merryweather Choosing between One-to-One Merging and Match-Merging Comparing Match-Merge Methods Use one-to-one merging when you want to combine one observation from each data set, but it is not important to match observations. For example, when merging an observation that contains a student’s name, year, and major with an observation that contains a date, time, and location for a conference, it does not matter which student gets which time slot; therefore, a one-to-one merge is appropriate. In cases where you must merge certain observations, use a match-merge. For example, when merging employee information from two different data sets, it is crucial that you merge observations that relate to the same employee. Therefore, you must use a match-merge. Sometimes you might want to merge by a particular variable, but your data is arranged in such a way that you can see that a one-to-one merge will work. The next Merging SAS Data Sets 4 Input SAS Data Set for Examples 287 example illustrates a case when you could use a one-to-one merge for matching observations because you are certain that your data is ordered correctly. However, as a subsequent example shows, it is risky to use a one-to-one merge in such situations. Input SAS Data Set for Examples Consider the data set COMPANY2. Each observation in this data set corresponds to an observation with the same value of Name in FINANCE. The program that follows creates and displays COMPANY2; it also displays FINANCE for comparison. data company2; input name $ 1-25 age 27-28 gender $ 30; datalines; Benito, Gisela 32 F Gunter, Thomas 27 M Harbinger, Nicholas 36 M Phillipon, Marie-Odile 28 F Rudelich, Herbert 39 M Sirignano, Emily 12 F Vincent, Martina 34 F ; proc print data=company2; title ’Little Theater Company Roster’; run; proc print data=finance; title ’Little Theater Employee Information’; run; The following outout displays the two data sets: Output 18.13 The COMPANY2 and FINANCE Data Sets Little Theater Company Roster Obs 1 2 3 4 5 6 7 name Benito, Gisela Gunter, Thomas Harbinger, Nicholas Phillipon, Marie-Odile Rudelich, Herbert Sirignano, Emily Vincent, Martina age 32 27 36 28 39 12 34 1 gender F M M F M F F 288 4 When to Use a One-to-One Merge Chapter 18 Little Theater Employee Information Obs IdNumber 1 2 3 4 5 6 7 228-88-9649 929-75-0218 446-93-2122 776-84-5391 029-46-9261 442-21-8075 074-53-9892 Name 2 Salary Benito, Gisela Gunter, Thomas Harbinger, Nicholas Phillipon, Marie-Odile Rudelich, Herbert Sirignano, Emily Vincent, Martina 28000 27500 33900 29750 35000 5000 35000 When to Use a One-to-One Merge The following program shows that because both data sets are sorted by NAME and because each observation in one data set has a corresponding observation in the other data set, a one-to-one merge has the same result as merging by Name. /* One-to-one merge */ data one_to_one; merge company2 finance; run; proc print data=one_to_one; title ’Using a One-to-One Merge to Combine’; title2 ’COMPANY2 and FINANCE’; run; /* Match-merge */ data match; merge company2 finance; by name; run; proc print data=match; title ’Using a Match-Merge to Combine’; title2 ’COMPANY2 and FINANCE’; run; The following output displays the results of the two merges. You can see that they are identical. Output 18.14 Comparing a One-to-One Merge with a Match-Merge When Observations Correspond Using a One-to-One Merge to Combine COMPANY2 and FINANCE Obs 1 2 3 4 5 6 7 name Benito, Gisela Gunter, Thomas Harbinger, Nicholas Phillipon, Marie-Odile Rudelich, Herbert Sirignano, Emily Vincent, Martina age 32 27 36 28 39 12 34 gender F M M F M F F IdNumber 228-88-9649 929-75-0218 446-93-2122 776-84-5391 029-46-9261 442-21-8075 074-53-9892 1 Salary 28000 27500 33900 29750 35000 5000 35000 Merging SAS Data Sets 4 When to Use a Match-Merge Using a Match-Merge to Combine COMPANY2 and FINANCE Obs 1 2 3 4 5 6 7 name 2 age gender IdNumber Salary 32 27 36 28 39 12 34 F M M F M F F 228-88-9649 929-75-0218 446-93-2122 776-84-5391 029-46-9261 442-21-8075 074-53-9892 28000 27500 33900 29750 35000 5000 35000 Benito, Gisela Gunter, Thomas Harbinger, Nicholas Phillipon, Marie-Odile Rudelich, Herbert Sirignano, Emily Vincent, Martina 289 Even though the resulting data sets are identical, it is not wise to use a one-to-one merge when it is essential to merge a particular observation from one data set with a particular observation from another data set. When to Use a Match-Merge In the previous example, you can easily determine that the data sets contain the same values for Name and that the values appear in the same order. However, if the data sets contained hundreds of observations, then it would be difficult to ascertain that all the values match. If the observations do not match, then serious problems can occur. The next example illustrates why you should not use a one-to-one merge for matching observations. Consider the original data set, COMPANY, which contains an observation for Michael Morrison (see Output 18.6). FINANCE has no corresponding observation. If a programmer did not realize this fact and tried to use the following program to perform a one-to-one merge with FINANCE, then several problems could appear. data badmerge; merge company finance; run; proc print data=badmerge; title ’Using a One-to-One Merge Instead of a Match-Merge’; run; The following output shows the potential problems: Output 18.15 One-to-One Merge with Unequal Numbers of Observations in Each Data Set Using a One-to-One Merge Instead of a Match-Merge Obs 1 2 3 4 5 6 7 8 Name Benito, Gisela Gunter, Thomas Harbinger, Nicholas Phillipon, Marie-Odile Rudelich, Herbert Sirignano, Emily Vincent, Martina Vincent, Martina Age Gender 32 27 36 32 28 39 12 34 F M M M F M F F IdNumber 228-88-9649 929-75-0218 446-93-2122 776-84-5391 029-46-9261 442-21-8075 074-53-9892 1 Salary 28000 27500 33900 29750 35000 5000 35000 . 290 Review of SAS Tools 4 Chapter 18 The first three observations merge correctly. However, FINANCE does not have an observation for Michael Morrison. A one-to-one merge makes no attempt to match parts of the observations from the different data sets. It simply combines observations based on their positions in the data sets that you name in the MERGE statement. Therefore, the fourth observation in BADMERGE combines the fourth observation in COMPANY (Michael’s name, age, and gender) with the fourth observation in FINANCE (Marie-Odile’s name, employee ID number, and salary). As SAS combines the observations, Marie-Odile’s name overwrites Michael’s. After writing this observation to the new data set, SAS processes the next observation in each data set. These observations are similarly mismatched. This type of mismatch continues until the seventh observation when the MERGE statement exhausts the observations in the smaller data set, FINANCE. After writing the seventh observation to the new data set, SAS begins the next iteration of the DATA step. Because SAS has read all observations in FINANCE, it sets the values for variables from that data set to missing in the program data vector. Then it reads the values for Name, Age, and Gender from COMPANY and writes the contents of the program data vector to the new data set. Therefore, the last observation has the same value for NAME as the previous observation and contains missing values for IdNumber and Salary. These missing values and the duplication of the value for Name might make you suspect that the observations did not merge as you intended them to. However, if instead of being an additional observation, the observation for Michael Morrison replaced another observation in COMPANY2, then no observations would have missing values, and the problem would not be as easy to spot. Therefore, you are safer using a match-merge in situations that call for it even if you think the data is arranged so that a one-to-one merge will have the same results. Review of SAS Tools Statements MERGE SAS-data-set-list; BY variable-list; read observations in multiple SAS data sets and combine them into one observation in one new SAS data set. SAS-data-set-list is a list of the SAS data sets to merge. The list may contain any number of data sets; variable-list is the name of one or more variables by which to merge the data sets. If you use a BY statement, then the data sets must be sorted by the same BY variables before you can merge them. If you do not use a BY statement, then SAS merges observations based on their positions in the original data sets. Learning More Indexes If a data set has an index on the variable or variables named in the BY statement that accompanies the MERGE statement, then you do not need to sort that data Merging SAS Data Sets 4 Learning More 291 set. For more information about indexes, see SAS Language Reference: Concepts and the Base SAS Procedures Guide. SAS date and time formats and informats The examples in this section read Time as a character variable, and they read Date with a SAS date informat. You could read Time using special SAS time informats. For more information about SAS date and time formats and informats, see SAS Language Reference: Dictionary. 292 293 CHAPTER 19 Updating SAS Data Sets Introduction to Updating SAS Data Sets 293 Purpose 293 Prerequisites 293 Understanding the UPDATE Statement 294 Understanding How to Select BY Variables 294 Updating a Data Set 295 Updating with Incremental Values 300 Understanding the Differences between Updating and Merging 302 General Comparisons between Updating and Merging 302 How the UPDATE and MERGE Statements Process Missing Values Differently 304 How the UPDATE and MERGE Statements Process Multiple Observations in a BY Group Differently 305 Handling Missing Values 305 Review of SAS Tools 308 Statements 308 Learning More 309 Introduction to Updating SAS Data Sets Purpose Updating replaces the values of variables in one data set with nonmissing values from another data set. In this section, you will learn about the following: 3 master data sets and transaction data sets 3 using the UPDATE statement 3 how to choose between updating and merging Prerequisites Before using this section, you should be familiar with the concepts presented in 3 Chapter 3, “Starting with Raw Data: The Basics,” on page 43 3 Chapter 5, “Starting with SAS Data Sets,” on page 81 3 Chapter 18, “Merging SAS Data Sets,” on page 269 294 Understanding the UPDATE Statement 4 Chapter 19 Understanding the UPDATE Statement When you update, you work with two SAS data sets. The data set that contains the original information is the master data set. The data set that contains the new information is the transaction data set. Many applications, such as maintaining mailing lists and inventories, call for periodic updates of information. In a DATA step, the UPDATE statement reads observations from the transaction data set and updates corresponding observations (observations with the same value of all BY variables) from the master data set. All nonmissing values for variables in the transaction data set replace the corresponding values that are read from the master data set. SAS writes the modified observations to the data set that you name in the DATA statement without modifying either the master or the transaction data set. The general form of the UPDATE statement is UPDATE master-SAS-data-set transaction-SAS-data-set; BY identifier-list; where master-SAS-data-set is the SAS data set containing information you want to update. transaction-SAS-data-set is the SAS data set containing information with which you want to update the master data set. identifier-list is the list of BY variables by which you identify corresponding observations. If the master data set contains an observation that does not correspond to an observation in the transaction data set, the DATA step writes that observation to the new data set without modification. An observation from the transaction data set that does not correspond to any observation in the master data set becomes the basis for a new observation. The new observation may be modified by other observations from the transaction data set before it is written to the new data set. Understanding How to Select BY Variables The master data set and the transaction data set must be sorted by the same variable or variables that you specify in the BY statement. Select a variable that meets these criteria: 3 The value of the variable is unique for each observation in the master data set. If you use more than one BY variable, no two observations in the master data set should have the same values for all BY variables. 3 The variable or variables never need to be updated. Some examples of variables that you can use in the BY statement include employee or student identification numbers, stock numbers, and the names of objects in an inventory. If you are updating a data set, you probably do not want duplicate values of BY variables in the master data set. For example, if you update by NAME, each observation in the master data set should have a unique value of NAME. If you update by NAME and AGE, two or more observations can have the same value for either NAME or AGE but should not have the same values for both. SAS warns you if it finds Updating SAS Data Sets 4 Updating a Data Set 295 duplicates but proceeds with the update. It applies all transactions only to the first observation in the BY group in the master data set. Updating a Data Set In this example, the circulation department of a magazine maintains a mailing list that contains tens of thousands of names. Each issue of the magazine contains a form for readers to fill out when they change their names or addresses. To simplify the maintenance job, the form requests that readers send only new information. New subscribers can start a subscription by completing the entire form. When a form is received, a data entry operator enters the information on the form into a raw data file. The mailing list is updated once per month from the raw data file. The mailing list includes these variables for each subscriber: SubscriberId is a unique number assigned to the subscriber at the time the subscription begins. A subscriber’s SubscriberId never changes. Name is the subscriber’s name. The last name appears first, followed by a comma and the first name. StreetAddress is the subscriber’s street address. City is the subscriber’s city. StateProv is the subscriber’s state or province. This variable is missing for addresses outside the United States and Canada. PostalCode is the subscriber’s postal code (zip code for addresses in the United States). Country is the subscriber’s country. The following program creates and displays the first part of this data set. The raw data are already sorted by SubscriberId. options pagesize=60 linesize=80 pageno=1 nodate; data mail_list; input SubscriberId 1-8 Name $ 9-27 StreetAddress $ 28-47 City $ 48-62 StateProv $ 63-64 PostalCode $ 67-73 Country $ ; datalines; 1001 Ericson, Jane 111 Clancey Court Chapel Hill NC 27514 1002 Dix, Martin 4 Shepherd St. Vancouver BC V6C 3E8 1003 Gabrielli, Theresa Via Pisanelli, 25 Roma 00196 1004 Clayton, Aria 14 Bridge St. San Francisco CA 94124 1005 Archuleta, Ruby Box 108 Milagro NM 87429 1006 Misiewicz, Jeremy 43-C Lakeview Apts. Madison WI 53704 1007 Ahmadi, Hafez 52 Rue Marston Paris 75019 1008 Jacobson, Becky 1 Lincoln St. Tallahassee FL 32312 1009 An, Ing 95 Willow Dr. Toronto ON M5J 2T3 1010 Slater, Emily 1009 Cherry St. York PA 17407 ...more data lines... ; proc print data=mail_list (obs=10); title ’Magazine Master Mailing List’; USA Canada Italy USA USA USA France USA Canada USA 296 Updating a Data Set 4 Chapter 19 run; The following output shows the results: Output 19.1 The MAIL_LIST Data Set Magazine Master Mailing List S u b s c r i b e r I d O b s 1 2 3 4 5 6 7 8 9 10 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1 N a m e S t r e e t A d d r e s s C i t y S t a t e P r o v Ericson, Jane Dix, Martin Gabrielli, Theresa Clayton, Aria Archuleta, Ruby Misiewicz, Jeremy Ahmadi, Hafez Jacobson, Becky An, Ing Slater, Emily 111 Clancey Court 4 Shepherd St. Via Pisanelli, 25 14 Bridge St. Box 108 43-C Lakeview Apts. 52 Rue Marston 1 Lincoln St. 95 Willow Dr. 1009 Cherry St. Chapel Hill Vancouver Roma San Francisco Milagro Madison Paris Tallahassee Toronto York NC 27514 BC V6C 3E8 00196 CA 94124 NM 87429 WI 53704 75019 FL 32312 ON M5J 2T3 PA 17407 P o s t a l C o d e C o u n t r y USA Canada Italy USA USA USA France USA Canada USA This month the information that follows is received for updating the mailing list: 3 Martin Dix changed his name to Martin Dix-Rosen. 3 Jane Ericson’s postal code changed. 3 Jeremy Misiewicz moved to a new street address. His city, state, and postal code remain the same. 3 Ing An moved from Toronto, Ontario, to Calgary, Alberta. 3 Martin Dix-Rosen, shortly after changing his name, moved from Vancouver, British Columbia, to Seattle, Washington. 3 Two new subscribers joined the list. They are given SubscriberID numbers 1011 and 1012. Each change is entered into the raw data file as soon as it is received. In each case, only the customer’s SubscriberId and the new information are entered. The raw data file looks like this: 1002 1001 1006 1009 1011 1002 1012 Dix-Rosen, Martin 27516 Mitchell, Wayne Stavros, Gloria 932 Webster St. 2540 Pleasant St. 28 Morningside Dr. P.O. Box 1850 212 Northampton Rd. Calgary New York Seattle South Hadley AB NY WA MA T2P 4H2 10017 USA 98101 USA 01075 USA The data is in fixed columns, matching the INPUT statement that created MAIL_LIST. Updating SAS Data Sets 4 Updating a Data Set 297 First, you must transform the raw data into a SAS data set and sort that data set by SubscriberId so that you can use it to update the master list. data mail_trans; infile ’your-input-file’ missover; input SubscriberId 1-8 Name $ 9-27 StreetAddress $ 28-47 City $ 48-62 StateProv $ 63-64 PostalCode $ 67-73 Country $ 75-80; run; proc sort data=mail_trans; by SubscriberId; run; proc print data=mail_trans; title ’Magazine Mailing List Changes’; title2 ’(for current month)’; run; Note the MISSOVER option in the INFILE statement. The MISSOVER option prevents the INPUT statement from going to a new line to search for values for variables which have not received values; instead, any variables that have not received values are set to missing. For example, when the first record is read, the end of the record is encountered before any value has been assigned to the Country variable; instead of going to the next record to search for a value for Country, the Country variable is assigned a missing value. For more information about the MISSOVER option, see Chapter 4, “Starting with Raw Data: Beyond the Basics,” on page 61. The following output shows the sorted data set MAIL_TRANS: Output 19.2 The MAIL_TRANS Data Set Magazine Mailing List Changes (for current month) O b s 1 2 3 4 5 6 7 S u b s c r i b e r I d 1001 1002 1002 1006 1009 1011 1012 N a m e S t r e e t A d d r e s s C i t y 1 S t a t e P r o v P o s t a l C o d e C o u n t r y 27516 Dix-Rosen, Martin Mitchell, Wayne Stavros, Gloria P.O. Box 1850 932 Webster St. 2540 Pleasant St. 28 Morningside Dr. 212 Northampton Rd. Seattle WA 98101 USA Calgary New York South Hadley AB NY MA T2P 4H2 10017 01075 USA USA Now that the new data are in a sorted SAS data set, the following program updates the mailing list. 298 Updating a Data Set 4 Chapter 19 data mail_newlist; update mail_list mail_trans; by SubscriberId; run; proc print data=mail_newlist; title ’Magazine Mailing List’; title2 ’(updated for current month)’; run; The following output shows the resulting data set MAIL_NEWLIST: Output 19.3 Updating a Data Set Magazine Mailing List (updated for current month) O b s 1 2 3 4 5 6 7 8 9 10 11 12 S u b s c r i b e r I d 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1 N a m e S t r e e t A d d r e s s C i t y S t a t e P r o v Ericson, Jane Dix-Rosen, Martin Gabrielli, Theresa Clayton, Aria Archuleta, Ruby Misiewicz, Jeremy Ahmadi, Hafez Jacobson, Becky An, Ing Slater, Emily Mitchell, Wayne Stavros, Gloria 111 Clancey Court P.O. Box 1850 Via Pisanelli, 25 14 Bridge St. Box 108 932 Webster St. 52 Rue Marston 1 Lincoln St. 2540 Pleasant St. 1009 Cherry St. 28 Morningside Dr. 212 Northampton Rd. Chapel Hill Seattle Roma San Francisco Milagro Madison Paris Tallahassee Calgary York New York South Hadley NC 27516 WA 98101 00196 CA 94124 NM 87429 WI 53704 75019 FL 32312 AB T2P 4H2 PA 17407 NY 10017 MA 01075 P o s t a l C o d e C o u n t r y USA USA Italy USA USA USA France USA Canada USA USA USA The data for subscriber 1002, who has two update transactions, is used below to show what happens when you update an observation in the master data set with corresponding observations from the transaction data set. 1 Before executing the DATA step, SAS reads the descriptor portion of each data set named in the UPDATE statement and, by default, creates a program data vector that contains all the variables from all data sets. As the following figure illustrates, SAS sets the value of each variable to missing. (Use the DROP= or KEEP= data set option to exclude one or more variables.) Updating SAS Data Sets Figure 19.1 4 Updating a Data Set 299 Program Data Vector before Execution of the DATA Step . 2 Next, SAS reads the first observation from the master data set and copies it into the program data vector, as the following figure illustrates. Figure 19.2 Program Data Vector after Reading the First Observation from the Master Data Set 1002 Dix, Martin 4 Shepherd St. Vancouver BC V6C 3E8 Canada 3 SAS applies the first transaction by copying all nonmissing values (the value of Name) from the first observation in this BY group (ID=1002) into the program data vector, as the following figure illustrates. Figure 19.3 1002 Program Data Vector after Applying the First Transaction Dix-Rosen, Martin 4 Shepherd St. Vancouver BC V6C 3E8 Canada 4 After completing this transaction, SAS looks for another observation in the same BY group in the transaction data set. If it finds a second observation with the same value for ID, then it applies the second transaction too (new values for StreetAddress, City, StateProv, PostalCode, and Country). Now the observation contains the new values from both transactions, as the following figure illustrates. Figure 19.4 1002 Program Data Vector after Applying the Second Transaction Dix-Rosen, Martin P.O. Box 1850 Seattle WA 98101 USA 5 After completing the second transaction, SAS looks for a third observation in the same BY group. Because no such observation exists, it writes the observation in its current form to the new data set and sets the values in the program data vector to missing. As the DATA step iterates, the UPDATE statement continues processing observations in this way until it reaches the end of the master and transaction data sets. The two observations in the transaction data set that describe new subscribers (and therefore have no corresponding observation in the master data set) become observations in the new data set. Remember that if there are duplicate observations in the master data set, all matching observations in the transaction data set are applied only to the first of the duplicate observations in the master data set. 300 Updating with Incremental Values 4 Chapter 19 Updating with Incremental Values Some applications do not update a data set by overwriting values in the master data set with new values from a transaction data set. Instead, they update a variable by mathematically manipulating its value based on the value of a variable in the transaction data set. In this example, a bookstore uses SAS to keep track of weekly sales and year-to-date sales. The program that follows creates, sorts by Title, and displays the data set, YEAR_SALES, which contains the year-to-date information. data year_sales; input Title $ 1-25 Author $ 27-50 Sales; datalines; The Milagro Beanfield War Nichols, John The Stranger Camus, Albert Always Coming Home LeGuin, Ursula Falling through Space Gilchrist, Ellen Don Quixote Cervantes, Miguel de The Handmaid’s Tale Atwood, Margaret ; 303 150 79 128 87 64 proc sort data=year_sales; by title; run; proc print data=year_sales (obs=6); title ’Bookstore Sales, Year-to-Date’; title2 ’By Title’; run; The following output displays the YEAR_SALES data set: Output 19.4 The YEAR_SALES Data Set, Sorted by Title Bookstore Sales, Year-to-Date By Title Obs 1 2 3 4 5 6 1 Title Author Sales Always Coming Home Don Quixote Falling through Space The Handmaid’s Tale The Milagro Beanfield War The Stranger LeGuin, Ursula Cervantes, Miguel de Gilchrist, Ellen Atwood, Margaret Nichols, John Camus, Albert 79 87 128 64 303 150 Every Saturday a SAS data set is created containing information about all the books that were sold during the past week. The program following creates, sorts by Title, and displays the data set WEEK_SALES, which contains the current week’s information. data week_sales; input Title $ 1-25 Author $ 27-50 Sales; datalines; The Milagro Beanfield War Nichols, John 32 Updating SAS Data Sets 4 The Stranger Camus, Albert Always Coming Home LeGuin, Ursula Falling through Space Gilchrist, Ellen The Accidental Tourist Tyler, Anne The Handmaid’s Tale Atwood, Margaret ; proc sort data=week_sales; by title; run; Updating with Incremental Values 301 17 10 12 15 8 proc print data=week_sales; title ’Bookstore Sales for Current Week’; title2 ’By Title’; run; The following output shows the data set, which contains the same variables as the year-to-date data set, but the variable Sales represents sales for only one week: Output 19.5 The WEEK_SALES Data Set, Sorted by Title Bookstore Sales for Current Week By Title Obs 1 2 3 4 5 6 Title Always Coming Home Falling through Space The Accidental Tourist The Handmaid’s Tale The Milagro Beanfield War The Stranger Author 1 Sales LeGuin, Ursula Gilchrist, Ellen Tyler, Anne Atwood, Margaret Nichols, John Camus, Albert 10 12 15 8 32 17 Note: If the transaction data set is updating only titles that are already in YEAR_SALES, it does not need to contain the variable Author. However, because this variable is there, the transaction data set can be used to add complete observations to the master data set. 4 The program that follows uses the weekly information to update the year-to-date data set and displays the new data set. data total_sales; drop NewSales; w update year_sales week_sales (rename=(Sales=NewSales)); u by Title; sales=sum(Sales,NewSales); v run; proc print data=total_sales; title ’Updated Year-to-Date Sales’; run; The following list corresponds to the numbered items in the preceding program: u The RENAME= data set option in the UPDATE statement changes the name of the variable Sales in the transaction data set (WEEK_SALES) to NewSales. As a 302 Understanding the Differences between Updating and Merging 4 Chapter 19 result, these values do not replace the value of Sales that are read from the master data set (YEAR_SALES). v The Sales value that is in the updated data set (TOTAL_SALES) is the sum of the year-to-date sales and the weekly sales. w The program drops the variable NewSales because it is not needed in the new data set. The following output shows that in addition to updating sales information for the titles already in the master data set, the UPDATE statement has added a new title, The Accidental Tourist. Output 19.6 Updating Year-to-Date Sales with Weekly Sales Updated Year-to-Date Sales Obs 1 2 3 4 5 6 7 Title Author Always Coming Home Don Quixote Falling through Space The Accidental Tourist The Handmaid’s Tale The Milagro Beanfield War The Stranger LeGuin, Ursula Cervantes, Miguel de Gilchrist, Ellen Tyler, Anne Atwood, Margaret Nichols, John Camus, Albert 1 Sales 89 87 140 15 72 335 167 Understanding the Differences between Updating and Merging General Comparisons between Updating and Merging The MERGE statement and the UPDATE statement both match observations from two SAS data sets; however, the two statements differ significantly. It is important to distinguish between the two processes and to choose the one that is appropriate for your application. The most straightforward differences are as follows: 3 The UPDATE statement uses only two data sets. The number of data sets that the MERGE statement can use is limited only by machine-dependent factors such as memory and disk space. 3 A BY statement must accompany an UPDATE statement. The MERGE statement performs a one-to-one merge if no BY statement follows it. 3 The two statements also process observations differently when a data set contains missing values or multiple observations in a BY group. To illustrate the differences, compare updating the SAS data set MAIL_LIST with the data set MAIL_TRANS to merging the two data sets. You have already seen the results of updating in the example that created Output 19.3. That output appears again in the following output for easy comparison. Updating SAS Data Sets Output 19.7 4 General Comparisons between Updating and Merging Updating a Data Set Magazine Mailing List (updated for current month) O b s 1 2 3 4 5 6 7 8 9 10 11 12 S u b s c r i b e r I d 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1 N a m e S t r e e t A d d r e s s C i t y S t a t e P r o v Ericson, Jane Dix-Rosen, Martin Gabrielli, Theresa Clayton, Aria Archuleta, Ruby Misiewicz, Jeremy Ahmadi, Hafez Jacobson, Becky An, Ing Slater, Emily Mitchell, Wayne Stavros, Gloria 111 Clancey Court P.O. Box 1850 Via Pisanelli, 25 14 Bridge St. Box 108 932 Webster St. 52 Rue Marston 1 Lincoln St. 2540 Pleasant St. 1009 Cherry St. 28 Morningside Dr. 212 Northampton Rd. Chapel Hill Seattle Roma San Francisco Milagro Madison Paris Tallahassee Calgary York New York South Hadley NC 27516 WA 98101 00196 CA 94124 NM 87429 WI 53704 75019 FL 32312 AB T2P 4H2 PA 17407 NY 10017 MA 01075 In contrast, the following program merges the two data sets. data mail_merged; merge mail_list mail_trans; by SubscriberId; run; proc print data=mail_merged; title ’Magazine Mailing List’; run; The following output shows the results of the merge: P o s t a l C o d e C o u n t r y USA USA Italy USA USA USA France USA Canada USA USA USA 303 304 How the UPDATE and MERGE Statements Process Missing Values Differently Output 19.8 4 Chapter 19 Results of Merging the Master and Transaction Data Sets Magazine Mailing List S u b s c r i b e r I d O b s 1 2 3 4 5 6 7 8 9 10 11 12 13 1001 1002 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 N a m e S t r e e t A d d r e s s 1 S t a t e P r o v C i t y P o s t a l C o d e C o u n t r y 27516 Dix-Rosen, Martin P.O. Box 1850 Gabrielli, Theresa Via Pisanelli, 25 Clayton, Aria 14 Bridge St. Archuleta, Ruby Box 108 932 Webster St. Ahmadi, Hafez 52 Rue Marston Jacobson, Becky 1 Lincoln St. 2540 Pleasant St. Slater, Emily 1009 Cherry St. Mitchell, Wayne 28 Morningside Dr. Stavros, Gloria 212 Northampton Rd. Seattle WA 98101 Roma 00196 San Francisco CA 94124 Milagro NM 87429 USA Italy USA USA Paris Tallahassee Calgary York New York South Hadley France USA FL AB PA NY MA 75019 32312 T2P 4H2 17407 10017 01075 USA USA USA The MERGE statement produces a data set containing 13 observations, whereas UPDATE produces a data set containing 12 observations. In addition, merging the data sets results in several missing values, whereas updating does not. Obviously, using the wrong statement may result in incorrect data. The differences between the merged and updated data sets result from the ways the two statements handle missing values and multiple observations in a BY group. How the UPDATE and MERGE Statements Process Missing Values Differently During an update, if a value for a variable is missing in the transaction data set, SAS uses the value from the master data set when it writes the observation to the new data set. When merging the same observations, SAS overwrites the value in the program data vector with the missing value. For example, the following observation exists in data set MAILING.MASTER. 1001 ERICSON, JANE 111 CLANCEY COURT CHAPEL HILL NC 27514 The following corresponding observation exists in MAILING.TRANS. 1001 27516 Updating combines the two observations and creates the following observation: 1001 ERICSON, JANE 111 CLANCEY COURT CHAPEL HILL NC 27516 Merging combines the two observations and creates this observation: 1001 27516 Updating SAS Data Sets 4 Handling Missing Values 305 How the UPDATE and MERGE Statements Process Multiple Observations in a BY Group Differently SAS does not write an updated observation to the new data set until it has applied all the transactions in a BY group. When merging data sets, SAS writes one new observation for each observation in the data set with the largest number of observations in the BY group. For example, consider this observation from MAILING.MASTER: 1002 DIX, MARTIN 4 SHEPHERD ST. NORWICH VT 05055 NH 03755 and the corresponding observations from MAILING.TRANS: 1002 1002 DIX-ROSEN, MARTIN R.R. 2, BOX 1850 HANOVER The UPDATE statement applies both transactions and combines these observations into a single one: 1002 DIX-ROSEN, MARTIN R.R. 2, BOX 1850 HANOVER NH 03755 The MERGE statement, on the other hand, first merges the observation from MAILING.MASTER with the first observation in the corresponding BY group in MAILING.TRANS. All values of variables from the observation in MAILING.TRANS are used, even if they are missing. Then SAS writes the observation to the new data set: 1002 DIX-ROSEN, MARTIN Next, SAS looks for other observations in the same BY group in each data set. Because more observations are in the BY group in MAILING.TRANS, all the values in the program data vector are retained. SAS merges them with the second observation in the BY group from MAILING.TRANS and writes the result to the new data set: 1002 R.R. 2, BOX 1850 HANOVER NH 03755 Therefore, merging creates two observations for the new data set, whereas updating creates only one. Handling Missing Values If you update a master data set with a transaction data set, and the transaction data set contains missing values, you can use the UPDATEMODE option on the UPDATE statement to tell SAS how you want to handle the missing values. The UPDATEMODE option specifies whether missing values in a transaction data set will replace existing values in a master data set. The syntax for using the UPDATEMODE option with the UPDATE statement is as follows: UPDATE master-SAS-data-set transaction-SAS-data-set ; BY by-variable; The MISSINGCHECK value in the UPDATEMODE option prevents missing values in a transaction data set from replacing values in a master data set. This is the default. The NOMISSINGCHECK value in the UPDATEMODE option enables missing values in a transaction data set to replace values in a master data set by preventing the check for missing data from being performed. 306 Handling Missing Values 4 Chapter 19 The following examples show how SAS handles missing values when you use the UPDATEMODE option on the UPDATE statement. The following example creates and sorts a master data set: options pagesize=60 linesize=80 pageno=1 nodate; data inventory; input PartNumber $ Description $ Stock @17 ReceivedDate date9. @27 Price; format ReceivedDate date9.; datalines; K89R seal 34 27jul2004 245.00 M4J7 sander 98 20jun2004 45.88 LK43 filter 121 19may2005 10.99 MN21 brace 43 10aug2005 27.87 BC85 clamp 80 16aug2005 9.55 NCF3 valve 198 20mar2005 24.50 ; proc sort data=inventory; by PartNumber; run; proc print data=inventory; title ’Master Data Set’; title2 ’Tool Warehouse Inventory’; run; The following output shows the results: Output 19.9 The Master Data Set Master Data Set Tool Warehouse Inventory 1 Obs Part Number Description Stock Received Date Price 1 2 3 4 5 6 BC85 K89R LK43 M4J7 MN21 NCF3 clamp seal filter sander brace valve 80 34 121 98 43 198 16AUG2005 27JUL2004 19MAY2005 20JUN2004 10AUG2005 20MAR2005 9.55 245.00 10.99 45.88 27.87 24.50 The following example creates and sorts a transaction data set: options linesize=80 pagesize=64 nodate pageno=1; data add_inventory; input PartNumber $ 1-4 Description $ 6-11 Stock 13-15 @17 Price; datalines; K89R seal 245.00 M4J7 sander 121 45.88 LK43 filter 34 10.99 MN21 brace 28.87 BC85 clamp 57 11.64 Updating SAS Data Sets NCF3 valve ; 121 4 Handling Missing Values 307 . proc sort data=add_inventory; by PartNumber; run; proc print data=add_inventory; title ’Transaction Data Set’; title2 ’Tool Warehouse Inventory’; run; The following output shows the results: Output 19.10 The Transaction Data Set Transaction Data Set Tool Warehouse Inventory Obs Part Number Description Stock 1 2 3 4 5 6 BC85 K89R LK43 M4J7 MN21 NCF3 clamp seal filter sander brace valve 57 . 34 121 . 121 1 Price 11.64 245.00 10.99 45.88 28.87 In the following example, SAS uses the NOMISSINGCHECK value of the UPDATEMODE option on the UPDATE statement: options pagesize=60 linesize=80 pageno=1 nodate; data new_inventory; update inventory add_inventory updatemode=nomissingcheck; by PartNumber; ReceivedDate=today(); run; proc print data=new_inventory; title ’Updated Master Data Set’; title2 ’Tool Warehouse Inventory’; run; The following output shows the results of using the NOMISSINGCHECK value. Observations 2 and 5 contain missing values for STOCK because the transaction data set contains missing values for STOCK for these items. Because checking for missing values in the transaction data set is not done, the original value in STOCK is replaced by missing values. In the sixth observation, the original value of PRICE is replaced by a missing value. 308 Review of SAS Tools 4 Output 19.11 Chapter 19 Updated Master Data Set: UPDATEMODE=NOMISSINGCHECK Updated Master Data Set Tool Warehouse Inventory 1 Obs Part Number Description Stock Received Date Price 1 2 3 4 5 6 BC85 K89R LK43 M4J7 MN21 NCF3 clamp seal filter sander brace valve 57 . 34 121 . 121 12JAN2007 12JAN2007 12JAN2007 12JAN2007 12JAN2007 12JAN2007 11.64 245.00 10.99 45.88 28.87 . The following output shows the results of using the MISSINGCHECK value. Note that no missing values are written to the updated master data set. The missing data in observations 2, 5, and 6 of the transaction data set is ignored, and the original data from the master data set remains. Output 19.12 Updated Master Data Set: UPDATEMODE=MISSINGCHECK Updated Master Data Set Tool Warehouse Inventory 1 Obs Part Number Description Stock Received Date Price 1 2 3 4 5 6 BC85 K89R LK43 M4J7 MN21 NCF3 clamp seal filter sander brace valve 57 34 34 121 43 121 12JAN2007 12JAN2007 12JAN2007 12JAN2007 12JAN2007 12JAN2007 11.64 245.00 10.99 45.88 28.87 24.50 For more information about using the UPDATE statement, see SAS Language Reference: Dictionary. Review of SAS Tools Statements UPDATE master-SAS-data-set transaction-SAS-data-set; BY identifier-list; replace the values of variables in one SAS data set with nonmissing values from another SAS data set. Master-SAS-data-set is the SAS data set containing information that you want to update; transaction-SAS-data-set is the SAS data set containing information with which you want to update the master data set; identifier-list is the list of BY variables by which you identify corresponding observations. Updating SAS Data Sets 4 Learning More 309 Learning More DATASETS procedure When you update a data set, you create a new data set containing the updated information. Typically, you want to use PROC DATASETS to delete the old master data set and rename the new one so that you can use the same program the next time you update the information. For more information about the DATASETS procedure, see Chapter 34, “Managing SAS Data Libraries,” on page 603. Indexes If a data set has an index on the variable or variables named in the BY statement that accompanies the UPDATE statement, you do not need to sort that data set. For more information about indexes, see the SAS Language Reference: Dictionary and the SAS Language Reference: Concepts. Merge statement See Chapter 18, “Merging SAS Data Sets,” on page 269. 310 311 CHAPTER 20 Modifying SAS Data Sets Introduction 311 Purpose 311 Prerequisites 311 Input SAS Data Set for Examples 312 Modifying a SAS Data Set: The Simplest Case 313 Modifying a Master Data Set with Observations from a Transaction Data Set Understanding the MODIFY Statement 314 Adding New Observations to the Master Data Set 314 Checking for Program Errors 315 The Program 315 Understanding How Duplicate BY Variables Affect File Update 317 How the DATA Step Processes Duplicate BY Variables 317 The Program 318 Handling Missing Values 319 Review of SAS Tools 320 Statements 320 Learning More 321 314 Introduction Purpose In this section, you will learn how to use the MODIFY statement in a DATA step to do the following: 3 3 3 3 replace values in a data set replace values in a master data set with values from a transaction data set append observations to an existing SAS data set delete observations from an existing SAS data set. The MODIFY statement modifies observations directly in the original master file. It does not create a copy of the file. Prerequisites Before continuing with this section, you should be familiar with the concepts presented in the following parts: 3 Chapter 3, “Starting with Raw Data: The Basics,” on page 43 312 Input SAS Data Set for Examples 4 Chapter 20 3 Chapter 5, “Starting with SAS Data Sets,” on page 81 3 Chapter 18, “Merging SAS Data Sets,” on page 269 3 Chapter 19, “Updating SAS Data Sets,” on page 293. Input SAS Data Set for Examples In this section you will look at examples from an inventory tracking system that is used by a tool vendor. The examples use the SAS data set INVENTORY as input. The data set contains these variables: PartNumber is a character variable that contains a unique value that identifies each item. Description is a character variable that contains the text description of each item. InStock is a numeric variable that contains a value that describes how many units of each tool the warehouse has in stock. ReceivedDate is a numeric variable that contains the SAS date value that is the day for which InStock values are current. Price is a numeric variable that contains the price of each item. The following program creates and displays the INVENTORY data set: options pagesize=60 linesize=80 pageno=1 nodate; data inventory; input PartNumber $ Description $ InStock @17 ReceivedDate date9. @27 Price; format ReceivedDate date9.; datalines; K89R seal 34 27jul1998 245.00 M4J7 sander 98 20jun1998 45.88 LK43 filter 121 19may1999 10.99 MN21 brace 43 10aug1999 27.87 BC85 clamp 80 16aug1999 9.55 NCF3 valve 198 20mar1999 24.50 KJ66 cutter 6 18jun1999 19.77 UYN7 rod 211 09sep1999 11.55 JD03 switch 383 09jan2000 13.99 BV1E timer 26 03aug2000 34.50 ; proc print data=inventory; title ’Tool Warehouse Inventory’; run; The following output shows the results: Modifying SAS Data Sets Output 20.1 4 Modifying a SAS Data Set: The Simplest Case 313 The INVENTORY Data Set Tool Warehouse Inventory Obs 1 2 3 4 5 6 7 8 9 10 Part Number K89R M4J7 LK43 MN21 BC85 NCF3 KJ66 UYN7 JD03 BV1E Description seal sander filter brace clamp valve cutter rod switch timer In Stock 34 98 121 43 80 198 6 211 383 26 1 Received Date Price 27JUL1998 20JUN1998 19MAY1999 10AUG1999 16AUG1999 20MAR1999 18JUN1999 09SEP1999 09JAN2000 03AUG2000 245.00 45.88 10.99 27.87 9.55 24.50 19.77 11.55 13.99 34.50 Modifying a SAS Data Set: The Simplest Case You can use the MODIFY statement to replace all values for a specific variable or variables in a data set. The syntax for using the MODIFY statement for this purpose is MODIFY SAS-data-set; In the following program, the price of each part in the inventory is increased by 15%. The new values for PRICE replace the old values on all records in the original INVENTORY data set. The FORMAT statement in the print procedure writes the price of each item with two-digit decimal precision. data inventory; modify inventory; price=price+(price*.15); run; proc print data=inventory; title ’Tool Warehouse Inventory’; title2 ’(Price reflects 15% increase)’; format price 8.2; run; The following output shows the results: 314 Modifying a Master Data Set with Observations from a Transaction Data Set Output 20.2 4 Chapter 20 The INVENTORY Data Set with Updated Prices Tool Warehouse Inventory (Price reflects 15% increase) Obs Part Number 1 2 3 4 5 6 7 8 9 10 K89R M4J7 LK43 MN21 BC85 NCF3 KJ66 UYN7 JD03 BV1E Description seal sander filter brace clamp valve cutter rod switch timer In Stock 34 98 121 43 80 198 6 211 383 26 1 Received Date Price 27JUL1998 20JUN1998 19MAY1999 10AUG1999 16AUG1999 20MAR1999 18JUN1999 09SEP1999 09JAN2000 03AUG2000 281.75 52.76 12.64 32.05 10.98 28.18 22.74 13.28 16.09 39.68 Modifying a Master Data Set with Observations from a Transaction Data Set Understanding the MODIFY Statement The MODIFY statement replaces data in a master data set with data from a transaction data set, and makes the changes in the original master data set. You can use a BY statement to match observations from the transaction data set with observations in the master data set. The syntax for using the MODIFY statement and the BY statement is MODIFY master-SAS-data-set transaction-SAS-data-set; BY by-variable; The master-SAS-data-set specifies the SAS data set that you want to modify. The transaction-SAS-data-set specifies the SAS data set that provides the values for updating the master data set. The by-variable specifies one or more variables by which you identify corresponding observations. When you use a BY statement with the MODIFY statement, the DATA step uses dynamic WHERE processing to find observations in the master data set. Neither the master data set nor the transaction data set needs to be sorted. For large data sets, however, sorting the data before you modify it can enhance performance significantly. Adding New Observations to the Master Data Set You can use the MODIFY statement to add observations to an existing master data set. If the transaction data set contains an observation that does not match an observation in the master data set, then SAS enables you to write a new observation to the master data set if you use an explicit OUTPUT statement in your program. When you specify an explicit OUTPUT statement, you must also specify a REPLACE statement if you want to replace observations in place. All new observations append to the end of the master data set. Modifying SAS Data Sets 4 The Program 315 Checking for Program Errors You can use the _IORC_ automatic variable for error checking in your DATA step program. The _IORC_ automatic variable contains the return code for each I/O operation that the MODIFY statement attempts to perform. The best way to test the values of _IORC_ is with the mnemonic codes that are provided by the SYSRC autocall macro. Each mnemonic code describes one condition. The mnemonics provide an easy method for testing problems in a DATA step program. The following is a partial list of codes: _DSENMR specifies that the transaction data set observation does not exist in the master data set (used only with MODIFY and BY statements). If consecutive observations with different BY values do not find a match in the master data set, then both of them return _DSENMR. _DSEMTR specifies that multiple transaction data set observations with a given BY value do not exist in the master data set (used only with MODIFY and BY statements). If consecutive observations with the same BY values do not find a match in the master data set, then the first observation returns _DSENMR and the subsequent observations return _DSEMTR. _SOK specifies that the observation was located in the master data set. For a complete list of mnemonic codes, see the MODIFY statement in SAS Language Reference: Dictionary. The Program The program in this section updates values in a master data set with values from a transaction data set. If a transaction does not exist in the master data set, then the program adds the transaction to the master data set. In this example, a warehouse received a shipment of new items, and the INVENTORY master data set must be modified to reflect the changes. The master data set contains a complete list of the inventory items. The transaction data set contains items that are on the master inventory as well as new inventory items. The following program creates the ADD_INVENTORY transaction data set, which contains items for updating the master data set. The PartNumber variable contains the part number for the item and corresponds to PartNumber in the INVENTORY data set. The Description variable names the item. The NewStock variable contains the number of each item in the current shipment. The NewPrice variable contains the new price of the item. The program attempts to update the master data set INVENTORY (see Output 20.1) according to the values in the transaction data set ADD_INVENTORY. The program uses the _IORC_ automatic variable to detect errors. data add_inventory; u input PartNumber $ Description $ NewStock @16 NewPrice; datalines; K89R seal 6 247.50 AA11 hammer 55 32.26 BB22 wrench 21 17.35 316 The Program 4 Chapter 20 KJ66 cutter 10 24.50 CC33 socket 7 22.19 BV1E timer 30 36.50 ; options pagesize=60 linesize=80 pageno=1 nodate; data inventory; modify inventory add_inventory; v by PartNumber; select (_iorc_); w /* The observation exists in the master data set. */ when (%sysrc(_sok)) do; x InStock=InStock+NewStock; ReceivedDate=today(); Price=NewPrice; replace; y end; /* The observation does not exist in the master data set. */ when (%sysrc(_dsenmr)) do; U InStock=NewStock; ReceivedDate=today(); Price=NewPrice; output; V _error_=0; end; otherwise do; W put ’An unexpected I/O error has occurred.’/ W ’Check your data and your program.’; W _error_=0; stop; end; end; proc print data=inventory; title ’Tool Warehouse Inventory’; run; The following list corresponds to the numbered items in the preceding program: u The DATA statement creates the transaction data set ADD_INVENTORY. v The MODIFY statement loads the data from the INVENTORY and ADD_INVENTORY data sets. w The _IORC_ automatic variable is used for error checking. The value of _IORC_ is a numeric return code that indicates the status of the most recent I/O operation. x The SYSRC autocall macro checks to see if the value of _IORC_ is _SOK. If the value is _SOK, then an observation in the transaction data set matches an observation in the master data set. y The REPLACE statement updates the master data set INVENTORY by replacing the observation in the master data set with the observation from the transaction data set. U The SYSRC autocall macro checks to see if the value of _IORC_ is _DSENMR. If the value is _DSENMR, then an observation in the transaction data set does not exist in the master data set. Modifying SAS Data Sets 4 How the DATA Step Processes Duplicate BY Variables 317 V The OUTPUT statement writes the current observation to the end of the master data set. W If neither condition is met, the PUT statement writes a message to the log. The following output shows the results: Output 20.3 The Updated INVENTORY Data Set Tool Warehouse Inventory Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 Part Number K89R M4J7 LK43 MN21 BC85 NCF3 KJ66 UYN7 JD03 BV1E AA11 BB22 CC33 Description seal sander filter brace clamp valve cutter rod switch timer hammer wrench socket In Stock 40 98 121 43 80 198 16 211 383 56 55 21 7 1 Received Date Price 19JAN2001 20JUN1998 19MAY1999 10AUG1999 16AUG1999 20MAR1999 19JAN2001 09SEP1999 09JAN2000 19JAN2001 19JAN2001 19JAN2001 19JAN2001 247.50 45.88 10.99 27.87 9.55 24.50 24.50 11.55 13.99 36.50 32.26 17.35 22.19 SAS writes the following message to the log: NOTE: The data set WORK.INVENTORY has been updated. There were 3 observations rewritten, 3 observations added and 0 observations deleted. CAUTION: If you execute your program without the OUTPUT and REPLACE statements, then your master file might not update correctly. Using OUTPUT or REPLACE in a DATA step overrides the default replacement of observations. If you use these statements in a DATA step, then you must explicitly program each action that you want to take. 4 For more information about the MODIFY, OUTPUT, and REPLACE statements, see the Statements section in SAS Language Reference: Dictionary. Understanding How Duplicate BY Variables Affect File Update How the DATA Step Processes Duplicate BY Variables When you use a BY statement with MODIFY, both the master and the transaction data sets can have observations with duplicate values of BY variables. Neither the master nor the transaction data set needs to be sorted, because BY-group processing uses dynamic WHERE processing to find an observation in the master data set. The DATA step processes duplicate observations in the following ways: 3 If duplicate BY values exist in the master data set, then MODIFY applies the current transaction to the first occurrence in the master data set. 318 The Program 4 Chapter 20 3 If duplicate BY values exist in the transaction data set, then the observations are applied one on top of another so that the values overwrite each other. The value in the last transaction is the final value in the master data set. 3 If both the master and the transaction data sets contain duplicate BY values, then MODIFY applies each transaction to the first occurrence in the group in the master data set. The Program The program in this section updates the master data set INVENTORY_2 with observations from the transaction data set ADD_INVENTORY_2. Both data sets contain consecutive and nonconsecutive duplicate values of the BY variable PartNumber. The following program creates the master data set INVENTORY_2. Note that the data set contains three observations for PartNumber M4J7. data inventory_2; input PartNumber $ Description $ InStock @17 ReceivedDate date9. @27 Price; format ReceivedDate date9.; datalines; K89R seal 34 27jul1998 245.00 M4J7 sander 98 20jun1998 45.88 M4J7 sander 98 20jun1998 45.88 LK43 filter 121 19may1999 10.99 MN21 brace 43 10aug1999 27.87 M4J7 sander 98 20jun1998 45.88 BC85 clamp 80 16aug1999 9.55 NCF3 valve 198 20mar1999 24.50 KJ66 cutter 6 18jun1999 19.77 ; The following program creates the transaction data set ADD_INVENTORY_2, and then modifies the master data set INVENTORY_2. Note that the data set ADD_INVENTORY_2 contains three observations for PartNumber M4J7. options pagesize=60 linesize=80 pageno=1 nodate; data add_inventory_2; input PartNumber $ Description $ NewStock; datalines; K89R abc 17 M4J7 def 72 M4J7 ghi 66 LK43 jkl 311 M4J7 mno 43 BC85 pqr 75 ; data inventory_2; modify inventory_2 add_inventory_2; by PartNumber; ReceivedDate=today(); InStock=InStock+NewStock; run; Modifying SAS Data Sets 4 Handling Missing Values 319 proc print data=inventory_2; title "Tool Warehouse Inventory"; run; The following output shows the results: Output 20.4 The Updated INVENTORY_2 Data Set: Duplicate BY Variables Tool Warehouse Inventory Obs Part Number Description In Stock 1 2 3 4 5 6 7 8 9 K89R M4J7 M4J7 LK43 MN21 M4J7 BC85 NCF3 KJ66 abc mno sander jkl brace sander pqr valve cutter 51 279 98 432 43 98 155 198 6 1 Received Date Price 22JAN2001 22JAN2001 20JUN1998 22JAN2001 10AUG1999 20JUN1998 22JAN2001 20MAR1999 18JUN1999 245.00 45.88 45.88 10.99 27.87 45.88 9.55 24.50 19.77 Handling Missing Values By default, if the transaction data set contains missing values for a variable that is common to both the master and the transaction data sets, then the MODIFY statement does not replace values in the master data set with missing values. If you want to replace values in the master data set with missing values, then you use the UPDATEMODE= option on the MODIFY statement. UPDATEMODE specifies whether missing values in a transaction data set will replace existing values in a master data set. The syntax for using the UPDATEMODE= option with the MODIFY statement is MODIFY master-SAS-data-set transaction-SAS-data-set ; BY by-variable; MISSINGCHECK prevents missing values in a transaction data set from replacing values in a master data set. This is the default. NOMISSINGCHECK enables missing values in a transaction data set to replace values in a master data set by preventing the check for missing data from being performed. The following example creates the master data set Event_List, which contains the schedule and codes for athletic events. The example then updates Event_List with the transaction data set Event_Change, which contains new information about the schedule. Because the MODIFY statement uses the NOMISSINGCHECK value of the UPDATEMODE= option, values in the master data set are replaced by missing values from the transaction data set. The following program creates the EVENT_LIST master data set: data Event_List; input Event $ 1-10 Weekday $ 12-20 TimeofDay $ 22-30 Fee Code; datalines; Basketball Monday evening 10 58 Soccer Tuesday morning 5 33 Yoga Wednesday afternoon 15 92 320 Review of SAS Tools 4 Chapter 20 Swimming ; Wednesday morning 10 63 The following program creates the EVENT_CHANGE transaction data set: data Event_Change; input Event $ 1-10 Weekday $ 12-20 Fee Code; datalines; Basketball Wednesday 10 . Yoga Monday . 63 Swimming . . ; The following program modifies and prints the master data set: options pagesize=60 linesize=80 pageno=1 nodate; data Event_List; modify Event_List Event_Change updatemode=nomissingcheck; by Event; run; proc print data=Event_List; title ’Schedule of Athletic Events’; run; The following output shows the results: Output 20.5 The EVENT_LIST Master Data Set: Missing Values Schedule of Athletic Events Obs 1 2 3 4 1 Event Weekday TimeofDay Fee Basketball Soccer Yoga Swimming Wednesday Tuesday Monday evening morning afternoon morning 10 5 . . Code . 33 63 . Review of SAS Tools Statements BY by-variable; specifies one or more variables to use with the BY statement. You use the BY variable to identify corresponding observations in a master data set and a transaction data set. MODIFY master-SAS-data-set transaction-SAS-data-set ; replaces the values of variables in one SAS data set with values from another SAS data set. The master-SAS-data-set contains data that you want to update. The Modifying SAS Data Sets 4 Learning More 321 transaction-SAS-data-set contains observations with which to update the master data set. The UPDATEMODE argument determines whether missing values in the transaction data set overwrite values in the master data set. The MISSINGCHECK option prevents missing values in a transaction data set from replacing values in a master data set. This is the default. The NOMISSINGCHECK option enables missing values in a transaction data set to replace values in a master data set by preventing the check for missing data from being performed. MODIFY SAS-data-set; replaces the values of variables in a data set with values that you specify in your program. OUTPUT; if a MODIFY statement is present, writes the current observation to the end of the master data set. REPLACE; if a MODIFY statement is present, writes the current observation to the same physical location from which it was read in a data set that is named in the DATA statement. Learning More MERGE statement See Chapter 18, “Merging SAS Data Sets,” on page 269. MODIFY statement For complete information about the various applications of the MODIFY statement, see SAS Language Reference: Dictionary. UPDATE statement See Chapter 19, “Updating SAS Data Sets,” on page 293. 322 323 CHAPTER 21 Conditionally Processing Observations from Multiple SAS Data Sets Introduction to Conditional Processing from Multiple SAS Data Sets Purpose 323 Prerequisites 323 Input SAS Data Sets for Examples 324 Determining Which Data Set Contributed the Observation 326 Understanding the IN= Data Set Option 326 The Program 326 Combining Selected Observations from Multiple Data Sets 328 Performing a Calculation Based on the Last Observation 330 Understanding When the Last Observation Is Processed 330 The Program 330 Review of SAS Tools 332 Statements 332 Learning More 332 323 Introduction to Conditional Processing from Multiple SAS Data Sets Purpose When combining SAS data sets, you can process observations conditionally, based on which data set contributed that observation. You can do the following: 3 Determine which data set contributed each observation in the combined data set. 3 Create a new data set that includes only selected observations from the data sets that you combine. 3 Determine when SAS is processing the last observation in the DATA step so that you can execute conditional operations, such as creating totals. You have seen some of these concepts in earlier topics, but in this section you will apply them to the processing of multiple data sets. The examples use the SET statement, but you can also use all of the features that are discussed here with the MERGE, MODIFY, and UPDATE statements. Prerequisites Before using this section, you should understand the concepts presented in the following sections: 3 Chapter 3, “Starting with Raw Data: The Basics,” on page 43 324 Input SAS Data Sets for Examples 4 Chapter 21 3 Chapter 5, “Starting with SAS Data Sets,” on page 81 3 Chapter 17, “Interleaving SAS Data Sets,” on page 263 Input SAS Data Sets for Examples The following program creates two SAS data sets, SOUTHAMERICAN and EUROPEAN. Each data set contains the following variables: Year is the year that South American and European countries competed in the World Cup Finals from 1954 to 1998. Country is the name of the competing country. Score is the final score of the game. Result is the result of the game. The value for winners is won; the value for losers is lost. data southamerican; title "South American World Cup Finalists from 1954 to 1998"; input Year $ Country $ 9-23 Score $ 25-28 Result $ 32-36; datalines; 1998 Brazil 0-3 lost 1994 Brazil 3-2 won 1990 Argentina 0-1 lost 1986 Argentina 3-2 won 1978 Argentina 3-1 won 1970 Brazil 4-1 won 1962 Brazil 3-1 won 1958 Brazil 5-2 won ; data european; title "European World Cup Finalists From 1954 to 1998"; input Year $ Country $ 9-23 Score $ 25-28 Result $ 32-36; datalines; 1998 France 3-0 won 1994 Italy 2-3 lost 1990 West Germany 1-0 won 1986 West Germany 2-3 lost 1982 Italy 3-1 won 1982 West Germany 1-3 lost 1978 Holland 1-2 lost 1974 West Germany 2-1 won 1974 Holland 1-2 lost 1970 Italy 1-4 lost 1966 England 4-2 won 1966 West Germany 2-4 lost 1962 Czechoslovakia 1-3 lost 1958 Sweden 2-5 lost 1954 West Germany 3-2 won 1954 Hungary 2-3 lost ; Conditionally Processing Observations from Multiple SAS Data Sets 4 Input SAS Data Sets for Examples 325 options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=southamerican;u by year;u run; proc print data=southamerican; title ’World Cup Finalists:’; title2 ’South American Countries’; title3 ’from 1954 to 1998’; run; proc sort data=european;u by year;u run; proc print data=european; title ’World Cup Finalists:’; title2 ’European Countries’; title3 ’from 1954 to 1998’; run; u The PROC SORT statement sorts the data set in ascending order according to the BY variable. To create the interleaved data set in the next example, the data must be in ascending order. Output 21.1 World Cup Finalists by Continent World Cup Finalists: South American Countries from 1954 to 1998 Obs Year Country 1 2 3 4 5 6 7 8 1958 1962 1970 1978 1986 1990 1994 1998 Brazil Brazil Brazil Argentina Argentina Argentina Brazil Brazil 1 Score Result 5-2 3-1 4-1 3-1 3-2 0-1 3-2 0-3 won won won won won lost won lost 326 Determining Which Data Set Contributed the Observation 4 Chapter 21 World Cup Finalists: European Countries from 1954 to 1998 Obs Year Country 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1954 1954 1958 1962 1966 1966 1970 1974 1974 1978 1982 1982 1986 1990 1994 1998 West Germany Hungary Sweden Czechoslovakia England West Germany Italy West Germany Holland Holland Italy West Germany West Germany West Germany Italy France 2 Score Result 3-2 2-3 2-5 1-3 4-2 2-4 1-4 2-1 1-2 1-2 3-1 1-3 2-3 1-0 2-3 3-0 won lost lost lost won lost lost won lost lost won lost lost won lost won Determining Which Data Set Contributed the Observation Understanding the IN= Data Set Option When you create a new data set by combining observations from two or more data sets, knowing which data set an observation came from can be useful. For example, you might want to perform a calculation based on which data set contributed an observation. Otherwise, you might lose important contextual information that you need for later processing. You can determine which data set contributed a particular observation by using the IN= data set option. The IN= data set option enables you to determine which data sets have contributed to the observation that is currently in the program data vector. The syntax for this option on the SET statement is SET SAS-data-set-1 (IN=variable) SAS-data-set-2; BY a-common-variable; When you use the IN= option with a data set in a SET, MERGE, MODIFY, or UPDATE statement, SAS creates a temporary variable associated with that data set. The value of variable is 1 if the data set has contributed to the observation currently in the program data vector. The value is 0 if it has not contributed. You can use the IN= option with any or all the data sets you name in a SET, MERGE, MODIFY, or UPDATE statement, but use a different variable name in each case. Note: The IN= variable exists during the execution of the DATA step only; it is not written to the output data set that is created. 4 The Program The original data sets, SOUTHAMERICAN and EUROPEAN, do not need a variable that identifies the countries’ continent because all observations in SOUTHAMERICAN pertain to the South American continent, and all observations in EUROPEAN pertain Conditionally Processing Observations from Multiple SAS Data Sets 4 The Program 327 to the European continent. However, when you combine the data sets, you lose the context, which in this case is the relevant continent for each observation. The following example uses the SET statement with a BY statement to combine the two data sets into one data set that contains all the observations in chronological order: options pagesize=60 linesize=80 pageno=1 nodate; data finalists; set southamerican european; by year; run; proc print data=finalists; title ’World Cup Finalists’; title2 ’from 1958 to 1998’; run; Output 21.2 World Cup Finalists Grouped by Year World Cup Finalists from 1958 to 1998 Obs Year Country 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1954 1954 1958 1958 1962 1962 1966 1966 1970 1970 1974 1974 1978 1978 1982 1982 1986 1986 1990 1990 1994 1994 1998 1998 West Germany Hungary Brazil Sweden Brazil Czechoslovakia England West Germany Brazil Italy West Germany Holland Argentina Holland Italy West Germany Argentina West Germany Argentina West Germany Brazil Italy Brazil France 1 Score Result 3-2 2-3 5-2 2-5 3-1 1-3 4-2 2-4 4-1 1-4 2-1 1-2 3-1 1-2 3-1 1-3 3-2 2-3 0-1 1-0 3-2 2-3 0-3 3-0 won lost won lost won lost won lost won lost won lost won lost won lost won lost lost won won lost lost won Notice that this output would be more useful if it showed from which data set each observation originated. To solve this problem, the following program uses the IN= data set option in conjunction with IF-THEN/ELSE statements. By determining which data set contributed an observation, the conditional statement executes and assigns the appropriate value to the variable Continent in each observation in the new data set FINALISTS. options pagesize=60 linesize=80 pageno=1 nodate; data finalists; 328 Combining Selected Observations from Multiple Data Sets 4 Chapter 21 set southamerican (in=S) european;u by Year; if S then Continent=’South America’;v else Continent=’Europe’; run; proc print data=finalists; title ’World Cup Finalists’; title2 ’from 1954 to 1998’; run; The following list corresponds to the numbered items in the preceding program: u The IN= option in the SET statement tells SAS to create a variable named S. v When the current observation comes from the data set SOUTHAMERICAN, the value of S is 1. Otherwise, the value is 0. The IF-THEN/ELSE statements execute one of two assignment statements, depending on the value of S. If the observation comes from the data set SOUTHAMERICAN, then the value that is assigned to Continent is South America. If the observation comes from the data set EUROPEAN, then the value that is assigned to Continent is Europe. The following output shows the results: Output 21.3 World Cup Finalists with Continent World Cup Finalists from 1954 to 1998 Obs Year Country 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1954 1954 1958 1958 1962 1962 1966 1966 1970 1970 1974 1974 1978 1978 1982 1982 1986 1986 1990 1990 1994 1994 1998 1998 West Germany Hungary Brazil Sweden Brazil Czechoslovakia England West Germany Brazil Italy West Germany Holland Argentina Holland Italy West Germany Argentina West Germany Argentina West Germany Brazil Italy Brazil France 1 Score Result 3-2 2-3 5-2 2-5 3-1 1-3 4-2 2-4 4-1 1-4 2-1 1-2 3-1 1-2 3-1 1-3 3-2 2-3 0-1 1-0 3-2 2-3 0-3 3-0 won lost won lost won lost won lost won lost won lost won lost won lost won lost lost won won lost lost won Continent Europe Europe South America Europe South America Europe Europe Europe South America Europe Europe Europe South America Europe Europe Europe South America Europe South America Europe South America Europe South America Europe Combining Selected Observations from Multiple Data Sets To create a data set that contains only the observations that are selected according to a particular criterion, you can use the subsetting IF statement and a SET statement Conditionally Processing Observations from Multiple SAS Data Sets 4 Combining Selected Observations 329 that specifies multiple data sets. The following DATA step reads two input data sets to create a combined data set that lists only the winning teams: data champions(drop=result);u set southamerican (in=S) european;v by Year; if result=’won’;w if S then Continent=’South America’;x else Continent=’Europe’; run; proc print data=champions; title ’World Cup Champions from 1954 to 1998’; title2 ’including Countries’’ Continent’; run; The following list corresponds to the numbered items in the preceding program: u The DROP= data set option drops the variable Result from the new data set CHAMPIONS because all values for this variable will be the same. v The SET statement reads observations from two data sets: SOUTHAMERICAN and EUROPEAN. The S= data option creates the variable S which is set to 1 each time an observation is contributed by the SOUTHAMERICAN data set. w A subsetting IF statement writes the observation to the output data set CHAMPIONS only if the value of the Result variable is won. x When the current observation comes from the data set SOUTHAMERICAN, the value of S is 1. Otherwise, the value is 0. The IF-THEN/ELSE statements execute one of two assignment statements, depending on the value of S. If the observation comes from the data set SOUTHAMERICAN, then the value assigned to Continent is South America. If the observation comes from the data set EUROPEAN, then the value assigned to Continent is Europe. The following output shows the resulting data set CHAMPIONS: Output 21.4 Combining Selected Observations World Cup Champions from 1954 to 1998 including Countries’ Continent Obs Year Country 1 2 3 4 5 6 7 8 9 10 11 12 1954 1958 1962 1966 1970 1974 1978 1982 1986 1990 1994 1998 West Germany Brazil Brazil England Brazil West Germany Argentina Italy Argentina West Germany Brazil France Score 3-2 5-2 3-1 4-2 4-1 2-1 3-1 3-1 3-2 1-0 3-2 3-0 Continent Europe South America South America Europe South America Europe South America Europe South America Europe South America Europe 2 330 Performing a Calculation Based on the Last Observation 4 Chapter 21 Performing a Calculation Based on the Last Observation Understanding When the Last Observation Is Processed Many applications require that you determine when the DATA step processes the last observation in the input data set. For example, you might want to perform calculations only on the last observation in a data set, or you might want to write an observation only after the last observation has been processed. For this purpose, you can use the END= option for the SET, MERGE, MODIFY, or UPDATE statement. The syntax for this option is: SET SAS-data-set-list END=variable; The END= option defines a temporary variable whose value is 1 when the DATA step is processing the last observation. At all other times, the value of variable is 0. Although the DATA step can use the END= variable, SAS does not add it to the resulting data set. Note: Chapter 12, “Using More Than One Observation in a Calculation,” on page 187 explains how to use the END= option in the SET statement with a single data set. The END= option works the same way with multiple data sets, but it is important to note that END= is set to 1 only when the last observation from all input data sets is being processed. 4 The Program This example uses the data in SOUTHAMERICAN and EUROPEAN to calculate how many years a team from each continent won the World Cup from 1954 to 1998. To perform this calculation, this program must perform the following tasks: 1 identify on which continent a country is located. 2 keep a running total of how many times a team from each continent won the World Cup. 3 after processing all observations, multiply the final total for each continent by 4 (the length of time between World Cups) to determine the length of time each continent has been a World Cup champion. 4 write only the final observation to the output data set. The variables that contain the totals do not contain the final total until the last observation is processed. The following DATA step calculates the running totals and produces the output data set that contains only those totals. data timespan (keep=YearsSouthAmerican keep=YearsEuropean);x set southamerican (in=S) european end=LastYear;uw by Year; if result=’won’ then do; if S then SouthAmericanWins+1;v else EuropeanWins+1;v end; if lastyear thenw do; YearsSouthAmerican=SouthAmericanWins*4; Conditionally Processing Observations from Multiple SAS Data Sets 4 The Program 331 YearsEuropean=EuropeanWins*4; output;x end; proc print data=timespan; title ’Total Years as Reigning World Cup Champions’; title2 ’from 1954 to 1998’; run; The following list corresponds to the numbered items in the preceding program: u The END= option creates the temporary variable LastYear. The value of LastYear is 0 until the DATA step begins processing the last observation. At that point, the value of LastYear is set to 1. v Two new variables, SouthAmericanWins and EuropeanWins, keep a running total of the number of victories each continent achieves. For each observation in which the value of the variable Result is won, a different sum statement executes, based on the data set that the observation came from: SouthAmericanWins+1; or EuropeanWins+1; w When the DATA step begins processing the last observation, the value of LASTYEAR changes from 0 to 1. When this change occurs, the conditional statement IF LastYear becomes true, and the statements that follow it are executed. The assignment statement multiplies the total number of victories for each continent by 4 and assigns the result to the appropriate variable, YearsSouthAmerican or YearsEuropean. x The OUTPUT statement writes the observation to the newly created data set. Remember that the DATA step automatically writes an observation at the end of each iteration. However, the OUTPUT statement turns off this automatic feature. The DATA step writes only the last observation to TIMESPAN. When the DATA step writes the observation from the program data vector to the output data set, it writes only two variables, YearsSouthAmerican and YearsEuropean, as directed by the KEEP= data set option in the DATA statement. Output 21.5 Using the END= Option to Perform a Calculation Based on the Last Observation in the Data Sets Total Years as Reigning World Cup Champions from 1954 to 1998 Obs Years South American Years European 1 24 24 3 332 Review of SAS Tools 4 Chapter 21 Review of SAS Tools Statements IF condition; tests whether the condition is true. If it is true, then SAS continues processing the current observation; if it is false, then SAS stops processing the observation and returns to the beginning of the DATA step. This type of IF statement is called a subsetting IF statement because it produces a subset of the original observations. IF condition THEN action; tests whether the condition is true; if so, then the action in the THEN clause is executed. If the condition is false and an ELSE statement is present, then the ELSE action is executed. If the condition is false and no ELSE statement is present, then execution proceeds to the next statement in the DATA step. SET SAS-data-set (IN=variable) SAS-data-set-list; creates a variable that is associated with a SAS data set. The value of variable is 1 if the data set has contributed to the observation currently in the program data vector; 0 if it has not. The IN= variable exists only while the DATA step executes; it is not written to the output data set. You can use the option with any data set that you name in the SET, MERGE, MODIFY, or UPDATE statement, but use a different variable name for each one. SET SAS-data-set-list END=variable; creates a variable whose value is 0 until the DATA step starts to process its last observation. When processing of the last observation begins, the value of variable changes to 1. The END= variable exists only while the DATA step executes; it is not written to the output data set. You can also use the END= option with the MERGE, MODIFY, and UPDATE statements. Learning More DATA set options For an introduction to data set options, see Chapter 5, “Starting with SAS Data Sets,” on page 81. DO statement See Chapter 13, “Finding Shortcuts in Programming,” on page 201. IF statements For more information about both the subsetting and conditional IF statements, see Chapter 9, “Acting on Selected Observations,” on page 139. OUTPUT and subsetting IF statement See Chapter 10, “Creating Subsets of Observations,” on page 159. SUM statement and END= option See Chapter 12, “Using More Than One Observation in a Calculation,” on page 187. 333 5 P A R T Understanding Your SAS Session Chapter 22. . . . . . . . .Analyzing Your SAS Session with the SAS Log Chapter 23. . . . . . . . .Directing SAS Output and the SAS Log Chapter 24. . . . . . . . .Diagnosing and Avoiding Errors 357 349 335 334 335 CHAPTER 22 Analyzing Your SAS Session with the SAS Log Introduction to Analyzing Your SAS Session with the SAS Log 335 Purpose 335 Prerequisites 336 Understanding the SAS Log 336 Understanding the Role of the SAS Log 336 Resolving Errors with the Log 337 Locating the SAS Log 337 Understanding the Log Structure 337 Detecting a Syntax Error 337 Examining the Components of a Log 338 Writing to the SAS Log 339 Default Output to the SAS Log 339 Using the PUT Statement 339 Using the LIST Statement 340 Suppressing Information to the SAS Log 341 Using SAS System Options to Suppress Log Output 341 Suppressing SAS Statements 341 Suppressing System Notes 342 Limiting the Number of Error Messages 342 Suppressing SAS Statements, Notes, and Error Messages 343 Changing the Log’s Appearance 344 Review of SAS Tools 346 Statements 346 System Options 346 Learning More 346 Introduction to Analyzing Your SAS Session with the SAS Log Purpose The SAS log is a useful tool for analyzing your SAS session and programs. In this section, you will learn about the following: 3 the log in relation to output 3 the log structure 3 the log’s default destination, which depends on the method that you use to run SAS You will also learn how to do the following: 3 write to the log 336 Prerequisites 4 Chapter 22 3 suppress information from being written to the log Prerequisites You should understand the basic SAS programming concepts that are presented in the following sections: 3 Chapter 1, “What Is the SAS System?,” on page 3 3 Chapter 2, “Introduction to DATA Step Processing,” on page 19 3 Chapter 3, “Starting with Raw Data: The Basics,” on page 43 Understanding the SAS Log Understanding the Role of the SAS Log The SAS log results from executing a SAS program, and in that sense it is output. The SAS log provides a record of everything that you do in your SAS session or with your SAS program, from the names of the data sets that you have created to the number of observations and variables in those data sets. This record can tell you what statements were executed, how much time the DATA and PROC steps required, and whether your program contains errors. As with SAS output, the destination of the SAS log varies depending on your method of running SAS and on your operating environment. The content of the SAS log varies according to the DATA and PROC steps that are executed and the options that are used. The sample log in the following output was generated by a SAS program that contains two PROC steps.* Another typical log is described in detail later in the section. Output 22.1 A Sample SAS Log NOTE: Libref OUT was successfully assigned as follows: Engine: V8 Physical Name: YOUR-DATA-LIBRARY 57 options linesize=120; 58 59 proc sort data=out.sat_scores; 60 by test; 61 run; 62 63 proc plot data=out.sat_scores; 64 by test; 65 label SATscore=’SAT score’; 66 plot SATscore*year / haxis= 1972 1975 1978 1981 1984 1987 1990 1993 1996 1999; 67 title1 ’SAT Scores by Year, 1972-1999’; 68 title3 ’Separate statistics by Test Type’; 69 run; NOTE: There were 108 observations read from the data set OUT.SAT_SCORES. * The DATA step that created this data set is shown in the Appendix. The data set is stored in a SAS data library referenced by the libref OUT throughout the rest of this section. For examples in which raw data is read, the raw data is shown in the Appendix. Analyzing Your SAS Session with the SAS Log 4 Detecting a Syntax Error 337 Resolving Errors with the Log The SAS program that generated the log in the previous example ran without errors. If the program had contained errors, then those errors would have been reflected, as part of the session, in the log. SAS generates messages for data errors, syntax errors, and programming errors. You can browse those messages, make necessary changes to your program, and then rerun it successfully. Locating the SAS Log The destination of your log depends on the method you are using to start, run, and exit SAS. It also depends on your operating environment and on the setting of SAS system options. The following table shows the default destination for each method of operation: Method of Operation Destination of SAS Log SAS windowing environment (interactive full-screen) Log window interactive line mode on the terminal display, as statements are entered noninteractive SAS programs depends on the operating environment batch jobs line printer or disk file Understanding the Log Structure Detecting a Syntax Error The following SAS program contains one DATA step and two PROC steps. However, the DATA statement has a syntax error– that is, it does not have a semicolon. /* omitted semicolon */ data out.sat_scores4 infile ’your-input-file’; input test $ 1-8 gender $ 18 year 20-23 score 25-27; run; proc sort data = out.sat_scores4; by test; run; proc print data = out.sat_scores4; by test; run; 338 Examining the Components of a Log 4 Chapter 22 The following output shows the results. Although some variation occurs across operating environments and among methods of running SAS, the SAS log is a representative sample. Output 22.2 3 4 5 6 7 8 Analyzing a SAS Log with Error Messages /* omitted semicolon */ data out.sat_scores4u infile ’your-input-file’; input test $ 1-8 gender $ 18 year 20-23 scores 25-27; run; ERROR: No CARDS or INFILE statement. v ERROR: The value YOUR-INPUT-FILE is not a valid SAS name. NOTE: The SAS System stopped processing this step because of errors. w WARNING: The data set OUT.SAT_SCORES4 may be incomplete. When this step was stopped there were 0 observations and 4 variables. WARNING: Data set OUT.SAT_SCORES4 was not replaced because this step was stopped. WARNING: The data set WORK.INFILE may be incomplete. When this step was stopped there were 0 observations and 4 variables. 9 10 11 12 proc sort data=out.sat_scores4; u by test; run; NOTE: Input data set is empty.w NOTE: The data set OUT.SAT_SCORES4 has 0 observations and 4 variables. x 13 14 15 16 proc print data=out.sat_scores4; u by test; run; NOTE: No observations in data set OUT.SAT_SCORES4. Examining the Components of a Log The SAS log provides valuable information, especially if you have questions and need to contact your site’s SAS Support Consultant or SAS Technical Support, because the contents of the log will help them diagnose your problem. The following list corresponds to the numbered items in the preceding log: u SAS statements for the DATA and PROC steps v error messages w notes, which might include warning messages. x notes that contain the number of observations and variables for each data set that is created. Analyzing Your SAS Session with the SAS Log 4 Using the PUT Statement 339 Writing to the SAS Log Default Output to the SAS Log The previous sample logs show the information that appears on the log by default. You can also write to the log by using the PUT statement or the LIST statement within a DATA step. These statements can be used to debug your SAS programs. Using the PUT Statement The PUT statement enables you to write information that you specify, including text strings and variable values, to the log. Values can be written in column, list, formatted, or named output style.* Used as a screening device, the PUT statement can also be a useful debugging tool. For example, the following statement writes the values of all variables, including the automatic variables _ERROR_ and _N_, that are defined in the current DATA step: put _all_; The following program reads the data set OUT.SAT_SCORES and uses the PUT statement to write to the SAS log the records for which the score is 500 points or more. The following partial output shows that the records are written to the log immediately after the SAS statements: libname out ’your-data-library’; data _null_; set out.sat_scores; if SATscore >= 500 then put test gender year; run; Output 22.3 Writing to the SAS Log with the PUT Statement NOTE: Libref OUT was successfully assigned as follows: Engine: V8 Physical Name: YOUR-DATA-LIBRARY 123 124 data _null_; 125 set out.sat_scores; 126 if SATscore >= 500 then put test gender year; 127 run; Math m 1972 Math m 1973 Math m 1974 . . . * Named output enables you to write a variable’s name as well as its value to the SAS log. For more information, see “PUT, Named” in the Statements section of SAS Language Reference: Dictionary. 340 Using the LIST Statement 4 Chapter 22 Using the LIST Statement Use the LIST statement in the DATA step to list on the log the current input record. The following program shows that the LIST statement, like the PUT statement, can be very effective when combined with conditional processing to write selected information to the log: data out.sat_scores3; infile ’your-input-file’; input test $ gender $ year SATscore @@; if SATscore < 500 then delete; else list; run; When the LIST statement is executed, SAS causes the current input buffer to be printed following the DATA step. The following partial output shows the results. Note the presence of the columns ruler before the first line. The ruler indicates that input data has been written to the log. It can be used to reference column positions in the input buffer. Also notice that, because two observations are created from each input record, the entire input record is printed whenever either value of the SATscore variable from that input line is at least 500. Finally, note that the LIST statement causes the record length to be printed at the end of each line (in this case, each record has a length of 36). This feature of the LIST statement works only in operating environments that support variable-length (as opposed to fixed-length) input records. Analyzing Your SAS Session with the SAS Log Output 22.4 4 Suppressing SAS Statements 341 Writing to the SAS Log with the LIST Statement NOTE: Libref OUT was successfully assigned as follows: Engine: V8 Physical Name: YOUR-DATA-LIBRARY 248 data out.sat_scores3; 249 infile ’YOUR-DATA-FILE’; 250 input test $ gender $ year SATscore @@; 251 if SATscore < 500 then delete; 252 else list; 253 run; NOTE: The infile ’YOUR-DATA-FILE’ is: File Name=YOUR-DATA-FILE, Owner Name=userid,Group Name=dev, Access Permission=rw-r--r--, File Size (bytes)=1998 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 1 Verbal m 1972 531 Verbal f 1972 529 36 2 Verbal m 1973 523 Verbal f 1973 521 36 3 Verbal m 1974 524 Verbal f 1974 520 36 . . . 53 Math m 1997 530 Math f 1997 494 36 54 Math m 1998 531 Math f 1998 496 36 NOTE: 54 records were read from the infile ’YOUR-DATA-FILE’. The minimum record length was 36. The maximum record length was 36. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set OUT.SAT_SCORES3 has 69 observations and 4 variables. Suppressing Information to the SAS Log Using SAS System Options to Suppress Log Output There might be times when you want to prevent some information from being written to the SAS log. You can suppress SAS statements, system messages, and error messages with the NOSOURCE, NONOTES, and ERRORS= SAS system options. You can specify these options when you invoke SAS, in the OPTIONS window, or in an OPTIONS statement. In this section, the options are specified in OPTIONS statements. Note that all SAS system options remain in effect for the duration of your session or until you change them. Suppressing SAS Statements If you regularly execute large SAS programs without making changes, then you can use the NOSOURCE system option as follows to suppress the listing of the SAS statements to the log: options nosource; The NOSOURCE option causes only source lines that contain errors to be printed. You can return to the default by specifying the SOURCE system option as follows: 342 Suppressing System Notes 4 Chapter 22 options source; The SOURCE option causes all subsequent source lines to be printed. You can also control whether secondary source statements (from files that are included with a %INCLUDE statement) are printed on the SAS log. Specify the following statement to suppress secondary statements: options nosource2; The following OPTIONS statement causes secondary source statements to print to the log: options source2; Suppressing System Notes Much of the information that is supplied by the log appears as notes, including 3 copyright information 3 licensing and site information 3 number of observations and variables in the data set. SAS also issues a note to tell you that it has stopped processing a step because of errors. If you do not want the notes to appear on the log, then use the NONOTES system option to suppress their printing: options nonotes; All messages starting with NOTE: are suppressed. You can return to the default by specifying the NOTES system option: options notes; Limiting the Number of Error Messages SAS prints messages for data input errors that appear in your SAS program; the default number is usually 20 but might vary from site to site. Use the ERRORS= system option to specify the maximum number of observations for which error messages are printed. Note that this option limits only the error messages that are produced for incorrect data. This kind of error is caused primarily by trying to read character values for a variable that the INPUT statement defines as numeric. If data errors are detected in more observations than the number you specify, then processing continues, but error messages do not print for the additional errors. For example, the following OPTIONS statement specifies printing for a maximum of five observations: options errors=5; However, as discussed in “Suppressing SAS Statements, Notes, and Error Messages” on page 343, it might be dangerous to suppress error messages. Note: No option is available to eliminate warning messages. 4 Analyzing Your SAS Session with the SAS Log 4 Suppressing SAS Statements, Notes, and Error Messages 343 Suppressing SAS Statements, Notes, and Error Messages The following SAS program reads the test score data as in the other examples in this section, but in this example the character symbol for the variable GENDER is omitted. Also, the data is not sorted before using a BY statement with PROC PRINT. At the same time, for efficiency, SAS statements, notes, and error messages are suppressed. libname out ’your-data-library’; options nosource nonotes errors=0; data out.sats5; infile ’your-input-file’; input test $ gender year SATscore 25-27; run; proc print; by test; run; This program does not generate output. The SAS log that appears is shown in the following output. Because the SAS system option ERRORS=0 is specified, the error limit is reached immediately, and the errors that result from trying to read GENDER as a numeric value are not printed. Also, specifying the NOSOURCE and NONOTES system options causes the log to contain no SAS statements that can be verified and no notes to explain what happened. The log does contain an error message that explains that OUT.SATS5 is not sorted in ascending sequence. This error is not caused by invalid input data, so the ERRORS=0 option has no effect on this error. Output 22.5 Suppressing Information to the SAS Log NOTE: Libref OUT was successfully assigned as follows: Engine: V8 Physical Name: YOUR-DATA-LIBRARY 370 options nosource nonotes errors=0; ERROR: Limit set by ERRORS= option reached. Further errors of this type will not be printed. ERROR: Data set OUT.SAT_SCORES5 is not sorted in ascending sequence. The current by-group has test = Verbal and the next by-group has test = Math. Note: The NOSOURCE, NONOTES, and ERRORS= system options are used to save space. They are most useful with an already-tested program, perhaps one that is run regularly. However, as demonstrated in this section, they are not always appropriate. During development of a new program, the error messages in the log might be essential for debugging, and should not be limited. Similarly, notes should not be suppressed because they can help you pinpoint problems with a program. They are especially important if you seek help in debugging your program from someone not already familiar with it. In short, you should not suppress any information in the log until you have already executed the program without errors. 4 The following partial output shows the results if the previous sample SAS code is reexecuted with the SOURCE, NOTES, and ERRORS= options. 344 Changing the Log’s Appearance Output 22.6 4 Chapter 22 Debugging with the SAS Log 412 options source notes errors=20; 413 414 data out.sat_scores5; 415 infile ’YOUR-DATA-FILE’; 416 input test $ gender year score @@; 417 run; NOTE: The infile ’YOUR-DATA-FILE’ is: File Name=YOUR-DATA-FILE, Owner Name=userid,Group Name=dev, Access Permission=rw-r--r--, File Size (bytes)=1998 NOTE: Invalid data for gender in line 1 8-8. RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 1 Verbal m 1972 531 Verbal f 1972 529 36 test=Verbal gender=. year=1972 score=531 _ERROR_=1 _N_=1 NOTE: Invalid data for gender in line 1 27-27. test=Verbal gender=. year=1972 score=529 _ERROR_=1 _N_=2 NOTE: Invalid data for gender in line 2 8-8. 2 Verbal m 1973 523 Verbal f 1973 521 36 test=Verbal gender=. year=1973 score=523 _ERROR_=1 _N_=3 NOTE: Invalid data for gender in line 2 27-27. test=Verbal gender=. year=1973 score=521 _ERROR_=1 _N_=4 . . . NOTE: Invalid data for gender in line 10 8-8. 10 Verbal m 1981 508 Verbal f 1981 496 36 test=Verbal gender=. year=1981 score=508 _ERROR_=1 _N_=19 NOTE: Invalid data for gender in line 10 27-27. ERROR: Limit set by ERRORS= option reached. Further errors of this type will not be printed. test=Verbal gender=. year=1981 score=496 _ERROR_=1 _N_=20 NOTE: 54 records were read from the infile ’YOUR-DATA-FILE’. The minimum record length was 36. The maximum record length was 36. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set OUT.SAT_SCORES5 has 108 observations and 4 variables. 418 419 proc print; 420 by test; 421 run; ERROR: Data set OUT.SAT_SCORES5 is not sorted in ascending sequence. The current by-group has test = Verbal and the next by-group has test = Math. NOTE: The SAS System stopped processing this step because of errors. NOTE: There were 55 observations read from the data set OUT.SAT_SCORES5. Again, this program does not generate output, but this time the log is a more effective problem-solving tool. The log includes all the SAS statements from the program as well as many informative notes. Specifically, it includes enough messages about the invalid data for the variable GENDER that the problem can be spotted. With this information, the program can be modified and rerun successfully. Changing the Log’s Appearance Chapter 31, “Understanding and Customizing SAS Output: The Basics,” on page 537 shows you how to customize your output. Except in an interactive session, you can also Analyzing Your SAS Session with the SAS Log 4 Changing the Log’s Appearance 345 customize the log by using the PAGE and SKIP statements. Use the PAGE statement to move to a new page on the log; use the SKIP statement to skip lines on the log. With the SKIP statement, specify the number of lines that you want to skip; if you do not specify a number, then one line is skipped. If the number that you specify exceeds the number of lines remaining on the page, then SAS treats the SKIP statement like a PAGE statement and skips to the top of the next page. The PAGE and SKIP statements do not appear on the log. The following output shows the result if a PAGE statement is inserted before the PROC PRINT step in the previous example: Output 22.7 Using the PAGE Statement 456 options source notes errors=20; 457 458 data out.sat_scores5; 459 infile 459! ’/dept/pub/doc/901/authoring/basess/miscsrc/rawdata/sat_scores.raw’; 460 input test $ gender year score @@; 461 run; NOTE: The infile ’YOUR-DATA-FILE’ is: File Name=YOUR-DATA-FILE, Owner Name=userid,Group Name=dev, Access Permission=rw-r--r--, File Size (bytes)=1998 NOTE: Invalid data for gender in line 1 8-8. RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 1 Verbal m 1972 531 Verbal f 1972 529 36 test=Verbal gender=. year=1972 score=531 _ERROR_=1 _N_=1 NOTE: Invalid data for gender in line 1 27-27. test=Verbal gender=. year=1972 score=529 _ERROR_=1 _N_=2 NOTE: Invalid data for gender in line 2 8-8. 2 Verbal m 1973 523 Verbal f 1973 521 36 test=Verbal gender=. year=1973 score=523 _ERROR_=1 _N_=3 NOTE: Invalid data for gender in line 2 27-27. test=Verbal gender=. year=1973 score=521 _ERROR_=1 _N_=4 . . . NOTE: Invalid data for gender in line 10 8-8. 10 Verbal m 1981 508 Verbal f 1981 496 36 test=Verbal gender=. year=1981 score=508 _ERROR_=1 _N_=19 NOTE: Invalid data for gender in line 10 27-27. ERROR: Limit set by ERRORS= option reached. Further errors of this type will not be printed. test=Verbal gender=. year=1981 score=496 _ERROR_=1 _N_=20 NOTE: 54 records were read from the infile ’/dept/pub/doc/901/authoring/basess/miscsrc/rawdata/sat_scores.raw’. The minimum record length was 36. The maximum record length was 36. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set OUT.SAT_SCORES5 has 108 observations and 4 variables. 465 proc print; 466 by test; 467 run; ERROR: Data set OUT.SAT_SCORES5 is not sorted in ascending sequence. The current by-group has test = Verbal and the next by-group has test = Math. NOTE: The SAS System stopped processing this step because of errors. NOTE: There were 55 observations read from the data set OUT.SAT_SCORES5. 346 Review of SAS Tools 4 Chapter 22 Review of SAS Tools Statements The following statements are used to write to the log and to change the log’s appearance: LIST; lists on the SAS log the contents of the input buffer for the observation being processed. PAGE; skips to a new page on the log. PUT | <_ALL_>; writes lines to the SAS log, the output file, or any file that is specified in a FILE statement. If no FILE statement has been executed in this iteration of the DATA step, then the PUT statement writes to the SAS log. Variable-list names the variables whose values are to be written, and _ALL_ signifies that the values of all variables, including _ERROR_ and _N_, are to be written to the log. SKIP ; on the SAS log, skips the number of lines that you specify with the value n. If the number is greater than the number of lines remaining on the page, then SAS treats the SKIP statement like a PAGE statement and skips to the top of the next page. System Options The following system options are used to suppress information to the log. In this section, they are specified in OPTIONS statements. ERRORS=n specifies the maximum number of observations for which error messages about data input errors are printed. NOTES|NONOTES controls whether notes are printed to the log. SOURCE|NOSOURCE controls whether SAS statements are printed to the log. SOURCE2|NOSOURCE2 controls whether secondary SAS statements from files included by %INCLUDE statements are printed to the log. Learning More Automatic variables Chapter 24, “Diagnosing and Avoiding Errors,” on page 357 discusses the automatic variables _N_ and _ERROR_. FILE and PUT statements Analyzing Your SAS Session with the SAS Log 4 Learning More 347 Chapter 31, “Understanding and Customizing SAS Output: The Basics,” on page 537 discusses the FILE and PUT statements. The Log window Chapter 39, “Using the SAS Windowing Environment,” on page 655 discusses the Log window. Operating environment-specific information The SAS documentation for your operating environment contains information about the appearance and destination of the SAS log, as well as for routing output. The SAS environment Chapter 38, “Introducing the SAS Environment,” on page 643 provides information about methods of operation and on specifying SAS system options when you invoke SAS. It also discusses executing SAS statements automatically. The SAS log SAS Language Reference: Concepts provides complete reference information about the SAS log. SAS statements SAS Language Reference: Dictionary provides complete reference information about the SAS statements that are discussed in this section. SAS system options SAS Language Reference: Dictionary provides complete reference information about SAS options that work across all operating environments. Refer to the SAS documentation for your operating environment for information about operating environment-specific options. Your SAS session Other sections provide more information about your SAS session. See especially Chapter 24, “Diagnosing and Avoiding Errors,” on page 357, which contains more information about error messages. 348 349 CHAPTER 23 Directing SAS Output and the SAS Log Introduction to Directing SAS Output and the SAS Log 349 Purpose 349 Prerequisites 350 Input File and SAS Data Set for Examples 350 Routing the Output and the SAS Log with PROC PRINTTO 351 Routing Output to an Alternate Location 351 Routing the SAS Log to an Alternate Location 352 Restoring the Default Destination 353 Storing the Output and the SAS Log in the SAS Windowing Environment 353 Understanding the Default Destination 353 Storing the Contents of the Output and Log Windows 354 Redefining the Default Destination in a Batch or Noninteractive Environment 354 Determining the Default Destination 354 Changing the Default Destination 354 Understanding the Configuration File 355 Review of SAS Tools 355 PROC PRINTTO Statement Options 355 SAS Windowing Environment Commands 356 SAS System Options 356 Learning More 356 Introduction to Directing SAS Output and the SAS Log Purpose The SAS provides several methods to direct SAS output and the SAS log to different destinations. In this section, you will learn how to use the following SAS language elements: 3 PRINTTO procedure from within a program or session to route DATA step output, the SAS log, or procedure output from their default destinations to another destination 3 FILE command, in the SAS windowing environment, to store the contents of the Log and Output windows in files 3 PRINT= and LOG= system options when you invoke SAS to redefine the destination of the log and output for an entire SAS session 350 Prerequisites 4 Chapter 23 Prerequisites Before proceeding with this section, you should be familiar with the following features and concepts: 3 creating DATA step or PROC step output 3 locating the log and procedure output 3 referencing external files Input File and SAS Data Set for Examples The examples in this section are based on data from a university entrance exam called the Scholastic Aptitude Test, or SAT. The data is provided in one input file that contains the average SAT scores of entering university classes from 1972 to 1998.* The input file has the following structure: Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal m f m f m f m f m f 1972 1972 1973 1973 1974 1974 1975 1975 1976 1976 531 529 523 521 524 520 515 509 511 508 The input file contains the following values from left to right: 3 3 3 3 type of SAT exam gender of student year of the exam average exam score of the first-year class The following program creates the data set that this section uses: data sat_scores; input Test $ Gender $ Year SATscore @@; datalines; Verbal m 1972 531 Verbal f 1972 529 Verbal m 1973 523 Verbal f 1973 521 Verbal m 1974 524 Verbal f 1974 520 ...more data lines... Math Math Math ; m 1996 527 m 1997 530 m 1998 531 Math Math Math f 1996 492 f 1997 494 f 1998 496 * See Chapter 31, “Understanding and Customizing SAS Output: The Basics,” on page 537 for a complete listing of the input data. Directing SAS Output and the SAS Log 4 Routing Output to an Alternate Location 351 Routing the Output and the SAS Log with PROC PRINTTO Routing Output to an Alternate Location You can use the PRINTTO procedure to redirect SAS procedure output from the listing destination to an alternate location. These locations are: 3 a permanent file 3 a SAS catalog entry 3 a dummy file, which serves to suppress the output After PROC PRINTTO executes, all procedure output is sent to the alternate location until you execute another PROC PRINTTO statement or until your program or session ends. The default destination for the procedure output depends on how you configure SAS to handle output. For more information, see the discussion of SAS output in Chapter 31, “Understanding and Customizing SAS Output: The Basics,” on page 537. Note: If you used the Output Delivery System (ODS) to close the listing destination, then PROC PRINTTO does not receive any output to redirect. However, the procedure results still go to the destination that you specified with ODS. 4 You use the PRINT= option in the PROC PRINTTO statement to specify the name of the file or SAS catalog that will contain the procedure output. If you specify a file, then either use the complete name of the file in quotation marks or use a fileref for the file. (See “Using External Files in Your SAS Job” on page 38 for more information about filerefs and filenames.) You can also specify the NEW option in the PROC PRINTTO statement so that SAS replaces the previous contents of the output file. Otherwise, SAS appends the output to any output that is currently in the file. To route output to an alternate file, insert a PROC PRINTTO step in the program before the PROC step that generates the procedure output. The following program routes the output from PROC PRINT to an external file: proc printto print=’alternate-output-file’ new; run; proc print data=sat_scores; title ’Mean SAT Scores for Entering University Classes’; run; proc printto; run; After the PROC PRINT step executes, alternate-output-file contains the procedure output. The second PROC PRINTTO step redirects output back to its default destination. The PRINTTO procedure does not produce the output. Instead it tells SAS to route the results of all subsequent procedures until another PROC PRINTTO statement executes. Therefore, the PROC PRINTTO statement must precede the procedure whose output you want to route. Figure 23.1 on page 352 shows how SAS uses PROC PRINTTO to route procedure output. You can also use PROC PRINTTO multiple times in a program so that output from different steps of a SAS job is stored in different files. 352 Routing the SAS Log to an Alternate Location Figure 23.1 4 Chapter 23 Using PROC PRINTTO Route Output proc ...; proc ...; proc printto file=alt-dest-1; proc ...; proc printto file=alt-dest-2; proc ...; Routing the SAS Log to an Alternate Location You can use the PRINTTO procedure to redirect the SAS log to an alternate location. The location can be one of the following: 3 a permanent file 3 a SAS catalog entry 3 a dummy file to suppress the log After PROC PRINTTO executes, the log is sent either to a permanent external file or to a SAS catalog entry until you execute another PROC PRINTTO statement, or until your program or session ends. You use the LOG= option in the PROC PRINTTO statement to specify the name of the file or SAS catalog that will contain the log. If you specify a file, then either use the complete name of the file in quotation marks or use a fileref for the file. You can also specify the NEW option in the PROC PRINTTO statement so that SAS replaces the previous contents of the file. Otherwise, SAS appends the log to any log that is currently in the file. The following program routes the SAS log to an alternate file: proc printto log=’alternate-log-file’; run; After the PROC PRINT step executes, alternate-log-file contains the SAS log. The contents of this file are shown in the following output: Directing SAS Output and the SAS Log Output 23.1 4 Understanding the Default Destination 353 Using the PRINTTO Procedure to Route the SAS Log to an Alternate File 8 data sat_scores; 9 input Test $ Gender $ Year SATscore @@; 10 datalines; NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.SAT_SCORES has 108 observations and 4 variables. 65 ; 66 proc print data=sat_scores; 67 title ’Mean SAT Scores for Entering University Classes’; 68 run; NOTE: There were 108 observations read from the dataset WORK.SAT_SCORES. 69 proc printto; run; Restoring the Default Destination Specify the PROC PRINTTO statement with no argument when you want to route the log and the output back to their default destinations: proc printto; run; You might want to return only the log or only the procedure output to its default destination. The following PROC PRINTTO statement routes only the log back to the default destination: proc printto log=log; run; The following PROC PRINTTO statement routes only the procedure output to the default destination: proc printto print=print; run; Storing the Output and the SAS Log in the SAS Windowing Environment Understanding the Default Destination Within the SAS windowing environment, the default destination for most procedure output is a monospace listing that appears in the Output window. However, you can use the Output Delivery System (ODS) to change which destinations are opened and closed. Each time you execute a procedure within a single session, SAS appends the output to the existing output. To view the results, you can 3 scroll the Output window, which contains the output in the order in which you generated it 3 use the Results window to select a pointer that is a link to the procedure output. The SAS windowing environment interacts with certain aspects of the ODS to format, control, and manage your output. In the SAS windowing environment, the default destination for the SAS log messages is the Log window. When you execute a procedure, SAS appends the log messages to the existing log messages in the Log window. You can scroll the Log window to see the results. To print your log messages, execute the PRINT command. To clear the contents 354 Storing the Contents of the Output and Log Windows 4 Chapter 23 of the Log window, execute the CLEAR command. When your session ends, SAS automatically clears the window. Within the SAS windowing environment, you can use the PRINTTO procedure to route log messages or procedure output to a location other than the default location, just as you can in other methods of operation. For details, see “Routing the Output and the SAS Log with PROC PRINTTO” on page 351. You can also use ODS to change the destination of the procedure output. For additional information about using ODS, viewing procedure output, and changing the destination of the procedure output, see Chapter 31, “Understanding and Customizing SAS Output: The Basics,” on page 537. Storing the Contents of the Output and Log Windows If you want to store a copy of the contents of the Output or Log window in a file, then use the FILE command. On the command line, specify the FILE command followed by the name of the file: file ’file-to-store-contents-of-window’ SAS has a built-in safeguard that prevents you from accidentally overwriting a file. If you inadvertently specify an existing file, then a dialog box appears. The dialog box asks you to choose a course of action, provides you with information, and might prevent you from overwriting the file by mistake. You are asked whether to: 3 replace the contents of the file 3 append the contents of the file 3 cancel the FILE command Redefining the Default Destination in a Batch or Noninteractive Environment Determining the Default Destination Usually, in a batch or noninteractive environment, SAS routes procedure output to the listing file and routes the SAS log to a log file. These files are usually defined by your installation and are created automatically when you invoke SAS. Contact your SAS Support Consultant if you have questions pertaining to your site. Changing the Default Destination If you want to redefine the default destination for procedure output, then use the PRINT= system option. If you want to redefine the default destination for the SAS log, then use the LOG= system option. You specify these options only at initialization. Operating Environment Information: The way that you specify output destinations when you use SAS system options depends on your operating environment. For details, see the SAS documentation for your operating environment. 4 Options that you must specify at initialization are called configuration options. The configuration options affect the following: 3 the initialization of the SAS System 3 the hardware interface Directing SAS Output and the SAS Log 4 PROC PRINTTO Statement Options 355 3 the operating system interface In contrast to other SAS system options, which affect the appearance of output, file handling, use of system variables, or processing of observations, you cannot change configuration options in the middle of a program. You specify configuration options when SAS is invoked, either in the configuration file or in the SAS command. Understanding the Configuration File The configuration file is a special file that contains configuration options as well as other SAS system options and their settings. Each time you invoke SAS, the settings of the configuration file are examined. You can specify the options in the configuration file in the same format as they are used in the SAS command for your operating environment. For example, under UNIX this file’s contents might include the following: WORK=WORK SASUSER=SASUSER EXPLORER SAS automatically sets the options as they appear in the configuration file. If you specify options both in the configuration file and in the SAS command, then the options are concatenated. If you specify the same option in the SAS command and in the configuration file, then the setting in the SAS command overrides the setting in the file. For example, specifying the NOEXPLORER option in the SAS command overrides the EXPLORER option in the configuration file and tells SAS to start your session without displaying the Explorer window. Review of SAS Tools PROC PRINTTO Statement Options PROC PRINTTO ; LOG=’alternate-log-file’ identifies the location and routes the SAS log to this alternate location. NEW specifies that the current log or procedure output writes over the previous contents of the file. PRINT=’alternate-output-file’ identifies the location and routes the procedure output to this alternate location. 356 SAS Windowing Environment Commands 4 Chapter 23 SAS Windowing Environment Commands CLEAR clears the contents of a window, as specified. FILE routes a copy of the contents of a window to the file that you specify; the original contents remain in place. PRINT prints the contents of the window. SAS System Options LOG=system-filename redefines the default destination for the SAS log to the file named system-filename. PRINT=system-filename redefines the default destination for procedure output to the file named system-filename. Learning More Output Delivery System For complete reference documentation about the Output Delivery System, see SAS Output Delivery System: User’s Guide. PROC PRINTTO For complete reference documentation about PROC PRINTTO, see Base SAS Procedures Guide. SAS environment For details about the methods of operating SAS and interactive processing in the windowing environment, see Part 10, “Understanding Your SAS Environment.” SAS log For complete reference information about the SAS log and procedure output, see SAS Language Reference: Concepts. SAS output For more information, see the other sections in “Understanding Your SAS Session.” SAS system options For details about SAS system options, including configuration options, see SAS Language Reference: Dictionary. For operating-specific information about routing output, the PRINT= option, LOG= option, and other SAS system options, see the SAS documentation for your operating environment. 357 CHAPTER 24 Diagnosing and Avoiding Errors Introduction to Diagnosing and Avoiding Errors 357 Purpose 357 Prerequisites 357 Understanding How the SAS Supervisor Checks a Job Understanding How SAS Processes Errors 358 Distinguishing Types of Errors 358 Diagnosing Errors 359 Examples in This Section 359 Diagnosing Syntax Errors 359 Diagnosing Execution-Time Errors 361 Diagnosing Data Errors 362 Using a Quality Control Checklist 366 Learning More 366 357 Introduction to Diagnosing and Avoiding Errors Purpose In this section, you will learn how to diagnose errors in your programs by learning the following: 3 how the SAS Supervisor checks a program for errors 3 how to distinguish among the types of errors 3 how to interpret the notes, warning messages, and error messages in the log 3 what to check for as you develop a program Prerequisites You should understand the concepts that are presented in the following sections: 3 Chapter 2, “Introduction to DATA Step Processing,” on page 19 3 Chapter 3, “Starting with Raw Data: The Basics,” on page 43 3 Chapter 6, “Understanding DATA Step Processing,” on page 97 3 Chapter 22, “Analyzing Your SAS Session with the SAS Log,” on page 335 Understanding How the SAS Supervisor Checks a Job To better understand the errors that you make so that you can avoid others, it is important to understand how the SAS Supervisor checks a job. The SAS Supervisor is 358 Understanding How SAS Processes Errors 4 Chapter 24 the part of SAS that is responsible for executing SAS programs. To check the syntax of a SAS program, the SAS Supervisor does the following: 3 reads the SAS statements and data 3 translates the program statements into executable machine code or intermediate code 3 creates data sets 3 calls SAS procedures, as requested 3 prints error messages 3 ends the job The SAS Supervisor knows 3 the forms and types of statements that can be present in a DATA step 3 the types of statements and the options that can be present in a PROC step To process a program, the SAS Supervisor scans all the SAS statements and breaks each statement into words. Each word is processed separately; when all the words in a step are processed, the step is executed. If the SAS Supervisor detects an error, then it flags the error at its location and prints an explanation. The SAS Supervisor assumes that anything it does not recognize is an error. Understanding How SAS Processes Errors When SAS detects an error, it usually underlines the error or underlines the point at which it detects the error, identifying the error with a number. Each number is uniquely associated with an error message. Then SAS enters syntax check mode. SAS reads the remaining program statements, checks their syntax, and underlines additional errors if necessary. In a batch or noninteractive program, an error in a DATA step statement causes SAS to remain in syntax check mode for the rest of the program. It does not execute any more DATA or PROC steps that create external files or SAS data sets. Procedures that read from SAS data sets execute with 0 observations, and procedures that do not read SAS data sets execute normally. A syntax error in a PROC step usually affects only that step. At the end of the step, SAS writes a message in the SAS log for each error that is detected. Distinguishing Types of Errors SAS recognizes four kinds of errors: 3 syntax errors 3 execution-time errors 3 data errors 3 semantic errors Syntax errors are errors made in the SAS statements of a program. They include misspelled keywords, missing or invalid punctuation, and invalid statement or data set options. SAS detects syntax errors as it compiles each DATA or PROC step. Execution-time errors cause a program to fail when it is submitted for execution. Most execution-time errors that are not serious produce notes in the SAS log, but the program is allowed to run to completion. For more serious errors, however, SAS issues error messages and stops all processing. Diagnosing and Avoiding Errors 4 Diagnosing Syntax Errors 359 Data errors are actually a type of execution-time error. They occur when the raw data that you are analyzing with a SAS program contains invalid values. For example, a data error occurs if you specify numeric variables in the INPUT statement for character data. Data errors do not cause a program to stop but instead generate notes in the SAS log. Semantic errors, another type of execution-time error, occur when the form of a SAS statement is correct, but some elements are not valid in that usage. Examples include the following: 3 specifying the wrong number of arguments for a function 3 using a numeric variable name where only character variables are valid 3 using a libref that has not yet been assigned Diagnosing Errors Examples in This Section This section uses nationwide test results from the Scholastic Aptitude Test (SAT) for university-bound students from 1972 through 1998* to show what happens when errors occur. Diagnosing Syntax Errors The SAS Supervisor detects syntax errors as it compiles each step, and then SAS does the following: 3 prints the word ERROR 3 identifies the error’s location 3 prints an explanation of the error. In the following program, the CHART procedure is used to analyze the data. Note that a semicolon in the DATA statement is omitted, and the keyword INFILE is misspelled. /* omitted semicolon and misspelled keyword */ libname out ’your-data-library’; data out.error1 infill ’your-input-file’; input test $ gender $ year SATscore @@; run; proc chart data = out.error1; hbar test / sumvar=SATscore type=mean group=gender discrete; run; The following output shows the result of the two syntax errors: * See the Appendix for a complete listing of the input data that is used to create the data sets in this section. 360 Diagnosing Syntax Errors Output 24.1 4 Chapter 24 Diagnosing Syntax Errors NOTE: Libref OUT was successfully assigned as follows: Engine: V8 Physical Name: ’YOUR-DATA-LIBRARY’ 50 data out.error1 51 infill ’YOUR-INPUT-FILE’; 52 input test $ gender $ year SATscore @@; 53 run; ERROR: No CARDS or INFILE statement. ERROR: Memtype field is invalid. NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set OUT.ERROR1 may be incomplete. When this step was stopped there were 0 observations and 4 variables. WARNING: Data set OUT.ERROR1 was not replaced because this step was stopped. WARNING: The data set WORK.INFILL may be incomplete. When this step was stopped there were 0 observations and 4 variables. WARNING: Data set WORK.INFILL was not replaced because this step was stopped. 54 55 proc chart data=out.error1; 56 hbar test / sumvar=SATscore type=mean group=gender discrete; 57 run; NOTE: No observations in data set OUT.ERROR1. As the log indicates, SAS recognizes the keyword DATA and attempts to process the DATA step. Because the DATA statement must end with a semicolon, SAS assumes that INFILL is a data set name and that two data sets are being created: OUT.ERROR1 and WORK.INFILL. Because it considers INFILL the name of a data set, it does not recognize it as part of another statement and, therefore, does not detect the spelling error. Because the quoted string is invalid in a DATA statement, SAS stops processing here and creates no observations for either data set. SAS attempts to execute the program logically based on the statements that it contains, according to the steps outlined earlier in this section. The second syntax error, the misspelled keyword, is never recognized because SAS considers the DATA statement to be in effect until a semicolon ends the statement. The point to remember is that when multiple errors are made in the same program, not all of them might be detected the first time the program is executed, or they might be flagged differently in a group than if they were made alone. You might find that one correction uncovers another error or at least changes its explanation in the log. To illustrate this point, the previous program is reexecuted with the semicolon added to the DATA statement. An attempt to correct the misspelled keyword simply introduces a different spelling error, as follows. /* misspelled keyword */ libname out ’your-data-library’; data out.error2; unfile ’your-input-file’; input test $ gender $ year SATscore @@; run; proc chart data = out.error1; hbar test / sumvar=SATscore type=mean group=gender discrete; run; The following output shows the results: Diagnosing and Avoiding Errors Output 24.2 4 Diagnosing Execution-Time Errors 361 Correcting Syntax and Finding Different Error Messages NOTE: Libref OUT was successfully assigned as follows: Engine: V8 Physical Name: YOUR-DATA-LIBRARY 70 data out.error2; 71 unfile ’YOUR-INPUT-FILE’ -----180 ERROR 180-322: Statement is not valid or it is used out of proper order. 72 input test $ gender $ year SATscore @@; 73 run; ERROR: No CARDS or INFILE statement. NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set OUT.ERROR2 may be incomplete. When this step was stopped there were 0 observations and 4 variables. 74 75 proc chart data=out.error1; 76 hbar test / sumvar=SATscore type=mean group=gender discrete; 77 run; NOTE: No observations in data set OUT.ERROR1. With the semicolon added, SAS now attempts to create only one data set. From that point on, SAS reads the SAS statements as it did before and issues many of the same messages. However, this time SAS considers the UNFILE statement invalid or out of proper order, and it creates no observations for the data set. Diagnosing Execution-Time Errors Several types of errors are detected at execution time. Execution-time errors include the following: 3 illegal mathematical operations 3 observations out of order for BY-group processing 3 an incorrect reference in an INFILE statement (for example, misspelling or otherwise incorrectly stating the external file) When the SAS Supervisor encounters an execution-time error, it does the following: 3 prints a note, warning, or error message, depending on the seriousness of the error 3 in some cases, lists the values that are stored in the program data vector 3 continues or stops processing, depending on the seriousness of the error If the previous program is rerun with the correct spelling for INFILE but with a misspelling of the filename in the INFILE statement, then the error is detected at execution time and the data is not read. /* misspelled file in the INFILE statement */ libname out ’your-data-library’; data out.error3; infile ’an-incorrect-filename’; input test $ gender $ year SATscore @@; run; proc chart data = out.error3; hbar test / sumvar=SATscore type=mean group=gender discrete; run; 362 Diagnosing Data Errors 4 Chapter 24 As the SAS log in the following output indicates, SAS cannot find the file. SAS stops processing because of errors and creates no observations in the data set. Output 24.3 Diagnosing an Error in the INFILE Statement NOTE: Libref OUT was successfully assigned as follows: Engine: V8 Physical Name: YOUR-DATA-LIBRARY 10 data out.error3; 11 infile ’AN-INCORRECT-FILENAME’; 12 input test $ gender $ year SATscore @@; 13 run; ERROR: Physical file does not exist, AN-INCORRECT-FILENAME NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set OUT.ERROR3 may be incomplete. When this step was stopped there were 0 observations and 4 variables. 14 15 proc chart data=out.error3; 16 hbar test / sumvar=SATscore type=mean group=gender discrete; 17 run; NOTE: No observations in data set OUT.ERROR3. Diagnosing Data Errors When SAS detects data errors during execution, it continues processing and then does the following: 3 prints a note that describes the error 3 lists the values that are stored in the input buffer 3 lists the values that are stored in the program data vector Note that the values listed in the program data vector include two variables created automatically by SAS: _N_ counts the number of times the DATA step iterates. _ERROR_ indicates the occurrence of an error during an execution of the DATA step. The value that is assigned to the variable _ERROR_ is 0 when no error is encountered and 1 when an error is encountered. Diagnosing and Avoiding Errors 4 Diagnosing Data Errors 363 These automatic variables are assigned temporarily to each observation and are not stored with the data set. The raw data that is shown here is read by a program that uses formats to determine how variable values are printed: verbal verbal verbal verbal math math math math m 1967 463 f 1967 468 m 1970 459 f 1970 461 m 1967 514 f 1967 467 m 1970 509 f 1970 509 However, the data is not aligned correctly in the columns that are described by the INPUT statement. The sixth data line is shifted two spaces to the right, and the rest of the data lines, except for the first, are shifted one space to the right, as shown by a comparison of the raw data with the following program: /* data in wrong columns */ libname out ’your-data-library’; proc format; 364 Diagnosing Data Errors 4 Chapter 24 value xscore . =’accurate scores unavailable’; run; data out.error4; infile ’your--input-file’; input test $ 1-8 gender $ 18 year 20-23 score 25-27; format score xscore.; run; proc print data = out.error4; title ’Viewing Incorrect Output’; run; The following output shows the results of the SAS program: Output 24.4 Detecting Data Errors with Incorrect Output Viewing Incorrect Output Obs test 1 2 3 4 5 6 7 8 verbal verbal verbal verbal math math math math gender m 1 year score 1967 196 197 197 196 . 197 197 463 46 45 46 51 accurate scores unavailable 50 50 This program generates output, but it is not the expected output. The first observation appears to be correct, but subsequent observations have the following problems: 3 The values for the variable GENDER are missing. 3 Only the first three digits of the value for the variable YEAR are shown except in the sixth observation where a missing value is indicated. 3 The third digit of the value for the variable SCORE is missing, again except in the sixth observation, which does show the assigned value for the missing value. The SAS log in the following output contains an explanation: Diagnosing and Avoiding Errors Output 24.5 4 Diagnosing Data Errors 365 Diagnosing Data Errors NOTE: Libref OUT was successfully assigned as follows: Engine: V8 Physical Name: YOUR-DATA-LIBRARY 10 proc format; NOTE: Format XSCORE has been output. 11 value xscore . =’accurate scores unavailable’; 12 run; 13 14 data out.error4; 15 infile ’YOUR--INPUT-FILE’; 16 input test $ 1-8 gender $ 18 year 20-23 17 score 25-27; 18 format score xscore.; 19 run; NOTE: The infile ’YOUR-INPUT-FILE’ is: File Name=YOUR-INPUT-FILE, Owner Name=userid,Group Name=dev, Access Permission=rw-r--r--, File Size (bytes)=233 NOTE: Invalid data for year in line 6 20-23. NOTE: Invalid data for score in line 6 25-27. RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 6 math f 1967 467 29 test=math gender= year=. score=accurate scores unavailable _ERROR_=1 _N_=6 NOTE: 9 records were read from the infile ’YOUR-INPUT-FILE’. The minimum record length was 0. The maximum record length was 29. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set OUT.ERROR4 has 8 observations and 4 variables. 20 21 proc print data=out.error4; 22 title ’Viewing Incorrect Output’; 23 run; NOTE: There were 8 observations read from the data set OUT.ERROR4. The errors are flagged, starting with the first message that line 6 contains invalid data for the variable YEAR. The rule indicates that input data has been written to the log. SAS lists on the log the values that are stored in the program data vector. The following lines from the log indicate that SAS has encountered an error: NOTE: Invalid data for year in line 6 20-23. NOTE: Invalid data for score in line 6 25-27. RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 6 math f 1967 467 29 test=math gender= year=. score=accurate scores unavailable _ERROR_=1 _N_=6 Missing values are shown for the variables GENDER and YEAR. The NOTEs in the log indicate that the sixth line of input contained the error. To debug the program, either the raw data can be repositioned or the INPUT statement can be rewritten, remembering that all the data lines were shifted at least one space to the right. The variable TEST was unaffected, but the variable GENDER was completely removed from its designated field; therefore, SAS reads the variable GENDER as a missing value. In the sixth observation, for which the data was shifted right an additional space, the character value for GENDER occupied part of the field for the numeric variable YEAR. When SAS encounters invalid data, it treats the value as a missing value but also notes on the log that the data is invalid. The important point to 366 Using a Quality Control Checklist 4 Chapter 24 remember is that SAS can use only the information that you provide to it, not what you intend to provide to it. Using a Quality Control Checklist If you follow some basic guidelines as you develop a program, then you can avoid common errors. Use the following checklist to flag and correct common mistakes before you submit your program. 3 Check the syntax of your program. In particular, check the following: 3 All SAS statements end with a semicolon; be sure you have not omitted any semicolons or accidentally typed the wrong character. 3 Any starting and ending quotation marks must match; you can use either single or double quotation marks. 3 Most SAS statements begin with a SAS keyword. (Exceptions are assignment statements and sum statements.) Be sure you have not misspelled or omitted any of the keywords. 3 Every DO and SELECT statement must be followed by an END statement. 3 Check the order of your program. SAS usually executes the statements in a DATA step one by one, in the order they appear. After executing the DATA step, SAS moves to the next step and continues in the same fashion. Be sure that all the SAS statements appear in order so that SAS can execute them properly. For example, an INFILE statement, if used, must precede an INPUT statement. Also, be sure to end steps with the RUN statement. This is especially important at the end of your program because the RUN statement causes the previous step to be executed. 3 Check your INPUT statement and your data. SAS classifies all variables as either character or numeric. The assignment in the INPUT statement as either character or numeric must correspond to the actual values of variables in your data. Also, SAS allows for list, column, formatted, or named input. The method of input that you specify in the INPUT statement must correspond with the actual arrangement of raw data. Learning More INFILE statement options SAS Language Reference: Dictionary contains information about using the MISSOVER and STOPOVER options in the INFILE statement as debugging tools. The MISSOVER option prevents a SAS program from going past the end of a line to read values with list input if it does not find values in the current line for all INPUT statement variables. Then SAS assigns missing values to variables for which no values appear on the current input line. The STOPOVER option stops processing the DATA step when an INPUT statement using list input reaches the end of the current record without finding values for all variables in the statement. Then SAS sets _ERROR_ to 1, stops building the data set, and prints an incomplete data line. Program data vector and input buffer Chapter 2, “Introduction to DATA Step Processing,” on page 19 and Chapter 3, “Starting with Raw Data: The Basics,” on page 43 contain information about the program data vector and the input buffer. Diagnosing and Avoiding Errors 4 Learning More 367 The SAS log SAS Language Reference: Concepts contains complete reference information about the SAS log. SAS output SAS Language Reference: Concepts contains complete reference information about SAS output. Your SAS session Other sections provide more information about your SAS session. Chapter 23, “Directing SAS Output and the SAS Log,” on page 349 discusses warnings, notes, and error messages and presents debugging guidelines. 368 369 6 P A R T Producing Reports Chapter 25. . . . . . . . .Producing Detail Reports with the PRINT Procedure Chapter 26. . . . . . . . .Creating Summary Tables with the TABULATE Procedure Chapter 27. . . . . . . . .Creating Detail and Summary Reports with the REPORT Procedure 435 371 407 370 371 CHAPTER 25 Producing Detail Reports with the PRINT Procedure Introduction to Producing Detail Reports with the PRINT Procedure Purpose 372 Prerequisites 372 Input File and SAS Data Sets for Examples 372 Creating Simple Reports 373 Showing All the Variables 373 Labeling the Observation Column 374 Suppressing the Observation Column 375 Emphasizing a Key Variable 376 Understanding the ID Statement 376 Using an Unsorted Key Variable 376 Using a Sorted Key Variable 377 Reporting the Values of Selected Variables 378 Selecting Observations 379 Understanding the WHERE Statement 379 Making a Single Comparison 379 Making Multiple Comparisons 380 Creating Enhanced Reports 381 Ways to Enhance a Report 381 Specifying Formats for the Variables 382 Summing Numeric Variables 383 Grouping Observations by Variable Values 383 Computing Group Subtotals 384 Identifying Group Subtotals 385 Computing Multiple Group Subtotals 386 Computing Group Totals 389 Grouping Observations on Separate Pages 390 Creating Customized Reports 391 Ways to Customize a Report 391 Understanding Titles and Footnotes 392 Adding Titles and Footnotes 392 Defining Labels 393 Splitting Labels across Two or More Lines 394 Adding Double Spacing 395 Requesting Uniform Column Widths 396 Making Your Reports Easy to Change 399 Understanding the SAS Macro Facility 399 Using Automatic Macro Variables 399 Using Your Own Macro Variables 400 Defining Macro Variables 401 Referring to Macro Variables 401 372 372 Introduction to Producing Detail Reports with the PRINT Procedure 4 Chapter 25 Review of SAS Tools 402 PROC PRINT Statements 402 PROC SORT Statements 405 SAS Macro Language 405 Learning More 405 Introduction to Producing Detail Reports with the PRINT Procedure Purpose Detail reports, or simple data listings, contain one row for every observation that is selected for inclusion in the report. A detail report provides information about every record that is processed. For example, a detail report for a sales company includes all the information about every sale made during a particular quarter of the year. The PRINT procedure is one of several report writing tools that you can use to create a variety of detail reports. In this section, you will learn how to do the following: 3 produce simple reports by using a few basic PROC PRINT options and statements 3 produce enhanced reports by adding additional statements that format values, sum columns, group observations, and compute totals 3 customize the appearance of reports by adding titles, footnotes, and column labels 3 substitute text by using macro variables Prerequisites Before proceeding with this section, you should be familiar with the following features and concepts: 3 3 3 3 3 the assignment statement the OUTPUT statement the SORT procedure the BY statement the location of the procedure output Input File and SAS Data Sets for Examples The examples in this section use one input file* and five SAS data sets. The input file contains sales records for a company, TruBlend Coffee Makers, that distributes the coffee machines. The file has the following structure: 01 01 01 01 01 01 1 1 1 1 1 1 Hollingsworth Garcia Hollingsworth Jensen Garcia Jensen Deluxe Standard Deluxe Standard Standard Deluxe 260 41 330 1110 715 675 * See the “Data Set YEAR_SALES” on page 715 for a complete listing of the input data. 49.50 30.97 49.50 30.97 30.97 49.50 Producing Detail Reports with the PRINT Procedure 02 02 1 1 Jensen Garcia 4 Standard Deluxe 45 10 30.97 49.50 Deluxe Standard Deluxe 125 1254 175 49.50 30.97 49.50 Showing All the Variables 373 …more data lines… 12 12 12 4 4 4 Hollingsworth Jensen Hollingsworth The input file contains the following values from left to right: 3 the month that a sale was made 3 3 3 3 3 the the the the quarter of the year that a sale was made name of the sales representative type of coffee maker sold (standard or deluxe) number of units sold the price of each unit in US dollars The first of the five SAS data sets is named YEAR_SALES. This data set contains all the sales data from the input file, and a new variable named AmountSold, which is created by multiplying Units by Price. The following program creates the five SAS data sets that this section uses: data year_sales; infile ’your-input-file’; input Month $ Quarter $ SalesRep $14. Type $ Units Price; AmountSold = Units * Price; Creating Simple Reports Showing All the Variables By default, the PRINT procedure generates a simple report that shows the values of all the variables and the observations in the data set. For example, the following PROC PRINT step creates a report for the first sales quarter: options linesize=80 pageno=1 nodate; proc print data=qtr01; title ’TruBlend Coffee Makers Quarterly Sales Report’; run; The following output shows the values of all the variables for all the observations in QTR01: 374 Labeling the Observation Column Output 25.1 4 Chapter 25 Showing All Variables and All Observations TruBlend Coffee Makers Quarterly Sales Reportv Obsu 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Month 01 01 01 01 01 01 02 02 02 02 02 02 03 03 03 03 03 03 Quarter 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 SalesRep Hollingsworth Garcia Hollingsworth Jensen Garcia Jensen Garcia Garcia Garcia Hollingsworth Jensen Garcia Hollingsworth Jensen Garcia Hollingsworth Jensen Garcia Type Deluxe Standard Standard Standard Deluxe Standard Standard Deluxe Standard Standard Standard Standard Standard Standard Standard Standard Standard Standard Units 260 41 330 110 715 675 2045 10 40 1030 153 98 125 154 118 25 525 310 Price 49.50 30.97 30.97 30.97 49.50 30.97 30.97 49.50 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 1 Amount Sold 12870.00 1269.77 10220.10 3406.70 35392.50 20904.75 63333.65 495.00 1238.80 31899.10 4738.41 3035.06 3871.25 4769.38 3654.46 774.25 16259.25 9600.70 The following list corresponds to the numbered items in the preceding output: u The Obs column identifies each observation by a number. By default, SAS automatically displays the observation number at the beginning of each row. v The top of the report has a title and a page number. The TITLE statement in the PROC PRINT step produces the title. “Creating Customized Reports” on page 391 discusses the TITLE statement in more detail. For now, be aware that all the examples include at least one TITLE statement that produces a descriptive title similar to the one in this example. The content of the report is very similar to the contents of the original data set QTR01; however, the report is easy to produce and to enhance. Labeling the Observation Column A quick way to modify the report is to label the observation number (Obs column). The following SAS program includes the OBS= option in the PROC PRINT statement to change the column label for the Obs column: options linesize=80 pageno=1 nodate; proc print data=qtr01 obs=’Observation Number’; title ’TruBlend Coffee Makers Quarterly Sales Report’; run; The following output shows the report: Producing Detail Reports with the PRINT Procedure Output 25.2 4 Suppressing the Observation Column Labeling the Observation Column TruBlend Coffee Makers Quarterly Sales Report Observation Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 375 Month 01 01 01 01 01 01 02 02 02 02 02 02 03 03 03 03 03 03 Quarter 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 SalesRep Hollingsworth Garcia Hollingsworth Jensen Garcia Jensen Garcia Garcia Garcia Hollingsworth Jensen Garcia Hollingsworth Jensen Garcia Hollingsworth Jensen Garcia 1 Type Units Price Amount Sold Deluxe Standard Standard Standard Deluxe Standard Standard Deluxe Standard Standard Standard Standard Standard Standard Standard Standard Standard Standard 260 41 330 110 715 675 2045 10 40 1030 153 98 125 154 118 25 525 310 49.50 30.97 30.97 30.97 49.50 30.97 30.97 49.50 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 12870.00 1269.77 10220.10 3406.70 35392.50 20904.75 63333.65 495.00 1238.80 31899.10 4738.41 3035.06 3871.25 4769.38 3654.46 774.25 16259.25 9600.70 Suppressing the Observation Column A quick way to simplify the report is to suppress the observation number (Obs column). Usually it is unnecessary to identify each observation by number. (In some cases, you might want to show the observation numbers.) The following SAS program includes the NOOBS option in the PROC PRINT statement to suppress the Obs column: options linesize=80 pageno=1 nodate; proc print data=qtr01 noobs; title ’TruBlend Coffee Makers Quarterly Sales Report’; run; The following output shows the report: 376 Emphasizing a Key Variable Output 25.3 4 Chapter 25 Suppressing the Observation Column TruBlend Coffee Makers Quarterly Sales Report Month 01 01 01 01 01 01 02 02 02 02 02 02 03 03 03 03 03 03 Quarter 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Type Units Price Amount Sold Deluxe Standard Standard Standard Deluxe Standard Standard Deluxe Standard Standard Standard Standard Standard Standard Standard Standard Standard Standard 260 41 330 110 715 675 2045 10 40 1030 153 98 125 154 118 25 525 310 49.50 30.97 30.97 30.97 49.50 30.97 30.97 49.50 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 12870.00 1269.77 10220.10 3406.70 35392.50 20904.75 63333.65 495.00 1238.80 31899.10 4738.41 3035.06 3871.25 4769.38 3654.46 774.25 16259.25 9600.70 SalesRep Hollingsworth Garcia Hollingsworth Jensen Garcia Jensen Garcia Garcia Garcia Hollingsworth Jensen Garcia Hollingsworth Jensen Garcia Hollingsworth Jensen Garcia 1 Emphasizing a Key Variable Understanding the ID Statement To emphasize a key variable in a data set, you can use the ID statement in the PROC PRINT step. When you identify a variable in the ID statement, PROC PRINT displays the values of this variable in the first column of each row of the report. Highlighting a key variable in this way can help answer questions about your data. For example, the report can answer this question: “For each sales representative, what are the sales figures for the first quarter of the year?” The following two examples demonstrate how to answer this question quickly using data that is unsorted and sorted. Using an Unsorted Key Variable To produce a report that emphasizes the sales representative, the PROC PRINT step includes an ID statement that specifies the variable SalesRep. The revised program follows: options linesize=80 pageno=1 nodate; proc print data=qtr01; id SalesRep; title ’TruBlend Coffee Makers Quarterly Sales Report’; run; Because the ID statement automatically suppresses the observation numbers, the NOOBS option is not needed in the PROC PRINT statement. The following output shows the new report: Producing Detail Reports with the PRINT Procedure Output 25.4 4 Emphasizing a Key Variable 377 Using the ID Statement with an Unsorted Key Variable TruBlend Coffee Makers Quarterly Sales Report SalesRep Hollingsworth Garcia Hollingsworth Jensen Garcia Jensen Garcia Garcia Garcia Hollingsworth Jensen Garcia Hollingsworth Jensen Garcia Hollingsworth Jensen Garcia Month Quarter Type 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Deluxe Standard Standard Standard Deluxe Standard Standard Deluxe Standard Standard Standard Standard Standard Standard Standard Standard Standard Standard 01 01 01 01 01 01 02 02 02 02 02 02 03 03 03 03 03 03 1 Units Price Amount Sold 260 41 330 110 715 675 2045 10 40 1030 153 98 125 154 118 25 525 310 49.50 30.97 30.97 30.97 49.50 30.97 30.97 49.50 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 12870.00 1269.77 10220.10 3406.70 35392.50 20904.75 63333.65 495.00 1238.80 31899.10 4738.41 3035.06 3871.25 4769.38 3654.46 774.25 16259.25 9600.70 Notice that the names of the sales representatives are not in any particular order. The report will be easier to read when the observations are grouped together in alphabetical order by sales representative. Using a Sorted Key Variable If your data is not already ordered by the key variable, then use PROC SORT to sort the observations by this variable. If you do not specify an output data set, then PROC SORT permanently changes the order of the observations in the input data set. The following program shows how to alphabetically order the observations by sales representative: options linesize=80 pageno=1 nodate; proc sort data=qtr01;u by SalesRep;v run; proc print data=qtr01; id SalesRep;w title ’TruBlend Coffee Makers Quarterly Sales Report’; run; The following list corresponds to the numbered items in the preceding program: u A PROC SORT step precedes the PROC PRINT step. PROC SORT orders the observations in the data set alphabetically by the values of the BY variable and overwrites the input data set. v A BY statement sorts the observations alphabetically by SalesRep. w An ID statement identifies the observations with the value of SalesRep rather than with the observation number. PROC PRINT uses the sorted order of SalesRep to create the report. 378 Reporting the Values of Selected Variables 4 Chapter 25 The following output shows the report: Output 25.5 Using the ID Statement with a Sorted Key Variable TruBlend Coffee Makers Quarterly Sales Report SalesRep Month Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Jensen Jensen Jensen Jensen Jensen 01 01 02 02 02 02 03 03 01 01 02 03 03 01 01 02 03 03 1 Quarter Type Units Price Amount Sold 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Standard Deluxe Standard Deluxe Standard Standard Standard Standard Deluxe Standard Standard Standard Standard Standard Standard Standard Standard Standard 41 715 2045 10 40 98 118 310 260 330 1030 125 25 110 675 153 154 525 30.97 49.50 30.97 49.50 30.97 30.97 30.97 30.97 49.50 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 30.97 1269.77 35392.50 63333.65 495.00 1238.80 3035.06 3654.46 9600.70 12870.00 10220.10 31899.10 3871.25 774.25 3406.70 20904.75 4738.41 4769.38 16259.25 Now, the report clearly shows what each sales representative sold during the first three months of the year. Reporting the Values of Selected Variables By default, the PRINT procedure reports the values of all the variables in the data set. However, to control which variables are shown and in what order, add a VAR statement to the PROC PRINT step. For example, the information for the variables Quarter, Type, and Price is unnecessary. Therefore, the report needs to show only the values of the variables that are specified in the following order: SalesRep Month Units AmountSold The following program adds the VAR statement to create a report that lists the values of the four variables in a specific order: options linesize=80 pageno=1 nodate; proc print data=qtr01 noobs; var SalesRep Month Units AmountSold; title ’TruBlend Coffee Makers Quarterly Sales Report’; run; This program does not include the ID statement. It is unnecessary to identify the observations because the variable SalesRep is the first variable that is specified in the VAR statement. The NOOBS option in the PROC PRINT statement suppresses the observation numbers so that the sales representative appears in the first column of the report. The following output shows the report: Producing Detail Reports with the PRINT Procedure Output 25.6 4 Selecting Observations 379 Showing Selected Variables TruBlend Coffee Makers Quarterly Sales Report SalesRep Hollingsworth Garcia Hollingsworth Jensen Garcia Jensen Garcia Garcia Garcia Hollingsworth Jensen Garcia Hollingsworth Jensen Garcia Hollingsworth Jensen Garcia Month 01 01 01 01 01 01 02 02 02 02 02 02 03 03 03 03 03 03 Units 260 41 330 110 715 675 2045 10 40 1030 153 98 125 154 118 25 525 310 1 Amount Sold 12870.00 1269.77 10220.10 3406.70 35392.50 20904.75 63333.65 495.00 1238.80 31899.10 4738.41 3035.06 3871.25 4769.38 3654.46 774.25 16259.25 9600.70 The report is concise because it contains only those variables that are specified in the VAR statement. The next example revises the report to show only those observations that satisfy a particular condition. Selecting Observations Understanding the WHERE Statement To select observations that meet a particular condition from a data set, use a WHERE statement. The WHERE statement subsets the input data by specifying certain conditions that each observation must meet before it is available for processing. The condition that you define in a WHERE statement is an arithmetic or logical expression that generally consists of a sequence of operands and operators.* To compare character values, you must enclose them in single or double quotation marks and the values must match exactly, including capitalization. You can also specify multiple comparisons that are joined by logical operators in the WHERE statement. Using the WHERE statement might improve the efficiency of your SAS programs because SAS is not required to read all the observations in the input data set. Making a Single Comparison You can select observations based on a single comparison by using the WHERE statement. The following program uses a single comparison in a WHERE statement to produce a report that shows the sales activity for a sales representative named Garcia: options linesize=80 pageno=1 nodate; * The construction of the WHERE statement is similar to the construction of IF and IF-THEN statements. 380 Selecting Observations 4 Chapter 25 proc print data=qtr01 noobs; var SalesRep Month Units AmountSold; where SalesRep=’Garcia’; title ’TruBlend Coffee Makers Quarterly Sales for Garcia’; run; In the WHERE statement, the value Garcia is enclosed in quotation marks because SalesRep is a character variable. In addition, the letter G in the value Garcia is uppercase so that it matches exactly the value in the data set QTR01. The following output shows the report: Output 25.7 Making a Single Comparison TruBlend Coffee Makers Quarterly Sales for Garcia Sales Rep Month Units Amount Sold 01 01 02 02 02 02 03 03 41 715 2045 10 40 98 118 310 1269.77 35392.50 63333.65 495.00 1238.80 3035.06 3654.46 9600.70 Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia 1 Making Multiple Comparisons You can also select observations based on two or more comparisons by using the WHERE statement. However, when you use multiple WHERE statements in a PROC step, then only the last statement is used. You can create a compound comparison by using AND operator. For example, the following WHERE statement selects observations where Garcia sold only the deluxe coffee maker: where SalesRep = ’Garcia’ and Type=’Deluxe’ The following program uses two comparisons in a WHERE statement to produce a report that shows sales activities for a sales representative (Garcia) during the first month of the year: options linesize=80 pageno=1 nodate; proc print data=year_sales noobs; var SalesRep Month Units AmountSold; where SalesRep=’Garcia’ and Month=’01’; title ’TruBlend Coffee Makers Monthly Sales for Garcia’; run; The WHERE statement uses the logical AND operator. Therefore, both comparisons must be true for PROC PRINT to include an observation in the report. The following output shows the report: Producing Detail Reports with the PRINT Procedure Output 25.8 4 Ways to Enhance a Report 381 Making Two Comparisons TruBlend Coffee Makers Monthly Sales for Garcia Sales Rep Garcia Garcia Month Units Amount Sold 01 01 41 715 1269.77 35392.50 1 You might also want to select observations that meet at least one of several conditions. The following program uses two comparisons in the WHERE statement to create a report that shows every sale during the first quarter of the year that was greater than 500 units or more than $20,000: options linesize=80 pageno=1 nodate; proc print data=qtr01 noobs; var SalesRep Month Units AmountSold; where Units>500 or AmountSold>20000; title ’Quarterly Report for Sales above 500 Units or $20,000’; run; Notice this WHERE statement uses the logical OR operator. Therefore, only one of the comparisons must be true for PROC PRINT to include an observation in the report. The following output shows the report: Output 25.9 Making Comparisons for One Condition or Another Quarterly Report for Sales above 500 Units or $20,000 SalesRep Garcia Jensen Garcia Hollingsworth Jensen Month 01 01 02 02 03 Units 715 675 2045 1030 525 1 Amount Sold 35392.50 20904.75 63333.65 31899.10 16259.25 Creating Enhanced Reports Ways to Enhance a Report With just a few PROC PRINT statements and options, you can produce a variety of detail reports. By using additional statements and options that enhance the reports, you can do the following: 3 format the columns 3 sum the numeric variables 3 group the observations based on variable values 382 Specifying Formats for the Variables 4 Chapter 25 3 sum the groups of variable values 3 group the observations on separate pages The examples in this section use the SAS data set QTR02, which was created in “Input File and SAS Data Sets for Examples” on page 372. Specifying Formats for the Variables Specifying the formats of variables is a simple yet effective way to enhance the readability of your reports. By adding the FORMAT statement to your program, you can specify formats for variables. The format of a variable is a pattern that SAS uses to write the values of the variables. For example, SAS contains formats that add commas to numeric values, that add dollar signs to figures, or that report values as Roman numerals. Using a format can make the values of the variables Units and AmountSold easier to read than in the previous reports. Specifically, Units can use a COMMA format with a total field width of 7, which includes commas to separate every three digits and omits decimal values. AmountSold can use a DOLLAR format with a total field width of 14, which includes commas to separate every three digits, a decimal point, two decimal places, and a dollar sign. The following program illustrates how to apply these formats in a FORMAT statement: options linesize=80 pageno=1 nodate; proc print data=qtr02 noobs; var SalesRep Month Units AmountSold; where Units>500 or AmountSold>20000; format Units comma7. AmountSold dollar14.2; title ’Quarterly Report for Sales above 500 Units or $20,000’; run; PROC PRINT applies the COMMA7. format to the values of the variable Units and the DOLLAR14.2 format to the values of the variable AmountSold. The following output shows the report: Output 25.10 Formatting Numeric Variables Quarterly Report for Sales above 500 Units or $20,000 SalesRep Hollingsworth Jensen Garcia Jensen Hollingsworth Hollingsworth Garcia Garcia Month 04 04 04 04 05 05 06 06 Units AmountSold 530 1,110v 1,715 675 1,120 1,030 512 1,000 $16,414.10u $34,376.70 $53,113.55 $20,904.75 $34,686.40 $31,899.10 $15,856.64 $30,970.00 1 The following list corresponds to the numbered items in the preceding output: u AmountSold uses the DOLLAR14.2 format. The maximum column width is 14 spaces. Two spaces are reserved for the decimal part of a value. The remaining 12 spaces include the decimal point, whole numbers, the dollar sign, commas, and a minus sign if a value is negative. Producing Detail Reports with the PRINT Procedure 4 Grouping Observations by Variable Values 383 v Units uses the COMMA7. format. The maximum column width is seven spaces. The column width includes the numeric value, commas, and a minus sign if a value is negative. The formats do not affect the internal data values that are stored in the SAS data set. The formats change only how the current PROC step displays the values in the report. Note: Be sure to specify enough columns in the format to contain the largest value. If the format that you specify is not wide enough to contain the largest value, including special characters such as commas and dollar signs, then SAS applies the most appropriate format. 4 Summing Numeric Variables In addition to reporting the values in a data set, you can add the SUM statement to compute subtotals and totals for the numeric variables. The SUM statement enables you to request totals for one or more variables. The following program produces a report that shows totals for the two numeric variables Units and AmountSold: options linesize=80 pageno=1 nodate; proc print data=qtr02 noobs; var SalesRep Month Units AmountSold; where Units>500 or AmountSold>20000; format Units comma7. AmountSold dollar14.2; sum Units AmountSold; title ’Quarterly Sales Total for Sales above 500 Units or $20,000’; run; The following output shows the report: Output 25.11 Summing Numeric Variables Quarterly Sales Totals for Sales above 500 Units or $20,000 SalesRep Hollingsworth Jensen Garcia Jensen Hollingsworth Hollingsworth Garcia Garcia Month 04 04 04 04 05 05 06 06 Units AmountSold 530 1,110 1,715 675 1,120 1,030 512 1,000 ======= 7,692 $16,414.10 $34,376.70 $53,113.55 $20,904.75 $34,686.40 $31,899.10 $15,856.64 $30,970.00 ============== $238,221.24 1 The totals for Units and AmountSold are computed by summing the values for each sale made by all the sales representatives. As the next example shows, the PRINT procedure can also separately compute subtotals for each sales representative. Grouping Observations by Variable Values The BY statement enables you to obtain separate analyses on groups of observations. The previous example used the SUM statement to compute totals for the variables 384 Grouping Observations by Variable Values 4 Chapter 25 Units and AmountSold. However, the totals were for all three sales representatives as one group. The next two examples show how to use the BY and ID statements as a part of the PROC PRINT step to separate the sales representatives into three groups with three separate subtotals and one grand total. Computing Group Subtotals To obtain separate subtotals for specific numeric variables, add a BY statement to the PROC PRINT step. When you use a BY statement, the PRINT procedure expects that you already sorted the data set by using the BY variables. Therefore, if your data is not sorted in the proper order, then you must add a PROC SORT step before the PROC PRINT step. The BY statement produces a separate section of the report for each BY group. Do not specify in the VAR statement the variable that you use in the BY statement. Otherwise, the values of the BY variable appear twice in the report, as a header across the page and in columns down the page. The following program uses the BY statement in the PROC PRINT step to obtain separate subtotals of the variables Units and AmountSold for each sales representative: options linesize=80 pageno=1 nodate; proc sort data=qtr02; by SalesRep;u run; proc print data=qtr02 noobs; var Month Units AmountSold;v where Units>500 or AmountSold>20000; format Units comma7. AmountSold dollar14.2; sum Units AmountSold; by SalesRep;v title1 ’Sales Rep Quarterly Totals for Sales Above 500 Units or $20,000’; run; The following list corresponds to the numbered items in the preceding program: u The BY statement in the PROC SORT step sorts the data. v The variable SalesRep becomes part of the BY statement instead of the VAR statement. The following output shows the report: Producing Detail Reports with the PRINT Procedure Output 25.12 4 Grouping Observations by Variable Values 385 Grouping Observations with the BY Statement Sales Rep Quarterly Totals for Sales above 500 Units or $20,000 1 ------------------------------- SalesRep=Garcia --------------------------------u Month 04 06 06 -------SalesRep Units 1,715 512 1,000 ------3,227v AmountSold $53,113.55 $15,856.64 $30,970.00 -------------$99,940.19v ---------------------------- SalesRep=Hollingsworth ---------------------------Month 04 05 05 -------SalesRep Units AmountSold 530 1,120 1,030 ------2,680 $16,414.10 $34,686.40 $31,899.10 -------------$82,999.60 ------------------------------- SalesRep=Jensen -------------------------------Month 04 04 -------SalesRep Units 1,110 675 ------1,785 ======= 7,692w AmountSold $34,376.70 $20,904.75 -------------$55,281.45 ============== $238,221.24w The following list corresponds to the numbered items in the preceding report: u The values of the BY variables appear in dashed lines, called BY lines, above the output for the BY group. v The subtotal for the numeric variables is computed for each BY group (the three sales representatives). w A grand total is computed for the numeric variables. Identifying Group Subtotals You can use both the BY and ID statements in the PROC PRINT step to modify the appearance of your report. When you specify the same variables in both the BY and ID statements, the PRINT procedure uses the ID variable to identify the start of the BY group. The following example uses the data set that was sorted in the last example and adds the ID statement to the PROC PRINT step: options linesize=80 pageno=1 nodate; proc print data=qtr02; var Month Units AmountSold; where Units>500 or AmountSold>20000; format Units comma7. AmountSold dollar14.2; sum Units AmountSold; by SalesRep; 386 Grouping Observations by Variable Values 4 Chapter 25 id SalesRep; title1 ’Sales Rep Quarterly Totals for Sales above 500 Units or $20,000’; run; The following output shows the report: Output 25.13 Grouping Observations with the BY and ID Statements Sales Rep Quarterly Totals for Sales above 500 Units or $20,000 SalesRep Garcia Month Units AmountSold 04 06 06 1,715 512 1,000 ------3,227 $53,113.55 $15,856.64 $30,970.00 -------------$99,940.19 04 05 05 530 1,120 1,030 ------2,680 $16,414.10 $34,686.40 $31,899.10 -------------$82,999.60 04 04 1,110 675 ------1,785 ======= 7,692 $34,376.70 $20,904.75 -------------$55,281.45 ============== $238,221.24 ------------Garcia Hollingsworth ------------Hollingsworth Jensen ------------Jensen 1 The report has two distinct features. PROC PRINT separates the report into groups and suppresses the repetitive values of the BY and ID variables. The dashed lines above the BY groups do not appear because the BY and ID statements are used together in the PROC PRINT step. Remember these general rules about the SUM, BY, and ID statements: 3 You can specify a variable in the SUM statement while omitting it in the VAR statement. PROC PRINT simply adds the variable to the list of variables in the VAR statement. 3 You do not specify variables in the SUM statement that you used in the ID or BY statement. 3 When you use a BY statement and you specify only one BY variable, PROC PRINT subtotals the SUM variable for each BY group that contains more than one observation. 3 When you use a BY statement and you specify multiple BY variables, PROC PRINT shows a subtotal for a BY variable only when the value changes and when there are multiple observations with that value. Computing Multiple Group Subtotals You can also use two or more variables in a BY statement to define groups and subgroups. The following program produces a report that groups observations first by sales representative and then by month: options linesize=80 pageno=1 nodate; Producing Detail Reports with the PRINT Procedure 4 Grouping Observations by Variable Values 387 proc sort data=qtr02; by SalesRep Month;u run; proc print data=qtr02 noobs n=’Sales Transactions:’v ’Total Sales Transactions:’v; var Units AmountSold;w where Units>500 or AmountSold>20000; format Units comma7. AmountSold dollar14.2; sum Units AmountSold; by SalesRep Monthw; title1 ’Monthly Sales Rep Totals for Sales above 500 Units or $20,000’; run; The following list corresponds to the numbered items in the preceding program: u The BY statement in the PROC SORT step sorts the data by SalesRep and Month. v The N= option in the PROC PRINT statement reports the number of observations in a BY group and (because of the SUM statement) the overall total number of observations at the end of the report. The first piece of explanatory text that N= provides precedes the number for each BY group. The second piece of explanatory text that N= provides precedes the number for the overall total. w The variables SalesRep and Month are omitted in the VAR statement because the variables are specified in the BY statement. This prevents PROC PRINT from reporting the values for these variables twice. The following output shows the report: 388 Grouping Observations by Variable Values Output 25.14 4 Chapter 25 Grouping Observations with Multiple BY Variables Monthly Sales Rep Totals for Sales above 500 Units or $20,000 1 --------------------------- SalesRep=Garcia Month=04 --------------------------Units AmountSold 1,715 $53,113.55 Sales Transactions:1u --------------------------- SalesRep=Garcia Month=06 --------------------------Units 512 1,000 ------1,512v 3,227w AmountSold $15,856.64 $30,970.00 -------------$46,826.64v $99,940.19w Sales Transactions:2 ----------------------- SalesRep=Hollingsworth Month=04 -----------------------Units AmountSold 530 $16,414.10 Sales Transactions:1 ----------------------- SalesRep=Hollingsworth Month=05 -----------------------Units AmountSold 1,120 1,030 ------2,150 2,680 $34,686.40 $31,899.10 -------------$66,585.50 $82,999.60 Sales Transactions:2 --------------------------- SalesRep=Jensen Month=04 --------------------------Units 1,110 675 ------1,785 1,785 ======= 7,692x AmountSold $34,376.70 $20,904.75 -------------$55,281.45 $55,281.45 ============== $238,221.24x Sales Transactions:2u Total Sales Transactions:8y The following list corresponds to the numbered items in the preceding report: u The number of observations in the BY group is computed. This corresponds to the number of sales transactions for a sales representative in the month. Producing Detail Reports with the PRINT Procedure 4 Computing Group Totals 389 v When the BY group contains two or more observations, then a subtotal is computed for each numeric variable. w When the value of the first variable in the BY group changes, then an overall subtotal is computed for each numeric variable. The values of Units and AmountSold are summed for every month that Garcia had sales transactions because the sales representative changes in the next BY group. x The grand total is computed for the numeric variables. y The number of observations in the whole report is computed. This corresponds to the total number of sales transactions for every sales representative during the second quarter. Computing Group Totals When you use multiple BY variables as in the previous example, you can suppress the subtotals every time a change occurs for the value of the BY variables. Use the SUMBY statement to control which BY variable causes subtotals to appear. You can specify only one SUMBY variable, and this variable must also be specified in the BY statement. PROC PRINT computes sums when a change occurs to the following values: 3 the value of the SUMBY variable 3 the value of any variable in the BY statement that is specified before the SUMBY variable For example, consider the following statements: by Quarter SalesRep Month; sumby SalesRep; SalesRep is the SUMBY variable. In the BY statement, Quarter comes before SalesRep while Month comes after SalesRep. Therefore, these statements cause PROC PRINT to compute totals when either Quarter or SalesRep changes value, but not when Month changes value. The following program omits the monthly subtotals for each sales representative by designating SALESREP as the variable to sum by: options linesize=80 pageno=1 nodate; proc print data=qtr02; var Units AmountSold; where Units>500 or AmountSold>20000; format Units comma7. AmountSold dollar14.2; sum Units AmountSold; by SalesRep Month; id SalesRep Month; sumby SalesRep; title1 ’Sales Rep Quarterly Totals for Sales above 500 Units or $20,000’; run; This program assumes that QTR02 data has been previously sorted by the variables SalesRep and Month. The following output shows the report: 390 Grouping Observations on Separate Pages Output 25.15 4 Chapter 25 Combining Subtotals for Groups of Observations Sales Rep Quarterly Totals for Sales above 500 Units or $20,000 SalesRep Month Units AmountSold Garcia 04 1,715 $53,113.55 Garcia 06 512 1,000 ------3,227 $15,856.64 $30,970.00 -------------$99,940.19 ------------Garcia ----- Hollingsworth 04 530 $16,414.10 Hollingsworth 05 1,120 1,030 ------2,680 $34,686.40 $31,899.10 -------------$82,999.60 1,110 675 ------1,785 ======= 7,692 $34,376.70 $20,904.75 -------------$55,281.45 ============== $238,221.24 ------------Hollingsworth ----- Jensen ------------Jensen 04 ----- 1 Grouping Observations on Separate Pages You can also create a report with multiple sections that appear on separate pages by using the PAGEBY statement with the BY statement. The PAGEBY statement identifies a variable in the BY statement that causes the PRINT procedure to begin the report on a new page when a change occurs to the following values: 3 the value of the BY variable 3 the value of any BY variable that precedes it in the BY statement The following program uses a PAGEBY statement with the BY statement to create a report with multiple sections: options linesize=80 pageno=1 nodate; proc print data=qtr02 noobs; var Units AmountSold; where Units>500 or AmountSold>20000; format Units comma7. AmountSold dollar14.2; sum Units AmountSold; by SalesRep Month; id SalesRep Month; sumby SalesRep; pageby SalesRep; title1 ’Sales Rep Quarterly Totals for Sales above 500 Units or $20,000’; run; This program assumes that QTR02 data has been previously sorted by the variables SalesRep and Month. The following output shows the report: Producing Detail Reports with the PRINT Procedure Output 25.16 4 Ways to Customize a Report 391 Grouping Observations on Separate Pages Sales Rep Quarterly Totals for Sales above 500 Units or $20,000 SalesRep Month Units AmountSold Garcia 04 1,715 $53,113.55 Garcia 06 512 1,000 ------3,227 $15,856.64 $30,970.00 -------------$99,940.19 ------------Garcia ----- Sales Rep Quarterly Totals for Sales above 500 Units or $20,000 SalesRep Month Units AmountSold Hollingsworth 04 530 $16,414.10 Hollingsworth 05 1,120 1,030 ------2,680 $34,686.40 $31,899.10 -------------$82,999.60 ------------Hollingsworth ----- Sales Rep Quarterly Totals for Sales above 500 Units or $20,000 SalesRep Jensen ------------Jensen Month 04 ----- Units AmountSold 1,110 675 ------1,785 ======= 7,692 $34,376.70 $20,904.75 -------------$55,281.45 ============== $238,221.24 1 2 3 A page breaks occurs in the report when the value of the variable SalesRep changes from Garcia to Hollingsworth and from Hollingsworth to Jensen. Creating Customized Reports Ways to Customize a Report As you have seen from the previous examples, the PRINT procedure produces simple detail reports quickly and easily. With additional statements and options, you can enhance the readability of your reports. For example, you can do the following: 3 Add descriptive titles and footnotes. 3 Define and split labels across multiple lines. 3 Add double spacing. 3 Ensure that the column widths are uniform across the pages of the report. 392 Understanding Titles and Footnotes 4 Chapter 25 Understanding Titles and Footnotes Adding descriptive titles and footnotes is one of the easiest and most effective ways to improve the appearance of a report. You can use the TITLE statement to include from 1 to 10 lines of text at the top of the report. You can use the FOOTNOTE statement to include from 1 to 10 lines of text at the bottom of the report. In the TITLE statement, you can specify n immediately following the keyword TITLE, to indicate the level of the TITLE statement. n is a number from 1 to 10 that specifies the line number of the TITLE. You must enclose the text of each title in single or double quotation marks. Skipping over some values of n indicates that those lines are blank. For example, if you specify TITLE1 and TITLE3 statements but skip TITLE2, then a blank line occurs between the first and third lines. When you specify a title, SAS uses that title for all subsequent output until you cancel it or define another title for that line. A TITLE statement for a given line cancels the previous TITLE statement for that line and for all lines below it, that is, for those with larger n values. To cancel all existing titles, specify a TITLE statement without the n value: title; To suppress the nthe title and all titles below it, use the following statement: titlen; Footnotes work the same way as titles. In the FOOTNOTE statement, you can specify n immediately following the keyword FOOTNOTE, to indicate the level of the FOOTNOTE statement. n is a number from 1 to 10 that specifies the line number of the FOOTNOTE. You must enclose the text of each footnote in single or double quotation marks. As with the TITLE statement, skipping over some values of n indicates that those lines are blank. Remember that the footnotes are pushed up from the bottom of the report. In other words, the FOOTNOTE statement with the largest number appears on the bottom line. When you specify a footnote, SAS uses that footnote for all subsequent output until you cancel it or define another footnote for that line. You cancel and suppress footnotes in the same way that you cancel and suppress titles. Note: The maximum title length and footnote length that is allowed depends on your operating environment and the value of the LINESIZE= system option. Refer to the SAS documentation for your operating environment for more information. 4 Adding Titles and Footnotes The following program includes titles and footnotes in a report of second quarter sales during the month of April: options linesize=80 pageno=1 nodate; proc sort data=qtr02; by SalesRep; run; proc print data=qtr02 noobs; var SalesRep Month Units AmountSold; where Month=’04’; format Units comma7. AmountSold dollar14.2; Producing Detail Reports with the PRINT Procedure 4 Defining Labels 393 sum Units AmountSold; title1 ’TruBlend Coffee Makers, Inc.’; title3 ’Quarterly Sales Report’; footnote1 ’April Sales Totals’; footnote2 ’COMPANY CONFIDENTIAL INFORMATION’; run; The report includes three title lines and two footnote lines. The program omits the TITLE2 statement so that the second title line is blank. The following output shows the report: Output 25.17 Adding Titles and Footnotes TruBlend Coffee Makers, Inc.u v Quarterly Sales Reportu SalesRep Month Garcia Garcia Hollingsworth Hollingsworth Jensen Jensen 04 04 04 04 04 04 Units AmountSold 150 1,715 260 530 1,110 675 ======= 4,440 $4,645.50 $53,113.55 $8,052.20 $16,414.10 $34,376.70 $20,904.75 ============== $137,506.80 1 April Sales Totalsw COMPANY CONFIDENTIAL INFORMATIONw The following list corresponds to the numbered items in the preceding report: u a descriptive title line that is generated by a TITLE statement v a blank title line that is generated by omitting a TITLE statement for the second line w a descriptive footnote line that is generated by a FOOTNOTE statement. Defining Labels By default, SAS uses variable names for column headings. However, to improve the appearance of a report, you can specify your own column headings. To override the default headings, you need to do the following: 3 Add the LABEL option to the PROC PRINT statement. 3 Define the labels in the LABEL statement. The LABEL option causes the report to display labels, instead of variable names, for the column headings. You use the LABEL statement to assign the labels for the specific variables. A label can be up to 256 characters long, including blanks, and must be enclosed in single or double quotation marks. If you assign labels when you created the SAS data set, then you can omit the LABEL statement from the PROC PRINT step. 394 Splitting Labels across Two or More Lines 4 Chapter 25 The following program modifies the previous program and defines labels for the variables SalesRep, Units, and AmountSold: options linesize=80 pageno=1 nodate; proc sort data=qtr02; by SalesRep; run; proc print data=qtr02 noobs label; var SalesRep Month Units AmountSold; where Month=’04’; format Units comma7. AmountSold dollar14.2; sum Units AmountSold; label SalesRep = ’Sales Rep.’ Units = ’Units Sold’ AmountSold = ’Amount Sold’; title ’TruBlend Coffee Maker Sales Report for April’; footnote; run; The TITLE statement redefines the first title and cancels any additional titles that might have been previously defined. The FOOTNOTE statement cancels any footnotes that might have been previously defined. The following output shows the report: Output 25.18 Defining Labels TruBlend Coffee Maker Sales Report for April Sales Rep. Garcia Garcia Hollingsworth Hollingsworth Jensen Jensen Month 04 04 04 04 04 04 Units Sold Amount Sold 150 1,715 260 530 1,110 675 ======= 4,440 $4,645.50 $53,113.55 $8,052.20 $16,414.10 $34,376.70 $20,904.75 ============== $137,506.80 1 The label Units Sold is split between two lines. The PRINT procedure splits the label to conserve space. Splitting Labels across Two or More Lines Sometimes labels are too long to fit on one line, or you might want to split a label across two or more lines. By default, SAS automatically splits labels on the basis of column width. You can use the SPLIT= option to control where the labels are separated into multiple lines. The SPLIT= option replaces the LABEL option in the PROC PRINT statement. (You do not need to use both SPLIT= and LABEL because SPLIT= implies that PROC PRINT use labels.) In the SPLIT= option, you specify an alphanumeric character that indicates where to split labels. To use the SPLIT= option, you need to do the following: Producing Detail Reports with the PRINT Procedure 4 Adding Double Spacing 395 3 Define the split character as a part of the PROC PRINT statement. 3 Define the labels with a split character in the LABEL statement. The following PROC PRINT step defines the slash (/) as the split character and includes slashes in the LABEL statements to split the labels Sales Representative, Units Sold, and Amount Sold into two lines each: options linesize=80 pageno=1 nodate; proc sort data=qtr02; by SalesRep; run; proc print data=qtr02 noobs split=’/’; var SalesRep Month Units AmountSold; where Month=’04’; format Units comma7. AmountSold dollar14.2; sum Units AmountSold; title ’TruBlend Coffee Maker Sales Report for April’; label SalesRep = ’Sales/Representative’ Units = ’Units/Sold’ AmountSold = ’Amount/Sold’; run; The following output shows the report: Output 25.19 Reporting: Splitting Labels into Two Lines TruBlend Coffee Maker Sales Report for April Sales Representative Garcia Garcia Hollingsworth Hollingsworth Jensen Jensen Month 04 04 04 04 04 04 Units Sold Amount Sold 150 1,715 260 530 1,110 675 ======= 4,440 $4,645.50 $53,113.55 $8,052.20 $16,414.10 $34,376.70 $20,904.75 ============== $137,506.80 1 Adding Double Spacing You might want to improve the appearance of a report by adding double spaces between the rows of the report. The following program uses the DOUBLE option in the PROC PRINT statement to double-space the report: options linesize=80 pageno=1 nodate; proc sort data=qtr02; by SalesRep; run; proc print data=qtr02 noobs split=’/’ double; var SalesRep Month Units AmountSold; 396 Requesting Uniform Column Widths 4 Chapter 25 where Month=’04’; format Units comma7. AmountSold dollar14.2; sum Units AmountSold; title ’TruBlend Coffee Maker Sales Report for April’; label SalesRep = ’Sales/Representative’ Units = ’Units/Sold’ AmountSold = ’Amount/Sold’; run; The following output shows the report: Output 25.20 Adding Double Spacing TruBlend Coffee Maker Sales Report for April Sales Representative Month Units Sold Amount Sold Garcia 04 150 $4,645.50 Garcia 04 1,715 $53,113.55 Hollingsworth 04 260 $8,052.20 Hollingsworth 04 530 $16,414.10 Jensen 04 1,110 $34,376.70 Jensen 04 675 $20,904.75 ======= ============== 4,440 $137,506.80 1 Requesting Uniform Column Widths By default, PROC PRINT uses the width of the formatted variable as the column width. If you do not assign a format to the variable that explicitly specifies a field width, then the column width is the widest value of the variable on that page. This can cause the column widths to vary on different pages of a report. The WIDTH=UNIFORM option ensures that the columns of data line up from one page to the next. PROC PRINT will use a variable’s formatted width or, if no format is assigned, the widest data value as the variable’s column width on all pages. Unless you specify this option, PROC PRINT individually constructs each page of output. Each page contains as many variables and observations as possible. As a result, the report might have different numbers of variables or different column widths from one page to the next. If the sales records for TruBlend Coffee Makers* are sorted by the sales representatives and a report is created without using the WIDTH=UNIFORM option in the PROC PRINT statement, then the columns of values on the first page will not line up with those on the next page. The column shift occurs because of differences in the name length of the sales representatives. PROC PRINT lines up the columns on the first * See “Input File and SAS Data Sets for Examples” on page 372 to examine the sales records. Producing Detail Reports with the PRINT Procedure 4 Requesting Uniform Column Widths 397 page of the report, allowing enough space for the longest name, Hollingsworth. On the second page the longest name is Jensen, so the columns shift relative to the first page. The following example uses the WIDTH= option in the PROC PRINT statement to prevent the shifting of columns: options pagesize=66 linesize=80 pageno=1 nodate; proc sort data=qtr03; by SalesRep; run; proc print data=qtr03 split=’/’ width=uniform; var SalesRep Month Units AmountSold; format Units comma7. AmountSold dollar14.2; sum Units AmountSold; title ’TruBlend Coffee Makers 3rd Quarter Sales Report’; label SalesRep = ’Sales/Rep.’ Units = ’Units/Sold’ AmountSold = ’Amount/Sold’; run; The following output shows the report: 398 Requesting Uniform Column Widths Output 25.21 4 Chapter 25 Reporting: Using Uniform Column Widths TruBlend Coffee Makers 3rd Quarter Sales Report Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 Sales Rep. Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Garcia Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Jensen Jensen Jensen Jensen Jensen Jensen Jensen Jensen Jensen Jensen Jensen Jensen Jensen Jensen Jensen Jensen Jensen Jensen Jensen Month 07 07 07 07 07 07 07 07 08 08 08 08 08 08 08 09 09 09 09 09 09 07 07 07 07 07 08 08 08 08 08 08 09 09 09 09 07 07 07 07 07 07 08 08 08 08 08 08 08 09 09 09 09 09 09 1 Units Sold Amount Sold 250 90 90 265 1,250 90 90 465 110 240 198 1,198 110 240 198 118 412 100 1,118 412 100 60 30 130 60 330 120 230 230 290 330 50 125 1,000 125 175 110 110 275 110 110 675 145 453 453 45 145 453 225 254 284 275 876 254 284 $7,742.50 $2,787.30 $2,787.30 $8,207.05 $38,712.50 $2,787.30 $2,787.30 $14,401.05 $5,445.00 $7,432.80 $6,132.06 $37,102.06 $5,445.00 $7,432.80 $6,132.06 $3,654.46 $12,759.64 $3,097.00 $34,624.46 $12,759.64 $3,097.00 $2,970.00 $1,485.00 $4,026.10 $2,970.00 $10,220.10 $3,716.40 $7,123.10 $11,385.00 $8,981.30 $10,220.10 $2,475.00 $3,871.25 $30,970.00 $3,871.25 $5,419.75 $3,406.70 $3,406.70 $8,516.75 $3,406.70 $3,406.70 $20,904.75 $4,490.65 $14,029.41 $14,029.41 $2,227.50 $4,490.65 $14,029.41 $11,137.50 $7,866.38 $8,795.48 $13,612.50 $27,129.72 $7,866.38 $8,795.48 Producing Detail Reports with the PRINT Procedure 4 Using Automatic Macro Variables TruBlend Coffee Makers 3rd Quarter Sales Report Obs 56 57 Sales Rep. Jensen Jensen 09 09 2 Units Sold Amount Sold 275 876 ======= 17,116 $13,612.50 $27,129.72 ============== $557,321.62 Month 399 Making Your Reports Easy to Change Understanding the SAS Macro Facility Base SAS includes the macro facility as a tool to customize SAS and to reduce the amount of text you must enter to do common tasks. The macro facility enables you to assign a name to character strings or groups of SAS programming statements. From that point on, you can work with the names rather than with the text itself. When you use a macro facility name in a SAS program, the macro facility generates SAS statements and commands as needed. The rest of SAS receives those statements and uses them in the same way it uses the ones you enter in the standard manner. The macro facility enables you to create macro variables to substitute text in SAS programs. One of the major advantages of using macro variables is that it enables you to change the value of a variable in one place in your program and then have the change appear in multiple references throughout your program. You can substitute text by using automatic macro variables or by using your own macro variables, which you define and assign values to. Using Automatic Macro Variables The SAS macro facility includes many automatic macro variables. Some of the values associated with the automatic macro variables depend on your operating environment. You can use automatic macro variables to provide the time, the day of the week, and the date based on your computer’s internal clock as well as other processing information. To include a second title on a report that displays the text string “Produced on” followed by today’s date, add the following TITLE statement to your program: title2 "Produced on &SYSDATE9"; Notice the syntax for this statement. First, the ampersand that precedes SYSDATE9 tells the SAS macro facility to replace the reference with its assigned value. In this case, the assigned value is the date the SAS session started and is expressed as ddmmmyyyy, where dd is a two-digit date mmm is the first three letters of the month name yyyy is a four-digit year Second, the text of the TITLE statement is enclosed in double quotation marks because the SAS macro facility resolves macro variable references in the TITLE statement and the FOOTNOTE statement only if they are in double quotation marks. The following program, which includes a PROC SORT step and the TITLE statement, demonstrates how to use the SYSDATE9. automatic macro variable: 400 Using Your Own Macro Variables 4 Chapter 25 options linesize=80 pageno=1 nodate; proc sort data=qtr04; by SalesRep; run; proc print data=qtr04 noobs split=’/’ width=uniform; var SalesRep Month Units AmountSold; format Units comma7. AmountSold dollar14.2; sum Units AmountSold; title1 ’TruBlend Coffee Maker Quarterly Sales Report’; title2 "Produced on &SYSDATE9"; label SalesRep = ’Sales/Rep.’ Units = ’Units/Sold’ AmountSold = ’Amount/Sold’; run; The following output shows the report: Output 25.22 Using Automatic Macro Variables TruBlend Coffee Maker Quarterly Sales Report Produced on 30JAN2001 Sales Rep. Garcia Garcia Garcia Garcia Garcia Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Jensen Jensen Jensen Jensen Jensen Jensen Month 10 10 11 11 12 10 10 11 11 12 12 10 10 11 11 12 12 Units Sold Amount Sold 250 365 198 120 1,000 530 265 1,230 150 125 175 975 55 453 70 876 1,254 ======= 8,091 $7,742.50 $11,304.05 $6,132.06 $3,716.40 $30,970.00 $16,414.10 $8,207.05 $38,093.10 $7,425.00 $6,187.50 $5,419.75 $30,195.75 $1,703.35 $14,029.41 $2,167.90 $27,129.72 $38,836.38 ============== $255,674.02 1 Using Your Own Macro Variables In addition to using automatic macro variables, you can use the %LET statement to define your own macro variables and refer to them with the ampersand prefix. Defining macro variables at the beginning of your program enables you to change other parts of the program easily. The following example shows how to define two macro variables, Quarter and Year, and how to refer to them in a TITLE statement. Producing Detail Reports with the PRINT Procedure 4 Using Your Own Macro Variables 401 Defining Macro Variables To use two macro variables that produce flexible report titles, first define the macro variables. The following %LET statements define the two macro variables: %let Quarter=Fourth; %let Year=2000; The name of the first macro variable is Quarter and it is assigned the value Fourth. The name of the second macro variable is Year and it is assigned the value 2000. Macro variable names such as these conform to the following rules for SAS names: 3 macro variable names are one to 32 characters long 3 macro variable names begin with a letter or an underscore 3 letters, numbers, and underscores follow the first character. In these simple situations, do not assign values to macro variables that contain unmatched quotation marks or semicolons. If the values contain leading or trailing blanks, then SAS removes the blanks. Referring to Macro Variables To refer to the value of a macro variable, place an ampersand prefix in front of the name of the variable. The following TITLE statement contains references to the values of the macro variables Quarter and Year, which were previously defined in %LET statements: title3 "&Quarter Quarter &Year Sales Totals"; The complete program, which includes the two %LET statements and the TITLE3 statement, follows: options linesize=80 pageno=1 nodate; %let Quarter=Fourth;u %let Year=2000;v proc sort data=qtr04; by SalesRep; run; proc print data=qtr04 noobs split=’/’ width=uniform; var SalesRep Month Units AmountSold; format Units comma7. AmountSold dollar14.2; sum Units AmountSold; title1 ’TruBlend Coffee Maker Quarterly Sales Report’; title2 "Produced on &SYSDATE9"; title3 "&Quarter Quarter &Year Sales Totals";w label SalesRep = ’Sales/Rep.’ Units = ’Units/Sold’ AmountSold = ’Amount/Sold’; run; The following list corresponds to the numbered items in the preceding program: u The %LET statement creates a macro variable with the sales quarter. When an ampersand precedes Quarter, the SAS macro facility knows to replace any reference to &Quarter with the assigned value of Fourth. v The %LET statement creates a macro variable with the year. When ampersand precedes Year, the SAS macro facility knows to replace any reference to &Year with the assigned value of 2000. 402 Review of SAS Tools 4 Chapter 25 w The text of the TITLE2 and TITLE3 statements are enclosed in double quotation marks so that the SAS macro facility can resolve them. The following output shows the report: Output 25.23 Using Your Own Macro Variables TruBlend Coffee Maker Quarterly Sales Report Produced on 12JAN2001 Fourth Quarter 2000 Sales Totals Sales Rep. Garcia Garcia Garcia Garcia Garcia Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Hollingsworth Jensen Jensen Jensen Jensen Jensen Jensen Month 10 10 11 11 12 10 10 11 11 12 12 10 10 11 11 12 12 Units Sold Amount Sold 250 365 198 120 1,000 530 265 1,230 150 125 175 975 55 453 70 876 1,254 ======= 8,091 $7,742.50 $11,304.05 $6,132.06 $3,716.40 $30,970.00 $16,414.10 $8,207.05 $38,093.10 $7,425.00 $6,187.50 $5,419.75 $30,195.75 $1,703.35 $14,029.41 $2,167.90 $27,129.72 $38,836.38 ============== $255,674.02 1 Using macro variables can make your programs easy to modify. For example, if the previous program contained many references to Quarter and Year, then changes in only three places will produce an entirely different report: 3 the two values in the %LET statements 3 the data set name in the PROC PRINT statement Review of SAS Tools PROC PRINT Statements PROC PRINT < DATA=SAS-data-set> ; PROC UNIVARIATE option(s); starts the UNIVARIATE procedure. You can specify the following options in the PROC UNIVARIATE statement: DATA=SAS-data-set names the SAS data set that PROC UNIVARIATE uses. If you omit DATA=, then PROC UNIVARIATE uses the most recently created data set. NOPRINT suppresses the descriptive statistics that the PROC UNIVARIATE statement creates. CLASS variable-1<(variable-option(s))> > option(s)>; specifies up to two variables whose values determine the classification levels for the component histograms. Variables in a CLASS statement are referred to as class variables. 516 PROC UNIVARIATE Statements 4 Chapter 29 You can specify the following option(s) in the CLASS statement: ORDER=DATA | FORMATTED | FREQ | INTERNAL specifies the display order for the class variable values, where DATA orders values according to their order in the input data set. FORMATTED orders values by their ascending formatted values. This order depends on your operating environment. FREQ orders values by descending frequency count so that levels with the most observations are listed first. INTERNAL orders values by their unformatted values, which yields the same order as PROC SORT. This order depends on your operating environment. HISTOGRAM ; creates histograms and comparative histograms using high-resolution graphics for the analysis variables that are specified. If you omit variable(s) in the HISTOGRAM statement, then the procedure creates a histogram for each variable that you list in the VAR statement, or for each numeric variable in the DATA= data set if you omit a VAR statement. You can specify the following options in the PROC UNIVARIATE statement: CGRID=color specifies the color for grid lines when a grid displays on the histogram. GRID specifies to display a grid on the histogram. Grid lines are horizontal lines that are positioned at major tick marks on the vertical axis. HOFFSET=value specifies the offset in percentage screen units at both ends of the horizontal axis. GRID specifies to display a grid on the histogram. Grid lines are horizontal lines that are positioned at major tick marks on the vertical axis. LGRID=linetype specifies the line type for the grid when a grid displays on the histogram. The default is a solid line. MIDPOINTS=value(s) determines the width of the histogram bars as the difference between consecutive midpoints. PROC UNIVARIATE uses the same value(s) for all variables. You must use evenly spaced midpoints that are listed in increasing order. VAXIS=value(s) specifies tick mark values for the vertical axis. Use evenly spaced values that are listed in increasing order. The first value must be zero and the last value must be greater than or equal to the height of the largest bar. You must scale the values in the same units as the bars. VMINOR=n specifies the number of minor tick marks between each major tick mark on the vertical axis. PROC UNIVARIATE does not label minor tick marks. Producing Charts to Summarize Variables 4 FORMAT Statement 517 VSCALE=scale specifies the scale of the vertical axis, where scale is COUNT scales the data in units of the number of observations per data unit. PERCENT scales the data in units of percentage of observations per data unit. PROPORTION scales the data in units of proportion of observations per data unit. INSET ; places a box or table of summary statistics, called an inset, directly in the histogram. You can specify the following options in the PROC UNIVARIATE statement: keyword(s) specifies one or more keywords that identify the information to display in the inset. PROC UNIVARIATE displays the information in the order that you request the keywords. For a complete list of keywords, see the INSET statement in SAS/GRAPH Software: Reference, Volumes 1 and 2. FORMAT=format specifies a format for all the values in the inset. If you specify a format for a particular statistic, then this format overrides FORMAT=format. HEADER=string specifies the heading text where string cannot exceed 40 characters. NOFRAME suppresses the frame drawn around the text. POSITION=position determines the position of the inset. The position is a compass point keyword, a margin keyword, or a pair of coordinates (x, y). The default position is NW, which positions the inset in the upper-left (northwest) corner of the display. GOPTIONS Statement GOPTIONS options-list; specifies values for graphics options. Graphics options control characteristics of the graph, such as size, colors, type fonts, fill patterns, and symbols. In addition, they affect the settings of device parameters, which are defined in the device entry. Device parameters control such characteristics as the appearance of the display, the type of output that is produced, and the destination of the output. FORMAT Statement FORMAT variable format-name; enables you to display the value of a variable by using a special pattern that you specify as format-name. 518 Learning More 4 Chapter 29 Learning More PROC CHART For complete documentation, see the Base SAS Procedures Guide. In addition to the features that are described in this section, you can use PROC CHART to create star charts, to draw a reference line at a particular value on a bar chart, and to change the symbol that is used to draw charts. You can also create charts based, not only on frequency, sum, and mean, but also on cumulative frequency, percent, and cumulative percent. PROC UNIVARIATE For complete documentation, see the Base SAS Procedures Guide. PROC PLOT For a discussion about how to plot the relationship between variables, see Chapter 28, “Plotting the Relationship between Variables,” on page 463. When you are preparing graphics presentations, some data lends itself to charts, while other data is better suited for plots. SAS formats For complete documentation, see SAS Language Reference: Dictionary. Many formats are available with SAS, including fractions, hexadecimal values, roman numerals, social security numbers, date and time values, and numbers written as words. PROC FORMAT For complete documentation about how to create your own formats, see the Base SAS Procedures Guide. SAS/GRAPH software For complete documentation, see SAS/GRAPH Software: Reference, Volumes 1 and 2. If your site has SAS/GRAPH software, then you can use the GCHART procedure to take advantage of the high-resolution graphics capabilities of output devices and produce charts that include color, different fonts, and text. TITLE and FOOTNOTE statements For a discussion about using titles and footnotes in a report, see “Understanding Titles and Footnotes” on page 392. 519 8 P A R T Designing Your Own Output Chapter 30. . . . . . . . .Writing Lines to the SAS Log or to an Output File Chapter 31. . . . . . . . .Understanding and Customizing SAS Output: The Basics Chapter 32. . . . . . . . .Understanding and Customizing SAS Output: The Output Delivery System (ODS) 565 521 537 520 521 CHAPTER 30 Writing Lines to the SAS Log or to an Output File Introduction to Writing Lines to the SAS Log or to an Output File 521 Purpose 521 Prerequisites 521 Understanding the PUT Statement 522 Writing Output without Creating a Data Set 522 Writing Simple Text 523 Writing a Character String 523 Writing Variable Values 524 Writing on the Same Line More than Once 525 Releasing a Held Line 526 Writing a Report 528 Writing to an Output File 528 Designing the Report 528 Writing Data Values 529 Improving the Appearance of Numeric Data Values 530 Writing a Value at the Beginning of Each BY Group 531 Calculating Totals 532 Writing Headings and Footnotes for a One-Page Report 533 Review of SAS Tools 535 Statements 535 Learning More 536 Introduction to Writing Lines to the SAS Log or to an Output File Purpose In previous sections you learned how to store data values in a SAS data set and to use SAS procedures to produce a report that is based on these data values. In this section, you will learn how to do the following: 3 design output by positioning data values and character strings in an output file 3 prevent SAS from creating a data set by using the DATA _NULL_ statement 3 produce reports by using the DATA step instead of using a procedure 3 direct data to an output file by using a FILE statement Prerequisites Before proceeding with this section, you should be familiar with the concepts presented in the following sections: 522 Understanding the PUT Statement 4 Chapter 30 3 Chapter 1, “What Is the SAS System?,” on page 3 3 Chapter 2, “Introduction to DATA Step Processing,” on page 19 Understanding the PUT Statement When you create output using the DATA step, you can customize that output by using the PUT statement to write text to the SAS log or to another output file. The PUT statement has the following form: PUT< variable >< ’character-string’>; where variable names the variable that you want to write. format specifies a format to use when you write variable values. ’character-string’ specifies a string of text to write. Be sure to enclose the string in quotation marks. Writing Output without Creating a Data Set In many cases, when you use a DATA step to write a report, you do not need to create an additional data set. When you use the DATA _NULL_ statement, SAS processes the DATA step without writing observations to a data set. Using the DATA _NULL_ statement can increase program efficiency considerably. The following is an example of a DATA _NULL_ statement: data _null_; The following program uses a PUT statement to write newspaper circulation values to the SAS log. Because the program uses a DATA _NULL_ statement, SAS does not create a data set. data _null_; length state $ 15; input state $ morning_copies evening_copies year; put state morning_copies evening_copies year; datalines; Massachusetts 798.4 984.7 1999 Massachusetts 834.2 793.6 1998 Massachusetts 750.3 . 1997 Alabama . 698.4 1999 Alabama 463.8 522.0 1998 Alabama 583.2 234.9 1997 Alabama . 339.6 1996 ; The following output shows the results: Writing Lines to the SAS Log or to an Output File Output 30.1 4 Writing a Character String 523 Writing to the SAS Log 184 data _null_; 185 length state $ 15; 186 input state $ morning_copies evening_copies year; 187 put state morning_copies evening_copies year; 188 datalines; Massachusetts 798.4 984.7 1999 Massachusetts 834.2 793.6 1998 Massachusetts 750.3 . 1997 Alabama . 698.4 1999 Alabama 463.8 522 1998 Alabama 583.2 234.9 1997 Alabama . 339.6 1996 196 ; SAS indicates missing numeric values with a period. Note that the log contains three missing values. Writing Simple Text Writing a Character String In its simplest form, the PUT statement writes the character string that you specify to the SAS log, to a procedure output file, or to an external file. If you omit the destination (as in this example), then SAS writes the string to the log. In the following example, SAS executes the PUT statement once during each iteration of the DATA step. When SAS encounters missing values for MORNING_VALUES or EVENING_COPIES, the PUT statement writes a message to the log. data _null_; length state $ 15; infile ’your-input-file’; input state $ morning_copies evening_copies year; if morning_copies=. then put ’** Morning Circulation Figures Missing’; else if evening_copies=. then put ’** Evening Circulation Figures Missing’; run; The following output shows the results: 524 Writing Variable Values Output 30.2 4 Chapter 30 Writing a Character String to the SAS Log 93 data _null_; 94 length state $ 15; 95 infile ’your-input-file’; 96 input state $ morning_copies evening_copies year; 97 if morning_copies =. then put ’** Morning Circulation Figures Missing’; 98 else 99 if evening_copies =. then put ’** Evening Circulation Figures Missing’; 100 run; NOTE: The infile ’your-input-file’ is: File Name=file-name, Owner Name=xxxxxx,Group Name=xxxx, Access Permission=rw-r--r--, File Size (bytes)=223 ** Evening Circulation Figures Missing ** Morning Circulation Figures Missing ** Morning Circulation Figures Missing NOTE: 7 records were read from the infile ’your-input-file’. The minimum record length was 30. The maximum record length was 31. Writing Variable Values Output 30.2 shows that the value for MORNING_COPIES is missing for two observations in the data set, and the value for EVENING_COPIES is missing for one observation. To identify which observations have the missing values, write the value of one or more variables along with the character string. The following program writes the value of YEAR and STATE, as well as the character string: data _null_; length state $ 15; infile ’your-input-file’; input state $ morning_copies evening_copies year; if morning_copies =. then put ’** Morning Circulation Figures Missing: ’ year state; else if evening_copies =. then put ’** Evening Circulation Figures Missing: ’ year state; run; Notice that the last character in each of the strings is blank. This is an example of list output. In list output, SAS automatically moves one column to the right after writing a variable value, but not after writing a character string. The simplest way to include the required space is to include it in the character string. SAS keeps track of its position in the output line with a pointer. Another way to describe the action in this PUT statement is to say that in list output, the pointer moves one column to the right after writing a variable value, but not after writing a character string. In later parts of this section, you will learn ways to move the pointer to control where the next piece of text is written. The following output shows the results: Writing Lines to the SAS Log or to an Output File Output 30.3 4 Writing on the Same Line More than Once 525 Writing a Character String and Variable Values 164 data _null_; 165 length state $ 15; 166 infile ’your-input-file’; 167 input state $ morning_copies evening_copies year; 168 if morning_copies =. then put 169 ’** Morning Circulation Figures Missing: ’ year state; 170 else 171 if evening_copies =. then put 172 ’** Evening Circulation Figures Missing: ’ year state; 173 run; NOTE: The infile ’your-file-name’ is: File Name=file-name, Owner Name=xxxxxx,Group Name=xxxx, Access Permission=rw-r--r--, File Size (bytes)=223 ** Evening Circulation Figures Missing: 1997 Massachusetts ** Morning Circulation Figures Missing: 1999 Alabama ** Morning Circulation Figures Missing: 1996 Alabama NOTE: 7 records were read from the infile ’your-input-file’. The minimum record length was 30. The maximum record length was 31. Writing on the Same Line More than Once By default, each PUT statement begins on a new line. However, you can write on the same line if you use more than one PUT statement and at least one trailing @ (“at” sign). The trailing @ is a type of pointer control called a line-hold specifier. Pointer controls are one way to specify where SAS writes text. In the following example, using the trailing @ causes SAS to write the item in the second PUT statement on the same line rather than on a new line. The execution of either PUT statement holds the output line for further writing because each PUT statement has a trailing @. SAS continues to write on that line when a later PUT statement in the same iteration of the DATA step is executed and also when a PUT statement in a later iteration is executed. options linesize=80 pagesize=60; data _null_; length state $ 15; infile ’your-input-file’; input state $ morning_copies evening_copies year; if morning_copies =. then put ’** Morning Tot Missing: ’ year state @; if evening_copies =. then put ’** Evening Tot Missing: ’ year state @; run; The following output shows the results: 526 Releasing a Held Line 4 Output 30.4 Chapter 30 Writing on the Same Line More than Once 157 options linesize=80 pagesize=60; 158 159 data _null_; 160 length state $ 15; 161 infile ’your-input-file’; 162 input state $ morning_copies evening_copies year; 163 if morning_copies =. then put 164 ’** Morning Tot Missing: ’ year state @; 165 if evening_copies =. then put 166 ’** Evening Tot Missing: ’ year state @; 167 run; NOTE: The infile ’your-input-file’ is: File Name=file-name, Owner Name=xxxxxx,Group Name=xxxx, Access Permission=rw-r--r--, File Size (bytes)=223 ** Evening Tot Missing: 1997 Massachusetts ** Morning Tot Missing: 1999 Alabama ** Morning Tot Missing: 1996 Alabama NOTE: 7 records were read from the infile ’your-input-file’. The minimum record length was 30. The maximum record length was 31. If the output line were long enough, then SAS would write all three messages about missing data on a single line. Because the line is not long enough, SAS continues writing on the next line. When it determines that an individual data value or character string does not fit on a line, SAS brings the entire item down to the next line. SAS does not split a data value or character string. Releasing a Held Line In the following example, the input file has five missing values. One record has missing values for both the MORNING_COPIES and EVENING_COPIES variables. Three other records have missing values for either the MORNING_COPIES or the EVENING_COPIES variable. To improve the appearance of your report, you can write all the missing variables for each observation on a separate line. When values for the two variables MORNING_COPIES and EVENING_COPIES are missing, two PUT statements write to the same line. When either MORNING_COPIES or EVENING_COPIES is missing, only one PUT statement writes to that line. SAS determines where to write the output by the presence of the trailing @ sign in the PUT statement and the presence of a null PUT statement that releases the hold on the line. Executing a PUT statement with a trailing @ causes SAS to hold the current output line for further writing, either in the current iteration of the DATA step or in a future iteration. Executing a PUT statement without a trailing @ releases the held line. To release a line without writing a message, use a null PUT statement: put; A null PUT statement has the same characteristics of other PUT statements: by default, it writes output to a new line, writes what you specify in the statement (nothing in this case), and releases the line when it finishes executing. If a trailing @ is in effect, then the null PUT statement begins on the current line, writes nothing, and releases the line. The following program shows how to write one or more items to the same line: Writing Lines to the SAS Log or to an Output File 4 Releasing a Held Line 527 3 If a value for MORNING_COPIES is missing, then the first PUT statement holds the line in case EVENING_COPIES is missing a value for that observation. 3 If a value for EVENING_COPIES is missing, then the next PUT statement writes a message and releases the line. 3 If EVENING_COPIES does not have a missing value, but if a message has been written for MORNING_COPIES (MORNING_COPIES=.), then the null PUT statement releases the line. 3 If neither EVENING_COPIES nor MORNING_COPIES has missing values, then the line is not released and no PUT statement is executed. options linesize=80 pagesize=60; data _null_; length state $ 15; infile ’your-input-file’; input state $ morning_copies evening_copies year; if morning_copies=. then put ’** Morning Tot Missing: ’ year state @; if evening_copies=. then put ’** Evening Tot Missing: ’ year state; else if morning_copies=. then put; run; The following output shows the results: Output 30.5 Writing One or More Times to a Line and Releasing the Line 7 data _null_; 8 length state $ 15; 9 infile ’your-input-file’; 10 input state $ morning_copies evening_copies year; 11 if morning_copies=. then put 12 ’** Morning Tot Missing: ’ year state @; 13 if evening_copies=. then put 14 ’** Evening Tot Missing: ’ year state; 15 else if morning_copies=. then put; 16 run; NOTE: The infile ’your-input-file’ is: File Name=your-input-file, Owner Name=xxxxxx,Group Name=xxxx, Access Permission=rw-r--r--, File Size (bytes)=223 ** Evening Tot Missing: 1997 Massachusetts ** Morning Tot Missing: 1999 Alabama ** Morning Tot Missing: 1998 Alabama ** Evening Tot Missing: 1998 Alabama ** Morning Tot Missing: 1996 Alabama NOTE: 7 records were read from the infile ’your-input-file’. The minimum record length was 30. The maximum record length was 31. 528 Writing a Report 4 Chapter 30 Writing a Report Writing to an Output File The PUT statement writes lines of text to the SAS log. However, the SAS log is not usually a good destination for a formal report because it also contains the source statements for the program and messages from SAS. The simplest destination for a printed report is the SAS output file, which is the same place SAS writes output from procedures. SAS automatically defines various characteristics such as page numbers for the procedure output file, and you can take advantage of them instead of defining all the characteristics yourself. To route lines to the procedure output file, use the FILE statement. The FILE statement has the following form: FILE PRINT ; PRINT is a reserved fileref that directs output that is produced by PUT statements to the same print file as the output that is produced by SAS procedures. Note: code. 4 Be sure that the FILE statement precedes the PUT statement in the program FILE statement options specify options that you can use to customize output. The report that is produced in this section uses the following options: NOTITLES eliminates the default title line and makes that line available for writing. By default, the procedure output file contains the title “The SAS System.” Because the report creates another title that is descriptive, you can remove the default title by specifying the NOTITLES option. FOOTNOTES controls whether currently defined footnotes are written to the report. Note: When you use the FILE statement to include footnotes in a report, you must use the FOOTNOTES option in the FILE statement and include a FOOTNOTE statement in your program. The FOOTNOTE statement contains the text of the footnote. 4 Note: You can also remove the default title with a null TITLE statement: title;. In this case, SAS writes a line that contains only the date and page number in place of the default title, and the line is not available for writing other text. 4 Designing the Report After choosing a destination for your report, the next step in producing a report is to decide how you want it to look. You create the design and determine which lines and columns the text will occupy. Planning how you want your final report to look helps you write the necessary PUT statements to produce the report. The rest of the examples in this section show how to modify a program to produce a final report that resembles the one shown here. Writing Lines to the SAS Log or to an Output File 4 Writing Data Values 529 ----+----1----+----2----+----3----+----4----+----5----+----6----+----7-1 Morning and Evening Newspaper Circulation 2 3 State Year Thousands of Copies 4 Morning Evening 5 6 Alabama 1984 256.3 480.5 7 1985 291.5 454.3 8 1986 303.6 454.7 9 1987 . 454.5 10 ------------11 Total for each category 851.4 1844.0 12 Combined total 2695.4 13 14 15 Massachusetts 1984 . . 16 1985 . 68.0 17 1986 222.7 68.6 18 1987 224.1 66.7 19 ----------20 Total for each category 446.8 203.3 21 Combined total 650.1 22 23 24 25 26 27 28 29 30 Preliminary Report ----+----1----+----2----+----3----+----4----+----5----+----6----+----7-- Writing Data Values After you design your report, you can begin to write the program that will create it. The following program shows how to display the data values for the YEAR, MORNING_COPIES, and EVENING_COPIES variables in specific positions. In a PUT statement, the @ followed by a number is a pointer control, but it is different from the trailing @ described earlier. The @n argument is a column-pointer control. It tells SAS to move to column n. In this example the pointer moves to the specified locations, and the PUT statement writes values at those points using list output. Combining list output with pointer controls is a simple but useful way of writing data values in columns. options pagesize=30 linesize=80 pageno=1 nodate; data _null_; infile ’your-input-file’; input state $ morning_copies evening_copies year; file print notitles; put @26 year @53 morning_copies @66 evening_copies; run; 530 Improving the Appearance of Numeric Data Values 4 Chapter 30 The following output shows the results: Output 30.6 Data Values in Specific Locations in the Output 1999 1998 1997 1999 1998 1997 1996 798.4 834.2 750.3 . 463.8 583.2 . 984.7 793.6 . 698.4 522 234.9 339.6 Improving the Appearance of Numeric Data Values In the design for your report, all numeric values are aligned on the decimal point (see Output 30.6). To achieve this result, you have to alter the appearance of the numeric data values by using SAS formats. In the input data all values for MORNING_COPIES and EVENING_COPIES contain one decimal place, except in one case where the decimal value is 0. In list output SAS writes values in the simplest way, that is, by omitting the 0s in the decimal portion of a value. In formatted output, you can show one decimal place for every value by associating a format with a variable in the PUT statement. Using a format can also align your output values. The format that is used in the program is called the w.d format. The w.d format specifies the number of columns to be used for writing the entire value, including the decimal point. It also specifies the number of columns to be used for writing the decimal portion of each value. In this example the format 5.1 causes SAS to use five columns, including one decimal place, for writing each value. Therefore, SAS prints the 0s in the decimal portion as necessary. The format also aligns the periods that SAS uses to indicate missing values with the decimal points. options pagesize=30 linesize=80 pageno=1 nodate; data _null_; infile ’your-input-file’; input state $ morning_copies evening_copies year; file print notitles; put @26 year @53 morning_copies 5.1 @66 evening_copies 5.1; run; The following output shows the results: Output 30.7 Formatted Numeric Output 1999 1998 1997 1999 1998 1997 1996 798.4 834.2 750.3 . 463.8 583.2 . 984.7 793.6 . 698.4 522.0 234.9 339.6 Writing Lines to the SAS Log or to an Output File 4 Writing a Value at the Beginning of Each BY Group 531 Writing a Value at the Beginning of Each BY Group The next step in creating your report is to add the name of the state to your output. If you include the name of the state in the PUT statement with other data values, then the state will appear on every line. However, remembering what you want your final report to look like, you need to write the name of the state only for the first observation of a particular state. Performing a task once for a group of observations requires the use of the BY statement for BY-group processing. The BY statement has the following form: BY by-variable(s)< NOTSORTED>; The by-variable names the variable by which the data set is sorted. The optional NOTSORTED option specifies that observations with the same BY value are grouped together but are not necessarily sorted in alphabetical or numerical order. For BY-group processing, 3 ensure that observations come from a SAS data set, not an external file. 3 when the data is grouped in BY groups but the groups are not necessarily in alphabetical order, use the NOTSORTED option in the BY statement. For example, use by state notsorted; The following program creates a permanent SAS data set named NEWS.CIRCULATION, and writes the name of the state on the first line of the report for each BY group. options pagesize=30 linesize=80 pageno=1 nodate; libname news ’SAS-data-library’; data news.circulation; length state $ 15; input state $ morning_copies evening_copies year; datalines; Massachusetts 798.4 984.7 1999 Massachusetts 834.2 793.6 1998 Massachusetts 750.3 . 1997 Alabama . 698.4 1999 Alabama 463.8 522.0 1998 Alabama 583.2 234.9 1997 Alabama . 339.6 1996 ; data _null_; set news.circulation; by state notsorted; file print notitles; if first.state then put / @7 state @; put @26 year @53 morning_copies 5.1 @66 evening_copies 5.1; run; During the first observation for a given state, a PUT statement writes the name of the state and holds the line for further writing (the year and circulation figures). The next PUT statement writes the year and circulation figures and releases the held line. In observations after the first, only the second PUT statement is processed. It writes the year and circulation figures and releases the line as usual. 532 Calculating Totals 4 Chapter 30 The first PUT statement contains a slash (/), a pointer control that moves the pointer to the beginning of the next line. In this example, the PUT statement prepares to write on a new line (the default action). Then the slash moves the pointer to the beginning of the next line. As a result, SAS skips a line before writing the value of STATE. In the output, a blank line separates the data for Massachusetts from the data for Alabama. The output for Massachusetts also begins one line farther down the page than it would have otherwise. (That blank line is used later in the development of the report.) The following output shows the results: Output 30.8 Effect of BY-Group Processing Massachusetts 1999 1998 1997 798.4 834.2 750.3 984.7 793.6 . Alabama 1999 1998 1997 1996 . 463.8 583.2 . 698.4 522.0 234.9 339.6 Calculating Totals The next step is to calculate the total morning circulation figures, total evening circulation figures, and total overall circulation figures for each state. Sum statements accumulate the totals, and assignment statements start the accumulation at 0 for each state. When the last observation for a given state is being processed, an assignment statement calculates the overall total, and a PUT statement writes the totals and additional descriptive text. options pagesize=30 linesize=80 pageno=1 nodate; libname news ’SAS-data-library’; data _null_; set news.circulation; by state notsorted; file print notitles; /* Set values of accumulator variables to 0 */ /* at beginning of each BY group. */ if first.state then do; morning_total=0; evening_total=0; put / @7 state @; end; put @26 year @53 morning_copies 5.1 @66 evening_copies 5.1; /* Accumulate separate totals for morning and */ /* evening circulations. */ morning_total+morning_copies; evening_total+evening_copies; /* Calculate total circulation at the end of */ Writing Lines to the SAS Log or to an Output File 4 Writing Headings and Footnotes for a One-Page Report /* each BY group. 533 */ if last.state then do; all_totals=morning_total+evening_total; put @52 ’------’ @65 ’------’ / @26 ’Total for each category’ @52 morning_total 6.1 @65 evening_total 6.1 / @35 ’Combined total’ @59 all_totals 6.1; end; run; The following output shows the results: Output 30.9 Calculating and Writing Totals for Each BY Group Massachusetts 1999 1998 1997 Total for each category Combined total Alabama 1999 1998 1997 1996 Total for each category Combined total 798.4 834.2 750.3 -----2382.9 984.7 793.6 . -----1778.3 4161.2 . 463.8 583.2 . -----1047.0 698.4 522.0 234.9 339.6 -----1794.9 2841.9 Notice that Sum statements ignore missing values when they accumulate totals. Also, by default, Sum statements assign the accumulator variables (in this case, MORNING_TOTAL and EVENING_TOTAL) an initial value of 0. Therefore, although the assignment statements in the DO group are executed for the first observation for both states, you need them only for the second state. Writing Headings and Footnotes for a One-Page Report The report is complete except for the title lines, column headings, and footnote. Because this is a simple, one-page report, you can write the heading with a PUT statement that is executed only during the first iteration of the DATA step. The automatic variable _N_ counts the number of times the DATA step has iterated or looped, and the PUT statement is executed when the value of _N_ is 1. The FOOTNOTES option on the FILE statement and the FOOTNOTE statement create the footnote. The following program is complete: options pagesize=30 linesize=80 pageno=1 nodate; libname news ’SAS-data-library’; data _null_; set news.circulation; by state notsorted; file print notitles footnotes; if _n_=1 then put @16 ’Morning and Evening Newspaper Circulation’ // @7 ’State’ @26 ’Year’ @51 ’Thousands of Copies’ / 534 Writing Headings and Footnotes for a One-Page Report 4 Chapter 30 @51 ’Morning Evening’; if first.state then do; morning_total=0; evening_total=0; put / @7 state @; end; put @26 year @53 morning_copies 5.1 @66 evening_copies 5.1; morning_total+morning_copies; evening_total+evening_copies; if last.state then do; all_totals=morning_total+evening_total; put @52 ’------’ @65 ’------’ / @26 ’Total for each category’ @52 morning_total 6.1 @65 evening_total 6.1 / @35 ’Combined total’ @59 all_totals 6.1; end; footnote ’Preliminary Report’; run; The following output shows the results: Output 30.10 The Final Report Morning and Evening Newspaper Circulation State Year Massachusetts 1999 1998 1997 Thousands of Copies Morning Evening Total for each category Combined total Alabama 1999 1998 1997 1996 Total for each category Combined total 798.4 834.2 750.3 -----2382.9 984.7 793.6 . -----1778.3 4161.2 . 463.8 583.2 . -----1047.0 698.4 522.0 234.9 339.6 -----1794.9 2841.9 Preliminary Report Notice that a blank line appears between the last line of the heading and the first data for Massachusetts although the PUT statement for the heading does not write a blank line. The line comes from the slash (/) in the PUT statement that writes the value of STATE in the first observation of each BY group. Writing Lines to the SAS Log or to an Output File 4 Statements 535 Executing a PUT statement during the first iteration of the DATA step is a simple way to produce headings, especially when a report is only one page long. Review of SAS Tools Statements BY variable-1 <. . . variable-n > ; indicates that all observations with common values of the BY variables are grouped together. The NOTSORTED option indicates that the variables are grouped but that the groups are not necessarily in alphabetical or numerical order. DATA _NULL_; specifies that SAS will not create an output data set. FILE PRINT ; directs output to the SAS procedure output file. Place the FILE statement before the PUT statements that write to that file. The NOTITLES option suppresses titles that are currently in effect, and makes the lines unavailable for writing other text. The FOOTNOTES option, along with the FOOTNOTE statement, writes a footnote to the file. PUT; by default, begins a new line and releases a previously held line. A PUT statement that does not write any text is known as a null PUT statement. PUT > ; writes lines to the destination that is specified in the FILE statement; if no FILE statement is present, then the PUT statement writes to the SAS log. By default, each PUT statement begins on a new line, writes what is specified, and releases the line. A DATA step can contain any number of PUT statements. By default, SAS writes a variable or character-string at the current position in the line. SAS automatically moves the pointer one column to the right after writing a variable value but not after writing a character string; that is, SAS places a blank after a variable value but not after a character string. This form of output is called list output. If you place a format after a variable name, then SAS writes the value of the variable beginning at its current position in the line and using the format that you specify. The position of the pointer after a formatted value is the following column; that is, SAS does not automatically skip a column. Using a format in a PUT statement is called formatted output. You can combine list and formatted output in a single PUT statement. PUT<@n> > > <@>; writes lines to the destination that is specified in the FILE statement; if no FILE statement is present, then the PUT statement writes to the SAS log. The @n pointer control moves the pointer to column n in the current line. The / moves the pointer to the beginning of a new line. (You can use slashes anywhere in the PUT statement to skip lines.) Multiple slashes skip multiple lines. The trailing @, if present, must be the last item in the PUT statement. Executing a PUT statement with a trailing @ holds the current line for use by a later PUT statement either in the same iteration of the DATA step or a later iteration. Executing a PUT statement without a trailing @ releases a held line. TITLE; specifies title lines for SAS output. 536 Learning More 4 Chapter 30 Learning More Pointer controls For more information about pointer controls, see the PUT statement in the Statements section of SAS Language Reference: Dictionary. Statements For more information about the statements that are described in this section, see SAS Language Reference: Dictionary. 537 CHAPTER 31 Understanding and Customizing SAS Output: The Basics Introduction to the Basics of Understanding and Customizing SAS Output 538 Purpose 538 Prerequisites 538 Understanding Output 538 Output from Procedures 538 Output from DATA Step Applications 538 Output from the Output Delivery System (ODS) 539 Input SAS Data Set for Examples 540 Locating Procedure Output 541 Making Output Informative 542 Adding Titles 542 Adding Footnotes 543 Labeling Variables 545 Developing Descriptive Output 546 Controlling Output Appearance 548 Specifying SAS System Options 548 Numbering Pages 548 Centering Output 548 Specifying Page and Line Size 548 Writing Date and Time Values 549 Choosing Options Selectively 549 Controlling the Appearance of Pages 550 Input Data Set for Examples of Multiple-page Reports 550 Writing Centered Title and Column Headings 551 Writing Titles and Column Headings in Specific Columns 554 Changing a Portion of a Heading 556 Controlling Page Divisions 558 Representing Missing Values 561 Recognizing Default Values 561 Customizing Output of Missing Values by Using a System Option 561 Customizing Output of Missing Values by Using a Procedure 562 Review of SAS Tools 563 Statements 563 SAS System Options 564 Learning More 564 538 Introduction to the Basics of Understanding and Customizing SAS Output 4 Chapter 31 Introduction to the Basics of Understanding and Customizing SAS Output Purpose In this section you will learn to understand your output so that you can enhance its appearance and make it more informative. It discusses DATA step and PROC step output. This section describes how to enhance the appearance of your output by doing the following: 3 adding titles, column headings, footnotes, and labels 3 customizing headings 3 changing a portion of a heading 3 numbering pages and controlling page divisions 3 printing date and time values 3 representing missing numeric values with a character Prerequisites Before proceeding with this section, you should understand the concepts that are presented in the following sections: 3 Chapter 2, “Introduction to DATA Step Processing,” on page 19 3 Chapter 30, “Writing Lines to the SAS Log or to an Output File,” on page 521 Understanding Output Output from Procedures When you invoke a SAS procedure, SAS analyzes or processes your data. You can read a SAS data set, compute statistics, print results, or create a new data set. One of the results of executing a SAS procedure is creating procedure output. The destination of procedure output varies with the method of running SAS, the operating environment, and the options that you use. The form and content of the output varies with each procedure. Some procedures, such as the SORT procedure, do not produce printed output. SAS has numerous procedures that you can use to process your data. For example, you can use the PRINT procedure to print a report that lists the values of each variable in your SAS data set. You can use the MEANS procedure to compute descriptive statistics for variables across all observations and within groups of observations. You can use the UNIVARIATE procedure to produce information on the distribution of numeric variables. For a graphic representation of your data, you can use the CHART procedure. Many other procedures are available through SAS. Output from DATA Step Applications Although output is usually generated by a procedure, you can also generate output by using a DATA step application. Using the DATA step, you can do the following: Understanding and Customizing SAS Output: The Basics 4 Output from the Output Delivery System (ODS) 539 3 create a SAS data set 3 write to an external file 3 produce a report To generate output, you can use the FILE and PUT statements together within the DATA step. Use the FILE statement to identify your current output file. Then use the PUT statement to write lines that contain variable values or text strings to the output file. You can write the values in column, list, or formatted style. You can use the FILE and PUT statements to target a subset of data. If you have a large data set that includes unnecessary information, this kind of DATA step processing can save time and computer resources. Write your code so that the FILE statement executes before a PUT statement in the current execution of a DATA step. Otherwise, your data will be written to the SAS log. If you have a SAS data set, you can use the FILE and PUT statements to create an external file that another computer language can process. For example, you can create a SAS data set that lists the test scores for high school students. You can then use this file as input to a FORTRAN program that analyzes test scores. The following table lists the variables and the column positions that an existing FORTRAN program expects to find in the input SAS data set: Variable Column location YEAR 10-13 TEST 15-25 GENDER 30 SCORE 35-37 You can use the FILE and PUT statements in the DATA step to create the data set that the FORTRAN program reads: data _null_; set out.sats1; file ’your-output-file’; put @10 year @15 test @30 gender @35 score; run; Output from the Output Delivery System (ODS) Beginning with Version 7, procedure output is much more flexible because of the Output Delivery System (ODS). ODS is a method of delivering output in a variety of formats and of making the formatted output easy to access. Important features of ODS include the following: 3 ODS combines raw data with one or more table definitions to produce one or more output objects. When you send these objects to any or all ODS destinations, your output is formatted according to the instructions in the table definition. ODS destinations can produce an output data set, traditional monospace output, output that is formatted for a high-resolution printer, output that is formatted in HyperText Markup Language (HTML), and so on. 3 ODS provides table definitions that define the structure of the output from procedures and from the DATA step. You can customize the output by modifying these definitions or by creating your own definitions. 540 Input SAS Data Set for Examples 4 Chapter 31 3 ODS provides a way for you to choose individual output objects to send to ODS destinations. For example, PROC UNIVARIATE produces five output objects. You can easily create HTML output, an output data set, traditional Listing output, or Printer output from any or all of these output objects. You can send different output objects to different destinations. 3 ODS stores a link to each output object in the Results folder in the Results window. In addition, ODS removes responsibility for formatting output from individual procedures and from the DATA step. The procedure or DATA step supplies raw data and the name of the table definition that contains the formatting instructions; then ODS formats the output. Because formatting is now centralized in ODS, the addition of a new ODS destination does not affect any procedures or the DATA step. As future destinations are added to ODS, they will automatically become available to the DATA step and to all procedures that support ODS. For more information and examples, see Chapter 32, “Understanding and Customizing SAS Output: The Output Delivery System (ODS),” on page 565. Input SAS Data Set for Examples The following program creates a SAS data set that contains Scholastic Aptitude Test (SAT) information for university-bound high school seniors from 1972 through 1998. (To view the entire DATA step, see “DATA Step to Create the Data Set SAT_SCORES” on page 714.) The data set in this example is stored in a SAS data library that is referenced by the libref ADMIN. For selected years between 1972 and 1998, the data set shows estimated scores that are based on the total number of students nationwide taking the test. Scores are estimated for male (m)and female (f) students, for both the verbal and math portions of the test. options pagesize=60 linesize=80 pageno=1 nodate; libname admin ’your-data-library’; data admin.sat_scores; input Test $ Gender $ Year SATscore @@; datalines; Verbal m 1972 531 Verbal f 1972 529 Verbal m 1973 523 Verbal f 1973 521 Verbal m 1974 524 Verbal f 1974 520 ...more SAS data lines... Math m 1996 527 Math f 1996 492 Math m 1997 530 Math f 1997 494 Math m 1998 531 Math f 1998 496 ; proc print data=admin.sat_scores; run; The following output shows a partial list of the results: Understanding and Customizing SAS Output: The Basics Output 31.1 4 Locating Procedure Output The ADMIN.SAT_SCORES Data Set: Partial List of Output The SAS System Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Test Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Gender m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f Year 1972 1972 1973 1973 1974 1974 1975 1975 1976 1976 1977 1977 1978 1978 1979 1979 1980 1980 1981 1981 1982 1982 1983 1983 1984 1984 1985 1985 1986 1986 1 SATscore 531 529 523 521 524 520 515 509 511 508 509 505 511 503 509 501 506 498 508 496 509 499 508 498 511 498 514 503 515 504 Locating Procedure Output The destination of your procedure output depends on the method that you use to start, run, and exit SAS. It also depends on your operating environment and on the settings of SAS system options. The following table shows the default destination for each method of operation. Method of operation Destination of procedure output windowing environment OUTPUT and RESULTS windows interactive line mode on the terminal display, as each step executes noninteractive SAS programs depends on the operating environment batch jobs line printer or disk file 541 542 Making Output Informative 4 Chapter 31 Making Output Informative Adding Titles At the top of each page of output, SAS automatically writes the following title: The SAS System You can make output more informative by using the TITLE statement to specify your own title. A TITLE statement writes the title you specify at the top of every page. The form of the TITLE statement is: TITLE < ’text’>; where n specifies the relative line that contains the title, and text specifies the text of the title. The value of n can be 1 to 10. If you omit n, SAS assumes a value of 1. Therefore, you can specify TITLE or TITLE1 for the first title line. By default, SAS centers a title. To add the title ’SAT Scores by Year, 1972-1998’ to your output, use the following TITLE statement: title ’SAT Scores by Year, 1972-1998’; The TITLE statement is a global statement. This means that within a SAS session, SAS continues to use the most recently created title until you change or eliminate it, even if you generate different output later. You can use the TITLE statement anywhere in your program. You can specify up to ten titles per page by numbering them in ascending order. If you want to add a subtitle to your previous title, for example, the subtitle ’Separate Statistics by Test Type,’ then number your titles by the order in which you want them to appear. To add a blank line between titles, skip a number as you number your TITLE statements. Your TITLE statements now become title1 ’SAT Scores by Year, 1972-1998’; title3 ’Separate Statistics by Test Type’; To modify a title line, you change the text in the title and resubmit your program, including all of the TITLE statements. Be aware that a TITLE statement for a given line cancels the previous TITLE statement for that line and for all lines with higher-numbered titles. To eliminate all titles including the default title, specify title; or title1; The following example shows how to use multiple TITLE statements. options linesize=80 pagesize=60 pageno=1 nodate; libname admin ’SAS-data-library’; data report; set admin.sat_scores; if year ge 1995 then output; Understanding and Customizing SAS Output: The Basics 4 Adding Footnotes 543 title1 ’SAT Scores by Year, 1995-1998’; title3 ’Separate Statistics by Test Type’; run; proc print data=report; run; The following output shows the results: Output 31.2 Report Showing Multiple TITLE Statements SAT Scores by Year, 1995-1998 1 Separate Statistics by Test Type Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test Verbal Verbal Verbal Verbal Verbal Verbal Verbal Verbal Math Math Math Math Math Math Math Math Gender m f m f m f m f m f m f m f m f Year 1995 1995 1996 1996 1997 1997 1998 1998 1995 1995 1996 1996 1997 1997 1998 1998 SATscore 505 502 507 503 507 503 509 502 525 490 527 492 530 494 531 496 Although the TITLE statement can appear anywhere in your program, you can associate the TITLE statement with a particular procedure step by positioning it in one of the following locations: 3 before the step that produces the output 3 after the procedure statement but before the next DATA or RUN statement, or the next procedure Remember that the TITLE statement applies globally until you change or eliminate it. Adding Footnotes The FOOTNOTE statement follows the same guidelines as the TITLE statement. The FOOTNOTE statement is a global statement. This means that within a SAS session, SAS continues to use the most recently created footnote until you change or eliminate it, even if you generate different output later. You can use the FOOTNOTE statement anywhere in your program. A footnote writes up to ten lines of text at the bottom of the procedure output or DATA step output. The form of the FOOTNOTE statement is: FOOTNOTE <’text’>; where n specifies the relative line to be occupied by the footnote, and text specifies the text of the footnote. The value of n can be 1 to 10. If you omit n, SAS assumes a value of 1. 544 Adding Footnotes 4 Chapter 31 To add the footnote ’1967 and 1970 SAT scores estimated based on total number of people taking the SAT,’ specify the following statements anywhere in your program: footnote1 ’1967 and 1970 SAT scores estimated based on total number’; footnote2 ’of people taking the SAT’; You can specify up to ten lines of footnotes per page by numbering them in ascending order. When you alter the text of one footnote in a series and execute your program again, SAS changes the text of that footnote. However, if you execute your program with numbered FOOTNOTE statements, SAS eliminates all higher-numbered footnotes. footnote; or footnote1; The following example shows how to use multiple FOOTNOTE statements. options linesize=80 pagesize=30 pageno=1 nodate; libname admin ’SAS-data-library’; data report; set admin.sat_scores; if year ge 1996 then output; title1 ’SAT Scores by Year, 1996-1998’; title3 ’Separate Statistics by Test Type’; footnote1 ’1996 through 1998 SAT scores estimated based on total number’; footnote2 ’of people taking the SAT’; run; proc print data=report; run; The following output shows the results: Understanding and Customizing SAS Output: The Basics Output 31.3 4 Labeling Variables 545 Report Showing a Footnote SAT Scores by Year, 1996-1998 1 Separate Statistics by Test Type Obs 1 2 3 4 5 6 7 8 9 10 11 12 Test Verbal Verbal Verbal Verbal Verbal Verbal Math Math Math Math Math Math Gender m f m f m f m f m f m f Year 1996 1996 1997 1997 1998 1998 1996 1996 1997 1997 1998 1998 SATscore 507 503 507 503 509 502 527 492 530 494 531 496 1996 through 1998 SAT scores estimated based on total number of people taking the SAT Although the FOOTNOTE statement can appear anywhere in your program, you can associate the FOOTNOTE statement with a particular procedure step by positioning it at one of the following locations: 3 after the RUN statement for the previous step 3 after the procedure statement but before the next DATA or RUN statement, or before the next procedure Remember that the FOOTNOTE statement applies globally until you change or eliminate it. Labeling Variables In procedure output, SAS automatically writes the variables with the names that you specify. However, you can designate a label for some or all of your variables by specifying a LABEL statement either in the DATA step or, with some procedures, in the PROC step of your program. Your label can be up to 256 characters long, including blanks. For example, to describe the variable SATscore with the phrase ’SAT Score,’ specify label SATscore =’SAT Score’; If you specify the LABEL statement in the DATA step, the label is permanently stored in the data set. If you specify the LABEL statement in the PROC step, the label is associated with the variable only for the duration of the PROC step. In either case, when a label is assigned, it is written with almost all SAS procedures. The exception is the PRINT procedure. Whether you put the LABEL statement in the DATA step or in the PROC step, with the PRINT procedure you must specify the LABEL option as follows: 546 Developing Descriptive Output 4 Chapter 31 proc print data=report label; run; The following example shows how to use a label statement. options linesize=80 pagesize=30 pageno=1 nodate; libname admin ’SAS-data-library’; data report; set admin.sat_scores; if year ge 1996 then output; label Test=’Test Type’ SATscore=’SAT Score’; title1 ’SAT Scores by Year, 1996-1998’; title3 ’Separate Statistics by Test Type’; run; proc print data=report label; run; The following output shows the results: Output 31.4 Variable Labels in SAS Output SAT Scores by Year, 1996-1998 1 Separate Statistics by Test Type Obs 1 2 3 4 5 6 7 8 9 10 11 12 Test Type Verbal Verbal Verbal Verbal Verbal Verbal Math Math Math Math Math Math Gender m f m f m f m f m f m f Year SAT Score 1996 1996 1997 1997 1998 1998 1996 1996 1997 1997 1998 1998 507 503 507 503 509 502 527 492 530 494 531 496 Developing Descriptive Output The following example incorporates the TITLE, LABEL, and FOOTNOTE statements, and produces output. options linesize=80 pagesize=40 pageno=1 nodate; libname admin ’SAS-data-library’; proc sort data=admin.satscores; by gender; run; Understanding and Customizing SAS Output: The Basics 4 Developing Descriptive Output proc means data=admin.satscores maxdec=2 fw=8; by gender; label SATscore=’SAT score’; title1 ’SAT Scores by Year, 1967-1976’; title3 ’Separate Statistics by Test Type’; footnote1 ’1972 and 1976 SAT scores estimated based on the’; footnote2 ’total number of people taking the SAT’; run; The following output shows the results: Output 31.5 Titles, Labels, and Footnotes in SAS Output SAT Scores by Year, 1967-1976 1 Separate Statistics by Test Type ----------------------------------- Gender=f ----------------------------------The MEANS Procedure Variable Label N Mean Std Dev Minimum Maximum -------------------------------------------------------------------------Year 4 1975.00 2.58 1972.00 1978.00 SATscore SAT score 4 515.00 11.75 503.00 529.00 -------------------------------------------------------------------------- ----------------------------------- Gender=m ----------------------------------Variable Label N Mean Std Dev Minimum Maximum -------------------------------------------------------------------------Year 4 1975.00 2.58 1972.00 1978.00 SATscore SAT score 4 519.25 9.95 511.00 531.00 -------------------------------------------------------------------------- 1972 and 1976 SAT scores estimated based on the total number of people taking the SAT 547 548 Controlling Output Appearance 4 Chapter 31 Controlling Output Appearance Specifying SAS System Options You can enhance the appearance of your output by specifying SAS system options on the OPTIONS statement. The changes that result from specifying system options remain in effect for the rest of the job, session, or SAS process, or until you issue another OPTIONS statement to change the options. You can specify SAS system options through the OPTIONS statement, through the OPTIONS window, at SAS invocation, at the initiation of a SAS process, and in a configuration file. Default option settings can vary among sites. To determine the settings at your site, execute the OPTIONS procedure or browse the OPTIONS window. The OPTIONS statement has the following form: OPTIONS option(s); where option specifies one or more SAS options that you want to change. Note: An OPTIONS statement can appear at any place in a SAS program, except within data lines. 4 Numbering Pages By default, SAS numbers pages of output starting with page 1. However, you can suppress page numbers with the NONUMBER system option. To suppress page numbers, specify the following OPTIONS statement: options nonumber; This option, like all SAS system options, remains in effect for the duration of your session or until you change it. Change the option by specifying options number; You can use the PAGENO= system option to specify a beginning page number for the next page of output that SAS writes. The PAGENO= option enables you to reset page numbering in the middle of a SAS session. For example, the following OPTIONS statement resets the next output page number to 5: options pageno=5; Centering Output By default, SAS centers both the output and output titles. However, you can left-align your output by specifying the following OPTIONS statement: options nocenter; The NOCENTER option remains in effect for the duration of your SAS session or until you change it. Change the option by specifying options center; Specifying Page and Line Size Procedure output is scaled automatically to fit the size of the page and line. The number of lines per page and the number of characters per line of printed output are Understanding and Customizing SAS Output: The Basics 4 Choosing Options Selectively 549 determined by the settings of the PAGESIZE= and LINESIZE= system options. The default settings vary from site to site and are further affected by the machine, operating environment, and method of running SAS. For example, when SAS runs in interactive mode, the PAGESIZE= option by default assumes the size of the device that you specify. You can adjust both your page size and line size by resetting the PAGESIZE= and LINESIZE= options. For example, you can specify the following OPTIONS statement: options pagesize=40 linesize=64; The PAGESIZE= and LINESIZE= options remain in effect for the duration of your SAS session or until you change them. Writing Date and Time Values By default, SAS writes at the top of your output the beginning date and time of the SAS session during which your job executed. This automatic record is especially useful when you execute a program many times. However, you can use the NODATE system option to specify that these values not appear. To do this, specify the following OPTIONS statement: options nodate; The NODATE option remains in effect for the duration of your SAS session or until you change it. Choosing Options Selectively Choose the system options that you need to meet your specifications. The following program, which uses the conditional IF-THEN/ELSE statement to subset the data set, includes a number of SAS options. The OPTIONS statement specifies a line size of 64, left-aligns the output, numbers the output pages and supplies the date that the SAS session was started. options linesize=64 nocenter number date; libname admin ’/u/lirezn/saslearnV8’; data high_scores; set admin.sat_scores; if SATscore < 525 then delete; run; proc print data=high_scores; title ’SAT Scores: 525 and Above’; run; The following output shows the results: 550 Controlling the Appearance of Pages Output 31.6 4 Chapter 31 Effect of System Options on SAS Output SAT Scores: 525 and Above Obs Test 1 2 3 4 5 6 7 8 Verbal Verbal Math Math Math Math Math Math Gender m f m m m m m m 1 10:59 Wednesday, October 11, 2000 Year SATscore 1972 1972 1972 1973 1995 1996 1997 1998 531 529 527 525 525 527 530 531 Controlling the Appearance of Pages Input Data Set for Examples of Multiple-page Reports In the sections that follow, you learn how to customize multiple-page reports. The following program creates and prints a SAS data set that contains newspaper circulation figures for morning and evening editions. Each record lists the state, morning circulation figures (in thousands), evening circulation figures (in thousands), and year that the data represents. data circulation_figures; length state $ 15; input state $ morning_copies evening_copies year; datalines; Colorado 738.6 210.2 1984 Colorado 742.2 212.3 1985 Colorado 731.7 209.7 1986 Colorado 789.2 155.9 1987 Vermont 623.4 566.1 1984 Vermont 533.1 455.9 1985 Vermont 544.2 566.7 1986 Vermont 322.3 423.8 1987 Alaska 51.0 80.7 1984 Alaska 58.7 78.3 1985 Alaska 59.8 70.9 1986 Alaska 64.3 64.6 1987 Alabama 256.3 480.5 1984 Alabama 291.5 454.3 1985 Alabama 303.6 454.7 1986 Alabama . 454.5 1987 Maine . . 1984 Maine . 68.0 1985 Maine 222.7 68.6 1986 Maine 224.1 66.7 1987 Hawaii 433.5 122.3 1984 Hawaii 455.6 245.1 1985 Hawaii 499.3 355.2 1986 Understanding and Customizing SAS Output: The Basics Hawaii ; 4 Writing Centered Title and Column Headings 551 503.2 488.6 1987 proc print data=circulation_figures; run; The following output shows the results: Output 31.7 SAS Data Set CIRCULATION_FIGURES The SAS System Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 state Colorado Colorado Colorado Colorado Vermont Vermont Vermont Vermont Alaska Alaska Alaska Alaska Alabama Alabama Alabama morning_ copies 1 evening_ copies 738.6 742.2 731.7 789.2 623.4 533.1 544.2 322.3 51.0 58.7 59.8 64.3 256.3 291.5 303.6 210.2 212.3 209.7 155.9 566.1 455.9 566.7 423.8 80.7 78.3 70.9 64.6 480.5 454.3 454.7 year 1984 1985 1986 1987 1984 1985 1986 1987 1984 1985 1986 1987 1984 1985 1986 The SAS System Obs state 16 17 18 19 20 21 22 23 24 Alabama Maine Maine Maine Maine Hawaii Hawaii Hawaii Hawaii morning_ copies . . . 222.7 224.1 433.5 455.6 499.3 503.2 2 evening_ copies 454.5 . 68.0 68.6 66.7 122.3 245.1 355.2 488.6 year 1987 1984 1985 1986 1987 1984 1985 1986 1987 Writing Centered Title and Column Headings Producing centered titles with TITLE statements is easy, because centering is the default for the TITLE statement. Producing column headings is not so easy. You must insert the correct number of blanks in the TITLE statements so that the entire title, when centered, causes the text to fall in the correct columns. The following example shows how to write centered lines and column headings. The titles and column headings appear at the top of every page of output. 552 Writing Centered Title and Column Headings 4 Chapter 31 options linesize=80 pagesize=20 nodate; data report1; infile ’your-data-file’; input state $ morning_copies evening_copies year; run; title ’Morning and Evening Newspaper Circulation’; title2; title3 ’State Year title4 ’ Thousands of Copies’; Morning Evening’; data _null_; set report1; by state notsorted; file print; if first.state then do; morning_total=0; evening_total=0; put / @7 state @; end; put @26 year @53 morning_copies 5.1 @66 evening_copies 5.1; morning_total+morning_copies; evening_total+evening_copies; if last.state then do; all_totals=morning_total+evening_total; put @52 ’------’ @65 ’------’ / @26 ’Total for each category’ @52 morning_total 6.1 @65 evening_total 6.1 / @35 ’Combined total’ @59 all_totals 6.1; end; run; The following output shows the results: Understanding and Customizing SAS Output: The Basics Output 31.8 4 Writing Centered Title and Column Headings Centered Lines and Column Headings in SAS Output Morning and Evening Newspaper Circulation State Colorado Year 1984 1985 1986 1987 Total for each category Combined total Vermont 1984 1985 1986 1987 Total for each category Combined total 1 Thousands of Copies Morning Evening 738.6 742.2 731.7 789.2 -----3001.7 210.2 212.3 209.7 155.9 -----788.1 3789.8 623.4 533.1 544.2 322.3 -----2023.0 566.1 455.9 566.7 423.8 -----2012.5 4035.5 Morning and Evening Newspaper Circulation State Alaska Year 1984 1985 1986 1987 Total for each category Combined total Alabama 1984 1985 1986 1987 Total for each category Combined total 2 Thousands of Copies Morning Evening 51.0 58.7 59.8 64.3 -----233.8 80.7 78.3 70.9 64.6 -----294.5 528.3 256.3 291.5 303.6 . -----851.4 480.5 454.3 454.7 454.5 -----1844.0 2695.4 553 554 Writing Titles and Column Headings in Specific Columns 4 Chapter 31 Morning and Evening Newspaper Circulation State Maine Year 1984 1985 1986 1987 Total for each category Combined total Hawaii 1984 1985 1986 1987 Total for each category Combined total 3 Thousands of Copies Morning Evening . . 222.7 224.1 -----446.8 . 68.0 68.6 66.7 -----203.3 650.1 433.5 455.6 499.3 503.2 -----1891.6 122.3 245.1 355.2 488.6 -----1211.2 3102.8 When you create titles and column headings with TITLE statements, consider the following: 3 SAS writes page numbers on title lines by default. Therefore, page numbers appear in this report. If you do not want page numbers, specify the NONUMBER system option. 3 The PUT statement pointer begins on the first line after the last TITLE statement. SAS does not skip a line before beginning the text as it does with procedure output. In this example, the blank line between the TITLE4 statement and the first line of data for each state is produced by the slash (/) in the PUT statement in the FIRST.STATE group. Writing Titles and Column Headings in Specific Columns The easiest way to program headings in specific columns is to use a PUT statement. Instead of calculating the exact number of blanks that are required to make text fall in particular columns, you move the pointer to the appropriate column with pointer controls and write the text. To write headings with a PUT statement, you must execute the PUT statement at the beginning of each page, regardless of the observation that is being processed or the iteration of the DATA step. The FILE statement with the HEADER= option specifies the headings you want to write. Use the following form of the FILE statement to specify column headings. FILE PRINT HEADER=label; PRINT is a reserved fileref that directs output that is produced by any PUT statements to the same print file as the output that is produced by SAS procedures. The label variable defines a statement label that identifies a group of SAS statements that execute each time SAS begins a new output page. The following program uses the HEADER= option of the FILE statement to add a header routine to the DATA step. The routine uses pointer controls in the PUT statement to write the title, skip two lines, and then write column headings in specific locations. options linesize=80 pagesize=24; Understanding and Customizing SAS Output: The Basics 4 Writing Titles and Column Headings in Specific Columns 555 data _null_; set circulation_figures; by state notsorted; file print notitles header=pagetop; u if first.state then do; morning_total=0; evening_total=0; put / @7 state @; end; put @26 year @53 morning_copies 5.1 @66 evening_copies 5.1; morning_total+morning_copies; evening_total+evening_copies; if last.state then do; all_totals=morning_total+evening_total; put @52 ’------’ @65 ’------’ / @26 ’Total for each category’ @52 morning_total 6.1 @65 evening_total 6.1 / @35 ’Combined total’ @59 all_totals 6.1; end; return; v pagetop: w put @16 ’Morning and Evening Newspaper Circulation’ // @7 ’State’ @26 ’Year’ @51 ’Thousands of Copies’/ @51 ’Morning Evening’; return; x run; The following list corresponds to the numbered items in the preceding program: u The PRINT fileref in the FILE statement creates Listing output. The NOTITLES option eliminates title lines so that the lines can be used by the PUT statement. The HEADER= option defines a statement label that points to a group of SAS statements that executes each time SAS begins a new output page. (You can use the HEADER= option only for creating print files.) v The RETURN statement that is located before the header routine marks the end of the main part of the DATA step. It causes execution to return to the beginning of the step for another iteration. Without this return statement, the statements in the header routine would be executed during each iteration of the DATA step, as well as at the beginning of each page. w The pagetop: label identifies the header routine. Each time SAS begins a new page, execution moves from its current position to the label pagetop: and continues until SAS encounters the RETURN statement. When execution reaches the RETURN statement at the end of the header routine, execution returns to the statement that was being executed when SAS began a new page. x The RETURN statement ends the header routine. Execution returns to the statement that was being executed when SAS began a new page. The following output shows the results: 556 Changing a Portion of a Heading Output 31.9 4 Chapter 31 Title and Column Headings in Specific Locations Morning and Evening Newspaper Circulation State Year Colorado 1984 1985 1986 1987 Total for each category Combined total Vermont 1984 1985 1986 1987 Total for each category Combined total Alaska 1984 1985 1986 Thousands of Copies Morning Evening 738.6 742.2 731.7 789.2 -----3001.7 210.2 212.3 209.7 155.9 -----788.1 3789.8 623.4 533.1 544.2 322.3 -----2023.0 566.1 455.9 566.7 423.8 -----2012.5 4035.5 51.0 58.7 59.8 80.7 78.3 70.9 Morning and Evening Newspaper Circulation State Year 1987 Total for each category Combined total Alabama 1984 1985 1986 1987 Total for each category Combined total Maine 1984 1985 1986 1987 Total for each category Combined total Thousands of Copies Morning Evening 64.3 64.6 ----------233.8 294.5 528.3 256.3 291.5 303.6 . -----851.4 480.5 454.3 454.7 454.5 -----1844.0 2695.4 . . 222.7 224.1 -----446.8 . 68.0 68.6 66.7 -----203.3 650.1 Changing a Portion of a Heading You can use variable values to create headings that change on every page. For example, if you eliminate the default page numbers in the procedure output file, you can create your own page numbers as part of the heading. You can also write the numbers differently from the default method. For example, you can write “Page 1” rather than “1.” Page numbers are an example of a heading that changes with each new page. The following program creates page numbers using a Sum statement and writes the numbers as part of the header routine. Understanding and Customizing SAS Output: The Basics 4 Changing a Portion of a Heading 557 options linesize=80 pagesize=24; data _null_; set circulation_figures; by state notsorted; file print notitles header=pagetop; if first.state then do; morning_total=0; evening_total=0; put / @7 state @; end; put @26 year @53 morning_copies 5.1 @66 evening_copies 5.1; morning_total+morning_copies; evening_total+evening_copies; if last.state then do; all_totals=morning_total+evening_total; put @52 ’------’ @65 ’------’ / @26 ’Total for each category’ @52 morning_total 6.1 @65 evening_total 6.1 / @35 ’Combined total’ @59 all_totals 6.1; end; return; pagetop: pagenum+1; u put @16 ’Morning and Evening Newspaper Circulation’ @67 ’Page ’ pagenum // v @7 ’State’ @26 ’Year’ @51 ’Thousands of Copies’/ @51 ’Morning Evening’; return; run; The following list corresponds to the numbered items in the preceding program: u In this Sum statement, SAS adds the value 1 to the accumulator variable PAGENUM each time a new page begins. v The literal Page and the current page number print at the top of each new page. The following output shows the results: 558 Controlling Page Divisions Output 31.10 4 Chapter 31 Changing a Portion of a Heading Morning and Evening Newspaper Circulation State Year Colorado 1984 1985 1986 1987 Total for each category Combined total Vermont 1984 1985 1986 1987 Total for each category Combined total Alaska 1984 1985 1986 Page 1 Thousands of Copies Morning Evening 738.6 742.2 731.7 789.2 -----3001.7 210.2 212.3 209.7 155.9 -----788.1 3789.8 623.4 533.1 544.2 322.3 -----2023.0 566.1 455.9 566.7 423.8 -----2012.5 4035.5 51.0 58.7 59.8 80.7 78.3 70.9 Morning and Evening Newspaper Circulation State Year 1987 Total for each category Combined total Alabama 1984 1985 1986 1987 Total for each category Combined total Maine 1984 1985 1986 1987 Total for each category Combined total Page 2 Thousands of Copies Morning Evening 64.3 64.6 ----------233.8 294.5 528.3 256.3 291.5 303.6 . -----851.4 480.5 454.3 454.7 454.5 -----1844.0 2695.4 . . 222.7 224.1 -----446.8 . 68.0 68.6 66.7 -----203.3 650.1 Controlling Page Divisions The report in Output 31.10 automatically split the data for Alaska over two pages. To make attractive page divisions, you need to know that there is sufficient space on a page to print all the data for a particular state before you print any data for it. First, you must know how many lines are needed to print a group of data. Then you use the LINESLEFT= option in the FILE statement to create a variable whose value is the number of lines remaining on the current page. Before you begin writing a group of data, compare the number of lines that you need to the value of that variable. If more Understanding and Customizing SAS Output: The Basics 4 Controlling Page Divisions 559 lines are required than are available, use the _PAGE_ pointer control to advance the pointer to the first line of a new page. In your report, the maximum number of lines that you need for any state is eight (four years of circulation data for each state plus four lines for the underline, the totals, and the blank line between states). The following program creates a variable named CKLINES and compares its value to eight at the beginning of each BY group. If the value is less than eight, SAS begins a new page before writing that state. options pagesize=24; data _null_; set circulation_figures; by state notsorted; file print notitles header=pagetop linesleft=cklines; if first.state then do; morning_total=0; evening_total=0; if cklines<8 then put _page_; put / @7 state @; end; put @26 year @53 morning_copies 5.1 @66 evening_copies 5.1; morning_total+morning_copies; evening_total+evening_copies; if last.state then do; all_totals=morning_total+evening_total; put @52 ’------’ @65 ’------’ / @26 ’Total for each category’ @52 morning_total 6.1 @65 evening_total 6.1 / @35 ’Combined total’ @59 all_totals 6.1; end; return; pagetop: pagenum+1; put @16 ’Morning and Evening Newspaper Circulation’ @67 ’Page ’ pagenum // @7 ’State’ @26 ’Year’ @51 ’Thousands of Copies’/ @51 ’Morning Evening’; return; run; The following output shows the results: 560 Controlling Page Divisions Output 31.11 4 Chapter 31 Output with Specific Page Divisions Morning and Evening Newspaper Circulation State Year Colorado 1984 1985 1986 1987 Total for each category Combined total Vermont 1984 1985 1986 1987 Total for each category Combined total Page 1 Thousands of Copies Morning Evening 738.6 742.2 731.7 789.2 -----3001.7 210.2 212.3 209.7 155.9 -----788.1 3789.8 623.4 533.1 544.2 322.3 -----2023.0 566.1 455.9 566.7 423.8 -----2012.5 4035.5 Morning and Evening Newspaper Circulation State Year Alaska 1984 1985 1986 1987 Total for each category Combined total Alabama 1984 1985 1986 1987 Total for each category Combined total Thousands of Copies Morning Evening 51.0 58.7 59.8 64.3 -----233.8 80.7 78.3 70.9 64.6 -----294.5 528.3 256.3 291.5 303.6 . -----851.4 480.5 454.3 454.7 454.5 -----1844.0 2695.4 Morning and Evening Newspaper Circulation State Year Maine 1984 1985 1986 1987 Total for each category Combined total Page 2 Page 3 Thousands of Copies Morning Evening . . 222.7 224.1 -----446.8 . 68.0 68.6 66.7 -----203.3 650.1 Understanding and Customizing SAS Output: The Basics 4 Customizing Output of Missing Values by Using a System Option 561 Representing Missing Values Recognizing Default Values In the following example, numeric data for male verbal and math scores is missing for 1972. Character data for gender is missing for math scores in 1975. By default, SAS replaces a missing numeric value with a period, and a missing character value with a blank when it creates the data set. options pagesize=60 linesize=80 pageno=1 nodate; libname admin ’SAS-data-library’; data admin.sat_scores2; input Test $ 1-8 Gender $ 10 Year 12-15 SATscore 17-19; datalines; verbal m 1972 . verbal f 1972 529 verbal m 1975 515 verbal f 1975 509 math m 1972 . math f 1972 489 math 1975 518 math 1975 479 ; run; proc print data=admin.sat_scores2; title ’SAT Scores for Years 1972 and 1975’; run; The following output shows the results: Output 31.12 Default Display of Missing Values SAT Scores for Years 1972 and 1975 Obs Test 1 2 3 4 5 6 7 8 verbal verbal verbal verbal math math math math Gender m f m f m f Year 1972 1972 1975 1975 1972 1972 1975 1975 1 SATscore . 529 515 509 . 489 518 479 Customizing Output of Missing Values by Using a System Option If your data set contains missing numeric values, you can use the MISSING= system option to display the missing values as a single character rather than as the default 562 Customizing Output of Missing Values by Using a Procedure 4 Chapter 31 period. You specify the character you want to use as the value of the MISSING= option. You can specify any single character. In the following program, the MISSING= option in the OPTIONS statement causes the PRINT procedure to display the letter M, rather than a period, for each numeric missing value. options missing=’M’ pageno=1; libname admin ’SAS-data-library’; data admin.sat_scores2; input Test $ 1-8 Gender $ 10 Year 12-15 SATscore 17-19; datalines; verbal m 1972 verbal f 1972 529 verbal m 1975 515 verbal f 1975 509 math m 1972 math f 1972 489 math 1975 518 math 1975 479 ; proc print data=admin.sat_scores2; title ’SAT Scores for Years 1972 and 1975’; run; The following output shows the results: Output 31.13 Customized Output of Missing Numeric Values SAT Scores for Years 1972 and 1975 Obs Test 1 2 3 4 5 6 7 8 verbal verbal verbal verbal math math math math Gender m f m f m f Year SATscore 1972 1972 1975 1975 1972 1972 1975 1975 M 529 515 509 M 489 518 479 1 Customizing Output of Missing Values by Using a Procedure Using the FORMAT procedure is another way to represent missing numeric values. It enables you to customize missing values by formatting them. You first use the FORMAT procedure to define a format, and then use a FORMAT statement in a PROC or DATA step to associate the format with a variable. The following program uses the FORMAT procedure to define a format, and then uses a FORMAT statement in the PROC step to associate the format with the variable SCORE. Note that you do not follow the format name with a period in the VALUE statement but a period always accompanies the format when you use it in a FORMAT statement. Understanding and Customizing SAS Output: The Basics 4 Statements 563 options pageno=1; libname admin ’SAS-data-library’; proc format; value xscore .=’score unavailable’; run; proc print data=admin.sat_scores2; format SATscore xscore.; title ’SAT Scores for Years 1972 and 1975’; run; The following output shows the results: Output 31.14 Numeric Missing Values Replaced by a Format SAT Scores for Years 1972 and 1975 Obs Test 1 2 3 4 5 6 7 8 verbal verbal verbal verbal math math math math Gender m f m f m f Year 1972 1972 1975 1975 1972 1972 1975 1975 1 SATscore score unavailable 529 515 509 score unavailable 489 518 479 Review of SAS Tools Statements FILE file-specification; identifies an external file that the DATA step uses to write output from a PUT statement. FILE PRINT ; directs the output that is produced by any PUT statements to the same print file as the output that is produced by SAS procedures. The HEADER option defines a statement label that identifies a group of SAS statements that you want to execute each time SAS begins a new output page. The LINESLEFT= option defines a variable whose value is the number of lines left on the current page. FOOTNOTE <’text’>; specifies up to ten footnote lines to be printed at the bottom of a page of output. The variable n specifies the relative line to be occupied by the footnote, and text specifies the text of the footnote. LABEL variable=’label’; associates the variable that you specify with the descriptive text that you specify as the label. Your label can be up to 256 characters long, including blanks. You can use the LABEL statement in either the DATA step or the PROC step. 564 SAS System Options 4 Chapter 31 OPTIONS option(s); changes the value of one or more SAS system options. TITLE <’text’>; specifies up to ten title lines to be printed on each page of the procedure output file and other SAS output. The variable n specifies the relative line that contains the title line, and text specifies the text of the title. SAS System Options NUMBER|NONUMBER controls whether the page number prints on the first title line of each page of output. PAGENO=n resets the page number for the next page of output. CENTER|NOCENTER controls whether SAS procedure output is centered. PAGESIZE=n specifies the number of lines that can be printed per page of output. LINESIZE=n specifies the printer line width for the SAS log and the standard procedure output file used by the DATA step and procedures. DATE|NODATE controls whether the date and time are printed at the top of each page of the SAS log, the standard print file, or any file with the PRINT attribute. MISSING=’character’ specifies the character to be printed for missing numeric variable values. Learning More SAS output 3 Chapter 30, “Writing Lines to the SAS Log or to an Output File,” on page 521 3 Chapter 32, “Understanding and Customizing SAS Output: The Output Delivery System (ODS),” on page 565 565 CHAPTER 32 Understanding and Customizing SAS Output: The Output Delivery System (ODS) Introduction to Customizing SAS Output by Using the Output Delivery System Purpose 565 Prerequisites 566 Input Data Set for Examples 566 Understanding ODS Output Formats and Destinations 567 Selecting an Output Format 568 Creating Formatted Output 569 Creating HTML Output for a Web Browser 569 Understanding the Four Types of HTML Output Files 569 Creating HTML Output: The Simplest Case 569 Creating HTML Output: Linking Results with a Table of Contents Creating PostScript Output for a High-Resolution Printer 573 Creating RTF Output for Microsoft Word 574 Selecting the Output That You Want to Format 577 Identifying Output 577 Selecting and Excluding Program Output 579 Creating a SAS Data Set 584 Customizing ODS Output 585 Customizing ODS Output at the Level of a SAS Job 585 Customizing ODS Output by Using a Template 585 Storing Links to ODS Output 589 Review of SAS Tools 590 ODS Statements 590 Procedures 592 Learning More 592 565 571 Introduction to Customizing SAS Output by Using the Output Delivery System Purpose The Output Delivery System (ODS) enables you to produce output in a variety of formats, such as: 3 3 3 3 an HTML file a traditional SAS Listing a PostScript file an RTF file (for use with Microsoft Word) 566 Prerequisites 4 Chapter 32 3 an output data set In this chapter, you will learn how to create ODS output for the formats that are listed above. Prerequisites Before using this chapter, you should be familiar with the concepts presented in: 3 Chapter 1, “What Is the SAS System?,” on page 3 3 Chapter 23, “Directing SAS Output and the SAS Log,” on page 349 You should also be familiar with DATA step processing, and creating procedure output. Input Data Set for Examples The examples in this chapter are based on data from a college entrance exam called the Scholastic Aptitude Test, or SAT. The data is provided in one input file that contains the average SAT scores of students that are entering the university from 1972 to 1998. The input file has the following structure: Verbal Verbal Verbal Verbal Math Math Math Math m f m f m f m f 1972 1972 1973 1973 1972 1972 1973 1973 531 529 523 521 527 489 525 489 The input file contains the following kinds of values: 3 type of SAT test 3 gender of the student 3 year the test was given 3 average test score of the entering first-year college class The following program creates the data set that this chapter uses. (For a complete listing of the input data, see “Data Set SAT_SCORES” on page 714.) data sat_scores; input Test $ Gender $ Year SATscore @@; datalines; Verbal m 1972 531 Verbal f 1972 529 Verbal m 1973 523 Verbal f 1973 521 Verbal m 1974 524 Verbal f 1974 520 ...more data lines... Math m 1996 527 Math f 1996 492 Math m 1997 530 Math f 1997 494 Math m 1998 531 Math f 1998 496 ; Note: The examples use file names that may not be valid in all operating environments. For information about how your operating environment uses file specifications, see the documentation for your operating environment. 4 4 Customizing SAS Output: The Output Delivery System (ODS) Understanding ODS Output Formats and Destinations 567 Understanding ODS Output Formats and Destinations The Output Delivery System (ODS) enables you to produce output in a variety of formats that you can easily access. ODS removes responsibility for formatting output from individual procedures and from the DATA step. The procedure or DATA step supplies the data and the table definition, which contains formatting instructions for the output. The following figure illustrates the concept of output for SAS Version 8. The data and the table definition form an output object, which creates the type of ODS output that you specified in the table definition. Figure 32.1 Model of the Production of ODS Output Table Definition (formatting instructions) Data + Output Object RTF Destination RTF Output Output Destination SAS Data Sets } Listing Destination HTML Destination Printer Destination Listing Output HTML Output High-resolution Printer Output ODS Destination } ODS Output The following definitions describe the terms in the preceding figure: data Each procedure that supports ODS and each DATA step produces data, which contains the results (numbers and characters) of the step in a form similar to a SAS data set. table definition The table definition is a set of instructions that describes how to format the data. This description includes but is not limited to the following items: 3 3 3 3 the order of the columns text and order of column headings formats for data font sizes and font faces 568 Selecting an Output Format 4 Chapter 32 output object ODS combines formatting instructions with the data to produce an output object. The output object, therefore, contains both the results of the procedure or DATA step and information about how to format the results. An output object has a name, a label, and a path. Note: Although many output objects include formatting instructions, not all of them do. In some cases the output object consists of only the data. 4 ODS destinations An ODS destination specifies a specific type of output. ODS supports a number of destinations, including the following: RTF produces output that is formatted for use with Microsoft-Word. Output produces a SAS data set. Listing produces traditional SAS output (monospace format). HTML produces output that is formatted in Hyper Text Markup Language (HTML). You can access the output on the web with your web browser. Printer produces output that is formatted for a high-resolution printer. An example of this type of output is a PostScript file. ODS output ODS output consists of formatted output from any of the ODS destinations. For detailed information about ODS, see SAS Output Delivery System: User’s Guide. Selecting an Output Format You select the format for your output by opening and closing ODS destinations in your program. When one or more destinations are open, ODS can send output objects to them and produce formatted output. When a destination is closed, ODS does not send an output object to it and no output is produced. By default, all programs automatically produce Listing output along with output for other destinations that you specifically open. Therefore, by default, the Listing destination is open, and all other destinations are closed. To create formatted output, open one or more destinations by using the following ODS statements: ODS HTML file-specification(s); ODS OUTPUT data-set-definition; ODS PRINTER file-specification; ODS RTF file-specification; The argument file-specification opens the destination and specifies one or more files to write to. The argument data-set-definition opens the Output destination and enables SAS to create a data set from an output object. To view or print the ODS output that you have selected, you need to close all the destinations that you opened, except for the Listing destination. You can use separate Customizing SAS Output: The Output Delivery System (ODS) 4 Creating HTML Output for a Web Browser 569 statements to close individual destinations, or use one statement to close all destinations (including the Listing destination). To close ODS destinations, use the following statements: ODS HTML CLOSE; ODS OUTPUT CLOSE; ODS PRINTER CLOSE; ODS RTF CLOSE; ODS _ALL_ CLOSE; Note: The ODS _ALL_ CLOSE statement, which closes all open destinations, is available with SAS Release 8.2 and higher. 4 In some cases you might not want to create Listing output. Use the ODS LISTING CLOSE; statement at the beginning of your program to close the Listing destination and prevent SAS from producing Listing output. Closing unnecessary destinations conserves system resources. Note: Because ODS statements are global statements, it is good practice to open the Listing destination at the end of your program. If you execute other programs in your current SAS session, Listing output is then available. To open the Listing destination, use the ODS LISTING; statement at the end of your program. 4 Creating Formatted Output Creating HTML Output for a Web Browser Understanding the Four Types of HTML Output Files When you use the ODS HTML statement, you can create output that is formatted in HTML. You can browse the output files with Internet Explorer, Netscape, or any other browser that fully supports the HTML 3.2 tag set. The ODS HTML statement can create four types of HTML files: 3 a body file that contains the results of the DATA step or procedure 3 a table of contents that links to items in the body file 3 a table of pages that links to items in the body file 3 a frame file that displays the results of the procedure or DATA step, the table of contents, and the table of pages The body file is required with all ODS HTML output. If you do not want to link to your output, then creating a table of contents, a table of pages, and a frame file is not necessary. Creating HTML Output: The Simplest Case To produce the simplest kind of HTML output, the only file you need to create is a body file. The following example executes the MEANS procedure and creates an HTML body file and the default Listing file. These files contain summary statistics for the average SAT scores of entering first-year college students. The output is grouped by the CLASS variables Test and Gender. 570 Creating HTML Output for a Web Browser 4 Chapter 32 options pageno=1 nodate pagesize=30 linesize=78; ods html file=’summary-results.htm’; u proc means data=sat_scores fw=8; v var SATscore; class Test Gender; title1 ’Average SAT Scores Entering College Classes, 1972-1998*’; footnote1 ’* Recentered Scale for 1987-1995’; run; ods html close; w The following list corresponds to the numbered items in the preceding program: u The ODS HTML statement opens the HTML destination and creates the body file SUMMARY-RESULTS.HTM. v The MEANS procedure produces summary statistics for the average SAT scores of entering first-year college students. The output is grouped by the CLASS variables Test and Gender. w The ODS HTML CLOSE statement closes the HTML destination to make output available for viewing. The following output shows the results in HTML format: Display 32.1 ODS Output: HTML Format The following output shows the results in the Listing format: Customizing SAS Output: The Output Delivery System (ODS) Output 32.1 4 Creating HTML Output for a Web Browser 571 ODS Output: Listing Format Average SAT Scores Entering College Classes, 1972-1998* 1 The MEANS Procedure Analysis Variable : SATscore N Test Gender Obs N Mean Std Dev Minimum Maximum --------------------------------------------------------------------------Math f 27 27 481.8 7.0057 473.0 496.0 Verbal m 27 27 521.6 4.3175 515.0 531.0 f 27 27 503.0 8.2671 495.0 529.0 m 27 27 510.5 6.7218 501.0 531.0 --------------------------------------------------------------------------- * Recentered Scale for 1987-1995 Creating HTML Output: Linking Results with a Table of Contents The ODS HTML destination enables you to link to your results from a table of contents and a table of pages. To do this, you need to create the following HTML files: a body file, a frame file, a table of contents, and a table of pages (see “Understanding the Four Types of HTML Output Files” on page 569). When you view the frame file and select a link in the table of contents or the table of pages, the HTML table that contains the selected part of the procedure results appears at the top of your browser. The following example creates multiple pages of output from the UNIVARIATE procedure. You can access specific output results (tables) from links in the table of contents or the table of pages. The results contain statistics for the average SAT scores of entering first-year college classes. The output is grouped by the value of Gender in the CLASS statement and by the value of Test in the BY statement. proc sort data=sat_scores out=sorted_scores; by Test; run; options pageno=1 nodate; ods listing close; u ods html file=’odshtml-body.htm’ v contents=’odshtml-contents.htm’ page=’odshtml-page.htm’ frame=’odshtml-frame.htm’; 572 Creating HTML Output for a Web Browser 4 Chapter 32 proc univariate data=sorted_scores; w var SATscore; class Gender; by Test; title1 ’Average SAT Scores Entering College Classes, 1972-1998*’; footnote1 ’* Recentered Scale for 1987-1995’; run; ods html close; x ods listing; y The following list corresponds to the numbered items in the preceding program: u By default, the Listing destination is open. To conserve resources, the ODS LISTING CLOSE statement closes this destination. v The ODS HTML statement opens the HTML destination and creates four types of files: 3 the body file (created with the FILE= option), which contains the formatted data 3 the contents file, which is a table of contents with links to items in the body file 3 the page file, which is a table of pages with links to items in the body file 3 the frame file, which displays the table of contents, the table of pages, and the body file w The UNIVARIATE procedure produces statistics for the average SAT scores of entering first-year college students. The output is grouped by the value of Gender in the CLASS statement and the value of Test in the BY statement. x The ODS HTML CLOSE statement closes the HTML destination to make output available for viewing. y The ODS LISTING statement reopens the Listing destination so that the next program that you run can produce Listing output. The following SAS log shows that four HTML files are created with the ODS HTML statement: Output 32.2 Partial SAS Log: HTML File Creation 489 ods listing close; 490 ods html file=’odshtml-body.htm’ 491 contents=’odshtml-contents.htm’ 492 page=’odshtml-page.htm’ 493 frame=’odshtml-frame.htm’; NOTE: Writing HTML Body file: odshtml-body.htm NOTE: Writing HTML Contents file: odshtml-contents.htm NOTE: Writing HTML Pages file: odshtml-page.htm NOTE: Writing HTML Frames file: odshtml-frame.htm 494 495 proc univariate data=sorted_scores; 496 var SATscore; 497 class Gender; 498 by Test; 499 title1 ’Average SAT Scores Entering College Classes, 1972-1998*’; 500 footnote1 ’* Recentered Scale for 1987-1995’; 501 run; Customizing SAS Output: The Output Delivery System (ODS) 4 PostScript Output for a High-Resolution Printer 573 The following output shows the frame file, which displays the table of contents (upper left side), the table of pages (lower left side), and the body file (right side). Display 32.2 View of the HTML Frame File Both the Table of Contents and the Table of Pages contain links to the results in the body file. If you click on a link in the Table of Contents or the Table of Pages, SAS displays the corresponding results at the top of the browser. Creating PostScript Output for a High-Resolution Printer You can create output that is formatted for a high-resolution printer if you open the Printer destination. Before you can access the file, however, you must close the Printer destination. The following example executes the MEANS procedure and creates a PostScript file which contains summary statistics for the average SAT scores of entering first-year college students. The output is grouped by the value of Gender in the CLASS statement and the value of Test in the BY statement. proc sort data=sat_scores out=sorted_scores; by Test; run; options pageno=1 nodate; ods listing close; u ods printer ps file=’odsprinter_output.ps’; v 574 Creating RTF Output for Microsoft Word 4 Chapter 32 proc means data=sorted_scores fw=8; w var SATscore; class Gender ; by Test; title1 ’Average SAT Scores Entering College Classes, 1972-1998*’; footnote1 ’* Recentered Scale for 1987-1995’; run; ods printer close; x ods listing; y The following list corresponds to the numbered items in the preceding program: u By default, the Listing destination is open. To conserve resources, the program uses the ODS LISTING CLOSE statement to close this destination. v The ODS PRINTER statement opens the Printer destination and specifies the file to write to. The PS (PostScript) option ensures that you create a generic PostScript file. If this option is missing, ODS produces output for your current printer, if possible. w The MEANS procedure produces summary statistics for the average SAT scores of entering first-year college students. The output is grouped by the value of Gender in the CLASS statement and the value of Test in the BY statement. x The ODS PRINTER CLOSE statement closes the Printer destination to make output available for printing. y The ODS LISTING statement reopens the Listing destination so that the next program that you run can produce Listing output. The following output shows the results: Display 32.3 ODS Output: PostScript Format Creating RTF Output for Microsoft Word You can create output that is formatted for use with Microsoft Word if you open the RTF destination. Before you can access the file, you must close the RTF destination. The following example executes the UNIVARIATE procedure and creates an RTF file that contains summary statistics for the average SAT scores of entering first-year college students. The output is grouped by the CLASS variable Gender. ods listing close; u ods rtf file=’odsrtf_output.rtf’; v proc univariate data=sat_scores; w var SATscore; class Gender; title1 ’Average SAT Scores Entering College Classes, 1972-1998*’; footnote1 ’* Recentered Scale for 1987-1995’; run; ods rtf close; x ods listing; y Customizing SAS Output: The Output Delivery System (ODS) 4 Creating RTF Output for Microsoft Word 575 The following list corresponds to the numbered items in the preceding program: u By default, the Listing destination is open. To conserve resources, the ODS LISTING CLOSE statement closes this destination. v The ODS RTF statement opens the RTF destination and specifies the file to write to. w The UNIVARIATE procedure produces summary statistics for the average SAT scores of entering first-year college students. The output is grouped by the CLASS variable Gender. x The ODS RTF CLOSE statement closes the RTF destination to make output available. y The ODS LISTING statement reopens the Listing destination so that the next program that you run can produce Listing output. The following output shows the first page of the RTF output: 576 Creating RTF Output for Microsoft Word Display 32.4 4 Chapter 32 ODS Output: RTF Format 1 Average SAT Scores Entering College Classes, 1972–1998* The UNIVARIATE Procedure Variable: SATscore Gender=f Moments 54 Sum Weights N 54 Mean 492.425926 Sum Observations 26591 Std Deviation 13.1272464 Varience 172.324598 Skewness 0.38649931 Kurtosis 0.03082111 Uncorrected SS 13103231 Corrected SS 9133.2037 Coeff Variation 2.66588169 Std Error Mean 1.78639197 Basic Statistical Measures Location Mean Variability 492.4259 Std Deviation 13.12725 Median 495.5000 Variance Mode 172.32460 473.0000 Range 56.00000 Interquartile Range 20.00000 NOTE: The mode displayed is the smallest of 4 modes with a count of 4. Tests for Location: Mu0=0 Test Statistic Student's t t Sign M Signed Rank p Value 275.6539 Pr > |t| S <.0001 27 Pr >= |M| <.0001 7425 Pr >= |S| <.0001 Quantiles (Definition 5) Quantile Estimate 100% Max 529.0 99% 529.0 95% 520.0 90% 505.0 75% Q3 502.0 50% Median 495.5 * Recentered Scale for 1987–1995 Customizing SAS Output: The Output Delivery System (ODS) 4 Identifying Output 577 Selecting the Output That You Want to Format Identifying Output Program output, in the form of output objects, contain both the results of a procedure or DATA step and information about how to format the results. To select an output object for formatting, you need to know which output objects your program creates. To identify the output objects, use the ODS TRACE statement. The simplest form of the ODS TRACE statement is as follows: ODS TRACE ON|OFF; ODS TRACE determines whether to write to the SAS log a record of each output object that a program creates. The ON option writes the trace record to the log, and the OFF option suppresses the writing of the trace record. The trace record has the following components: Name is the name of the output object. Label is the label that briefly describes the contents of the output object. Template is the name of the table definition that ODS used to format the output object. Path shows the location of the output object. In the ODS SELECT statement in your program, you can refer to an output object by name, label, or path. The following program executes the UNIVARIATE procedure and writes a trace record to the SAS log. ods trace on; proc univariate data=sat_scores; var SATscore; class Gender; title1 ’Average SAT Scores Entering College Classes, 1972-1998*’; footnote1 ’* Recentered Scale for 1987-1995’; run; ods trace off; The following output shows the results of ODS TRACE. Two sets of output objects are listed because the program uses the class variable Gender to separate male and female results. The path component of the output objects identifies the female (f) and male (m) objects. 578 Identifying Output 4 Chapter 32 Output 32.3 403 404 405 406 407 408 409 410 ODS TRACE Output in the Log ods trace on; proc univariate data=sat_scores; var SATscore; class Gender; title1 ’Average SAT Scores Entering College Classes, 1972-1998*’; footnote1 ’* Recentered Scale for 1987-1995’; run; Output Added: ------------Name: Moments Label: Moments Template: base.univariate.Moments Path: Univariate.SATscore.f.Moments Output Added: ------------Name: BasicMeasures Label: Basic Measures of Location and Variability Template: base.univariate.Measures Path: Univariate.SATscore.f.BasicMeasures ------------Output Added: ------------Name: TestsForLocation Label: Tests For Location Template: base.univariate.Location Path: Univariate.SATscore.f.TestsForLocation ------------Output Added: ------------Name: Quantiles Label: Quantiles Template: base.univariate.Quantiles Path: Univariate.SATscore.f.Quantiles ------------Output Added: ------------Name: ExtremeObs Label: Extreme Observations Template: base.univariate.ExtObs Path: Univariate.SATscore.f.ExtremeObs ------------Output Added: ------------Name: Moments Label: Moments Template: base.univariate.Moments Path: Univariate.SATscore.m.Moments ------------Output Added: ------------Name: BasicMeasures Label: Basic Measures of Location and Variability Template: base.univariate.Measures Path: Univariate.SATscore.m.BasicMeasures ------------- Customizing SAS Output: The Output Delivery System (ODS) 4 Selecting and Excluding Program Output 579 Output Added: ------------Name: TestsForLocation Label: Tests For Location Template: base.univariate.Location Path: Univariate.SATscore.m.TestsForLocation ------------Output Added: ------------Name: Quantiles Label: Quantiles Template: base.univariate.Quantiles Path: Univariate.SATscore.m.Quantiles ------------Output Added: ------------Name: ExtremeObs Label: Extreme Observations Template: base.univariate.ExtObs Path: Univariate.SATscore.m.ExtremeObs ------------411 412 ods trace off; Selecting and Excluding Program Output For each destination, ODS maintains a selection list or an exclusion list. The selection list is a list of output objects that produce formatted output. The exclusion list is a list of output objects for which no output is produced. You can select and exclude output objects by specifying the destination in an ODS SELECT or ODS EXCLUDE statement. If you do not specify a destination, ODS sends output to all open destinations. Selection and exclusion lists can be modified and reset at different points in a SAS session, such as at procedure boundaries. If you end each procedure with an explicit QUIT statement, rather than waiting for the next PROC or DATA step to end it for you, the QUIT statement resets the selection list. To choose one or more output objects and send them to open ODS destinations, use the ODS SELECT statement. The simplest form of the ODS SELECT statement is as follows: ODS SELECT output-object(s); The argument ODS-destination identifies the output format, and output-object specifies one or more output objects to add to a selection list. To exclude one or more output objects from being sent to open destinations, use the ODS EXCLUDE statement. The simplest form of the ODS EXCLUDE statement is as follows: ODS EXCLUDE output-object(s); The argument ODS-destination identifies the output format, and output-object specifies one or more output objects to add to an exclusion list. The following example executes the UNIVARIATE procedure and creates 10 output objects. The ODS SELECT statement uses the name component in the trace records to select only the BasicMeasures and the TestsForLocation output objects. Because the HTML and Printer destinations are open, ODS creates HTML and Printer output from the output objects. 580 Selecting and Excluding Program Output 4 Chapter 32 options nodate pageno=1; ods listing close; ods html file=’odsselect-body.htm’ contents=’odsselect-contents.htm’ page=’odsselect-page.htm’ frame=’odsselect-frame.htm’; ods printer file=’odsprinter-select.ps’; ods select BasicMeasures TestsForLocation; proc univariate data=sat_scores; var SATscore; class Gender; title1 ’Average SAT Scores Entering College Classes, 1972-1998*’; footnote1 ’* Recentered Scale for 1987-1995’; run; ods html close; ods printer close; ods listing; The following two displays show the results in Printer format. They show the Basic Statistical Measures and Tests for Location tables based on gender. Customizing SAS Output: The Output Delivery System (ODS) Display 32.5 4 Selecting and Excluding Program Output ODS SELECT Statement: Printer Format (females) Average SAT Scores Entering College Classes, 1972–1998* The UNIVARIATE Procedure Variable: SATscore Gender = f Basic Statistical Measures Variability Location Mean 492.4259 Std Deviation 13.12725 Median 495.5000 Variance Mode 473.0000 172.32460 Range 56.00000 Interquartile Range 20.00000 NOTE: The mode displayed is the smallest of 4 modes with a count of 4. Tests for Location: Mu0=0 Test Statistic Student's t t Sign M Signed Rank S p Value 275.6539 Pr > |t| < .0001 27 Pr > = |M| < .0001 742.5 Pr >= |S| < .0001 * Recentered Scale for 1987–1995 1 581 582 Selecting and Excluding Program Output Display 32.6 4 Chapter 32 ODS SELECT Statement: Printer Format (males) Average SAT Scores Entering College Classes, 1972–1998* 2 The UNIVARIATE Procedure Variable: SATscore Gender = m Basic Statistical Measures Variability Location Mean 516.0185 Std Deviation Median 516.0000 Variance 62.54682 Mode 523.0000 Range 30.00000 Interquartile Range 14.00000 7.90865 Tests for Location: Mu0=0 Test Statistic Student's t t Sign M Signed Rank S p Value 479.4679 27 742.5 Pr > |t| < .0001 Pr > = |M| < .0001 Pr >= |S| < .0001 * Recentered Scale for 1987–1995 The following two displays show the results in HTML format. They, too, show the Basic Statistical Measures and Tests for Location tables based on gender. 4 Customizing SAS Output: The Output Delivery System (ODS) Display 32.7 Selecting and Excluding Program Output ODS SELECT Statement: HTML Format (females) Table of Contents 1. The Univariate Procedure SATscore Gender= f Basic Measures of Location and Variability Tests For Location Gender = m Basic Measures of Location and Variability Tests For Location Average SAT Scores Entering College Classes, 19721998* The UNIVARIATE Procedure Variable: SATscore Gender = f Basic Statistical Measures Location Mean Variability 492.4259 Std Deviation 13.12725 Median 495.5000 Variance Mode 172.32460 473.0000 Range 56.00000 Interquartile Range 20.00000 NOTE: The mode displayed is the smallest of 4 modes with a count of 4. Table of Pages Tests for Location: Mu0=0 1. The Univariate Procedure Page 1 Page 2 Statistic Test p Value Student's t t Sign M 27 Pr > = |M| < .0001 Signed Rank S 742.5 Pr >= |S| < .0001 275.6539 Pr > |t| < .0001 * Recentered Scale for 1987–1995 Display 32.8 ODS SELECT Statement: HTML Format (males) 583 584 Creating a SAS Data Set 4 Chapter 32 Creating a SAS Data Set ODS enables you to create a SAS data set from an output object. To create a single output data set, use the following form of the ODS OUTPUT statement: ODS OUTPUT output-object(s)=SAS-data-set; The argument output-object specifies one or more output objects to turn into a SAS data set, and SAS-data-set specifies the data set that you want to create. In the following program, ODS opens the Output destination and creates the SAS data set MYFILE.MEASURES from the output object BasicMeasures. ODS then closes the Output destination. libname myfile ’SAS-data-library’; ods listing close; u ods output BasicMeasures=myfile.measures; v proc univariate data=sat_scores; w var SATscore; class Gender; run; ods output close; x ods listing; y The following list corresponds to the numbered items in the preceding program: u By default, the Listing destination is open. To conserve resources, the ODS LISTING CLOSE statement closes this destination. v The ODS OUTPUT statement opens the Output destination and specifies the permanent data set to create from the output object BasicMeasures. w The UNIVARIATE procedure produces summary statistics for the average SAT scores of entering first-year college students. The output is grouped by the CLASS variable Gender. x The ODS OUTPUT CLOSE statement closes the Output destination. y The ODS LISTING statement reopens the default Listing destination so that the next program that you run can produce Listing output. The following SAS log shows that the MYFILE.MEASURES data set was created with the ODS OUTPUT statement: Output 32.4 Partial SAS Log: SAS Data Set Creation 404 libname myfile ’SAS-data-library’; NOTE: Libref MYFILE was successfully assigned as follows: Engine: V8 Physical Name: path-name 405 ods listing close; 406 ods output BasicMeasures=myfile.measures; 407 408 proc univariate data=sat_scores; 409 var SATscore; 410 class Gender; 411 run; NOTE: The data set MYFILE.MEASURES has 8 observations and 6 variables. Customizing SAS Output: The Output Delivery System (ODS) 4 Customizing ODS Output by Using a Template 585 Customizing ODS Output Customizing ODS Output at the Level of a SAS Job ODS provides a way for you to customize output at the level of the SAS job. To do this, you use a style definition, which describes how to show such items as color, font face, font size, and so on. The style definition determines the appearance of the output. The fancyprinter style definition is one of several that is available with SAS. The following example uses the fancyprinter style definition to customize program output. The output consists of two output objects, Moments and BasicMeasures, that the UNIVARIATE procedure creates. The STYLE= option on the ODS PRINTER statement specifies that the program use the fancyprinter style. options nodate pageno=1; ods listing close; ods printer ps file=’style_job.ps’ style=fancyprinter; ods select Moments BasicMeasures; proc univariate data=sat_scores; var SATscore; title ’Average SAT Scores for Entering College Classes, 1972-1982*’; footnote1 ’* Recentered Scale for 1987-1995’; run; ods printer close; ods listing; The following output shows the results: Display 32.9 Printer Output: Titles, Footnote, and Variables Printed in Italics For detailed information about style and table definitions, as well as the TEMPLATE procedure, see SAS Output Delivery System: User’s Guide. Customizing ODS Output by Using a Template Another way to customize ODS output is by using a template. In ODS, templates are called table definitions. A table definition describes how to format the output. It can determine the order of table headings and footnotes, the order of columns, and the appearance of the output. A table definition can contain one or more columns, headings, or footnotes. Many procedures that fully support ODS provide table definitions that you can customize. You can also create your own table definition by using the TEMPLATE procedure. The following is a simplified form of the TEMPLATE procedure: PROC TEMPLATE; DEFINE table-definition; HEADER header(s); 586 Customizing ODS Output by Using a Template 4 Chapter 32 COLUMN column(s); END; The DEFINE statement creates the table definition that serves as the template for writing the output. The HEADER statement specifies the order of the headings, and the COLUMN statement specifies the order of the columns. The arguments in each of these statements point to routines in the program that format the output. The END statement ends the table definition. The following example shows how to use PROC TEMPLATE to create customized HTML and printer output. In the example, the SAS program creates a customized table definition for the Basic Measures output table from PROC UNIVARIATE. The following customized version shows that 3 the “Measures of Variability” section precedes the “Measures of Location” section 3 column headings are modified 3 statistics are displayed in a bold, italic font with a 7.3 format. options nodate nonumber linesize=80 pagesize=60; u proc template; v define table base.univariate.Measures; w header h1 h2 h3; x column VarMeasure VarValue LocMeasure LocValue; y define h1; U text "Basic Statistical Measures"; spill_margin=on; space=1; end; define h2; U text "Measures of Variability"; start=VarMeasure; end=VarValue; end; define h3; U text "Measures of Location"; start=LocMeasure; end=LocValue; end; define LocMeasure; V print_headers=off; glue=2; space=3; style=rowheader; end; define LocValue; V print_headers=off; space=5; format=7.3; style=data{font_style=italic font_weight=bold}; end; define VarMeasure; V print_headers=off; glue=2; Customizing SAS Output: The Output Delivery System (ODS) 4 Customizing ODS Output by Using a Template 587 space=3; style=rowheader; end; define VarValue; V print_headers=off; format=7.3; style=data{font_style=italic font_weight=bold}; end; end; W run; X ods listing close; ods html file=’scores-body.htm’ at contents=’scores-contents.htm’ page=’scores-page.htm’ frame=’scores-frame.htm’; ods printer file=’scores.ps’; ak ods select BasicMeasures; al title; proc univariate data=sorted_scores mu0=3.5; am var SATscore; run; ods html close; an ods printer close; an ods listing; ao The following list corresponds to the numbered items in the preceding program: u All four options affect the Listing output. The NODATE and NONUMBER options affect the Printer output. None of the options affects the HTML output. v PROC TEMPLATE begins the procedure for creating a table. w The DEFINE statement creates the table definition base.univariate.Measures in SASUSER. x The HEADER statement determines the order in which the table definition uses the headings, which are defined later in the program. y The COLUMN statement determines the order in which the variables appear. PROC UNIVARIATE names the variables. U These DEFINE blocks define the three headings and specify the text to use for each heading. By default, a heading spans all columns. This is the case for H1. H2 spans the variables VarMeasure and VarValue. H3 spans LocMeasure and LocValue. V These DEFINE blocks specify characteristics for each of the four variables. They use FORMAT= to specify a format of 7.3 for LocValue and VarValue. They also use STYLE= to specify a bold, italic font for these two variables. The STYLE= option does not affect the Listing output. W The END statement ends the table definition. X The RUN statement executes the procedure. at The ODS HTML statement begins the program that uses the customized table definition. It opens the HTML destination and identifies the files to write to. ak The ODS PRINTER statement opens the Printer destination and identifies the file to write to. 588 Customizing ODS Output by Using a Template 4 Chapter 32 al The ODS SELECT statement selects the output object that contains the basic measures. am PROC UNIVARIATE produces one object for each variable. It uses the customized table definition to format the data. an The ODS statements close the HTML and the PRINTER destinations. ao The ODS LISTING statement opens the listing destination for output. The following display shows the printer output: Display 32.10 Customized Printer Output from the TEMPLATE Procedure The UNIVARIATE Procedure Variable: SATscore Basic Statistical Measures Measures of Variability Std Deviation Variance Measures of Location 16.025 Mean 504.222 256.791 Median 505.000 Range 58.000 Mode Interquartile Range 22.000 503.000 _ NOTE: The mode displayed is the smallest of 3 modes with a count of 5. The following display shows the HTML output: Display 32.11 Customized HTML Output from the TEMPLATE Procedure Customizing SAS Output: The Output Delivery System (ODS) 4 Storing Links to ODS Output 589 Storing Links to ODS Output When you run a procedure that supports ODS, SAS automatically stores a link to each piece of ODS output in the Results folder in the Results window. It marks the link with an icon that identifies the output destination that created the output. In the following example, SAS executes the UNIVARIATE procedure and generates Listing, HTML, Printer, and Rich Text Format (RTF) output as well as a SAS data set (Output output). The output contains statistics for the average SAT scores of entering first-year college students. The output is grouped by the CLASS variable Gender. ods ods ods ods ods listing close; html file=’store-links.htm’; printer file=’store-links.ps’; rtf file=’store-links.rtf’; output basicmeasures=measures; proc univariate data=sat_scores; var SATscore; class Gender; title; run; ods _all_ close; ods listing; PROC UNIVARIATE generates a folder called Univariate in the Results folder. Within this folder is another folder (SAT score) for the variable in the VAR statement. This folder contains two folders (Gender=f and Gender=m), one for each variable in the CLASS statement. The Gender=f and Gender=m folders each contain a folder for each output object. Within the folder for each output object is a link to each piece of output. The icon next to the link indicates which ODS destination created the output. In this example, the Moments output was sent to the Listing, HTML, Printer, and RTF destinations. The Basic Measures of Location and Variability output was sent to the Listing, HTML, Printer, RTF, and Output destinations. The Results folder in the display that follows shows the folders and output objects that the UNIVARIATE procedure creates. 590 Review of SAS Tools 4 Chapter 32 Display 32.12 View of the Results Folder Review of SAS Tools ODS Statements ODS EXCLUDE output-object(s); specifies one or more output objects to add to an exclusion list. Customizing SAS Output: The Output Delivery System (ODS) 4 ODS Statements 591 ODS HTMLHTML-file-specification(s)