Step By Programming With Base SAS Software Manual

SAS_programming_manual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 788 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Step-by-Step Programming with
Base SAS®Software
The correct bibliographic citation for this manual is as follows: SAS Institute Inc.
2001. Step-by-Step Programming with Base SAS ®Software. Cary, NC: SAS Institute Inc.
Step-by-Step Programming with Base SAS®Software
Copyright © 2001 by SAS Institute Inc., Cary, NC, USA.
ISBN 978-1-58025-791-6
All rights reserved. Produced in the United States of America.
For a hard-copy book: No part of this publication may be reproduced, stored in a
retrieval system, or transmitted, in any form or by any means, electronic, mechanical,
photocopying, or otherwise, without the prior written permission of the publisher, SAS
Institute Inc.
For a Web download or e-book: Your use of this publication shall be governed by the
terms established by the vendor at the time you acquire this publication.
U.S. Government Restricted Rights Notice. Use, duplication, or disclosure of this
software and related documentation by the U.S. government is subject to the Agreement
with SAS Institute and the restrictions set forth in FAR 52.227-19 Commercial Computer
Software-Restricted Rights (June 1987).
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513.
February 2007
SAS®Publishing provides a complete selection of books and electronic products to help
customers use SAS software to its fullest potential. For more information about our
e-books, e-learning products, CDs, and hard-copy books, visit the SAS Publishing Web site
at support.sas.com/pubs or call 1-800-727-3228.
SAS®and all other SAS Institute Inc. product or service names are registered trademarks
or trademarks of SAS Institute Inc. in the USA and other countries. ®indicates USA
registration.
Other brand and product names are registered trademarks or trademarks of their
respective companies.
Contents
PART1Introduction to the SAS System 1
Chapter 1 What Is the SAS System? 3
Introduction to the SAS System 3
Components of Base SAS Software 4
Output Produced by the SAS System 8
Ways to Run SAS Programs 11
Running Programs in the SAS Windowing Environment 13
Review of SAS Tools 15
Learning More 16
PART2Getting Your Data into Shape 17
Chapter 2 Introduction to DATA Step Processing 19
Introduction to DATA Step Processing 20
The SAS Data Set: Your Key to the SAS System 20
How the DATA Step Works: A Basic Introduction 26
Supplying Information to Create a SAS Data Set 33
Review of SAS Tools 41
Learning More 41
Chapter 3 Starting with Raw Data: The Basics 43
Introduction to Raw Data 44
Examine the Structure of the Raw Data: Factors to Consider 44
Reading Unaligned Data 44
Reading Data That Is Aligned in Columns 47
Reading Data That Requires Special Instructions 50
Reading Unaligned Data with More Flexibility 53
Mixing Styles of Input 55
Review of SAS Tools 58
Learning More 59
Chapter 4 Starting with Raw Data: Beyond the Basics 61
Introduction to Beyond the Basics with Raw Data 61
Testing a Condition before Creating an Observation 62
Creating Multiple Observations from a Single Record 63
Reading Multiple Records to Create a Single Observation 67
Problem Solving: When an Input Record Unexpectedly Does Not Have Enough
Values 74
Review of SAS Tools 77
Learning More 79
iv
Chapter 5 Starting with SAS Data Sets 81
Introduction to Starting with SAS Data Sets 81
Understanding the Basics 82
Input SAS Data Set for Examples 82
Reading Selected Observations 84
Reading Selected Variables 85
Creating More Than One Data Set in a Single DATA Step 89
Using the DROP= and KEEP= Data Set Options for Efficiency 91
Review of SAS Tools 92
Learning More 93
PART3Basic Programming 95
Chapter 6 Understanding DATA Step Processing 97
Introduction to DATA Step Processing 97
Input SAS Data Set for Examples 97
Adding Information to a SAS Data Set 98
Defining Enough Storage Space for Variables 103
Conditionally Deleting an Observation 104
Review of SAS Tools 105
Learning More 105
Chapter 7 Working with Numeric Variables 107
Introduction to Working with Numeric Variables 107
About Numeric Variables in SAS 108
Input SAS Data Set for Examples 108
Calculating with Numeric Variables 109
Comparing Numeric Variables 113
Storing Numeric Variables Efficiently 115
Review of SAS Tools 116
Learning More 117
Chapter 8 Working with Character Variables 119
Introduction to Working with Character Variables 119
Input SAS Data Set for Examples 120
Identifying Character Variables and Expressing Character Values 121
Setting the Length of Character Variables 122
Handling Missing Values 124
Creating New Character Values 127
Saving Storage Space by Treating Numbers as Characters 134
Review of SAS Tools 135
Learning More 136
Chapter 9 Acting on Selected Observations 139
Introduction to Acting on Selected Observations 139
Input SAS Data Set for Examples 140
v
Selecting Observations 141
Constructing Conditions 145
Comparing Characters 152
Review of SAS Tools 156
Learning More 157
Chapter 10 Creating Subsets of Observations 159
Introduction to Creating Subsets of Observations 159
Input SAS Data Set for Examples 160
Selecting Observations for a New SAS Data Set 161
Conditionally Writing Observations to One or More SAS Data Sets 164
Review of SAS Tools 170
Learning More 170
Chapter 11 Working with Grouped or Sorted Observations 173
Introduction to Working with Grouped or Sorted Observations 173
Input SAS Data Set for Examples 174
Working with Grouped Data 175
Working with Sorted Data 181
Review of SAS Tools 185
Learning More 186
Chapter 12 Using More Than One Observation in a Calculation 187
Introduction to Using More Than One Observation in a Calculation 187
Input File and SAS Data Set for Examples 188
Accumulating a Total for an Entire Data Set 189
Obtaining a Total for Each BY Group 191
Writing to Separate Data Sets 193
Using a Value in a Later Observation 196
Review of SAS Tools 199
Learning More 200
Chapter 13 Finding Shortcuts in Programming 201
Introduction to Shortcuts 201
Input File and SAS Data Set 201
Performing More Than One Action in an IF-THEN Statement 202
Performing the Same Action for a Series of Variables 204
Review of SAS Tools 207
Learning More 209
Chapter 14 Working with Dates in the SAS System 211
Introduction to Working with Dates 211
Understanding How SAS Handles Dates 212
Input File and SAS Data Set for Examples 213
Entering Dates 214
Displaying Dates 217
Using Dates in Calculations 221
vi
Using SAS Date Functions 223
Comparing Durations and SAS Date Values 225
Review of SAS Tools 227
Learning More 228
PART4Combining SAS Data Sets 231
Chapter 15 Methods of Combining SAS Data Sets 233
Introduction to Combining SAS Data Sets 233
Definition of Concatenating 234
Definition of Interleaving 234
Definition of Merging 235
Definition of Updating 236
Definition of Modifying 237
Comparing Modifying, Merging, and Updating Data Sets 238
Learning More 239
Chapter 16 Concatenating SAS Data Sets 241
Introduction to Concatenating SAS Data Sets 241
Concatenating Data Sets with the SET Statement 242
Concatenating Data Sets Using the APPEND Procedure 255
Choosing between the SET Statement and the APPEND Procedure 259
Review of SAS Tools 260
Learning More 260
Chapter 17 Interleaving SAS Data Sets 263
Introduction to Interleaving SAS Data Sets 263
Understanding BY-Group Processing Concepts 263
Interleaving Data Sets 264
Review of SAS Tools 267
Learning More 267
Chapter 18 Merging SAS Data Sets 269
Introduction to Merging SAS Data Sets 270
Understanding the MERGE Statement 270
One-to-One Merging 270
Match-Merging 276
Choosing between One-to-One Merging and Match-Merging 286
Review of SAS Tools 290
Learning More 290
Chapter 19 Updating SAS Data Sets 293
Introduction to Updating SAS Data Sets 293
Understanding the UPDATE Statement 294
Understanding How to Select BY Variables 294
Updating a Data Set 295
vii
Updating with Incremental Values 300
Understanding the Differences between Updating and Merging 302
Handling Missing Values 305
Review of SAS Tools 308
Learning More 309
Chapter 20 Modifying SAS Data Sets 311
Introduction 311
Input SAS Data Set for Examples 312
Modifying a SAS Data Set: The Simplest Case 313
Modifying a Master Data Set with Observations from a Transaction Data Set 314
Understanding How Duplicate BY Variables Affect File Update 317
Handling Missing Values 319
Review of SAS Tools 320
Learning More 321
Chapter 21 Conditionally Processing Observations from Multiple SAS Data Sets 323
Introduction to Conditional Processing from Multiple SAS Data Sets 323
Input SAS Data Sets for Examples 324
Determining Which Data Set Contributed the Observation 326
Combining Selected Observations from Multiple Data Sets 328
Performing a Calculation Based on the Last Observation 330
Review of SAS Tools 332
Learning More 332
PART5Understanding Your SAS Session 333
Chapter 22 Analyzing Your SAS Session with the SAS Log 335
Introduction to Analyzing Your SAS Session with the SAS Log 335
Understanding the SAS Log 336
Locating the SAS Log 337
Understanding the Log Structure 337
Writing to the SAS Log 339
Suppressing Information to the SAS Log 341
Changing the Log’s Appearance 344
Review of SAS Tools 346
Learning More 346
Chapter 23 Directing SAS Output and the SAS Log 349
Introduction to Directing SAS Output and the SAS Log 349
Input File and SAS Data Set for Examples 350
Routing the Output and the SAS Log with PROC PRINTTO 351
Storing the Output and the SAS Log in the SAS Windowing Environment 353
Redefining the Default Destination in a Batch or Noninteractive Environment 354
Review of SAS Tools 355
Learning More 356
viii
Chapter 24 Diagnosing and Avoiding Errors 357
Introduction to Diagnosing and Avoiding Errors 357
Understanding How the SAS Supervisor Checks a Job 357
Understanding How SAS Processes Errors 358
Distinguishing Types of Errors 358
Diagnosing Errors 359
Using a Quality Control Checklist 366
Learning More 366
PART6Producing Reports 369
Chapter 25 Producing Detail Reports with the PRINT Procedure 371
Introduction to Producing Detail Reports with the PRINT Procedure 372
Input File and SAS Data Sets for Examples 372
Creating Simple Reports 373
Creating Enhanced Reports 381
Creating Customized Reports 391
Making Your Reports Easy to Change 399
Review of SAS Tools 402
Learning More 405
Chapter 26 Creating Summary Tables with the TABULATE Procedure 407
Introduction to Creating Summary Tables with the TABULATE Procedure 408
Understanding Summary Table Design 408
Understanding the Basics of the TABULATE Procedure 410
Input File and SAS Data Set for Examples 412
Creating Simple Summary Tables 413
Creating More Sophisticated Summary Tables 419
Review of SAS Tools 431
Learning More 433
Chapter 27 Creating Detail and Summary Reports with the REPORT Procedure 435
Introduction to Creating Detail and Summary Reports with the REPORT
Procedure 436
Understanding How to Construct a Report 436
Input File and SAS Data Set for Examples 438
Creating Simple Reports 439
Creating More Sophisticated Reports 446
Review of SAS Tools 454
Learning More 458
PART7Producing Plots and Charts 461
Chapter 28 Plotting the Relationship between Variables 463
Introduction to Plotting the Relationship between Variables 463
Input File and SAS Data Set for Examples 464
ix
Plotting One Set of Variables 466
Enhancing the Plot 468
Plotting Multiple Sets of Variables 473
Review of SAS Tools 480
Learning More 481
Chapter 29 Producing Charts to Summarize Variables 483
Introduction to Producing Charts to Summarize Variables 484
Understanding the Charting Tools 484
Input File and SAS Data Set for Examples 485
Charting Frequencies with the CHART Procedure 487
Customizing Frequency Charts 494
Creating High-Resolution Histograms 503
Review of SAS Tools 514
Learning More 518
PART8Designing Your Own Output 519
Chapter 30 Writing Lines to the SAS Log or to an Output File 521
Introduction to Writing Lines to the SAS Log or to an Output File 521
Understanding the PUT Statement 522
Writing Output without Creating a Data Set 522
Writing Simple Text 523
Writing a Report 528
Review of SAS Tools 535
Learning More 536
Chapter 31 Understanding and Customizing SAS Output: The Basics 537
Introduction to the Basics of Understanding and Customizing SAS Output 538
Understanding Output 538
Input SAS Data Set for Examples 540
Locating Procedure Output 541
Making Output Informative 542
Controlling Output Appearance 548
Controlling the Appearance of Pages 550
Representing Missing Values 561
Review of SAS Tools 563
Learning More 564
Chapter 32 Understanding and Customizing SAS Output: The Output Delivery System
(ODS) 565
Introduction to Customizing SAS Output by Using the Output Delivery System 565
Input Data Set for Examples 566
Understanding ODS Output Formats and Destinations 567
Selecting an Output Format 568
Creating Formatted Output 569
x
Selecting the Output That You Want to Format 577
Customizing ODS Output 585
Storing Links to ODS Output 589
Review of SAS Tools 590
Learning More 592
PART9Storing and Managing Data in SAS Files 593
Chapter 33 Understanding SAS Data Libraries 595
Introduction to Understanding SAS Data Libraries 595
What Is a SAS Data Library? 596
Accessing a SAS Data Library 596
Storing Files in a SAS Data Library 598
Referencing SAS Data Sets in a SAS Data Library 599
Review of SAS Tools 601
Learning More 601
Chapter 34 Managing SAS Data Libraries 603
Introduction 603
Choosing Your Tools 603
Understanding the DATASETS Procedure 604
Looking at a PROC DATASETS Session 605
Review of SAS Tools 606
Learning More 606
Chapter 35 Getting Information about Your SAS Data Sets 607
Introduction to Getting Information about Your SAS Data Sets 607
Input Data Library for Examples 608
Requesting a Directory Listing for a SAS Data Library 608
Requesting Contents Information about SAS Data Sets 610
Requesting Contents Information in Different Formats 613
Review of SAS Tools 615
Learning More 615
Chapter 36 Modifying SAS Data Set Names and Variable Attributes 617
Introduction to Modifying SAS Data Set Names and Variable Attributes 617
Input Data Library for Examples 618
Renaming SAS Data Sets 618
Modifying Variable Attributes 619
Review of SAS Tools 626
Learning More 627
Chapter 37 Copying, Moving, and Deleting SAS Data Sets 629
Introduction to Copying, Moving, and Deleting SAS Data Sets 629
Input Data Libraries for Examples 630
Copying SAS Data Sets 630
xi
Copying Specific SAS Data Sets 634
Moving SAS Data Libraries and SAS Data Sets 635
Deleting SAS Data Sets 637
Deleting All Files in a SAS Data Library 639
Review of SAS Tools 640
Learning More 640
PART10 Understanding Your SAS Environment 641
Chapter 38 Introducing the SAS Environment 643
Introduction to the SAS Environment 644
Starting a SAS Session 645
Selecting a SAS Processing Mode 645
Review of SAS Tools 652
Learning More 654
Chapter 39 Using the SAS Windowing Environment 655
Introduction to Using the SAS Windowing Environment 657
Getting Organized 657
Finding Online Help 660
Using SAS Windowing Environment Command Types 660
Working with SAS Windows 663
Working with Text 667
Working with Files 671
Working with SAS Programs 676
Working with Output 682
Review of SAS Tools 690
Learning More 692
Chapter 40 Customizing the SAS Environment 693
Introduction to Customizing the SAS Environment 694
Customizing Your Current Session 695
Customizing Session-to-Session Settings 698
Customizing the SAS Windowing Environment 702
Review of SAS Tools 707
Learning More 708
PART11 Appendix 709
Appendix 1 Additional Data Sets 711
Introduction 711
Data Set CITY 712
Raw Data Used for “Understanding Your SAS Session” Section 713
Data Set SAT_SCORES 714
Data Set YEAR_SALES 715
Data Set HIGHLOW 716
xii
Data Set GRADES 717
Data Sets for “Storing and Managing Data in SAS Files” Section 718
Glossary 723
Index 745
1
PART
1
Introduction to the SAS System
Chapter 1..........
What Is the SAS System? 3
2
3
CHAPTER
1
What Is the SAS System?
Introduction to the SAS System 3
Components of Base SAS Software 4
Overview of Base SAS Software 4
Data Management Facility 4
Programming Language 5
Elements of the SAS Language 5
Rules for SAS Statements 6
Rules for Most SAS Names 6
Special Rules for Variable Names 6
Data Analysis and Reporting Utilities 6
Output Produced by the SAS System 8
Traditional Output 8
Output from the Output Delivery System (ODS) 9
Ways to Run SAS Programs 11
Selecting an Approach 11
SAS Windowing Environment 11
SAS/ASSIST Software 12
Noninteractive Mode 12
Batch Mode 12
Interactive Line Mode 13
Running Programs in the SAS Windowing Environment 13
Review of SAS Tools 15
Statements 15
Procedures 15
Learning More 16
Introduction to the SAS System
SAS is an integrated system of software solutions that enables you to perform the
following tasks:
data entry, retrieval, and management
report writing and graphics design
statistical and mathematical analysis
business forecasting and decision support
operations research and project management
applications development
How you use SAS depends on what you want to accomplish. Some people use many of
the capabilities of the SAS System, and others use only a few.
4 Components of Base SAS Software Chapter 1
At the core of the SAS System is Base SAS software which is the software product
that you will learn to use in this documentation. This section presents an overview of
Base SAS. It introduces the capabilities of Base SAS, addresses methods of running
SAS, and outlines various types of output.
Components of Base SAS Software
Overview of Base SAS Software
Base SAS software contains the following:
a data management facility
a programming language
data analysis and reporting utilities
Learning to use Base SAS enables you to work with these features of SAS. It also
prepares you to learn other SAS products, because all SAS products follow the same
basic rules.
Data Management Facility
SAS organizes data into a rectangular form or table that is called a SAS data set.
The following figure shows a SAS data set. The data describes participants in a
16-week weight program at a health and fitness club. The data for each participant
includes an identification number, name, team name, and weight (in U.S. pounds) at
the beginning and end of the program.
Figure 1.1 Rectangular Form of a SAS Data Set
IdNumber
1023
1049
1219
1246
1078
1
2
3
4
5
StartWeightTeamName
variable
data value
EndWeight
David Shaw
Amelia Serrano
Alan Nance
Ravi Sinha
Ashley McKnight
red
yellow
red
yellow
red
189
145
210
194
127
165
124
192
177
118
data value
observation
In a SAS data set, each row represents information about an individual entity and is
called an observation. Each column represents the same type of information and is
called a variable. Each separate piece of information is a data value. In a SAS data set,
What Is the SAS System? Programming Language 5
an observation contains all the data values for an entity; a variable contains the same
type of data value for all entities.
To build a SAS data set with Base SAS, you write a program that uses statements in
the SAS programming language. A SAS program that begins with a DATA statement
and typically creates a SAS data set or a report is called a DATA step.
The following SAS program creates a SAS data set named WEIGHT_CLUB from the
health club data:
data weight_club; u
input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight; v
Loss=StartWeight-EndWeight; w
datalines; x
1023 David Shaw red 189 165 y
1049 Amelia Serrano yellow 145 124 y
1219 Alan Nance red 210 192 y
1246 Ravi Sinha yellow 194 177 y
1078 Ashley McKnight red 127 118 y
;U
run;
The following list corresponds to the numbered items in the preceding program:
uThe DATA statement tells SAS to begin building a SAS data set named
WEIGHT_CLUB.
vThe INPUT statement identifies the fields to be read from the input data and
names the SAS variables to be created from them (IdNumber, Name, Team,
StartWeight, and EndWeight).
wThe third statement is an assignment statement. It calculates the weight each
person lost and assigns the result to a new variable, Loss.
xThe DATALINES statement indicates that data lines follow.
yThe data lines follow the DATALINES statement. This approach to processing
raw data is useful when you have only a few lines of data. (Later sections show
ways to access larger amounts of data that are stored in files.)
UThe semicolon signals the end of the raw data, and is a step boundary. It tells
SAS that the preceding statements are ready for execution.
Note: By default, the data set WEIGHT_CLUB is temporary; that is, it exists only
for the current job or session. For information about how to create a permanent SAS
data set, see Chapter 2, “Introduction to DATA Step Processing,” on page 19.
Programming Language
Elements of the SAS Language
The statements that created the data set WEIGHT_CLUB are part of the SAS
programming language. The SAS language contains statements, expressions, functions
and CALL routines, options, formats, and informats – elements that many
programming languages share. However, the way you use the elements of the SAS
language depends on certain programming rules. The most important rules are listed in
the next two sections.
6 Data Analysis and Reporting Utilities Chapter 1
Rules for SAS Statements
The conventions that are shown in the programs in this documentation, such as
indenting of subordinate statements, extra spacing, and blank lines, are for the purpose
of clarity and ease of use. They are not required by SAS. There are only a few rules for
writing SAS statements:
SAS statements end with a semicolon.
You can enter SAS statements in lowercase, uppercase, or a mixture of the two.
You can begin SAS statements in any column of a line and write several
statements on the same line.
You can begin a statement on one line and continue it on another line, but you
cannot split a word between two lines.
Words in SAS statements are separated by blanks or by special characters (such
as the equal sign and the minus sign in the calculation of the Loss variable in the
WEIGHT_CLUB example).
Rules for Most SAS Names
SAS names are used for SAS data set names, variable names, and other items. The
following rules apply:
A SAS name can contain from one to 32 characters.
The first character must be a letter or an underscore (_).
Subsequent characters must be letters, numbers, or underscores.
Blanks cannot appear in SAS names.
Special Rules for Variable Names
For variable names only, SAS remembers the combination of uppercase and
lowercase letters that you use when you create the variable name. Internally, the case
of letters does not matter. “CAT,” “cat,” and “Cat” all represent the same variable. But
for presentation purposes, SAS remembers the initial case of each letter and uses it to
represent the variable name when printing it.
Data Analysis and Reporting Utilities
The SAS programming language is both powerful and flexible. You can program any
number of analyses and reports with it. SAS can also simplify programming for you
with its library of built-in programs known as SAS procedures. SAS procedures use
data values from SAS data sets to produce preprogrammed reports, requiring minimal
effort from you.
For example, the following SAS program produces a report that displays the values
of the variables in the SAS data set WEIGHT_CLUB. Weight values are presented in
U.S. pounds.
options linesize=80 pagesize=60 pageno=1 nodate;
proc print data=weight_club;
title ’Health Club Data’;
run;
This procedure, known as the PRINT procedure, displays the variables in a simple,
organized form. The following output shows the results:
What Is the SAS System? Data Analysis and Reporting Utilities 7
Output 1.1 Displaying the Values in a SAS Data Set
Health Club Data 1
Id Start End
Obs Number Name Team Weight Weight Loss
1 1023 David Shaw red 189 165 24
2 1049 Amelia Serrano yellow 145 124 21
3 1219 Alan Nance red 210 192 18
4 1246 Ravi Sinha yellow 194 177 17
5 1078 Ashley McKnight red 127 118 9
To produce a table showing mean starting weight, ending weight, and weight loss for
each team, use the TABULATE procedure.
options linesize=80 pagesize=60 pageno=1 nodate;
proc tabulate data=weight_club;
class team;
var StartWeight EndWeight Loss;
table team, mean*(StartWeight EndWeight Loss);
title ’Mean Starting Weight, Ending Weight,’;
title2 ’and Weight Loss’;
run;
The following output shows the results:
Output 1.2 Table of Mean Values for Each Team
Mean Starting Weight, Ending Weight, 1
and Weight Loss
-----------------------------------------------------------
| | Mean |
| |--------------------------------------|
| |StartWeight | EndWeight | Loss |
|------------------+------------+------------+------------|
|Team | | | |
|------------------| | | |
|red | 175.33| 158.33| 17.00|
|------------------+------------+------------+------------|
|yellow | 169.50| 150.50| 19.00|
-----------------------------------------------------------
A portion of a SAS program that begins with a PROC (procedure) statement and ends
with a RUN statement (or is ended by another PROC or DATA statement) is called a
PROC step. Both of the PROC steps that create the previous two outputs comprise the
following elements:
a PROC statement, which includes the word PROC, the name of the procedure you
want to use, and the name of the SAS data set that contains the values. (If you
omit the DATA= option and data set name, the procedure uses the SAS data set
that was most recently created in the program.)
additional statements that give SAS more information about what you want to do,
for example, the CLASS, VAR, TABLE, and TITLE statements.
8 Output Produced by the SAS System Chapter 1
a RUN statement, which indicates that the preceding group of statements is ready
to be executed.
Output Produced by the SAS System
Traditional Output
A SAS program can produce some or all of the following kinds of output:
a SAS data set
contains data values that are stored as a table of observations and variables. It
also stores descriptive information about the data set, such as the names and
arrangement of variables, the number of observations, and the creation date of the
data set. A SAS data set can be temporary or permanent. The examples in this
section create the temporary data set WEIGHT_CLUB.
the SAS log
is a record of the SAS statements that you entered and of messages from SAS
about the execution of your program. It can appear as a file on disk, a display on
your monitor, or a hardcopy listing. The exact appearance of the SAS log varies
according to your operating environment and your site. The output in Output 1.3
shows a typical SAS log for the program in this section.
a report or simple listing
ranges from a simple listing of data values to a subset of a large data set or a
complex summary report that groups and summarizes data and displays statistics.
The appearance of procedure output varies according to your site and the options
that you specify in the program, but the output in Output 1.1 and Output 1.2
illustrate typical procedure output. You can also use a DATA step to produce a
completely customized report (see “Creating Customized Reports” on page 391).
other SAS files such as catalogs
contain information that cannot be represented as tables of data values. Examples
of items that can be stored in SAS catalogs include function key settings, letters
that are produced by SAS/FSP software, and displays that are produced by
SAS/GRAPH software.
external files or entries in other databases
can be created and updated by SAS programs. SAS/ACCESS software enables you
to create and update files that are stored in databases such as Oracle.
What Is the SAS System? Output from the Output Delivery System (ODS) 9
Output 1.3 Traditional Output: A SAS Log
NOTE: PROCEDURE PRINTTO used:
real time 0.02 seconds
cpu time 0.01 seconds
22
23 options pagesize=60 linesize=80 pageno=1 nodate;
24
25 data weight_club;
26 input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight;
27 Loss=StartWeight-EndWeight;
28 datalines;
NOTE: The data set WORK.WEIGHT_CLUB has 5 observations and 6 variables.
NOTE: DATA statement used:
real time 0.14 seconds
cpu time 0.07 seconds
34 ;
35
36
37 proc tabulate data=weight_club;
38 class team;
39 var StartWeight EndWeight Loss;
40 table team, mean*(StartWeight EndWeight Loss);
41 title ’Mean Starting Weight, Ending Weight,’;
42 title2 ’and Weight Loss’;
43 run;
NOTE: There were 5 observations read from the data set WORK.WEIGHT_CLUB.
NOTE: PROCEDURE TABULATE used:
real time 0.18 seconds
cpu time 0.09 seconds
44 proc printto; run;
Output from the Output Delivery System (ODS)
The Output Delivery System (ODS) enables you to produce output in a variety of
formats, such as
an HTML file
a traditional SAS Listing (monospace)
a PostScript file
an RTF file (for use with Microsoft Word)
an output data set
The following figure illustrates the concept of output for SAS Version 8.
10 Output from the Output Delivery System (ODS) Chapter 1
Figure 1.2 Model of the Production of ODS Output
Data Table Definition
(formatting instructions)
Output
Object
RTF
Output
SAS
Data
Sets
Listing
Output
HTML
Output
High-resolution
Printer
Output
ODS
Output
}
+
RTF
Destination
Output
Destination
Listing
Destination
HTML
Destination
Printer
Destination
ODS
Destination
}
The following definitions describe the terms in the preceding figure:
data
Each procedure that supports ODS and each DATA step produces data, which
contains the results (numbers and characters) of the step in a form similar to a
SAS data set.
table definition
The table definition is a set of instructions that describes how to format the data.
This description includes but is not limited to
the order of the columns
text and order of column headings
formats for data
font sizes and font faces
output object
ODS combines formatting instructions with the data to produce an output object.
The output object, therefore, contains both the results of the procedure or DATA
step and information about how to format the results. An output object has a
name, a label, and a path.
Note: Although many output objects include formatting instructions, not all do.
In some cases the output object consists of only the data.
ODS destinations
An ODS destination specifies a specific type of output. ODS supports a number of
destinations, which include the following:
RTF
What Is the SAS System? SAS Windowing Environment 11
produces output that is formatted for use with Microsoft Word.
Output
produces a SAS data set.
Listing
produces traditional SAS output (monospace format).
HTML
produces output that is formatted in Hyper Text Markup Language (HTML).
You can access the output on the web with your web browser.
Printer
produces output that is formatted for a high-resolution printer. An example
of this type of output is a PostScript file.
ODS output
ODS output consists of formatted output from any of the ODS destinations.
For more information about ODS output, see Chapter 23, “Directing SAS Output and
the SAS Log,” on page 349 and Chapter 32, “Understanding and Customizing SAS
Output: The Output Delivery System (ODS),” on page 565.
For complete information about ODS, see SAS Output Delivery System: User’s Guide.
Ways to Run SAS Programs
Selecting an Approach
There are several ways to run SAS programs. They differ in the speed with which
they run, the amount of computer resources that are required, and the amount of
interaction that you have with the program (that is, the kinds of changes you can make
while the program is running).
The examples in this documentation produce the same results, regardless of the way
you run the programs. However, in a few cases, the way that you run a program
determines the appearance of output. The following sections briefly introduce different
ways to run SAS programs.
SAS Windowing Environment
The SAS windowing environment enables you to interact with SAS directly through a
series of windows. You can use these windows to perform common tasks, such as
locating and organizing files, entering and editing programs, reviewing log information,
viewing procedure output, setting options, and more. If needed, you can issue operating
system commands from within this environment. Or, you can suspend the current SAS
windowing environment session, enter operating system commands, and then resume
the SAS windowing environment session at a later time.
Using the SAS windowing environment is a quick and convenient way to program in
SAS. It is especially useful for learning SAS and developing programs on small test
files. Although it uses more computer resources than other techniques, using the SAS
windowing environment can save a lot of program development time.
For more information about the SAS windowing environment, see Chapter 39, “Using
the SAS Windowing Environment,” on page 655.
12 SAS/ASSIST Software Chapter 1
SAS/ASSIST Software
One important feature of SAS is the availability of SAS/ASSIST software.
SAS/ASSIST provides a point-and-click interface that enables you to select the tasks
that you want to perform. SAS then submits the SAS statements to accomplish those
tasks. You do not need to know how to program in the SAS language in order to use
SAS/ASSIST.
SAS/ASSIST works by submitting SAS statements just like the ones shown earlier in
this section. In that way, it provides a number of features, but it does not represent the
total functionality of SAS software. If you want to perform tasks other than those that
are available in SAS/ASSIST, you need to learn to program in SAS as described in this
documentation.
Noninteractive Mode
In noninteractive mode, you prepare a file that contains SAS statements and any
system statements that are required by your operating environment, and submit the
program. The program runs immediately and occupies your current workstation
session. You cannot continue to work in that session while the program is running,*
and you usually cannot interact with the program.** The log and procedure output go
to prespecified destinations, and you usually do not see them until the program ends.
To modify the program or correct errors, you must edit and resubmit the program.
Noninteractive execution may be faster than batch execution because the computer
system runs the program immediately rather than waiting to schedule your program
among other programs.
Batch Mode
To run a program in batch mode, you prepare a file that contains SAS statements
and any system statements that are required by your operating environment, and then
you submit the program.
You can then work on another task at your workstation. While you are working, the
operating environment schedules your job for execution (along with jobs submitted by
other people) and runs it. When execution is complete, you can look at the log and the
procedure output.
The central feature of batch execution is that it is completely separate from other
activities at your workstation. You do not see the program while it is running, and you
cannot correct errors at the time they occur. The log and procedure output go to
prespecified destinations; you can look at them only after the program has finished
running. To modify the SAS program, you edit the program with the editor that is
supported by your operating environment and submit a new batch job.
When sites charge for computer resources, batch processing is a relatively
inexpensive way to execute programs. It is particularly useful for large programs or
when you need to use your workstation for other tasks while the program is executing.
However, for learning SAS or developing and testing new programs, using batch mode
might not be efficient.
*In a workstation environment, you can switch to another window and continue working.
** Limited ways of interaction are available. You can, for example, use the asterisk (*) option in a %INCLUDE statement in
your program.
What Is the SAS System? Running Programs in the SAS Windowing Environment 13
Interactive Line Mode
In an interactive line-mode session, you enter one line of a SAS program at a time,
and SAS executes each DATA or PROC step automatically as soon as it recognizes the
end of the step. You usually see procedure output immediately on your display monitor.
Depending on your site’s computer system and on your workstation, you may be able to
scroll backward and forward to see different parts of your log and procedure output, or
you may lose them when they scroll off the top of your screen. There are limited
facilities for modifying programs and correcting errors.
Interactive line-mode sessions use fewer computer resources than a windowing
environment. If you use line mode, you should familiarize yourself with the
%INCLUDE, %LIST, and RUN statements in SAS Language Reference: Dictionary.
Running Programs in the SAS Windowing Environment
You can run most programs in this documentation by using any of the methods that
are described in the previous sections. This documentation uses the SAS windowing
environment (as it appears on Windows and UNIX operating environments) when it is
necessary to show programming within a SAS session. The SAS windowing
environment appears differently depending on the operating environment that you use.
For more information about the SAS windowing environment, see Chapter 39, “Using
the SAS Windowing Environment,” on page 655.
The following example gives a brief overview of a SAS session that uses the SAS
windowing environment. When you invoke SAS, the following windows appear.
Display 1.1 SAS Windowing Environment
The specific window placement, display colors, messages, and some other details vary
according to your site, your monitor, and your operating environment. The window on
the left side of the display is the SAS Explorer window, which you can use to assign and
locate SAS libraries, files, and other items. The window at the top right is the Log
14 Running Programs in the SAS Windowing Environment Chapter 1
window; it contains the SAS log for the session. The window at the bottom right is the
Program Editor window. This window provides an editor in which you edit your SAS
programs.
To create the program for the health and fitness club, type the statements in the
Program Editor window. You can turn line numbers on or off to facilitate program
creation. The following display shows the beginning of the program.
Display 1.2 Editing a Program in the Program Editor Window
When you fill the Program Editor window, scroll down to continue typing the
program. When you finish editing the program, submit it to SAS and view the output.
(If SAS does not create output, check the SAS log for error messages.)
The following displays show the first and second pages of the Output window.
Display 1.3 The First Page of Output in the Output Window
What Is the SAS System? Procedures 15
Display 1.4 The Second Page of Output in the Output Window
After you finish viewing the output, you can return to the Program Editor window to
begin creating a new program.
By default, the output from all submissions remains in the Output window, and all
statements that you submit remain in memory until the end of your session. You can
view the output at any time, and you can recall previously submitted statements for
editing and resubmitting. You can also clear a window of its contents.
All the commands that you use to move through the SAS windowing environment can
be executed as words or as function keys. You can also customize the SAS windowing
environment by determining which windows appear, as well as by assigning commands
to function keys. For more information about customizing the SAS windowing
environment, see Chapter 40, “Customizing the SAS Environment,” on page 693.
Review of SAS Tools
Statements
DATA SAS-data-set;
begins a DATA step and tells SAS to begin creating a SAS data set. SAS-data-set
names the data set that is being created.
%INCLUDE source(s) </<SOURCE2> <S2=length><host-options>>;
brings SAS programming statements, data lines, or both into a current SAS
program.
RUN;
tells SAS to begin executing the preceding group of SAS statements.
For more information, see Statements in SAS Language Reference: Dictionary.
Procedures
PROC procedure <DATA=SAS-data-set>;
begins a PROC step and tells SAS to invoke a particular SAS procedure to process
the SAS data set that is specified in the DATA= option. If you omit the DATA=
option, then the procedure processes the most recently created SAS data set in the
program.
16 Learning More Chapter 1
For more information about using procedures, see the Base SAS Procedures Guide.
Learning More
Basic SAS usage
For an entry-level introduction to basic SAS programming language, see The Little
SAS Book: A Primer, Second Edition.
DATA step
For more information about how to create SAS data sets, see Chapter 2,
“Introduction to DATA Step Processing,” on page 19.
DATA step processing
For more information about DATA step processing, see Chapter 6, “Understanding
DATA Step Processing,” on page 97.
For information about how to easily use the SAS environment, see Getting Started
with the SAS System.
17
PART
2
Getting Your Data into Shape
Chapter 2..........
Introduction to DATA Step Processing 19
Chapter 3..........
Starting with Raw Data: The Basics 43
Chapter 4..........
Starting with Raw Data: Beyond the Basics 61
Chapter 5..........
Starting with SAS Data Sets 81
18
19
CHAPTER
2
Introduction to DATA Step
Processing
Introduction to DATA Step Processing 20
Purpose 20
Prerequisites 20
The SAS Data Set: Your Key to the SAS System 20
Understanding the Function of the SAS Data Set 20
Understanding the Structure of the SAS Data Set 22
Temporary versus Permanent SAS Data Sets 24
Creating and Using Temporary SAS Data Sets 24
Creating and Using Permanent SAS Data Sets 24
Conventions That Are Used in This Documentation 25
How the DATA Step Works: A Basic Introduction 26
Overview of the DATA Step 26
During the Compile Phase 28
During the Execution Phase 28
Example of a DATA Step 29
The DATA Step 29
The Statements 29
The Process 30
Supplying Information to Create a SAS Data Set 33
Overview of Creating a SAS Data Set 33
Telling SAS How to Read the Data: Styles of Input 34
Reading Dates with Two-Digit and Four-Digit Year Values 35
Defining Variables in SAS 35
Indicating the Location of Your Data 36
Data Locations 36
Raw Data in the Job Stream 37
Data in an External File 37
Data in a SAS Data Set 37
Data in a DBMS File 38
Using External Files in Your SAS Job 38
Identifying an External File Directly 38
Referencing an External File with a Fileref 39
Review of SAS Tools 41
Statements 41
Learning More 41
20 Introduction to DATA Step Processing Chapter 2
Introduction to DATA Step Processing
Purpose
The DATA step is one of the basic building blocks of SAS programming. It creates
the data sets that are used in a SAS program’s analysis and reporting procedures.
Understanding the basic structure, functioning, and components of the DATA step is
fundamental to learning how to create your own SAS data sets. In this section, you will
learn the following:
what a SAS data set is and why it is needed
how the DATA step works
what information you have to supply to SAS so that it can construct a SAS data
set for you
Prerequisites
You should understand the concepts introduced in Chapter 1, “What Is the SAS
System?,” on page 3 before continuing.
The SAS Data Set: Your Key to the SAS System
Understanding the Function of the SAS Data Set
SAS enables you to solve problems by providing methods to analyze or to process
your data in some way. You need to first get the data into a form that SAS can
recognize and process. After the data is in that form, you can analyze it and generate
reports. The following figure shows this process in the simplest case.
Introduction to DATA Step Processing Understanding the Function of the SAS Data Set 21
Figure 2.1 From Raw Data to Final Analysis
You begin with raw data, that is, a collection of data that has not yet been processed
by SAS. You use a set of statements known as a DATA step to get your data into a SAS
data set. Then you can further process your data with additional DATA step
programming or with SAS procedures.
In its simplest form, the DATA step can be represented by the three components that
are shown in the following figure.
Figure 2.2 From Raw Data to a SAS Data Set
SAS processes input in the form of raw data and creates a SAS data set.
When you have a SAS data set, you can use it as input to other DATA steps. The
following figure shows the SAS statements that you can use to create a new SAS data
set.
Figure 2.3 Using One SAS Data Set to Create Another
input output
DATA step statements
DATA statement;
SET, MERGE,
MODIFY, or UPDATE;
more statements;
existing
SAS
data set
new
SAS
data
set
22 Understanding the Structure of the SAS Data Set Chapter 2
Understanding the Structure of the SAS Data Set
Think of a SAS data set as a rectangular structure that identifies and stores data.
When your data is in a SAS data set, you can use additional DATA steps for further
processing, or perform many types of analyses with SAS procedures.
The rectangular structure of a SAS data set consists of rows and columns in which
data values are stored. The rows in a SAS data set are called observations, and the
columns are called variables. In a raw data file, the rows are called records and the
columns are called fields. Variables contain the data values for all of the items in an
observation.
For example, the following figure shows a collection of raw data about participants in
a health and fitness club. Each record contains information about one participant.
Figure 2.4 Raw Data from the Health and Fitness Club
The following figure shows how easily the health club records can be translated into
parts of a SAS data set. Each record becomes an observation. In this case, each
observation represents a participant in the program. Each field in the record becomes a
variable. The variables represent each participant’s identification number, name, team
name, and weight at the beginning and end of a 16-week program.
Introduction to DATA Step Processing Understanding the Structure of the SAS Data Set 23
Figure 2.5 How Data Fits into a SAS Data Set
IdNumber
1023
1049
1219
1246
1078
1221
1
2
3
4
5
6
StartWeightTeamName
variable
data value
EndWeight
David Shaw
Amelia Serrano
Alan Nance
Ravi Sinha
Ashley McKnight
Jim Brown
red
yellow
red
yellow
red
yellow
189
145
210
194
127
220
165
124
192
177
118
.
data value
observation
missing value
In a SAS data set, every variable exists for every observation. What if you do not
have all the data for each observation? If the raw data is incomplete because a value for
the numeric variable EndWeight was not recorded for one observation, then this missing
value is represented by a period that serves as a placeholder, as shown in observation 6
in the previous figure. (Missing values for character variables are represented by
blanks. Character and numeric variables are discussed later in this section.) By coding
a value as missing, you can add an observation to the data set for which the data is
incomplete and still retain the rectangular shape necessary for a SAS data set.
Along with data values, each SAS data set contains a descriptor portion, as
illustrated in the following figure:
Figure 2.6 Parts of a SAS Data Set
The descriptor portion consists of details that SAS records about a data set, such as
the names and attributes of all the variables, the number of observations in the data
set, and the date and time that the data set was created and updated.
Operating Environment Information: Depending on your operating environment and
the engine used to write the SAS data set, SAS may store additional information about
a SAS data set in its descriptor portion. For more information, refer to the SAS
documentation for your operating environment.
24 Temporary versus Permanent SAS Data Sets Chapter 2
Temporary versus Permanent SAS Data Sets
Creating and Using Temporary SAS Data Sets
When you use a DATA step to create a SAS data set with a one-level name, you
normally create a temporary SAS data set, one that exists only for the duration of your
current session. SAS places this data set in a SAS data library referred to as WORK.
In most operating environments, all files that SAS stores in the WORK library are
deleted at the end of a session.
The following is an example of a DATA step that creates the temporary data set
WEIGHT_CLUB.
data weight_club;
input IdNumber Name $ 6--20 Team $ 22--27 StartWeight EndWeight;
datalines;
1023 David Shaw red 189 165
1049 Amelia Serrano yellow 145 124
1219 Alan Nance red 210 192
1246 Ravi Sinha yellow 194 177
1078 Ashley McKnight red 127 118
1221 Jim Brown yellow 220 .
;
run;
The preceding program code refers to the temporary data set as WEIGHT_CLUB.
SAS. However, it assigns the first-level name WORK to all temporary data sets, and
refers to the WEIGHT_CLUB data set with its two-level name, WORK.WEIGHT_CLUB.
The following output from the SAS log shows the name of the temporary data set.
Output 2.1 SAS Log: The WORK.WEIGHT_CLUB Temporary Data Set
162 data weight_club;
163 input IdNumber Name $ 6-20 Team $ 22-27 StartWeight EndWeight;
164 datalines;
NOTE: The data set WORK.WEIGHT_CLUB has 6 observations and 5 variables.
Because SAS assigns the first-level name WORK to all SAS data sets that have only
a one-level name, you do not need to use WORK. You can refer to these temporary data
sets with a one-level name, such as WEIGHT_CLUB.
To reference this SAS data set in a later DATA step or in a PROC step, you can use a
one-level name:
proc print data = weight_club;
run;
Creating and Using Permanent SAS Data Sets
To create a permanent SAS data set, you must indicate a SAS data library other than
WORK. (WORK is a reserved libref that SAS automatically assigns to a temporary SAS
data library.) Use a LIBNAME statement to assign a libref to a SAS data library on
Introduction to DATA Step Processing Temporary versus Permanent SAS Data Sets 25
your operating environment’s file system. The libref functions as a shorthand way of
referring to a SAS data library. Here is the form of the LIBNAME statement:
LIBNAME libref your-data-library’;
where
libref
is a shortcut name to where your SAS files are stored. libref must be a valid SAS
name. It must begin with a letter or an underscore, and it can contain uppercase
and lowercase letters, numbers, or underscores. A libref has a maximum length of
8 characters.
your-data-library
must be the physical name for your SAS data library. The physical name is the
name that is recognized by the operating environment.
Operating Environment Information: Additional restrictions can apply to librefs and
physical file names under some operating environments. For more information, refer to
the SAS documentation for your operating environment.
The following is an example of the LIBNAME statement that is used with a DATA
step:
libname saveit ’your-data-library’; u
data saveit.weight_club; v
...more SAS statements...
;
proc print data = saveit.weight_club; w
run;
The following list corresponds to the numbered items:
uThe LIBNAME statement associates the libref SAVEIT with your-data-library,
where your-data-library is your operating environment’s name for a SAS data
library.
vTo create a new permanent SAS data set and store it in this SAS data library, you
must use the two-level name SAVEIT.WEIGHT_CLUB in the DATA statement.
wTo reference this SAS data set in a later DATA step or in a PROC step, you must
use the two-level name SAVEIT.WEIGHT_CLUB in the PROC step.
For more information, see Chapter 33, “Understanding SAS Data Libraries,” on page
595.
Conventions That Are Used in This Documentation
Data sets that are used in examples are usually shown as temporary data sets
specified with a one-level name:
data fitness;
In rare cases in this documentation, data sets are created as permanent SAS data
sets. These data sets are specified with a two-level name, and a LIBNAME statement
precedes each DATA step in which a permanent SAS data set is created:
libname saveit ’your-data-library’;
data saveit.weight_club;
26 How the DATA Step Works: A Basic Introduction Chapter 2
How the DATA Step Works: A Basic Introduction
Overview of the DATA Step
The DATA step consists of a group of SAS statements that begins with a DATA
statement. The DATA statement begins the process of building a SAS data set and
names the data set. The statements that make up the DATA step are compiled, and the
syntax is checked. If the syntax is correct, then the statements are executed. In its
simplest form, the DATA step is a loop with an automatic output and return action.
The following figure illustrates the flow of action in a typical DATA step.
Introduction to DATA Step Processing Overview of the DATA Step 27
Figure 2.7 Flow of Action in a Typical DATA Step
data-reading
statement:
is there a
record to read?
reads
an input record
executes additional
executable statements
writesan observation to
the SAS data set
returns
to the beginning of
the DATA step
compiles
SAS statements
(includes syntax checking)
creates
an input buffer
a program data vector
descriptor information
begins
with a DATA statement
(counts iterations)
sets variable values
to missing in the
program data vector
closes data set;
goes on to the next
DATA or PROC step
NO
YES
Compile Phase
Execution Phase
28 During the Compile Phase Chapter 2
During the Compile Phase
When you submit a DATA step for execution, SAS checks the syntax of the SAS
statements and compiles them, that is, automatically translates the statements into
machine code. SAS further processes the code, and creates the following three items:
input buffer is a logical area in memory into which SAS reads each record of data
from a raw data file when the program executes. (When SAS reads
from a SAS data set, however, the data is written directly to the
program data vector.)
program data
vector
is a logical area of memory where SAS builds a data set, one
observation at a time. When a program executes, SAS reads data
values from the input buffer or creates them by executing SAS
language statements. SAS assigns the values to the appropriate
variables in the program data vector. From here, SAS writes the
values to a SAS data set as a single observation.
The program data vector also contains two automatic variables,
_N_ and _ERROR_. The _N_ variable counts the number of times
the DATA step begins to iterate. The _ERROR_ variable signals the
occurrence of an error caused by the data during execution. These
automatic variables are not written to the output data set.
descriptor
information
is information about each SAS data set, including data set attributes
and variable attributes. SAS creates and maintains the descriptor
information.
During the Execution Phase
All executable statements in the DATA step are executed once for each iteration. If
your input file contains raw data, then SAS reads a record into the input buffer. SAS
then reads the values in the input buffer and assigns the values to the appropriate
variables in the program data vector. SAS also calculates values for variables created
by program statements, and writes these values to the program data vector. When the
program reaches the end of the DATA step, three actions occur by default that make
using the SAS language different from using most other programming languages:
1SAS writes the current observation from the program data vector to the data set.
2The program loops back to the top of the DATA step.
3Variables in the program data vector are reset to missing values.
Note: The following exceptions apply:
Variables that you specify in a RETAIN statement are not reset to missing
values.
The automatic variables _N_ and _ERROR_ are not reset to missing.
For information about the RETAIN statement, see “Using a Value in a Later
Observation” on page 196.
If there is another record to read, then the program executes again. SAS builds the
second observation, and continues until there are no more records to read. The data set
is then closed, and SAS goes on to the next DATA or PROC step.
Introduction to DATA Step Processing Example of a DATA Step 29
Example of a DATA Step
The DATA Step
The following simple DATA step produces a SAS data set from the data collected for
a health and fitness club. As discussed earlier, the input data contains each
participant’s identification number, name, team name, and weight at the beginning and
end of a 16-week weight program:
data weight_club; u
input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight; v
Loss = StartWeight - EndWeight; w
datalines; x
1023 David Shaw red 189 165
1049 Amelia Serrano yellow 145 124
1219 Alan Nance red 210 192
1246 Ravi Sinha yellow 194 177
1078 Ashley McKnight red 127 118
1221 Jim Brown yellow 220 .
1095 Susan Stewart blue 135 127
1157 Rosa Gomez green 155 141
1331 Jason Schock blue 187 172
1067 Kanoko Nagasaka green 135 122
1251 Richard Rose blue 181 166
1333 Li-Hwa Lee green 141 129
1192 Charlene Armstrong yellow 152 139
1352 Bette Long green 156 137
1262 Yao Chen blue 196 180
1087 Kim Sikorski red 148 135
1124 Adrienne Fink green 156 142
1197 Lynne Overby red 138 125
1133 John VanMeter blue 180 167
1036 Becky Redding green 135 123
1057 Margie Vanhoy yellow 146 132
1328 Hisashi Ito red 155 142
1243 Deanna Hicks blue 134 122
1177 Holly Choate red 141 130
1259 Raoul Sanchez green 189 172
1017 Jennifer Brooks blue 138 127
1099 Asha Garg yellow 148 132
1329 Larry Goss yellow 188 174
;x
The Statements
The following list corresponds to the numbered items in the preceding program:
uThe DATA statement begins the DATA step and names the data set that is being
created.
30 Example of a DATA Step Chapter 2
vThe INPUT statement creates five variables, indicates how SAS reads the values
from the input buffer, and assigns the values to variables in the program data
vector.
wThe assignment statement creates an additional variable called Loss, calculates
the value of Loss during each iteration of the DATA step, and writes the value to
the program data vector.
xThe DATALINES statement marks the beginning of the input data. The single
semicolon marks the end of the input data and the DATA step.
Note: A DATA step that does not contain a DATALINES statement must end
with a RUN statement.
The Process
When you submit a DATA step for execution, SAS automatically compiles the DATA
step and then executes it. At compile time, SAS creates the input buffer, program data
vector, and descriptor information for the data set WEIGHT_CLUB. As the following
figure shows, the program data vector contains the variables that are named in the
INPUT statement, as well as the variable Loss. The values of the _N_ and the
_ERROR_ variables are automatically generated for every DATA step. The _N_
automatic variable represents the number of times that the DATA step has iterated.
The _ERROR_ automatic variable acts like a binary switch whose value is 0 if no errors
exist in the DATA step, or 1 if one or more errors exist. These automatic variables are
not written to the output data set.
All variable values, except _N_ and _ERROR_, are initially set to missing. Note that
missing numeric values are represented by a period, and missing character values are
represented by a blank.
Figure 2.8 Variable Values Initially Set to Missing
Input Buffer
Program Data Vector
IdNumber Name StartWeight EndWeight LossTeam
----+----1----+----2----+----3----+----4----+----5----+----6----+----7
....
The syntax is correct, so the DATA step executes. As the following figure illustrates,
the INPUT statement causes SAS to read the first record of raw data into the input
buffer. Then, according to the instructions in the INPUT statement, SAS reads the data
values in the input buffer and assigns them to variables in the program data vector.
Introduction to DATA Step Processing Example of a DATA Step 31
Figure 2.9 Values Assigned to Variables by the INPUT Statement
Input Buffer
Program Data Vector
IdNumber Name StartWeight EndWeight LossTeam
----+----1----+----2----+----3----+----4----+----5----+----6----+----7
1023 David Shaw red 189 165
1023 David Shaw red 189 165 .
When SAS assigns values to all variables that are listed in the INPUT statement,
SAS executes the next statement in the program:
Loss = StartWeight - EndWeight;
This assignment statement calculates the value for the variable Loss and writes that
value to the program data vector, as the following figure shows.
Figure 2.10 Value Computed and Assigned to the Variable Loss
Input Buffer
Program Data Vector
IdNumber Name StartWeight EndWeight LossTeam
----+----1----+----2----+----3----+----4----+----5----+----6----+----7
1023 David Shaw red 189 165
1023 David Shaw red 189 165 24
SAS has now reached the end of the DATA step, and the program automatically does
the following:
writes the first observation to the data set
loops back to the top of the DATA step to begin the next iteration
increments the _N_ automatic variable by 1
resets the _ERROR_ automatic variable to 0
except for _N_ and _ERROR_, sets variable values in the program data vector to
missing values, as the following figure shows
32 Example of a DATA Step Chapter 2
Figure 2.11 Values Set to Missing
Input Buffer
Program Data Vector
IdNumber Name StartWeight EndWeight LossTeam
----+----1----+----2----+----3----+----4----+----5----+----6----+----7
1023 David Shaw red 189 165
....
Execution continues. The INPUT statement looks for another record to read. If there
are no more records, then SAS closes the data set and the system goes on to the next
DATA or PROC step. In this example, however, more records exist and the INPUT
statement reads the second record into the input buffer, as the following figure shows.
Figure 2.12 Second Record Is Read into the Input Buffer
Input Buffer
Program Data Vector
IdNumber Name StartWeight EndWeight LossTeam
----+----1----+----2----+----3----+----4----+----5----+----6----+----7
1049 Amelia Serrano yellow 145 124
....
The following figure shows that SAS assigned values to the variables in the program
data vector and calculated the value for the variable Loss, building the second
observation just as it did the first one.
Figure 2.13 Results of Second Iteration of the DATA Step
Input Buffer
Program Data Vector
IdNumber Name StartWeight EndWeight LossTeam
----+----1----+----2----+----3----+----4----+----5----+----6----+----7
1049 Amelia Serrano yellow 145 124
1049 Amelia Serrano yellow 145 124 21
This entire process continues until SAS detects the end of the file. The DATA step
iterates as many times as there are records to read. Then SAS closes the data set
WEIGHT_CLUB, and SAS looks for the beginning of the next DATA or PROC step.
Introduction to DATA Step Processing Overview of Creating a SAS Data Set 33
Now that SAS has transformed the collected data from raw data into a SAS data set,
it can be processed by a SAS procedure. The following output, produced with the
PRINT procedure, shows the data set that has just been created.
proc print data=weight_club;
title ’Fitness Center Weight Club’;
run;
Output 2.2 PROC PRINT Output of the WEIGHT_CLUB Data Set
Fitness Center Weight Club 1
Id Start End
Obs Number Name Team Weight Weight Loss
1 1023 David Shaw red 189 165 24
2 1049 Amelia Serrano yellow 145 124 21
3 1219 Alan Nance red 210 192 18
4 1246 Ravi Sinha yellow 194 177 17
5 1078 Ashley McKnight red 127 118 9
6 1221 Jim Brown yellow 220 . .
7 1095 Susan Stewart blue 135 127 8
8 1157 Rosa Gomez green 155 141 14
9 1331 Jason Schock blue 187 172 15
10 1067 Kanoko Nagasaka green 135 122 13
11 1251 Richard Rose blue 181 166 15
12 1333 Li-Hwa Lee green 141 129 12
13 1192 Charlene Armstrong yellow 152 139 13
14 1352 Bette Long green 156 137 19
15 1262 Yao Chen blue 196 180 16
16 1087 Kim Sikorski red 148 135 13
17 1124 Adrienne Fink green 156 142 14
18 1197 Lynne Overby red 138 125 13
19 1133 John VanMeter blue 180 167 13
20 1036 Becky Redding green 135 123 12
21 1057 Margie Vanhoy yellow 146 132 14
22 1328 Hisashi Ito red 155 142 13
23 1243 Deanna Hicks blue 134 122 12
24 1177 Holly Choate red 141 130 11
25 1259 Raoul Sanchez green 189 172 17
26 1017 Jennifer Brooks blue 138 127 11
27 1099 Asha Garg yellow 148 132 16
28 1329 Larry Goss yellow 188 174 14
Supplying Information to Create a SAS Data Set
Overview of Creating a SAS Data Set
You supply SAS with specific information for reading raw data so that you can create
a SAS data set from the raw data. You can use the data set for further processing, data
analysis, or report writing. To process raw data in a DATA step, you must
use an INPUT statement to tell SAS how to read the data
define the variables and indicate whether they are character or numeric
specify the location of the raw data
34 Telling SAS How to Read the Data: Styles of Input Chapter 2
Telling SAS How to Read the Data: Styles of Input
SAS provides many tools for reading raw data into a SAS data set. These tools
include three basic input styles as well as various format modifiers and pointer controls.
List input is used when each field in the raw data is separated by at least one space
and does not contain embedded spaces. The INPUT statement simply contains a list of
the variable names. List input, however, places numerous restrictions on your data.
These restrictions are discussed in detail in Chapter 3, “Starting with Raw Data: The
Basics,” on page 43. The following example shows list input. Note that there is at least
one blank space between each data value.
data scores;
input Name $ Test_1 Test_2 Test_3;
datalines;
Bill 187 97 103
Carlos 156 76 74
Monique 99 102 129
;
Column input enables you to read the same data if it is located in fixed columns:
data scores;
input Name $ 1-7 Test_1 9-11 Test_2 13-15 Test_3 17-19;
datalines;
Bill 187 97 103
Carlos 156 76 74
Monique 99 102 129
;
Formatted input enables you to supply special instructions in the INPUT statement
for reading data. For example, to read numeric data that contains special symbols, you
need to supply SAS with special instructions so that it can read the data correctly.
These instructions, called informats, are discussed in more detail in Chapter 3,
“Starting with Raw Data: The Basics,” on page 43. In the INPUT statement, you can
specify an informat to be used to read a data value, as in the example that follows:
data total_sales;
input Date mmddyy10. +2 Amount comma5.;
datalines;
09/05/2000 1,382
10/19/2000 1,235
11/30/2000 2,391
;
In this example, the MMDDYY10. informat for the variable Date tells SAS to
interpret the raw data as a month, day, and year, ignoring the slashes. The COMMA5.
informat for the variable Amount tells SAS to interpret the raw data as a number,
ignoring the comma. The +2 is a pointer control that tells SAS where to look for the
next item. For more information about pointer controls, see Chapter 3, “Starting with
Raw Data: The Basics,” on page 43.
SAS also enables you to mix these styles of input as required by the way values are
arranged in the data records. Chapter 3, “Starting with Raw Data: The Basics,” on
page 43 discusses in detail input styles (including their rules and restrictions), as well
as additional data-reading tools.
Introduction to DATA Step Processing Defining Variables in SAS 35
Reading Dates with Two-Digit and Four-Digit Year Values
In the previous example, the year values in the dates in the raw data had four digits:
09/05/2000
10/19/2000
11/30/2000
However, SAS is also capable of reading two-digit year values (for example, 09/05/99).
In this example, use the MMDDYY8. informat for the variable Date.
How does SAS know to which century a two-digit year belongs? SAS uses the value
of the YEARCUTOFF= SAS system option. In Version 7 and later of SAS, the default
value of the YEARCUTOFF= option is 1920. This means that two-digit years from 00 to
19 are assumed to be in the twenty-first century, that is, 2000 to 2019. Two-digit years
from 20 to 99 are assumed to be in the twentieth century, that is, 1920 to 1999.
Note: The YEARCUTOFF= option and the default setting may be different at your
site.
To avoid confusion, you should use four-digit year values in your raw data wherever
possible. For more information, see the Dates, Times, and Intervals section of SAS
Language Reference: Concepts.
Defining Variables in SAS
So far you have seen that the INPUT statement instructs SAS on how to read raw
data lines. At the same time that the INPUT statement provides instructions for
reading data, it defines the variables for the data set that come from the raw data. By
assuming default values for variable attributes, the INPUT statement does much of the
work for you. Later in this documentation, you will learn other statements that enable
you to define variables and assign attributes to variables, but this section and Chapter
3, “Starting with Raw Data: The Basics,” on page 43 concentrate on the use of the
INPUT statement.
SAS variables can have these attributes:
name
type
length
informat
format
label
position in observation
index type
See the SAS Variables section of SAS Language Reference: Concepts for more
information about variable attributes.
In an INPUT statement, you must supply each variable name. Unless you also
supply an informat, the type is assumed to be numeric, and its length is assumed to be
eight bytes. The following INPUT statement creates four numeric variables, each with
a length of eight bytes, without requiring you to specify either type or length. The table
summarizes this information.
input IdNumber Test_1 Test_2 Test_3;
36 Indicating the Location of Your Data Chapter 2
Variable name Type Length
IdNumber numeric 8
Test_1 numeric 8
Test_2 numeric 8
Test_3 numeric 8
The values of numeric variables can contain only numbers. To store values that contain
alphabetic or special characters, you must create a character variable. By following a
variable name in an INPUT statement with a dollar sign ($), you create a character
variable. The default length of a character variable is also eight bytes. The following
statement creates a data set that contains one character variable and four numeric
variables, all with a default length of eight bytes. The table summarizes this
information.
input IdNumber Name $ Test_1 Test_2 Test_3;
Variable name Type Length
IdNumber numeric 8
Name character 8
Test_1 numeric 8
Test_2 numeric 8
Test_3 numeric 8
In addition to specifying the types of variables in the INPUT statement, you can also
specify the lengths of character variables. Character variables can be up to 32,767 bytes
in length. To specify the length of a character variable in an INPUT statement, you
need to supply an informat or use column numbers. For example, following a variable
name in the INPUT statement with the informat $20., or with column specifications
such as 1-20, creates a character variable that is 20 bytes long.
Note that the length of numeric variables is not affected by informats or column
specifications in an INPUT statement. See SAS Language Reference: Concepts for more
information about numeric variables and lengths.
Two other variable attributes, format and label, affect how variable values and
names are represented when they are printed or displayed. These attributes are
assigned with different statements that you will learn about later.
Indicating the Location of Your Data
Data Locations
To create a SAS data set, you can read data from one of four locations:
raw data in the data (job) stream, that is, following a DATALINES statement
raw data in a file that you specify with an INFILE statement
Introduction to DATA Step Processing Indicating the Location of Your Data 37
data from an existing SAS data set
data in a database management system (DBMS) file
Raw Data in the Job Stream
You can place data directly in the job stream with the programming statements that
make up the DATA step. The DATALINES statement tells SAS that raw data follows.
The single semicolon that follows the last line of data marks the end of the data. The
DATALINES statement and data lines must occur last in the DATA step statements:
data weight_club;
input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight;
Loss = StartWeight - EndWeight;
datalines;
1023 David Shaw red 189 165
1049 Amelia Serrano yellow 145 124
1219 Alan Nance red 210 192
1246 Ravi Sinha yellow 194 177
1078 Ashley McKnight red 127 118
;
Data in an External File
If your raw data is already stored in a file, then you do not have to bring that file into
the data stream. Use an INFILE statement to specify the file containing the raw data.
(See “Using External Files in Your SAS Job” on page 38 for details about INFILE, FILE,
and FILENAME statements.) The statements in the code that follows demonstrate the
same example, this time showing that the raw data is stored in an external file:
data weight_club;
infile ’your-input-file’;
input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26
EndWeight 28-30;
Loss=StartWeight-EndWeight;
run;
Data in a SAS Data Set
You can also use data that is already stored in a SAS data set as input to a new data
set. To read data from an existing SAS data set, you must specify the existing data set’s
name in one of these statements:
SET statement
MERGE statement
MODIFY statement
UPDATE statement
For example, the statements that follow create a new SAS data set named RED that
adds the variable LossPercent:
data red;
set weight_club;
LossPercent = Loss / StartWeight * 100;
run;
38 Using External Files in Your SAS Job Chapter 2
The SET statement indicates that the input data is already in the structure of a SAS
data set and gives the name of the SAS data set to be read. In this example, the SET
statement tells SAS to read the WEIGHT_CLUB data set in the WORK library.
Data in a DBMS File
If you have data that is stored in another vendor’s database management system
(DBMS) files, then you can use SAS/ACCESS software to bring this data into a SAS
data set. SAS/ACCESS software enables you to assign a libref to a library containing
the DBMS file. In this example, a libref is declared, and points to a library containing
Oracle data. SAS reads data from an Oracle file into a SAS data set:
libname dblib oracle user=scott password=tiger path=’hrdept_002’;
data employees;
set dblib.employees;
run;
See SAS/ACCESS for Relational Databases: Reference for more information about
using SAS/ACCESS software to access DBMS files.
Using External Files in Your SAS Job
Your SAS programs often need to read raw data from a file, or write data or reports
to a file that is not a SAS data set. To use a file that is not a SAS data set in a SAS
program, you need to tell SAS where to find it. You can do the following:
Identify the file directly in the INFILE, FILE, or other SAS statement that uses
the file.
Set up a fileref for the file by using the FILENAME statement, and then use the
fileref in the INFILE, FILE, or other SAS statement.
Use operating environment commands to set up a fileref, and then use the fileref
in the INFILE, FILE, or other SAS statement.
The first two methods are described here. The third method depends on the
operating environment that you use.
Operating Environment Information: For more information, refer to the SAS
documentation for your operating environment.
Identifying an External File Directly
The simplest method for referring to an external file is to use the name of the file in
the INFILE, FILE, or other SAS statement that needs to refer to the file. For example,
if your raw data is stored in a file in your operating environment, and you want to read
the data using a SAS DATA step, you can tell SAS where to find the raw data by
putting the name of the file in the INFILE statement:
data temp;
infile ’your-input-file’;
input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26
EndWeight 28-30;
run;
The INFILE statement for this example may appear as follows for various operating
environments:
Introduction to DATA Step Processing Referencing an External File with a Fileref 39
Table 2.1 Example INFILE Statements for Various Operating Environments
Operating
environment
INFILE statement example
z/OS infile ’fitness.weight.rawdata(club1)’;
CMS infile ’club1 weight a’;
OpenVMS infile ’[fitness.weight.rawdata]club1.dat’;
UNIX infile ’/usr/local/fitness/club1.dat’;
Windows infile ’c:\fitness\club1.dat’;
Operating Environment Information: For more information, refer to the SAS
documentation for your operating environment.
Referencing an External File with a Fileref
An alternate method for referencing an external file is to use the FILENAME
statement to set up a fileref for a file. The fileref functions as a shorthand way of
referring to an external file. You then use the fileref in later SAS statements that
reference the file, such as the FILE or INFILE statement. The advantage of this
method is that if the program contains many references to the same external file and
the external filename changes, then the program needs to be modified in only one place,
rather than in every place where the file is referenced.
Here is the form of the FILENAME statement:
FILENAME fileref your-input-or-output-file’;
The fileref must be a valid SAS name, that is, it must
begin with a letter or an underscore
contain only letters, numbers, or underscores
have no more than 8 characters.
Operating Environment Information: Additional restrictions may apply under some
operating environments. For more information, refer to the SAS documentation for your
operating environment.
For example, you can reference the raw data that is stored in a file in your operating
environment by first using the FILENAME statement to specify the name of the file
and its fileref, and then using the INFILE statement with the same fileref to reference
the file.
filename fitclub ’your-input-file’;
data temp;
infile fitclub;
input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30;
run;
In this example, the INFILE statement stays the same for all operating
environments. The FILENAME statement, however, can appear differently in different
operating environments, as the following table shows:
40 Referencing an External File with a Fileref Chapter 2
Table 2.2 Example FILENAME Statements for Various Operating Environments
Operating
environment
FILENAME statement example
z/OS filename fitclub ’fitness.weight.rawdata(club1)’;
CMS filename fitclub ’club1 weight a’;
OpenVMS filename fitclub ’[fitness.weight.rawdata]club1.dat’;
UNIX filename fitclub ’/usr/local/fitness/club1.dat’;
Windows filename fitclub ’c:\fitness\club1.dat’;
If you need to use several files or members from the same directory, partitioned data
set (PDS), or MACLIB, then you can use the FILENAME statement to create a fileref
that identifies the name of the directory, PDS, or MACLIB. Then you can use the fileref
in the INFILE statement and enclose the name of the file, PDS member, or MACLIB
member in parentheses immediately after the fileref, as in this example:
filename fitclub ’directory-or-PDS-or-MACLIB’;
data temp;
infile fitclub(club1);
input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30;
run;
data temp2;
infile fitclub(club2);
input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30;
run;
In this case, the INFILE statements stay the same for all operating environments.
The FILENAME statement, however, can appear differently for different operating
environments, as the following table shows:
Table 2.3 Referencing Directories, PDSs, and MACLIBs in Various Operating
Environments
Operating
environment
FILENAME statement example
z/OS filename fitclub ’fitness.weight.rawdata’;
CMS filename fitclub ’use1 maclib’;1
OpenVMS filename fitclub ’[fitness.weight.rawdata]’;
UNIX filename fitclub ’/usr/local/fitness’;
Windows filename fitclub ’c:\fitness’;
1Under CMS, the external file must be a CMS MACLIB, a CMS TXTLIB, or a z/OS PDS.
Introduction to DATA Step Processing Learning More 41
Review of SAS Tools
Statements
DATA <libref.>SAS-data-set;
tells SAS to begin creating a SAS data set. If you omit the libref, then SAS creates
a temporary SAS data set. (SAS attaches the libref WORK for its internal
processing.) If you give a previously defined libref as the first level of the name,
then SAS stores the data set permanently in the library referenced by the libref. A
SAS program or a portion of a program that begins with a DATA statement and
ends with a RUN statement, another DATA statement, or a PROC statement is
called a DATA step.
FILENAME fileref your-input-or-output-file’;
associates a fileref with an external file. Enclose the name of the external file in
quotation marks.
INFILE fileref|’your-input-file’;
identifies an external file to be read by an INPUT statement. Specify a fileref that
has been assigned with a FILENAME statement or with an appropriate operating
environment command, or specify the actual name of the external file.
INPUT variable <$>;
reads raw data using list input. At least one blank must occur between any two
data values. The $denotes a character variable.
INPUT variable<$>column-range;
reads raw data that is aligned in columns. The $denotes a character variable.
INPUT variable informat;
reads raw data using formatted input. An informat supplies special instructions
for reading the data.
LIBNAME libref your-SAS-data-library’;
associates a libref with a SAS data library. Enclose the name of the library in
quotation marks. SAS locates a permanent SAS data set by matching the libref in
a two-level SAS data set name with the library associated with that libref in a
LIBNAME statement. The rules for creating a SAS data library depend on your
operating environment.
Learning More
ATTRIBUTE statement
For information about how the ATTRIBUTE statement enables you to assign
attributes to variables, see SAS Language Reference: Dictionary.
DBMS access
This documentation explains how to use SAS for reading files of raw data and SAS
data sets and writing to SAS data sets. However, SAS documentation for
SAS/ACCESS provides complete information about using SAS to read and write
information stored in several types of database management system (DBMS) files.
Informats
42 Learning More Chapter 2
For a discussion about informats that you use with dates, see Chapter 14,
“Working with Dates in the SAS System,” on page 211.
Length of variables
For more information about how a variable’s length affects the values you can
store in the variable, see Chapter 7, “Working with Numeric Variables,” on page
107 and Chapter 8, “Working with Character Variables,” on page 119.
LINESIZE= option
For information about how to use the LINESIZE= option in an INPUT statement
to limit how much of each data line the INPUT statement reads, see SAS
Language Reference: Dictionary.
MERGE, MODIFY, or UPDATE statements
In addition to the SET statement, you can read a SAS data set with the MERGE,
MODIFY, or UPDATE statements. For more information, see Chapter 18,
“Merging SAS Data Sets,” on page 269 and Chapter 19, “Updating SAS Data
Sets,” on page 293.
SET statement
For information about the SET statement, see Chapter 5, “Starting with SAS Data
Sets,” on page 81.
USER= SAS system option
You can specify the USER= SAS system option to use one-level names to point to
permanent SAS files. (If you specify USER=WORK, then SAS assumes that files
referenced with one-level names refer to temporary work files.) See the SAS
System Options section in SAS Language Reference: Dictionary for details.
43
CHAPTER
3
Starting with Raw Data: The
Basics
Introduction to Raw Data 44
Purpose 44
Prerequisites 44
Examine the Structure of the Raw Data: Factors to Consider 44
Reading Unaligned Data 44
Understanding List Input 44
Program: Basic List Input 45
Program: When the Data Is Delimited by Characters, Not Blanks 46
List Input: Points to Remember 46
Reading Data That Is Aligned in Columns 47
Understanding Column Input 47
Program: Reading Data Aligned in Columns 47
Understanding Some Advantages of Column Input over Simple List Input 48
Reading Embedded Blanks and Creating Longer Variables 48
Program: Skipping Fields When Reading Data Records 49
Column Input: Points to Remember 50
Reading Data That Requires Special Instructions 50
Understanding Formatted Input 50
Program: Reading Data That Requires Special Instructions 50
Understanding How to Control the Position of the Pointer 52
Formatted Input: Points to Remember 53
Reading Unaligned Data with More Flexibility 53
Understanding How to Make List Input More Flexible 53
Creating Longer Variables and Reading Numeric Data That Contains Special Characters 53
Reading Character Data That Contains Embedded Blanks 54
Mixing Styles of Input 55
An Example of Mixed Input 55
Understanding the Effect of Input Style on Pointer Location 56
Why You Can Get into Trouble by Mixing Input Styles 56
Pointer Location with Column and Formatted Input 56
Pointer Location with List Input 57
Review of SAS Tools 58
Statements 58
Column-Pointer Controls 59
Learning More 59
44 Introduction to Raw Data Chapter 3
Introduction to Raw Data
Purpose
To create a SAS data set from raw data, you must examine the data records first to
determine how the data values that you want to read are arranged. Then you can look
at the styles of reading input that are available in the INPUT statement. SAS provides
three basic input styles:
list
column
formatted
You can use these styles individually, in combination with each other, or in conjunction
with various line-hold specifiers, line-pointer controls, and column-pointer controls.
This section demonstrates various ways of using the INPUT statement to turn your raw
data into SAS data sets.
You can enter the data directly in a DATA step or use an existing file of raw data. If
your data is machine readable, then you need to learn how to use those tools that
enable SAS to read them. If your data is not yet entered, then you can choose the input
style that enables you to enter the data most easily.
Prerequisites
You should understand the concepts presented in Chapter 1, “What Is the SAS
System?,” on page 3 and Chapter 2, “Introduction to DATA Step Processing,” on page 19
before continuing.
Examine the Structure of the Raw Data: Factors to Consider
Before you can select the appropriate style of input, examine the structure of the raw
data that you want to read. Consider some of the following factors:
how the data is arranged in the input records (For example, are data fields aligned
in columns or unaligned? Are they separated by blanks or by other characters?)
whether character values contain embedded blanks
whether numeric values contain non-numeric characters such as commas
whether the data contains time or date values
whether each input record contains data for more than one observation
whether data for a single observation is spread over multiple input records
Reading Unaligned Data
Understanding List Input
The simplest form of the INPUT statement uses list input. List input is used to read
data values that are separated by a delimiter character (by default, a blank space). With
list input, SAS reads a data value until it encounters a blank space. SAS assumes the
Starting with Raw Data: The Basics Program: Basic List Input 45
value has ended and assigns the data to the appropriate variable in the program data
vector. SAS continues to scan the record until it reaches a nonblank character again.
SAS reads a data value until it encounters a blank space or the end of the input record.
Program: Basic List Input
This program uses the health and fitness club data from Chapter 2, “Introduction to
DATA Step Processing,” on page 19 to illustrate a DATA step that uses list input in an
INPUT statement.
data club1;
input IdNumber Name $ Team $ StartWeight EndWeight;w
datalines;u
1023 David red 189 165 v
1049 Amelia yellow 145 124
1219 Alan red 210 192
1246 Ravi yellow 194 177
1078 Ashley red 127 118
1221 Jim yellow 220 . v
;u
proc print data=club1;
title ’Weight of Club Members’;
run;
The following list corresponds to the numbered items in the preceding program:
uThe DATALINES statement marks the beginning of the data lines. The semicolon
that follows the data lines marks the end of the data lines and the end of the
DATA step.
vEach data value in the raw data record is separated from the next by at least one
blank space. The last record contains a missing value, represented by a period, for
the value of EndWeight.
wThe variable names in the INPUT statement are specified in exactly the same
order as the fields in the raw data records.
The output that follows shows the resulting data set. The PROC PRINT statement
that follows the DATA step produces this listing.
Output 3.1 Data Set Created with List Input
Weight of Club Members 1
Id Start End
Obs Number Name Team Weight Weight
1 1023 David red 189 165
2 1049 Amelia yellow 145 124
3 1219 Alan red 210 192
4 1246 Ravi yellow 194 177
5 1078 Ashley red 127 118
6 1221 Jim yellow 220 .
46 Program: When the Data Is Delimited by Characters, Not Blanks Chapter 3
Program: When the Data Is Delimited by Characters, Not Blanks
This program also uses the health and fitness club data but notice that here the data
is delimited by a comma instead of a blank space, the default delimiter.
options pagesize=60 linesize=80 pageno=1 nodate;
data club1;
infile datalinesvdlm=’,’w;
input IdNumber Name $ Team $ StartWeight EndWeight;
datalines;
1023,David,red,189,165u
1049,Amelia,yellow,145,124
1219,Alan,red,210,192
1246,Ravi,yellow,194,177
1078,Ashley,red,127,118
1221,Jim,yellow,220,.
;
proc print data=club1;
title ’Weight of Club Members’;
run;
The following list corresponds to the numbered items in the preceding output:
uThese data values are separated by commas instead of blanks.
vList input, by default, scans the input records, looking for blank spaces to delimit
each data value. The DLM= option enables list input to recognize a character, here
a comma, as the delimiter.
wThis example required the DLM= option, which is available only in the INFILE
statement. Usually this statement is used only when the input data resides in an
external file. The DATALINES specification, however, enables you to take
advantage of INFILE statement options, when you are reading data records from
the job stream.
Output 3.2 Reading Data Delimited by Commas
Weight of Club Members 1
Id Start End
Obs Number Name Team Weight Weight
1 1023 David red 189 165
2 1049 Amelia yellow 145 124
3 1219 Alan red 210 192
4 1246 Ravi yellow 194 177
5 1078 Ashley red 127 118
6 1221 Jim yellow 220 .
List Input: Points to Remember
The points to remember when you use list input are:
Use list input when each field is separated by at least one blank space or delimiter.
Specify each field in the order that they appear in the records of raw data.
Starting with Raw Data: The Basics Program: Reading Data Aligned in Columns 47
Represent missing values by a placeholder such as a period. (Under the default
behavior, a blank field causes the variable names and values to become
mismatched.)
Character values cannot contain embedded blanks.
The default length of character variables is eight bytes. SAS truncates a longer
value when it writes the value to the program data vector. (To read a character
variable that contains more than eight characters with list input, use a LENGTH
statement. See “Defining Enough Storage Space for Variables” on page 103.)
Data must be in standard character or numeric format (that is, it can be read
without an informat).
Note: List input requires the fewest specifications in the INPUT statement.
However, the restrictions that are placed on the data may require that you learn to use
other styles of input to read your data. For example, column input, which is discussed
in the next section, is less restrictive. This section has introduced only simple list input.
See “Understanding How to Make List Input More Flexible” on page 53 to learn about
modified list input.
Reading Data That Is Aligned in Columns
Understanding Column Input
With column input, data values occupy the same fields within each data record.
When you use column input in the INPUT statement, list the variable names and
specify column positions that identify the location of the corresponding data fields. You
can use column input when your raw data is in fixed columns and does not require the
use of informats to be read.
Program: Reading Data Aligned in Columns
The following program also uses the health and fitness club data, but now two more
data values are missing. The data is aligned in columns and SAS reads the data with
column input:
data club1;
input IdNumber 1-4 Name $ 6-11 Team $ 13-18 StartWeight 20-22
EndWeight 24-26;
datalines;
1023 David red 189 165
1049 Amelia yellow 145
1219 Alan red 210 192
1246 Ravi yellow 177
1078 Ashley red 127 118
1221 Jim yellow 220
;
proc print data=club1;
title ’Weight Club Members’;
run;
48 Understanding Some Advantages of Column Input over Simple List Input Chapter 3
The specification that follows each variable name indicates the beginning and ending
columns in which the variable value will be found. Note that with column input you are
not required to indicate missing values with a placeholder such as a period.
The following output shows the resulting data set. Missing numeric values occur
three times in the data set, and are indicated by periods.
Output 3.3 Data Set Created with Column Input
Weight Club Members 1
Id Start End
Obs Number Name Team Weight Weight
1 1023 David red 189 165
2 1049 Amelia yellow 145 .
3 1219 Alan red 210 192
4 1246 Ravi yellow . 177
5 1078 Ashley red 127 118
6 1221 Jim yellow 220 .
Understanding Some Advantages of Column Input over Simple List
Input
Here are several advantages of using column input:
With column input, character variables can contain embedded blanks.
Column input also enables the creation of variables that are longer than eight
bytes. In the preceding example, the variable Name in the data set CLUB1
contains only the members’ first names. By using column input, you can read the
first and last names as a single value. These differences between input styles are
possible for two reasons:
Column input uses the columns that you specify to determine the length of
character variables.
Column input, unlike list input, reads data until it reaches the last specified
column, not until it reaches a blank space.
Column input enables you to skip some data fields when reading records of raw
data. It also enables you to read the data fields in any order and reread some
fields or parts of fields.
Reading Embedded Blanks and Creating Longer Variables
This DATA step uses column input to create a new data set named CLUB2. The
program still uses the health and fitness club weight data. However, the data has been
modified to include members’ first and last names. Now the second data field in each
record or raw data contains an embedded blank and is 18 bytes long.
data club2;
input IdNumber 1-4 Name $ 6-23 Team $ 25-30 StartWeight 32-34
EndWeight 36-38;
datalines;
1023 David Shaw red 189 165
Starting with Raw Data: The Basics Program: Skipping Fields When Reading Data Records 49
1049 Amelia Serrano yellow 145 124
1219 Alan Nance red 210 192
1246 Ravi Sinha yellow 194 177
1078 Ashley McKnight red 127 118
1221 Jim Brown yellow 220
;
proc print data=club2;
title ’Weight Club Members’;
run;
The following output shows the resulting data set.
Output 3.4 Data Set Created with Column Input (Embedded Blanks)
Weight Club Members 1
Id Start End
Obs Number Name Team Weight Weight
1 1023 David Shaw red 189 165
2 1049 Amelia Serrano yellow 145 124
3 1219 Alan Nance red 210 192
4 1246 Ravi Sinha yellow 194 177
5 1078 Ashley McKnight red 127 118
6 1221 Jim Brown yellow 220 .
Program: Skipping Fields When Reading Data Records
Column input also enables you to skip over fields or to read the fields in any order.
This example uses column input to read the same health and fitness club data, but it
reads the value for the variable Team first and omits the variable IdNumber altogether.
You can read or reread part of a value when using column input. For example,
because the team names begin with different letters, this program saves storage space
by reading only the first character in the field that contains the team name. Note the
INPUT statement:
data club2;
input Team $ 25 Name $ 6-23 StartWeight 32-34 EndWeight 36-38;
datalines;
1023 David Shaw red 189 165
1049 Amelia Serrano yellow 145 124
1219 Alan Nance red 210 192
1246 Ravi Sinha yellow 194 177
1078 Ashley McKnight red 127 118
1221 Jim Brown yellow 220
;
proc print data=club2;
title ’Weight Club Members’;
run;
The following output shows the resulting data set. The variable that contains the
identification number is no longer in the data set. Instead, Team is the first variable in
the new data set, and it contains only one character to represent the team value.
50 Column Input: Points to Remember Chapter 3
Output 3.5 Data Set Created with Column Input (Skipping Fields)
Weight Club Members 1
Start End
Obs Team Name Weight Weight
1 r David Shaw 189 165
2 y Amelia Serrano 145 124
3 r Alan Nance 210 192
4 y Ravi Sinha 194 177
5 r Ashley McKnight 127 118
6 y Jim Brown 220 .
Column Input: Points to Remember
Remember the following rules when you use column input:
Character variables can be up to 32,767 bytes (32KB) in length and are not limited
to the default length of eight bytes.
Character variables can contain embedded blanks.
You can read fields in any order.
A placeholder is not required to indicate a missing data value. A blank field is
read as missing and does not cause other values to be read incorrectly.
You can skip over part of the data in the data record.
You can reread fields or parts of fields.
You can read standard character and numeric data only. Informats are ignored.
Reading Data That Requires Special Instructions
Understanding Formatted Input
Sometimes the INPUT statement requires special instructions to read the data
correctly. For example, SAS can read numeric data that is in special formats such as
binary, packed decimal, or date/time. SAS can also read numeric values that contain
special characters such as commas and currency symbols. In these situations, use
formatted input. Formatted input combines the features of column input with the
ability to read nonstandard numeric or character values. The following data shows
formatted input:
1,262
$55.64
02JAN2003
Program: Reading Data That Requires Special Instructions
The data in this program includes numeric values that contain a comma, which is an
invalid character for a numeric variable:
data january_sales;
input Item $ 1-16 Amount comma5.;
Starting with Raw Data: The Basics Program: Reading Data That Requires Special Instructions 51
datalines;
trucks 1,382
vans 1,235
sedans 2,391
;
proc print data=january_sales;
title ’January Sales in Thousands’;
run;
The INPUT statement cannot read the values for the variable Amount as valid
numeric values without the additional instructions provided by an informat. The
informat COMMA5. enables the INPUT statement to read and store this data as a
valid numeric value.
The following figure shows that the informat COMMA5. instructs the program to
read five characters of data (the comma counts as part of the length of the data), to
remove the comma from the data, and to write the resulting numeric value to the
program data vector. Note that the name of an informat always ends in a period (.).
Figure 3.1 Reading a Value with an Informat
COMMA5. informat
The following figure shows that the data values are read into the input buffer exactly
as they occur in the raw data records, but they are written to the program data vector
(and then to the data set as an observation) as valid numeric values without any special
characters.
Figure 3.2 Input Value Compared to Variable Value
Input Buffer
Program Data Vector
Item Amount
----+----1----+----2----+----3
trucks 1,382
1382trucks
The following output shows the resulting data set. The values for Amount contain
only numbers. Note that the commas are removed.
52 Understanding How to Control the Position of the Pointer Chapter 3
Output 3.6 Data Set Created with Column and Formatted Input
January Sales in Thousands 1
Obs Item Amount
1 trucks 1382
2 vans 1235
3 sedans 2391
In a report, you might want to include the comma in numeric values to improve
readability. Just as the informat gives instructions on how to read a value and to remove
the comma, a format gives instructions to add characters to variable values in the
output. See “Writing Output without Creating a Data Set” on page 522 for an example.
Understanding How to Control the Position of the Pointer
As the INPUT statement reads data values, it uses an input pointer to keep track of
the position of the data in the input buffer. Column-pointer controls provide additional
control over pointer movement and are especially useful with formatted input.
Column-pointer controls tell how far to advance the pointer before SAS reads the next
value. In this example, SAS reads data lines with a combination of column and
formatted input:
data january_sales;
input Item $ 1-16 Amount comma5.;
datalines;
trucks 1,382
vans 1,235
sedans 2,391
;
In the next example, SAS reads data lines by using formatted input with a
column-pointer control:
data january_sales;
input Item $10. @17 Amount comma5.;
datalines;
trucks 1,382
vans 1,235
sedans 2,391
;
After SAS reads the first value for the variable Item, the pointer is left in the next
position, column 11. The absolute column-pointer control, @17, then directs the pointer
to move to column 17 in the input buffer. Now, it is in the correct position to read a
value for the variable Amount.
In the following program, the relative column-pointer control, +6, instructs the
pointer to move six columns to the right before SAS reads the next data value.
data january_sales;
input Item $10. +6 Amount comma5.;
datalines;
trucks 1,382
Starting with Raw Data: The Basics Creating Longer Variables and Reading Numeric Data That Contains Special Characters 53
vans 1,235
sedans 2,391
;
The data in these two programs is aligned in columns. As with column input, you
instruct the pointer to move from field to field. With column input you use column
specifications; with formatted input you use the length that is specified in the informat
together with pointer controls.
Formatted Input: Points to Remember
Remember the following rules when you use formatted input:
SAS reads formatted input data until it has read the number of columns that the
informat indicates. This method of reading the data is different from list input,
which reads until a blank space (or other defined delimiter character) is reached.
You can position the pointer to read the next value by using pointer controls.
You can read data stored in nonstandard form such as packed decimal, or data
that contains commas.
You have the flexibility of using informats with all the features of column input, as
described in “Column Input: Points to Remember” on page 50.
Reading Unaligned Data with More Flexibility
Understanding How to Make List Input More Flexible
While list input is the simplest to code, remember that it places restrictions on your
data. By using format modifiers, you can take advantage of the simplicity of list input
without the inconvenience of the usual restrictions. For example, you can use modified
list input to do the following:
Create character variables that are longer than the default length of eight bytes.
Read numeric data with special characters like commas, dashes, and currency
symbols.
Read character data that contains embedded blanks.
Read data values that can be stored as SAS date variables.
Creating Longer Variables and Reading Numeric Data That Contains
Special Characters
By simply modifying list input with the colon format modifier (:) you can read
character data that contains more than eight characters
numeric data that contains special characters.
To use the colon format modifier with list input, place the colon between the variable
name and the informat. As in simple list input, at least one blank (or other defined
delimiter character) must separate each value from the next, and character values
cannot contain embedded blanks (or other defined delimiter characters). Consider this
DATA step:
data january_sales;
input Item : $12. Amount : comma5.;
54 Reading Character Data That Contains Embedded Blanks Chapter 3
datalines;
Trucks 1,382
Vans 1,235
Sedans 2,391
SportUtility 987
;
proc print data=january_sales;
title ’January Sales in Thousands’;
run;
The variable Item has a length of 12, and the variable Amount requires an informat (in
this case, COMMA5.) that removes commas from numbers so that they are read as
valid numeric values. The data values are not aligned in columns as was required in
the last example, which used formatted input to read the data.
The following output shows the resulting data set.
Output 3.7 Data Set Created with Modified List Input (: comma5.)
January Sales in Thousands 1
Obs Item Amount
1 Trucks 1382
2 Vans 1235
3 Sedans 2391
4 SportUtility 987
Reading Character Data That Contains Embedded Blanks
Because list input uses a blank space to determine where one value ends and the
next one begins, values normally cannot contain blanks. However, with the ampersand
format modifier (&) you can use list input to read data that contains single embedded
blanks. The only restriction is that at least two blanks must divide each value from the
next data value in the record.
To use the ampersand format modifier with list input, place the ampersand between
the variable name and the informat. The following DATA step uses the ampersand
format modifier with list input to create the data set CLUB2. Note that the data is not
in fixed columns; therefore, column input is not appropriate.
data club2;
input IdNumber Name & $18. Team $ StartWeight EndWeight;
datalines;
1023 David Shaw red 189 165
1049 Amelia Serrano yellow 145 124
1219 Alan Nance red 210 192
1246 Ravi Sinha yellow 194 177
1078 Ashley McKnight red 127 118
1221 Jim Brown yellow 220 .
;
proc print data=club2;
title ’Weight Club Members’;
run;
Starting with Raw Data: The Basics An Example of Mixed Input 55
The character variable Name, with a length of 18, contains members’ first and last
names separated by one blank space. The data lines must have two blank spaces
between the values for the variable Name and the variable Team for the INPUT
statement to correctly read the data.
The following output shows the resulting data set.
Output 3.8 Data Set Created with Modified List Input (& $18.)
Weight Club Members 1
Id Start End
Obs Number Name Team Weight Weight
1 1023 David Shaw red 189 165
2 1049 Amelia Serrano yellow 145 124
3 1219 Alan Nance red 210 192
4 1246 Ravi Sinha yellow 194 177
5 1078 Ashley McKnight red 127 118
6 1221 Jim Brown yellow 220 .
Mixing Styles of Input
An Example of Mixed Input
When you begin an INPUT statement in a particular style (list, column, or
formatted), you are not restricted to using that style alone. You can mix input styles in
a single INPUT statement as long as you mix them in a way that appropriately
describes the raw data records. For example, this DATA step uses all three input styles:
data club1;
input IdNumber u
Name $18. v
Team $ 25-30 w
StartWeight EndWeight; u
datalines;
1023 David Shaw red 189 165
1049 Amelia Serrano yellow 145 124
1219 Alan Nance red 210 192
1246 Ravi Sinha yellow 194 177
1078 Ashley McKnight red 127 118
1221 Jim Brown yellow 220 .
;
proc print data=club1;
title ’Weight Club Members’;
run;
The following list corresponds to the numbered items in the preceding program:
uThe variables IdNumber, StartWeight, and EndWeight are read with list input.
vThe variable Name is read with formatted input.
wThe variable Team is read with column input.
The following output demonstrates that the data is read correctly.
56 Understanding the Effect of Input Style on Pointer Location Chapter 3
Output 3.9 Data Set Created with Mixed Styles of Input
Weight Club Members 1
Id Start End
Obs Number Name Team Weight Weight
1 1023 David Shaw red 189 165
2 1049 Amelia Serrano yellow 145 124
3 1219 Alan Nance red 210 192
4 1246 Ravi Sinha yellow 194 177
5 1078 Ashley McKnight red 127 118
6 1221 Jim Brown yellow 220 .
Understanding the Effect of Input Style on Pointer Location
Why You Can Get into Trouble by Mixing Input Styles
CAUTION:
When you mix styles of input in a single INPUT statement, you can get unexpected results
if you do not understand where the input pointer is positioned after SAS reads a value in
the input buffer. As the INPUT statement reads data values from the record in the
input buffer, it uses a pointer to keep track of its position. Read the following
sections so that you understand how the pointer movement differs between input
styles before mixing multiple input styles in a single INPUT statement
Pointer Location with Column and Formatted Input
With column and formatted input, you supply the instructions that determine the
exact pointer location. With column input, SAS reads the columns that you specify in
the INPUT statement. With formatted input, SAS reads the exact length that you
specify with the informat. In both cases, the pointer moves as far as you instruct it and
stops. The pointer is left in the column that immediately follows the last column that is
read.
Here are two examples of input followed by an explanation of the pointer location.
The first DATA step shows column input:
data scores;
input Team $ 1-6 Score 12-13;
datalines;
red 59
blue 95
yellow 63
green 76
;
The second DATA step uses the same data to show formatted input:
data scores;
input Team $6. +5 Score 2.;
datalines;
red 59
blue 95
yellow 63
green 76
Starting with Raw Data: The Basics Understanding the Effect of Input Style on Pointer Location 57
;
The following figure shows that the pointer is located in column 7 after the first
value is read with either of the two previous INPUT statements.
Figure 3.3 Pointer Position: Column and Formatted Input
----+----1----+----2
red 59
Unlike list input, column and formatted input rely totally on your instructions to
move the pointer and read the value for the second variable, Score. Column input uses
column specifications to move the pointer to each data field. Formatted input uses
informats and pointer controls to control the position of the pointer.
This INPUT statement uses column input with the column specifications 12-13 to
move the pointer to column 12 and read the value for the variable Score:
input Team $ 1-6 Score 12-13;
This INPUT statement uses formatted input with the +5 column-pointer control to
move the pointer to column 12. Then the value for the variable Score is read with the 2.
numeric informat.
input Team $6. +5 Score 2.;
Without the use of a pointer control, which moves the pointer to the column where the
value begins, this INPUT statement would attempt to read the value for Score in
columns 7 and 8, which are blank.
Pointer Location with List Input
List input, on the other hand, uses a scanning method to determine the pointer
location. With list input, the pointer reads until a blank is reached and then stops in
the next column. To read the next variable value, the pointer moves automatically to
the first nonblank column, discarding any leading blanks it encounters. Here is the
same data that is read with list input:
data scores;
input Team $ Score;
datalines;
red 59
blue 95
yellow 63
green 76
;
The following figure shows that the pointer is located in column 5 after the value red
is read. Because Score, the next variable, is read with list input, the pointer scans for
the next nonblank space before it begins to read a value for Score. Unlike column and
formatted input, you do not have to explicitly move the pointer to the beginning of the
next field in list input.
58 Review of SAS Tools Chapter 3
Figure 3.4 Pointer Position: List Input
----+----1----+----2
red 59
Review of SAS Tools
Statements
DATALINES;
indicates that data lines immediately follow the DATALINES statement. A
semicolon in the line that immediately follows the last data line indicates the end
of the data and causes the DATA step to compile and execute.
INFILE DATALINES DLM=’character’;
identifies the source of the input records as data lines in the job stream rather
than as an external file. When your program contains the input data, the data
lines directly follow the DATALINES statement. Because you can specify
DATALINES in the INFILE statement, you can take advantage of many
data-reading options that are available only through the INFILE statement.
The DLM= option specifies the character that is used to separate data values in
the input records. By default, a blank space denotes the end of a data value. This
option is useful when you want to use list input to read data records in which a
character other than a blank separates data values.
INPUT variable <&> <$>;
reads the input data record using list input. The & (ampersand format modifier)
enables character values to contain embedded blanks. When you use the
ampersand format modifier, two blanks are required to signal the end of a data
value. The $ indicates a character variable.
INPUT variable start-column <– end-column>;
reads the input data record using column input. You can omit end-column if the
data is only 1 byte long. This style of input enables you to skip columns of data
that you want to omit.
INPUT variable :informat;
INPUT variable &informat;
read the input data record using modified list input. The : (colon format modifier)
instructs SAS to use the informat that follows to read the data value. The &
(ampersand format modifier) instructs SAS to use the informat that follows to read
the data value. When you use the ampersand format modifier, two blanks are
required to signal the end of a data value.
INPUT <pointer-control> variable informat;
reads raw data using formatted input. The informat supplies special instructions
to read the data. You can also use a pointer-control to direct SAS to start reading
at a particular column.
The syntax given above for the three styles of input shows only one variable.
Subsequent variables in the INPUT statement may or may not be described in the
Starting with Raw Data: The Basics Learning More 59
same input style as the first one. You may use any of the three styles of input (list,
column, and formatted) in a single INPUT statement.
Column-Pointer Controls
@n
moves the pointer to the nth column in the input buffer.
+n
moves the pointer forward ncolumns in the input buffer.
/
moves the pointer to the next line in the input buffer.
#n
moves the pointer to the nth line in the input buffer.
Learning More
Advanced features
For some more advanced data-reading features, see Chapter 4, “Starting with Raw
Data: Beyond the Basics,” on page 61.
Character-delimited data
For more information about reading data that is delimited by a character other
than a blank space, see the DELIMITER= option in the INFILE statement in SAS
Language Reference: Dictionary .
Pointer controls
For a complete discussion and listing of column-pointer controls, line-pointer
controls, and line-hold specifiers, see SAS Language Reference: Dictionary.
Types of input
For more information about the INPUT statement, see SAS Language Reference:
Dictionary.
60
61
CHAPTER
4
Starting with Raw Data: Beyond
the Basics
Introduction to Beyond the Basics with Raw Data 61
Purpose 61
Prerequisites 62
Testing a Condition before Creating an Observation 62
Creating Multiple Observations from a Single Record 63
Using the Double Trailing @ Line-Hold Specifier 63
Understanding How the Double Trailing @ Affects DATA Step Execution 64
Reading Multiple Records to Create a Single Observation 67
How the Data Records Are Structured 67
Method 1: Using Multiple Input Statements 67
Method 2: Using the / Line-Pointer Control 69
Reading Variables from Multiple Records in Any Order 70
Understanding How the #n Line-Pointer Control Affects DATA Step Execution 71
Problem Solving: When an Input Record Unexpectedly Does Not Have Enough Values 74
Understanding the Default Behavior 74
Methods of Control: Your Options 75
Four Options: FLOWOVER, STOPOVER, MISSOVER, and TRUNCOVER 75
Understanding the MISSOVER Option 76
Understanding the TRUNCOVER Option 77
Review of SAS Tools 77
Column-Pointer Controls 77
Line-Hold Specifiers 78
Statements 78
Learning More 79
Introduction to Beyond the Basics with Raw Data
Purpose
To create a SAS data set from raw data, you often need more than the most basic
features. In this section, you will learn advanced features for reading raw data that
include the following:
how to understand and then control what happens when a value is unexpectedly
missing in an input record
how to read a record more than once so that you may test a condition before
taking action on the current record
how to create multiple observations from a single input record
how to read multiple observations to create a single record
62 Prerequisites Chapter 4
Prerequisites
You should understand the concepts presented in Chapter 1, “What Is the SAS
System?,” on page 3 and Chapter 2, “Introduction to DATA Step Processing,” on page 19
before continuing.
Testing a Condition before Creating an Observation
Sometimes you need to read a record, and hold that record in the input buffer while
you test for a specified condition before a decision can be made about further
processing. As an example, the ability to hold a record so that you can read from it
again, if necessary, is useful when you need to test for a condition before SAS creates an
observation from a data record. To do this, you can use the trailing at-sign (@).
For example, to create a SAS data set that is a subset of a larger group of records,
you might need to test for a condition to decide if a particular record will be used to
create an observation. The trailing at-sign placed before the semicolon at the end of an
INPUT statement instructs SAS to hold the current data line in the input buffer. This
makes the data line available for a subsequent INPUT statement. Otherwise, the next
INPUT statement causes SAS to read a new record into the input buffer.
You can set up the process to read each record twice by following these steps:
1Use an INPUT statement to read a portion of the record.
2Use a trailing @ at the end of the INPUT statement to hold the record in the input
buffer for the execution of the next INPUT statement.
3Use an IF statement on the portion that is read in to test for a condition.
4If the condition is met, use another INPUT statement to read the remainder of the
record to create an observation.
5If the condition is not met, the record is released and control passes back to the
top of the DATA step.
To read from a record twice, you must prevent SAS from automatically placing a new
record into the input buffer when the next INPUT statement executes. Use of a trailing
@ in the first INPUT statement serves this purpose. The trailing @ is one of two
line-hold specifiers that enable you to hold a record in the input buffer for further
processing.
For example, the health and fitness club data contains information about all
members. This DATA step creates a SAS data set that contains only members of the
red team:
data red_team;
input Team $ 13-18 @; u
if Team=’red’; v
input IdNumber 1-4 StartWeight 20-22 EndWeight 24-26; w
datalines;
1023 David red 189 165
1049 Amelia yellow 145 124
1219 Alan red 210 192
1246 Ravi yellow 194 177
1078 Ashley red 127 118
1221 Jim yellow 220 .
;x
proc print data=red_team;
Starting with Raw Data: Beyond the Basics Using the Double Trailing @ Line-Hold Specifier 63
title ’Red Team’;
run;
In this DATA step, these actions occur:
uThe INPUT statement reads a record into the input buffer, reads a data value
from columns 13 through 18, and assigns that value to the variable Team in the
program data vector. The single trailing @ holds the record in the input buffer.
vThe IF statement enables the current iteration of the DATA step to continue only
when the value for Team is red. When the value is not red, the current iteration
stops and SAS returns to the top of the DATA step, resets values in the program
data vector to missing, and releases the held record from the input buffer.
wThe INPUT statement executes only when the value of Team is red. It reads the
remaining data values from the record held in the input buffer and assigns values
to the variables IdNumber, StartWeight, and EndWeight.
xThe record is released from the input buffer when the program returns to the top
of the DATA step.
The following output shows the resulting data set:
Output 4.1 Subset Data Set Created with Trailing @
Red Team 1
Id Start End
Obs Team Number Weight Weight
1 red 1023 189 165
2 red 1219 210 192
3 red 1078 127 118
Creating Multiple Observations from a Single Record
Using the Double Trailing @ Line-Hold Specifier
Sometimes you may need to create multiple observations from a single record of raw
data. One way to tell SAS how to read such a record is to use the other line-hold
specifier, the double trailing at-sign (@@ or “double trailing @”). The double trailing @
not only prevents SAS from reading a new record into the input buffer when a new
INPUT statement is encountered, but it also prevents the record from being released
when the program returns to the top of the DATA step. (Remember that the trailing @
does not hold a record in the input buffer across iterations of the DATA step.)
For example, this DATA step uses the double trailing @ in the INPUT statement:
data body_fat;
input Gender $ PercentFat @@;
datalines;
m 13.3 f 22
m 22 f 23.2
m16 m12
;
proc print data=body_fat;
64 Understanding How the Double Trailing @ Affects DATA Step Execution Chapter 4
title ’Results of Body Fat Testing’;
run;
The following output shows the resulting data set:
Output 4.2 Data Set Created with Double Trailing @
Results of Body Fat Testing 1
Percent
Obs Gender Fat
1 m 13.3
2 f 22.0
3 m 22.0
4 f 23.2
5 m 16.0
6 m 12.0
Understanding How the Double Trailing @ Affects DATA Step Execution
To understand how the data records in the previous example were read, look at the
data lines that were used in the previous DATA step:
m 13.3 f 22
m 22 f 23.2
m16 m12
Each record contains the raw data for two observations instead of one. Consider this
example in terms of the flow of the DATA step, as explained in Chapter 2, “Introduction
to DATA Step Processing,” on page 19.
When SAS reaches the end of the DATA step, it returns to the top of the program
and begins the next iteration, executing until there are no more records to read. Each
time it returns to the top of the DATA step and executes the INPUT statement, it
automatically reads a new record into the input buffer. The second set of data values in
each record, therefore, would never be read:
m 13.3 f22
m22 f 23.2
m16 m12
To allow the second set of data values in each record to be read, the double trailing @
tells SAS to hold the record in the input buffer. Each record is held in the input buffer
until the end of the record is reached. The program does not automatically place the
next record into the input buffer each time the INPUT statement is executed, and the
current record is not automatically released when it returns to the top of the DATA
step. As a result, the pointer location is maintained on the current record which
enables the program to read each value in that record. Each time the DATA step
completes an iteration, an observation is written to the data set.
The next five figures demonstrate what happens in the input buffer when a double
trailing @ appears in the INPUT statement, as in this example:
input Gender $ PercentFat @@;
The first figure shows that all values in the program data vector are set to missing.
The INPUT statement reads the first record into the input buffer. The program begins
Starting with Raw Data: Beyond the Basics Understanding How the Double Trailing @ Affects DATA Step Execution 65
to read values from the current pointer location, which is the beginning of the input
buffer.
Figure 4.1 First Iteration: First Record Is Read
Input Buffer
Program Data Vector
Gender PercentFat
----+----1----+----2
m 13.3 f 22
.
The following figure shows that the value mis written to the program data vector.
When the pointer reaches the blank space that follows 13.3, the complete value for the
variable PercentFat has been read. The pointer stops in the next column, and the value
13.3 is written to the program data vector.
Figure 4.2 First Observation Is Created
Input Buffer
Program Data Vector
Gender PercentFat
----+----1----+----2
m 13.3 f 22
m 13.3
There are no other variables in the INPUT statement and no more statements in the
DATA step, so three actions take place:
1The first observation is written to the data set.
2The DATA step begins its next iteration.
3The values in the program data vector are set to missing.
The following figure shows the current position of the pointer. SAS is ready to read
the next piece of data in the same record.
66 Understanding How the Double Trailing @ Affects DATA Step Execution Chapter 4
Figure 4.3 Second Iteration: First Record Remains in the Input Buffer
Input Buffer
Program Data Vector
Gender PercentFat
----+----1----+----2
m 13.3 f 22
.
The following figure shows that the INPUT statement reads the next two values from
the input buffer and writes them to the program data vector.
Figure 4.4 Second Observation Is Created
Input Buffer
Program Data Vector
Gender PercentFat
----+----1----+----2
m 13.3 f 22
f22
When the DATA step completes the second iteration, the values in the program data
vector are written to the data set as the second observation. Then the DATA step
begins its third iteration. Values in the program data vector are set to missing, and the
INPUT statement executes. The pointer, which is now at column 13 (two columns to the
right of the last data value that was read), continues reading. Because this is list input,
the pointer scans for the next nonblank character to begin reading the next value.
When the pointer reaches the end of the input buffer and fails to find a nonblank
character, SAS reads a new record into the input buffer.
The final figure shows that values for the third observation are read from the
beginning of the second record.
Starting with Raw Data: Beyond the Basics Method 1: Using Multiple Input Statements 67
Figure 4.5 Third Iteration: Second Record Is Read into the Input Buffer
Input Buffer
Program Data Vector
Gender PercentFat
----+----1----+----2
m 22 f 23.2
.
The process continues until SAS reads all the records. The resulting SAS data set
contains six observations instead of three.
Note: Although this program successfully reads all of the data in the input records,
SAS writes a message to the log noting that the program had to go to a new line.
Reading Multiple Records to Create a Single Observation
How the Data Records Are Structured
An earlier example (see “Reading Character Data That Contains Embedded Blanks”
on page 54) shows data for several observations that are contained in a single record of
raw data:
1023 David Shaw red 189 165
This INPUT statement reads all the data values arranged across a single record:
input IdNumber 1-4 Name $ 6-23 Team $ StartWeight EndWeight;
Now, consider the opposite situation: when information for a single observation is not
contained in a single record of raw data but is scattered across several records. For
example, the health and fitness club data could be constructed in such a way that the
information about a single member is spread across several records instead of in a
single record:
1023 David Shaw
red
189 165
Method 1: Using Multiple Input Statements
Multiple INPUT statements, one for each record, can read each record into a single
observation, as in this example:
input IdNumber 1-4 Name $ 6-23;
input Team $ 1-6;
input StartWeight 1-3 EndWeight 5-7;
To understand how to use multiple INPUT statements, consider what happens as a
DATA step executes. Remember that one record is read into the INPUT buffer
68 Method 1: Using Multiple Input Statements Chapter 4
automatically as each INPUT statement is encountered during each iteration. SAS
reads the data values from the input buffer and writes them to the program data vector
as variable values. At the end of the DATA step, all the variable values in the program
data vector are written automatically as a single observation.
This example uses multiple INPUT statements in a DATA step to read only selected
data fields and create a data set containing only the variables IdNumber, StartWeight,
and EndWeight.
data club2;
input IdNumber 1-4; u
input; v
input StartWeight 1-3 EndWeight 5-7; w
datalines;
1023 David Shaw
red
189 165
1049 Amelia Serrano
yellow
145 124
1219 Alan Nance
red
210 192
1246 Ravi Sinha
yellow
194 177
1078 Ashley McKnight
red
127 118
1221 Jim Brown
yellow
220 .
;
proc print data=club2;
title ’Weight Club Members’;
run;
The following list corresponds to the numbered items in the preceding program:
uThe first INPUT statement reads only one data field in the first record and
assigns a value to the variable IdNumber.
vThe second INPUT statement, without arguments, is a null INPUT statement
that reads the second record into the input buffer. However, it does not assign a
value to a variable.
wThe third INPUT statement reads the third record into the input buffer and
assigns values to the variables StartWeight and EndWeight.
The following output shows the resulting data set:
Starting with Raw Data: Beyond the Basics Method 2: Using the / Line-Pointer Control 69
Output 4.3 Data Set Created with Multiple INPUT Statements
Weight Club Members 1
Id Start End
Obs Number Weight Weight
1 1023 189 165
2 1049 145 124
3 1219 210 192
4 1246 194 177
5 1078 127 118
6 1221 220 .
Method 2: Using the / Line-Pointer Control
Writing a separate INPUT statement for each record is not the only way to create a
single observation. You can write a single INPUT statement and use the slash (/)
line-pointer control. The slash line-pointer control forces a new record into the input
buffer and positions the pointer at the beginning of that record.
This example uses only one INPUT statement to read multiple records:
data club2;
input IdNumber 1-4 / / StartWeight 1-3 EndWeight 5-7;
datalines;
1023 David Shaw
red
189 165
1049 Amelia Serrano
yellow
145 124
1219 Alan Nance
red
210 192
1246 Ravi Sinha
yellow
194 177
1078 Ashley McKnight
red
127 118
1221 Jim Brown
yellow
220 .
;
proc print data=club2;
title ’Weight Club Members’;
run;
The / line-pointer control appears exactly where a new INPUT statement begins in
the previous example (see “Method 1: Using Multiple Input Statements” on page 67).
The sequence of events in the input buffer and the program data vector as this DATA
step executes is identical to the previous example in method 1. The / is the signal to
read a new record into the input buffer, which happens automatically when the DATA
step encounters a new INPUT statement. The preceding example shows two slashes
70 Reading Variables from Multiple Records in Any Order Chapter 4
(/ /), indicating that SAS skips a record. SAS reads the first record, skips the second
record, and reads the third record.
The following output shows the resulting data set:
Output 4.4 Data Set Created with the / Line-Pointer Control
Weight Club Members 1
Id Start End
Obs Number Weight Weight
1 1023 189 165
2 1049 145 124
3 1219 210 192
4 1246 194 177
5 1078 127 118
6 1221 220 .
Reading Variables from Multiple Records in Any Order
You can also read multiple records to create a single observation by pointing to a
specific record in a set of input records with the #nline-pointer control. As you saw in
the last section, the advantage of using the / line-pointer control over multiple INPUT
statements is that it requires fewer statements. However, using the #nline-pointer
control enables you to read the variables in any order, no matter which record contains
the data values. It is also useful if you want to skip data lines.
This example uses one INPUT statement to read multiple data lines in a different
order:
data club2;
input #2 Team $ 1-6 #1 Name $ 6-23 IdNumber 1-4
#3 StartWeight 1-3 EndWeight 5-7;
datalines;
1023 David Shaw
red
189 165
1049 Amelia Serrano
yellow
145 124
1219 Alan Nance
red
210 192
1246 Ravi Sinha
yellow
194 177
1078 Ashley McKnight
red
127 118
1221 Jim Brown
yellow
220 .
;
proc print data=club2;
Starting with Raw Data: Beyond the Basics Understanding How the #n Line-Pointer Control Affects DATA Step Execution 71
title ’Weight Club Members’;
run;
The following output shows the resulting data set:
Output 4.5 Data Set Created with the #nLine-Pointer Control
Weight Club Members 1
Id Start End
Obs Team Name Number Weight Weight
1 red David Shaw 1023 189 165
2 yellow Amelia Serrano 1049 145 124
3 red Alan Nance 1219 210 192
4 yellow Ravi Sinha 1246 194 177
5 red Ashley McKnight 1078 127 118
6 yellow Jim Brown 1221 220 .
The order of the observations is the same as in the raw records ( shown in the section
“Reading Variables from Multiple Records in Any Order” on page 70). However, the
order of the variables in the data set differs from the order of the variables in the raw
input data records. This occurs because the order of the variables in the INPUT
statements corresponds with their order in the resulting data sets.
Understanding How the #n Line-Pointer Control Affects DATA Step
Execution
To understand the importance of the #nline-pointer control, remember the sequence
of events in the DATA steps that demonstrate the / line-pointer control and multiple
INPUT statements. Each record is read into the input buffer sequentially. The data is
read, and then a / or a new INPUT statement causes the program to read the next
record into the input buffer. It is impossible for the program to read a value from the
first record after a value from the second record is read because the data in the first
record is no longer available in the input buffer.
To solve this problem, use the #nline-pointer control. The #nline-pointer control
signals the program to create a multiple-line input buffer so that all the data for a
single observation is available while the observation is being built in the program data
vector. The #nline-pointer control also identifies the record in which data for each
variable appears. To use the #nline-pointer control, the raw data must have the same
number of records for each observation; for example, it cannot have three records for
one observation and two for the next.
When the program compiles and builds the input buffer, it looks at the INPUT
statement and creates an input buffer with as many lines as are necessary to contain
the number of records it needs to read for a single observation. In this example, the
highest number of records specified is three, so the input buffer is built to contain three
records at one time. The following figures demonstrate the flow of the DATA step in
this example.
This figure shows that the values are set to missing in the program data vector and
that the INPUT statement reads the first three records into the input buffer.
72 Understanding How the #n Line-Pointer Control Affects DATA Step Execution Chapter 4
Figure 4.6 Three Records Are Read into the Input Buffer as a Single Observation
Input Buffer
Program Data Vector
IdNumberName StartWeight EndWeightTeam
----+----1----+----2----+----3----+----4----+----5----+----6
1023 David Shaw
----+----1----+----2----+----3----+----4----+----5----+----6
red
----+----1----+----2----+----3----+----4----+----5----+----6
189 165
...
The INPUT statement for this example is as follows:
input #2 Team $ 1-6
#1 Name $ 6-23 IdNumber 1-4
#3 StartWeight 1-3 EndWeight 5-7;
The first variable is preceded by #2 to indicate that the value in the second record is
assigned to the variable Team. The following figure shows that the pointer advances to
the second line in the input buffer, reads the value, and writes it to the program data
vector.
Figure 4.7 Reading from the Second Record First
Input Buffer
Program Data Vector
IdNumberName StartWeight EndWeightTeam
----+----1----+----2----+----3----+----4----+----5----+----6
1023 David Shaw
----+----1----+----2----+----3----+----4----+----5----+----6
red
----+----1----+----2----+----3----+----4----+----5----+----6
189 165
..red .
The following figure shows that the pointer then moves to the sixth column in the first
record, reads a value, and assigns it to the variable Name in the program data vector.
It then moves to the first column to read the ID number, and assigns it to the variable
IdNumber.
Starting with Raw Data: Beyond the Basics Understanding How the #n Line-Pointer Control Affects DATA Step Execution 73
Figure 4.8 Reading from the First Record
Input Buffer
Program Data Vector
IdNumberName StartWeight EndWeightTeam
----+----1----+----2----+----3----+----4----+----5----+----6
1023 David Shaw
----+----1----+----2----+----3----+----4----+----5----+----6
red
----+----1----+----2----+----3----+----4----+----5----+----6
189 165
.1023red David Shaw .
The following figure shows that the process continues with the pointer moving to the
third record in the first observation. Values are read and assigned to StartWeight and
EndWeight, the last variable that is listed.
Figure 4.9 Reading from the Third Record
Input Buffer
Program Data Vector
IdNumberName StartWeight EndWeightTeam
----+----1----+----2----+----3----+----4----+----5----+----6
1023 David Shaw
----+----1----+----2----+----3----+----4----+----5----+----6
red
----+----1----+----2----+----3----+----4----+----5----+----6
189 165
1891023red David Shaw 165
When the bottom of the DATA step is reached, variable values in the program data
vector are written as an observation to the data set. The DATA step returns to the top,
and values in the program data vector are set to missing. The INPUT statement
executes again. The final figure shows that the next three records are read into the
input buffer, ready to create the second observation.
74 Problem Solving: When an Input Record Unexpectedly Does Not Have Enough Values Chapter 4
Figure 4.10 Reading the Next Three Records into the Input Buffer
Input Buffer
Program Data Vector
IdNumberName StartWeight EndWeightTeam
----+----1----+----2----+----3----+----4----+----5----+----6
1049 Amelia Serrano
----+----1----+----2----+----3----+----4----+----5----+----6
yellow
----+----1----+----2----+----3----+----4----+----5----+----6
145 124
...
Problem Solving: When an Input Record Unexpectedly Does Not Have
Enough Values
Understanding the Default Behavior
When a DATA step reads raw data from an external file, problems can occur when
SAS encounters the end of an input line before reading in data for all variables
specified in the input statement. This problem can occur when reading variable-length
records and/or records containing missing values.
The following is an example of an external file that contains variable-length records:
----+-----1-----+-----2
22
333
4444
55555
This DATA step uses the numeric informat 5. to read a single field in each record of
raw data and to assign values to the variable TestNumber:
data numbers;
infile ’your-external-file’;
input TestNumber 5.;
run;
proc print data=numbers;
title ’Test DATA Step’;
run;
The DATA step reads the first value (22). Because the value is shorter than the 5
characters expected by the informat, the DATA step attempts to finish filling the value
with the next record (333). This value is entered into the PDV and becomes the value of
Starting with Raw Data: Beyond the Basics Methods of Control: Your Options 75
the TestNumber variable for the first observation. The DATA step then goes to the next
record, but encounters the same problem because the value (4444) is shorter than the
value that is expected by the informat. Again, the DATA step goes to the next record,
reads the value (55555), and assigns that value to the TestNumber variable for the
second observation.
The following output shows the results. After this program runs, the SAS log
contains a note to indicate the places where SAS went to the next record to search for
data values.
Output 4.6 Reading Raw Data Past the End of a Line: Default Behavior
Test DATA Step 1
Test
Obs Number
1 333
2 55555
Methods of Control: Your Options
Four Options: FLOWOVER, STOPOVER, MISSOVER, and TRUNCOVER
To control how SAS behaves after it attempts to read past the end of a data line, you
can use the following options in the INFILE statement:
infile ’your-external-file’ flowover;
is the default behavior. The DATA step simply reads the next record into the input
buffer, attempting to find values to assign to the rest of the variable names in the
INPUT statement.
infile ’your-external-file’ stopover;
causes the DATA step to stop processing if an INPUT statement reaches the end of
the current record without finding values for all variables in the statement. Use
this option if you expect all of the data in the external file to conform to a given
standard and if you want the DATA step to stop when it encounters a data record
that does not conform to the standard.
infile ’your-external-file’ missover;
prevents the DATA step from going to the next line if it does not find values in the
current record for all of the variables in the INPUT statement. Instead, the DATA
step assigns a missing value for all variables that do not have values.
infile ’your-external-file’ truncover;
causes the DATA step to assign the raw data value to the variable even if the
value is shorter than expected by the INPUT statement. If, when the DATA step
encounters the end of an input record, there are variables without values, the
variables are assigned missing values for that observation.
You can also use these options even when your data lines are in the program itself,
that is, when they follow the DATALINES statement. Simply use datalines instead of
a reference to an external file to indicate that the data records are in the DATA step
itself:
infile datalines flowover;
infile datalines stopover;
76 Methods of Control: Your Options Chapter 4
infile datalines missover;
infile datalines truncover;
Note: The examples in this section show the use of the MISSOVER and
TRUNCOVER options with formatted input. You can also use these options with list
input and column input.
Understanding the MISSOVER Option
The MISSOVER option prevents the DATA step from going to the next line if it does
not find values in the current record for all of the variables in the INPUT statement.
Instead, the DATA step assigns a missing value for all variables that do not have
complete values according to any specified informats. The input file contains the
following raw data:
----+-----1-----+-----2
22
333
4444
55555
The following example uses the MISSOVER option:
data numbers;
infile ’your-external-file’ missover;
input TestNumber 5.;
run;
proc print data=numbers;
title ’Test DATA Step’;
run;
Output 4.7 Output from the MISSOVER Option
Test DATA Step 1
Test
Obs Number
1.
2.
3.
4 55555
Because the fourth record is the only one whose value matches the informat, it is the
only record whose value is assigned to the TestNumber variable. The other observations
receive missing values. This result is probably not the desired outcome for this
example, but the MISSOVER option can sometimes be valuable. For an example, see
“Updating a Data Set” on page 295.
Note: If there is a blank line at the end of the last record, the DATA step attempts
to load another record into the input buffer. Because there are no more records, the
MISSOVER option instructs the DATA step to assign missing values to all variables,
and an extra observation is added to the data set. To prevent this situation from
Starting with Raw Data: Beyond the Basics Column-Pointer Controls 77
occurring, make sure that your input data does not have a blank line at the end of the
last record.
Understanding the TRUNCOVER Option
The TRUNCOVER option causes the DATA step to assign the raw data value to the
variable even if the value is shorter than the length that is expected by the INPUT
statement. If, when the DATA step encounters the end of an input record, there are
variables without values, the variables are assigned missing values for that observation.
The following example demonstrates the use of the TRUNCOVER statement:
data numbers;
infile ’your-external-file’ truncover;
input TestNumber 5.;
run;
proc print data=numbers;
title ’Test DATA Step’;
run;
Output 4.8 Output from the TRUNCOVER Option
Test DATA Step 1
Test
Obs Number
122
2 333
3 4444
4 55555
This result shows that all of the values were assigned to the TestNumber variable,
despite the fact that three of them did not match the informat. For another example
using the TRUNCOVER option, see “Input SAS Data Set for Examples” on page 140.
Review of SAS Tools
Column-Pointer Controls
@n
moves the pointer to the ncolumn in the input buffer.
+n
moves the pointer forward ncolumns in the input buffer.
/
moves the pointer to the next line in the input buffer.
#n
moves the pointer to the nth line in the input buffer.
78 Line-Hold Specifiers Chapter 4
Line-Hold Specifiers
@
(trailing @) prevents SAS from automatically reading a new data record into the
input buffer when a new INPUT statement is executed within the same iteration
of the DATA step. When used, the trailing @ must be the last item in the INPUT
statement.
@@
(double trailing @) prevents SAS from automatically reading a new data record
into the input buffer when the next INPUT statement is executed, even if the
DATA step returns to the top for another iteration. When used, the double trailing
@ must be the last item in the INPUT statement.
Statements
DATALINES;
indicates that data lines immediately follow. A semicolon in the line that
immediately follows the last data line indicates the end of the data and causes the
DATA step to compile and execute.
INFILE fileref< FLOWOVER | STOPOVER | MISSOVER | TRUNCOVER>;
INFILE ’external-file’ <FLOWOVER | STOPOVER | MISSOVER | TRUNCOVER>;
identifies an external file to be read by an INPUT statement. Specify a fileref that
has been assigned with a FILENAME statement or with an appropriate operating
environment command. Or you can specify the actual name of the external file.
These options give you control over how SAS behaves if the end of a data record
is encountered before all of the variables are assigned values. You can use these
options with list, modified list, formatted, and column input.
FLOWOVER
is the default behavior. It causes the DATA step to look in the next record if
the end of the current record is encountered before all of the variables are
assigned values
MISSOVER
causes the DATA step to assign missing values to any variables that do not
have values when the end of a data record is encountered. The DATA step
continues processing.
STOPOVER
causes the DATA step to stop execution immediately and write a note to the
SAS log.
TRUNCOVER
causes the DATA step to assign values to variables, even if the values are
shorter than expected by the INPUT statement, and to assign missing values
to any variables that do not have values when the end of a record is
encountered.
INPUT variable <&> <$>;
reads the input data record using list input. The & (ampersand format modifier)
allows character values to contain embedded blanks. When you use the
ampersand format modifier, two blanks are required to signal the end of a data
value. The $ indicates a character variable.
Starting with Raw Data: Beyond the Basics Learning More 79
INPUT variable start-column<end-column>;
reads the input data record using column input. You can omit end-column if the
data is only 1 byte long. This style of input enables you to skip columns of data
that you want to omit.
INPUT variable :informat;
INPUT variable &informat;
reads the input data record using modified list input. The : (colon format modifier)
instructs SAS to use the informat that follows to read the data value. The &
(ampersand format modifier) instructs SAS to use the informat that follows to read
the data value. When you use the ampersand format modifier, two blanks are
required to signal the end of a data value.
INPUT <pointer-control> variable informat;
reads raw data using formatted input. The informat supplies special instructions
to read the data. You can also use a pointer-control to direct SAS to start reading
at a particular column.
The syntax given above for the three styles of input shows only one variable.
Subsequent variables in the INPUT statement may or may not be described in the
same input style as the first one. You may use any of the three styles of input (list,
column, and formatted) in a single INPUT statement.
Learning More
Handling missing data values
For complete details about the FLOWOVER, STOPOVER, MISSOVER, and
TRUNCOVER options in the INFILE statement, see SAS Language Reference:
Dictionary.
Reading multiple input records
Testing a condition
For more information about performing conditional processing with the IF
statement, see Chapter 9, “Acting on Selected Observations,” on page 139 and
Chapter 10, “Creating Subsets of Observations,” on page 159.
For a complete discussion and listing of line-pointer controls and line-hold
specifiers, see SAS Language Reference: Dictionary.
80
81
CHAPTER
5
Starting with SAS Data Sets
Introduction to Starting with SAS Data Sets 81
Purpose 81
Prerequisites 81
Understanding the Basics 82
Input SAS Data Set for Examples 82
Reading Selected Observations 84
Reading Selected Variables 85
Overview of Reading Selected Variables 85
Keeping Selected Variables 86
Dropping Selected Variables 87
Choosing between Data Set Options and Statements 88
Choosing between the DROP= and KEEP= Data Set Option 88
Creating More Than One Data Set in a Single DATA Step 89
Using the DROP= and KEEP= Data Set Options for Efficiency 91
Review of SAS Tools 92
Data Set Options 92
Procedures 93
Statements 93
Learning More 93
Introduction to Starting with SAS Data Sets
Purpose
In this section, you will learn how to do the following:
display information about a SAS data set
create a new SAS data set from an existing SAS data set rather than creating it
from raw data records
Reading a SAS data set in a DATA step is simpler than reading raw data because the
work of describing the data to SAS has already been done.
Prerequisites
You should understand the concepts presented in Chapter 1, “What Is the SAS
System?,” on page 3 and Chapter 2, “Introduction to DATA Step Processing,” on page 19
before continuing with this section.
82 Understanding the Basics Chapter 5
Understanding the Basics
When you use a SAS data set as input into a DATA step, the description of the data
set is available to SAS. In your DATA step, use a SET, MERGE, MODIFY, or UPDATE
statement to read the SAS data set. Use SAS programming statements to process the
data and create an output SAS data set.
In a DATA step, you can create a new data set that is a subset of the original data
set. For example, if you have a large data set of personnel data, you might want to look
at a subset of observations that meet certain conditions, such as observations for
employees hired after a certain date. Alternatively, you might want to see all
observations but only a few variables, such as the number of years of education or years
of service to the company.
When you use existing SAS data sets, as well as with subsets created from SAS data
sets, you can make more efficient use of computer resources than if you use raw data or
if you are working with large data sets. Reading fewer variables means that SAS
creates a smaller program data vector, and reading fewer observations means that
fewer iterations of the DATA step occur. Reading data directly from a SAS data set is
more efficient than reading the raw data again, because the work of describing and
converting the data has already been done.
One way of looking at a SAS data set is to produce a listing of the data in a SAS data
set by using the PRINT procedure. Another way to look at a SAS data set is to display
information that describes its structure rather than its data values. To display
information about the structure of a data set, use the DATASETS procedure with the
CONTENTS statement. If you need to work with a SAS data set that is unfamiliar to
you, the CONTENTS statement in the DATASETS procedure displays valuable
information such as the name, type, and length of all the variables in the data set. An
example that shows the CONTENTS statement in the DATASETS procedure is shown
in “Input SAS Data Set for Examples” on page 82.
Input SAS Data Set for Examples
The examples in this section use a SAS data set named CITY, which contains
information about expenditures for a small city. It reports total city expenditures for
the years 1980 through 2000 and divides the expenses into two major categories:
services and administration. (To see the program that creates the CITY data set, see
“DATA Step to Create the Data Set CITY” on page 712.)
The following example uses the DATASETS procedure with the NOLIST option to
display the CITY data set. The NOLIST option prevents the DATASETS procedure
from listing other data sets that are also located in the WORK library:
proc datasets library=work nolist;
contents data=city;
run;
Starting with SAS Data Sets Input SAS Data Set for Examples 83
Output 5.1 The Structure of CITY as Shown by PROC DATASETS
The SAS System 1
The DATASETS Procedure
Data Set Name: WORK.CITY Observations: 21 u
Member Type: DATA Variables: 10 u
Engine: V8 Indexes: 0
Created: 9:54 Wednesday, October 6, 1999 Observation Length: 80
Last Modified: 9:54 Wednesday, October 6, 1999 Deleted Observations: 0
Protection: Compressed: NO
Data Set Type: Sorted: NO
Label:
-----Engine/Host Dependent Information----- v
Data Set Page Size: 8192
Number of Data Set Pages: 1
First Data Page: 1
Max Obs per Page: 101
Obs in First Data Page: 21
Number of Data Set Repairs: 0
File Name: /usr/tmp/code_editor_saswork/SAS_
work63ED00006E98/city.sas7bdat
Release Created: 8.0001M0
Host Created: HP-UX
Inode Number: 62403
Access Permission: rw-r--r--
Owner Name: abcdef
File Size (bytes): 16384
-----Alphabetic List of Variables and Attributes-----
w# Variable Type Len Pos xLabel
----------------------------------------------------------------------------
5 AdminLabor Num 8 32 Administration: Labor
6 AdminSupplies Num 8 40 Administration: Supplies
9 AdminTotal Num 8 64 Administration: Total
7 AdminUtilities Num 8 48 Administration: Utilities
3 ServicesFire Num 8 16 Services: Fire
2 ServicesPolice Num 8 8 Services: Police
8 ServicesTotal Num 8 56 Services: Total
4 ServicesWater_Sewer Num 8 24 Services: Water & Sewer
10 Total Num 8 72 Total Outlays
1 Year Num 8 0
The following list corresponds to the numbered items in the previous SAS output:
uThe Observations and the Variables fields identify the number of observations and
the number of variables.
vThe Engine/Host Dependent Information section lists detailed information about
the data set. This information is generated by the engine, which is the mechanism
for reading from and writing to files.
Operating Environment Information: The output in this section may differ,
depending on your operating environment. For more information, refer to the SAS
documentation for your operating environment.
wThe Alphabetic List of Variables and Attributes lists the name, type, length, and
position of each variable.
xThe Label lists the format, informat, and label for each variable, if they exist.
84 Reading Selected Observations Chapter 5
Reading Selected Observations
If you are interested in only part of a large data set, you can use data set options to
create a subset of your data. Data set options specify which observations you want the
new data set to include. In Chapter 10, “Creating Subsets of Observations,” on page
159 you learn how to use the subsetting IF statement to create a subset of a large SAS
data set. In this section, you learn how to use the FIRSTOBS= and OBS= data set
options to create subsets of a larger data set.
For example, you might not want to read the observations at the beginning of the
data set. You can use the FIRSTOBS= data set option to define which observation
should be the first one that is processed. For the data set CITY, this example creates a
data set that excludes observations that contain data prior to 1991 by specifying
FIRSTOBS=12. As a result, SAS does not read the first 11 observations, which contain
data prior to 1991. (To see the program that creates the CITY data set, see “DATA Step
to Create the Data Set CITY” on page 712.)
The following program creates the data set CITY2, which contains the same number
of variables but fewer observations than CITY.
data city2;
set city(firstobs=12);
run;
proc print;
title ’City Expenditures’;
title2 ’1991 - 2000’;
run;
The following output shows the results:
Starting with SAS Data Sets Overview of Reading Selected Variables 85
Output 5.2 Subsetting a Data Set by Observations
City Expenditures 1
1991 - 2000
S
e
r
v
i
Sc A
eeAdS
rSs dme
veW mir
i r aAinvA
c v tdnUid
e i emStcm
s c riuiei
P e _nplsn
osSLpiTTT
YlFealtooo
Oeiiwbiittt
bacreoeeaaa
sreerrsslll
1 1991 2195 1002 643 256 24 55 3840 335 4175
2 1992 2204 964 692 256 28 70 3860 354 4214
3 1993 2175 1144 735 241 19 83 4054 343 4397
4 1994 2556 1341 813 238 25 97 4710 360 5070
5 1995 2026 1380 868 226 24 97 4274 347 4621
6 1996 2526 1454 946 317 13 89 4926 419 5345
7 1997 2027 1486 1043 226 . 82 4556 . .
8 1998 2037 1667 1152 244 20 88 4856 352 5208
9 1999 2852 1834 1318 270 23 74 6004 367 6371
10 2000 2787 1701 1317 307 26 66 5805 399 6204
You can also specify the last observation you want to include in a new data set with
the OBS= data set option. For example, the next program creates a SAS data set
containing only the observations for 1989 (the 10th observation) through 1994 (the 15th
observation).
data city3;
set city (firstobs=10 obs=15);
run;
Reading Selected Variables
Overview of Reading Selected Variables
You can create a subset of a larger data set not only by excluding observations but
also by specifying which variables you want the new data set to contain. In a DATA
step you can use the SET statement and the KEEP= or DROP= data set options (or the
DROP and KEEP statements) to create a subset from a larger data set by specifying
which variables you want the new data set to include.
86 Keeping Selected Variables Chapter 5
Keeping Selected Variables
This example uses the KEEP= data set option in the SET statement to read only the
variables that represent the services-related expenditures of the data set CITY.
data services;
set city (keep=Year ServicesTotal ServicesPolice ServicesFire
ServicesWater_Sewer);
run;
proc print data=services;
title ’City Services-Related Expenditures’;
run;
The following output shows the resulting data set. Note that the data set SERVICES
contains only those variables that are specified in the KEEP= option.
Output 5.3 Selecting Variables with the KEEP= Option
City Services-Related Expenditures 1
Services
Services Services Water_ Services
Obs Year Police Fire Sewer Total
1 1980 2819 1120 422 4361
2 1981 2477 1160 500 4137
3 1982 2028 1061 510 3599
4 1983 2754 893 540 4187
5 1984 2195 963 541 3699
6 1985 1877 926 535 3338
7 1986 1727 1111 535 3373
8 1987 1532 1220 519 3271
9 1988 1448 1156 577 3181
10 1989 1500 1076 606 3182
11 1990 1934 969 646 3549
12 1991 2195 1002 643 3840
13 1992 2204 964 692 3860
14 1993 2175 1144 735 4054
15 1994 2556 1341 813 4710
16 1995 2026 1380 868 4274
17 1996 2526 1454 946 4926
18 1997 2027 1486 1043 4556
19 1998 2037 1667 1152 4856
20 1999 2852 1834 1318 6004
21 2000 2787 1701 1317 5805
The following example uses the KEEP statement instead of the KEEP= data set
option to read all of the variables from the CITY data set. The KEEP statement creates
a new data set (SERVICES) that contains only the variables listed in the KEEP
statement. The following program gives results that are identical to those in the
previous example:
data services;
set city;
keep Year ServicesTotal ServicesPolice ServicesFire
ServicesWater_Sewer;
run;
Starting with SAS Data Sets Dropping Selected Variables 87
The following example has the same effect as using the KEEP= data set option in the
DATA statement. All of the variables are read into the program data vector, but only
the specified variables are written to the SERVICES data set:
data services (keep=Year ServicesTotal ServicesPolice ServicesFire
ServicesWater_Sewer);
set city;
run;
Dropping Selected Variables
Use the DROP= option to create a subset of a larger data set when you want to
specify which variables are being excluded rather than which ones are being included.
The following DATA step reads all of the variables from the data set CITY except for
those that are specified with the DROP= option, and then creates a data set named
SERVICES2:
data services2;
set city (drop=Total AdminTotal AdminLabor AdminSupplies
AdminUtilities);
run;
proc print data=services2;
title ’City Services-Related Expenditures’;
run;
The following output shows the resulting data set:
Output 5.4 Excluding Variables with the DROP= Option
City Services-Related Expenditures 1
Services
Services Services Water_ Services
Obs Year Police Fire Sewer Total
1 1980 2819 1120 422 4361
2 1981 2477 1160 500 4137
3 1982 2028 1061 510 3599
4 1983 2754 893 540 4187
5 1984 2195 963 541 3699
6 1985 1877 926 535 3338
7 1986 1727 1111 535 3373
8 1987 1532 1220 519 3271
9 1988 1448 1156 577 3181
10 1989 1500 1076 606 3182
11 1990 1934 969 646 3549
12 1991 2195 1002 643 3840
13 1992 2204 964 692 3860
14 1993 2175 1144 735 4054
15 1994 2556 1341 813 4710
16 1995 2026 1380 868 4274
17 1996 2526 1454 946 4926
18 1997 2027 1486 1043 4556
19 1998 2037 1667 1152 4856
20 1999 2852 1834 1318 6004
21 2000 2787 1701 1317 5805
88 Choosing between Data Set Options and Statements Chapter 5
The following example uses the DROP statement instead of the DROP= data set option
to read all of the variables from the CITY data set and to exclude the variables that are
listed in the DROP statement from being written to the new data set. The results are
identical to those in the previous example:
data services2;
set city;
drop Total AdminTotal AdminLabor AdminSupplies AdminUtilities;
run;
proc print data=services2;
run;
Choosing between Data Set Options and Statements
When you create only one data set in the DATA step, the data set options to drop and
keep variables have the same effect on the output data set as the statements to drop
and keep variables. When you want to control which variables are read into the
program data vector, using the data set options in the statement (such as a SET
statement) that reads the SAS data set is generally more efficient than using the
statements. Later topics in this section show you how to use the data set options in
some cases where the statements will not work.
Choosing between the DROP= and KEEP= Data Set Option
In a simple case, you might decide to use the DROP= or KEEP= option, depending on
which method enables you to specify fewer variables. If you work with large jobs that
read data sets, and you expect that variables might be added between the times your
batch jobs run, you may want to use the KEEP= option to specify which variables are
included in the subset data set.
The following figure shows two data sets named SMALL. They have different
contents because the new variable F was added to data set BIG before the DATA step
ran on Tuesday. The DATA step uses the DROP= option to keep variables D and E from
being written to the output data set. The result is that the data sets contain different
contents: the second SMALL data set has an extra variable, F. If the DATA step used
the KEEP= option to specify A, B, and C, then both of the SMALL data sets would have
the same variables (A, B, and C). The addition of variable F to the original data set BIG
would have no effect on the creation of the SMALL data set.
Starting with SAS Data Sets Creating More Than One Data Set in a Single DATA Step 89
Figure 5.1 Using the DROP= Option
A B C
data small;
set big(drop=d e);
run;
data small;
set big(drop=d e);
run;
A B C
D E
A B C
D E F
F
A B C
F
Creating More Than One Data Set in a Single DATA Step
You can use a single DATA step to create more than one data set at a time. You can
create data sets with different contents by using the KEEP= or DROP= data set
options. For example, the following DATA step creates two SAS data sets: SERVICES
contains variables that show services-related expenditures, and ADMIN contains
variables that represent the administration-related expenditures. Use the KEEP=
option after each data set name in the DATA statement to determine which variables
are written to each SAS data set being created.
data services(keep=ServicesTotal ServicesPolice ServicesFire
ServicesWater_Sewer)
admin(keep=AdminTotal AdminLabor AdminSupplies
AdminUtilities);
set city;
run;
proc print data=services;
title ’City Expenditures: Services’;
run;
90 Creating More Than One Data Set in a Single DATA Step Chapter 5
proc print data=admin;
title ’City Expenditures: Administration’;
run;
The following output shows both data sets. Note that each data set contains only the
variables that are specified with the KEEP= option after its name in the DATA
statement.
Output 5.5 Creating Two Data Sets in One DATA Step
City Expenditures: Services 1
Services
Services Services Water_ Services
Obs Police Fire Sewer Total
1 2819 1120 422 4361
2 2477 1160 500 4137
3 2028 1061 510 3599
4 2754 893 540 4187
5 2195 963 541 3699
6 1877 926 535 3338
7 1727 1111 535 3373
8 1532 1220 519 3271
9 1448 1156 577 3181
10 1500 1076 606 3182
11 1934 969 646 3549
12 2195 1002 643 3840
13 2204 964 692 3860
14 2175 1144 735 4054
15 2556 1341 813 4710
16 2026 1380 868 4274
17 2526 1454 946 4926
18 2027 1486 1043 4556
19 2037 1667 1152 4856
20 2852 1834 1318 6004
21 2787 1701 1317 5805
Starting with SAS Data Sets Using the DROP= and KEEP= Data Set Options for Efficiency 91
City Expenditures: Administration 2
Admin Admin Admin Admin
Obs Labor Supplies Utilities Total
1 391 63 98 552
2 172 47 70 289
3 269 29 79 377
4 227 21 67 315
5 214 21 59 294
6 198 16 80 294
7 213 27 70 310
8 195 11 69 275
9 225 12 58 295
10 235 19 62 316
11 266 11 63 340
12 256 24 55 335
13 256 28 70 354
14 241 19 83 343
15 238 25 97 360
16 226 24 97 347
17 317 13 89 419
18 226 . 82 .
19 244 20 88 352
20 270 23 74 367
21 307 26 66 399
Note: In this case, using the KEEP= data set option is necessary, because when you
use the KEEP statement, all data sets that are created in the DATA step contain the
same variables.
Using the DROP= and KEEP= Data Set Options for Efficiency
The DROP= and KEEP= data set options are valid in both the DATA statement and
the SET statement. However, you can write a more efficient DATA step if you
understand the consequences of using these options in the DATA statement rather than
the SET statement.
In the DATA statement, these options affect which variables SAS writes from the
program data vector to the resulting SAS data set. In the SET statement, these options
determine which variables SAS reads from the input SAS data set. Therefore, they
determine how the program data vector is built.
When you specify the DROP= or KEEP= option in the SET statement, SAS does not
read the excluded variables into the program data vector. If you work with a large data
set (perhaps one containing thousands or millions of observations), you can construct a
more efficient DATA step by not reading unneeded variables from the input data set.
Note also that if you use a variable from the input data set to perform a calculation,
the variable must be read into the program data vector. If you do not want that
variable to appear in the new data set, however, use the DROP= option in the DATA
statement to exclude it.
The following DATA step creates the same two data sets as the DATA step in the
previous example, but it does not read the variable Total into the program data vector.
Compare the SET statement here to the one in “Creating More Than One Data Set in a
Single DATA Step” on page 89.
data services (keep=ServicesTotal ServicesPolice ServicesFire
ServicesWater_Sewer)
92 Review of SAS Tools Chapter 5
admin (keep=AdminTotal AdminLabor AdminSupplies
AdminUtilities);
set city(drop=Total);
run;
proc print data=services;
title ’City Expenditures: Services’;
run;
proc print data=admin;
title ’City Expenditures: Administration’;
run;
In contrast with previous examples, the data set options in this example appear in
both the DATA and SET statements. In the SET statement, the DROP= option
determines which variables are omitted from the program data vector. In the DATA
statement, the KEEP= option controls which variables are written from the program
data vector to each data set being created.
Note: Using a DROP or KEEP statement is comparable to using a DROP= or
KEEP= option in the DATA statement. All variables are included in the program data
vector; they are excluded when the observation is written from the program data vector
to the new data set. When you create more than one data set in a single DATA step,
using the data set options enables you to drop or keep different variables in each of the
new data sets. A DROP or KEEP statement, on the other hand, affects all of the data
sets that are created.
Review of SAS Tools
Data Set Options
DROP=variable(s)
specifies the variables to be excluded.
Used in the SET statement, DROP= specifies the variables that are not to be
read from the existing SAS data set into the program data vector. Used in the
DATA statement, DROP= specifies the variables to be excluded from the data set
that is being created.
FIRSTOBS=n
specifies the first observation to be read from the SAS data set that you specify in
the SET statement.
KEEP=variable(s)
specifies the variables to be included.
Used in the SET statement, KEEP= specifies the variables to be read from the
existing SAS data set into the program data vector. Used in the DATA statement,
KEEP= specifies which variables in the program data vector are to be written to
the data set being created.
OBS=n
specifies the last observation to be read from the SAS data set that you specify in
the SET statement.
Starting with SAS Data Sets Learning More 93
Procedures
PROC DATASETS <LIBRARY=SAS-data-library>;
CONTENTS <DATA=SAS-data set>;
describes the structure of a SAS data set, including the name, type, and length of
all variables in the data set.
Statements
DATA SAS-data-set<(data-set-options)>;
begins a DATA step and names the SAS data set or data sets that are being
created. You can specify the DROP= or KEEP= data set options in parentheses
after each data set name to control which variables are written to the output data
set from the program data vector.
DROP variable(s);
specifies the variables to be excluded from the data set that is being created. See
also the DROP= data set option.
KEEP variable(s)
specifies the variables to be written to the data set that is being created. See also
the KEEP= data set option.
SET SAS-data-set(data-set-options);
reads observations from a SAS data set rather than records of raw data. You can
specify the DROP= or KEEP= data set options in parentheses after a data set
name to control which variables are read into the program data vector from the
input data set.
Learning More
Creating SAS data sets
For a general discussion about creating SAS data sets from other SAS data sets by
merging, concatenating, interleaving, and updating, see Chapter 15, “Methods of
Combining SAS Data Sets,” on page 233.
Data set options
See the “Data Set Options” section of SAS Language Reference: Dictionary, and
the SAS documentation for your operating environment.
DROP and KEEP statements
See the “Statements” section of SAS Language Reference: Dictionary.
Engines
see SAS Language Reference: Concepts.
Subsetting IF statement
You can use the subsetting IF statement and conditional (IF-THEN) logic when
creating a new SAS data set from an existing one. For more information, see
Chapter 9, “Acting on Selected Observations,” on page 139 and Chapter 10,
“Creating Subsets of Observations,” on page 159.
94
95
PART
3
Basic Programming
Chapter 6..........
Understanding DATA Step Processing 97
Chapter 7..........
Working with Numeric Variables 107
Chapter 8..........
Working with Character Variables 119
Chapter 9..........
Acting on Selected Observations 139
Chapter 10.........
Creating Subsets of Observations 159
Chapter 11.........
Working with Grouped or Sorted Observations 173
Chapter 12.........
Using More Than One Observation in a Calculation 187
Chapter 13.........
Finding Shortcuts in Programming 201
Chapter 14.........
Working with Dates in the SAS System 211
96
97
CHAPTER
6
Understanding DATA Step
Processing
Introduction to DATA Step Processing 97
Purpose 97
Prerequisites 97
Input SAS Data Set for Examples 97
Adding Information to a SAS Data Set 98
Understanding the Assignment Statement 98
Making Uniform Changes to Data by Creating a Variable 99
Adding Information to Some Observations but Not Others 100
Making Uniform Changes to Data Without Creating Variables 101
Using Variables Efficiently 101
Defining Enough Storage Space for Variables 103
Conditionally Deleting an Observation 104
Review of SAS Tools 105
Statements 105
Learning More 105
Introduction to DATA Step Processing
Purpose
To add, modify, and delete information in a SAS data set, you use a DATA step. In
this section, you will learn how the DATA step works, the general form of the
statements, and some programming techniques.
Prerequisites
You should understand the concepts presented in Chapter 2, “Introduction to DATA
Step Processing,” on page 19 and Chapter 3, “Starting with Raw Data: The Basics,” on
page 43 before proceeding with this section.
Input SAS Data Set for Examples
Tradewinds Travel Inc. has an external file that they use to manipulate and store
data about their tours. The external file contains the following information:
u vwxy
France 8 793 575 Major
98 Adding Information to a SAS Data Set Chapter 6
Spain 10 805 510 Hispania
India 10 . 489 Royal
Peru 7 722 590 Mundial
The numbered fields represent
uthe name of the country toured
vthe number of nights on the tour
wthe airfare in US dollars
xthe cost of the land package in US dollars
ythe name of the company that offers the tour
Notice that the cost of the airfare for the tour to India has a missing value, which is
indicated by a period.
The following DATA step creates a permanent SAS data set named
MYLIB.INTERNATIONALTOURS:
options pagesize=60 linesize=80 pageno=1 nodate;
libname mylib ’permanent-data-library’;
data mylib.internationaltours;
infile ’input-file’;
input Country $ Nights AirCost LandCost Vendor $;
proc print data = mylib.internationaltours;
title ’Data Set MYLIB.INTERNATIONALTOURS’;
run;
The PROC PRINT statement that follows the DATA step produces this display of the
MYLIB.INTERNATIONALTOURS data set:
Output 6.1 Creating a Permanent SAS Data Set
Data Set MYLIB.INTERNATIONALTOURS 1
Air Land
Obs Country Nights Cost Cost Vendor
1 France 8 793 575 Major
2 Spain 10 805 510 Hispania
3 India 10 . 489 Royal
4 Peru 7 722 590 Mundial
Adding Information to a SAS Data Set
Understanding the Assignment Statement
One of the most common reasons for using program statements in the DATA step is
to produce new information from the original information or to change the information
read by the INPUT or SET/MERGE/MODIFY/UPDATE statement. How do you add
information to observations with a DATA step?
Understanding DATA Step Processing Making Uniform Changes to Data by Creating a Variable 99
The basic method of adding information to a SAS data set is to create a new variable
in a DATA step with an assignment statement. An assignment statement has the form:
variable=expression;
The variable receives the new information; the expression creates the new
information. You specify the calculation necessary to produce the information and write
the calculation as the expression. When the expression contains character data, you
must enclose the data in quotation marks. SAS evaluates the expression and stores the
new information in the variable that you name. It is important to remember that if you
need to add the information to only one or two observations out of many, SAS creates
that variable for all observations. The SAS data set that is being created must have
information in every observation and every variable.
Making Uniform Changes to Data by Creating a Variable
Sometimes you want to make a particular change to every observation. For example,
at Tradewinds Travel the airfare must be increased for every tour by $10 because of a
new tax. One way to do this is to write an assignment statement that creates a new
variable that calculates the new airfare:
NewAirCost = AirCost+10;
This statement directs SAS to read the value of AirCost, add 10 to it, and assign the
result to the new variable, NewAirCost.
When this assignment statement is included in a DATA step, the DATA step looks
like this:
options pagesize=60 linesize=80 pageno=1 nodate;
data newair;
set mylib.internationaltours;
NewAirCost = AirCost + 10;
proc print data=newair;
var Country AirCost NewAirCost;
title ’Increasing the Air Fare by $10 for All Tours’;
run;
Note: In this example, the VAR statement in the PROC PRINT step determines
which variables are displayed in the output.
The following output shows the resulting SAS data set, NEWAIR:
Output 6.2 Adding Information to All Observations by Using a New Variable
Increasing the Air Fare by $10 for All Tours 1
New u
Air Air
Obs Country Cost Cost
1 France 793 803
2 Spain 805 815
3 India . . v
4 Peru 722 732
100 Adding Information to Some Observations but Not Others Chapter 6
Notice in this data set that
ubecause SAS carries out each statement in the DATA step for every observation,
NewAirCost is calculated during each iteration of the DATA step.
vthe observation for India contains a missing value for AirCost; SAS therefore
assigns a missing value to NewAirCost for that observation
The SAS data set has information in every observation and every variable.
Adding Information to Some Observations but Not Others
Often you need to add information to some observations but not to others. For
example, some tour operators award bonus points to travel agencies for scheduling
particular tours. Two companies, Hispania and Mundial, are offering bonus points this
year.
IF-THEN/ELSE statements can cause assignment statements to be carried out only
when a condition is met. In the following DATA step, the IF statements check the value
of the variable Vendor. If the value is either Hispania or Mundial, information about
the bonus points is added to those observations.
options pagesize=60 linesize=80 pageno=1 nodate;
data bonus;
set mylib.internationaltours;
if Vendor = ’Hispania’ then BonusPoints = ’For 10+ people’;
else if Vendor = ’Mundial’ then BonusPoints = ’Yes’;
run;
proc print data=bonus;
var Country Vendor BonusPoints;
title1 ’Adding Information to Observations for’;
title2 ’Vendors Who Award Bonus Points’;
run;
The following output displays the results:
Output 6.3 Specifying Values for Specific Observations by Using a New Variable
Adding Information to Observations for 1
Vendors Who Award Bonus Points
Obs Country Vendor BonusPoints
1 France Major u
2 Spain Hispania For 10+ people v
3 India Royal u
4 Peru Mundial Yes
The new variable BonusPoints has the following information:
uIn the two observations that are not assigned a value for BonusPoints, SAS
assigns a missing value, represented by a blank in this case, to indicate the
absence of a character value.
vThe first value that SAS encounters for BonusPoints contains 14 characters;
therefore, SAS sets aside 14 bytes of storage in each observation for BonusPoints,
regardless of the length of the value for that observation.
Understanding DATA Step Processing Using Variables Efficiently 101
Making Uniform Changes to Data Without Creating Variables
Sometimes you want to change the value of existing variables without adding new
variables. For example, in one DATA step a new variable, NewAirCost, was created to
contain the value of the airfare plus the new $10 tax:
NewAirCost = AirCost + 10;
You can also decide to change the value of an existing variable rather than create a
new variable. Following the example, AirCost is changed as follows:
AirCost = AirCost + 10;
SAS processes this statement just as it does other assignment statements. It
evaluates the expression on the right side of the equal sign and assigns the result to the
variable on the left side of the equal sign. The fact that the same variable appears on
the right and left sides of the equal sign does not matter. SAS evaluates the expression
on the right side of the equal sign before looking at the variable on the left side.
The following program contains the new assignment statement:
options pagesize=60 linesize=80 pageno=1 nodate;
data newair2;
set mylib.internationaltours;
AirCost = AirCost + 10;
proc print data=newair2;
var Country AirCost;
title ’Adding Tax to the Air Cost Without Adding a New Variable’;
run;
The following output displays the results:
Output 6.4 Changing the Information in a Variable
Adding Tax to the Air Cost Without Adding a New Variable 1
Air
Obs Country Cost
1 France 803
2 Spain 815
3 India .
4 Peru 732
When you change the kind of information that a variable contains, you change the
meaning of that variable. In this case, you are changing the meaning of AirCost from
airfare without tax to airfare with tax. If you remember the current meaning and if you
know that you do not need the original information, then changing a variable’s values is
useful. However, for many programmers, having separate variables is easier than
recalling one variable whose definition changes.
Using Variables Efficiently
Variables that contain information that applies to only one or two observations use
more storage space than necessary. When possible, create fewer variables that apply to
102 Using Variables Efficiently Chapter 6
more observations in the data set, and allow the different values in different
observations to supply the information.
For example, the Major company offers discounts, not bonus points, for groups of 30
or more people. An inefficient program would create separate variables for bonus points
and discounts, as follows:
/* inefficient use of variables */
options pagesize=60 linesize=80 pageno=1 nodate;
data tourinfo;
set mylib.internationaltours;
if Vendor = ’Hispania’ then BonusPoints = ’For 10+ people’;
else if Vendor = ’Mundial’ then BonusPoints = ’Yes’;
else if Vendor = ’Major’ then Discount = ’For 30+ people’;
run;
proc print data=tourinfo;
var Country Vendor BonusPoints Discount;
title ’Information About Vendors’;
run;
The following output displays the results:
Output 6.5 Inefficient: Using Variables That Scatter Information Across Multiple Variables
Information About Vendors 1
Obs Country Vendor BonusPoints Discount
1 France Major For 30+ people
2 Spain Hispania For 10+ people
3 India Royal
4 Peru Mundial Yes
As you can see, storage space is used inefficiently. Both BonusPoints and Discount
have a significant number of missing values.
With a little planning, you can make the SAS data set much more efficient. In the
following DATA step, the variable Remarks contains information about bonus points,
discounts, and any other special features of any tour.
/* efficient use of variables */
options pagesize=60 linesize=80 pageno=1 nodate;
data newinfo;
set mylib.internationaltours;
if Vendor = ’Hispania’ then Remarks = ’Bonus for 10+ people’;
else if Vendor = ’Mundial’ then Remarks = ’Bonus points’;
else if Vendor = ’Major’ then Remarks = ’Discount: 30+ people’;
run;
proc print data=newinfo;
var Country Vendor Remarks;
title ’Information About Vendors’;
run;
Understanding DATA Step Processing Defining Enough Storage Space for Variables 103
The following output displays a more efficient use of variables:
Output 6.6 Efficient: Using Variables to Contain Maximum Information
Information About Vendors 1
Obs Country Vendor Remarks
1 France Major Discount: 30+ people
2 Spain Hispania Bonus for 10+ people
3 India Royal
4 Peru Mundial Bonus points
Remarks has fewer missing values and contains all the information that is used by
BonusPoints and Discount in the inefficient example. Using variables efficiently can
save storage space and optimize your SAS data set.
Defining Enough Storage Space for Variables
The first time that a value is assigned to a variable, SAS enables as many bytes of
storage space for the variable as there are characters in the first value assigned to it.
At times, you may need to specify the amount of storage space that a variable requires.
For example, as shown in the preceding example, the variable Remarks contains
miscellaneous information about tours:
if Vendor = ’Hispania’ then Remarks = ’Bonus for 10+ people’;
In this assignment statement, SAS enables 20 bytes of storage space for Remarks as
there are 20 characters in the first value assigned to it. The longest value may not be
the first one assigned, so you specify a more appropriate length for the variable before
the first value is assigned to it:
length Remarks $ 30;
This statement, called a LENGTH statement, applies to the entire data set. It
defines the number of bytes of storage that is used for the variable Remarks in every
observation. SAS uses the LENGTH statement during compilation, not when it is
processing statements on individual observations. The following DATA step shows the
use of the LENGTH statement:
options pagesize=60 linesize=80 pageno=1 nodate;
data newlength;
set mylib.internationaltours;
length Remarks $ 30;
if Vendor = ’Hispania’ then Remarks = ’Bonus for 10+ people’;
else if Vendor = ’Mundial’ then Remarks = ’Bonus points’;
else if Vendor = ’Major’ then Remarks = ’Discount for 30+ people’;
run;
proc print data=newlength;
var Country Vendor Remarks;
title ’Information About Vendors’;
run;
104 Conditionally Deleting an Observation Chapter 6
The following output displays the NEWLENGTH data set:
Output 6.7 Using a LENGTH Statement
Information About Vendors 1
Obs Country Vendor Remarks
1 France Major Discount for 30+ people
2 Spain Hispania Bonus for 10+ people
3 India Royal
4 Peru Mundial Bonus points
Because the LENGTH statement affects variable storage, not the spacing of columns
in printed output, the Remarks variable appears the same in Output 6.6 and Output
6.7. To show the effect of the LENGTH statement on variable storage using the
DATASETS procedures, see Chapter 35, “Getting Information about Your SAS Data
Sets,” on page 607.
Conditionally Deleting an Observation
If you do not want the program data vector to write to a data set based on a
condition, use the DELETE statement in the DATA step. For example, if the tour to
Peru has been discontinued, it is no longer necessary to include the observation for
Peru in the data set that is being created. The following example uses the DELETE
statement to prevent SAS from writing that observation to the output data set:
options pagesize=60 linesize=80 pageno=1 nodate;
data subset;
set mylib.internationaltours;
if Country = ’Peru’ then delete;
run;
proc print data=subset;
title ’Omitting a Discontinued Tour’;
run;
The following output displays the results:
Output 6.8 Deleting an Observation
Omitting a Discontinued Tour 1
Air Land
Obs Country Nights Cost Cost Vendor
1 France 8 793 575 Major
2 Spain 10 805 510 Hispania
3 India 10 . 489 Royal
The observation for Peru has been deleted from the data set.
Understanding DATA Step Processing Learning More 105
Review of SAS Tools
Statements
DELETE;
prevents SAS from writing a particular observation to the output data set. It
usually appears as part of an IF-THEN/ELSE statement.
If condition THEN action ELSE action;
tests whether the condition is true. When the condition is true, the THEN
statement specifies the action to take. When the condition is false, the ELSE
statement provides an alternative action. The action can be one or more
statements, including assignment statements.
LENGTH variable <$> length;
assigns the number of bytes of storage (length) for a variable. Include a dollar sign
($) if the variable is character. The LENGTH statement must appear before the
first use of the variable.
variable=expression;
is an assignment statement. It causes SAS to evaluate the expression on the right
side of the equal sign and assign the result to the variable on the left. You must
select the name of the variable and create the proper expression for calculating its
value. The same variable name can appear on the left and right sides of the equal
sign because SAS evaluates the right side before assigning the result to the
variable on the left side.
Learning More
Character variables
For information about expressions involving alphabetic and special characters as
well as numbers, see Chapter 8, “Working with Character Variables,” on page 119.
DATA step
For general DATA step information, see Chapter 2, “Introduction to DATA Step
Processing,” on page 19. Complete information about the DATA step can be found
in the “DATA Step Concepts” section of SAS Language Reference: Concepts.
IF-THEN/ELSE statements
The IF-THEN/ELSE statements are discussed in Chapter 9, “Acting on Selected
Observations,” on page 139.
LENGTH statement
Additional information about the LENGTH statement can be found in Chapter 7,
“Working with Numeric Variables,” on page 107 and Chapter 8, “Working with
Character Variables,” on page 119. To show the effect of the LENGTH statement
on variable storage using the DATASETS procedures, see Chapter 35, “Getting
Information about Your SAS Data Sets,” on page 607.
Missing values
For more information about missing values, see the in Chapter 7, “Working with
Numeric Variables,” on page 107 and Chapter 8, “Working with Character
Variables,” on page 119.
106 Learning More Chapter 6
Numeric variables
Information about working with numeric variables and expressions can be found
in Chapter 7, “Working with Numeric Variables,” on page 107.
SAS statements
For complete reference information about the IF-THEN/ELSE, LENGTH,
DELETE, assignment, and comment statements, see SAS Language Reference:
Dictionary.
107
CHAPTER
7
Working with Numeric Variables
Introduction to Working with Numeric Variables 107
Purpose 107
Prerequisites 107
About Numeric Variables in SAS 108
Input SAS Data Set for Examples 108
Calculating with Numeric Variables 109
Using Arithmetic Operators in Assignment Statements 109
Understanding Numeric Expressions and Assignment Statements 111
Understanding How SAS Handles Missing Values 111
Why SAS Assigns Missing Values 111
Rules for Missing Values 111
Propagating Missing Values 112
Calculating Numbers Using SAS Functions 112
Rounding Values 112
Calculating a Cost When There Are Missing Values 112
Combining Functions 113
Comparing Numeric Variables 113
Storing Numeric Variables Efficiently 115
Review of SAS Tools 116
Functions 116
Statements 117
Learning More 117
Introduction to Working with Numeric Variables
Purpose
In this section, you will learn the following:
how to perform arithmetic calculations in SAS using arithmetic operators and the
SAS functions ROUND and SUM
how to compare numeric variables using logical operators
how to store numeric variables efficiently when disk space is limited
Prerequisites
Before proceeding with this section, you should understand the concepts presented in
the following topics:
108 About Numeric Variables in SAS Chapter 7
Part 1, “Introduction to the SAS System”
Part 2, “Getting Your Data into Shape”
Chapter 6, “Understanding DATA Step Processing,” on page 97
About Numeric Variables in SAS
Anumeric variable is a variable whose values are numbers.
Note: SAS uses double-precision floating point representation for calculations and,
by default, for storing numeric variables in SAS data sets.
SAS accepts numbers in many forms, such as scientific notation, and hexadecimal. For
more information, see the discussion on the types of numbers that SAS can read from
data lines in SAS Language Reference: Concepts. For simplicity, this documentation
concentrates on numbers in standard representation, as shown here:
1254
336.05
-243
You can use SAS to perform all kinds of mathematical operations. To perform a
calculation in a DATA step, you can write an assignment statement in which the
expression contains arithmetic operators, SAS functions, or a combination of the two.
To compare numeric variables, you can write an IF-THEN/ELSE statement using
logical operators. For more information on numeric functions, see the discussion in the
“Functions and CALL Routines” section in