readenglishbook.com » Computers » Training Manual for Data Analysis using SAS, Sujai Das [best management books of all time TXT] 📗

Book online «Training Manual for Data Analysis using SAS, Sujai Das [best management books of all time TXT] 📗». Author Sujai Das



1 2 3 4 5 6 7
Go to page:
Training manual SAS

 

 

 

 

Training Manual

 

 

 

 

Data Analysis using SAS

 

 

 

 

Sujai Das

 

 

 

 

 

 

NIRJAFT, 12 Regent Park, Kolkata - 700040

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1. Introduction

SAS (Statistical Analysis System) software is comprehensive software which deals with many problems related to Statistical analysis, Spreadsheet, Data Creation, Graphics, etc. It is a layered, multivendor architecture. Regardless of the difference in hardware, operating systems, etc., the SAS applications look the same and produce the same results. The three components of the SAS System are Host, Portable Applications and Data. Host provides all the required interfaces between the SAS system and the operating environment. Functionalities and applications reside in Portable component and the user supplies the Data. We, in this course will be dealing with the software related to perform statistical analysis of data.

 

Windows of SAS

1. Program Editor : All the instructions are given here.

2. Log : Displays SAS statements submitted for execution and messages

3. Output : Gives the output generated

 

Rules for SAS Statements

1. SAS program communicates with computer by the SAS statements.

2. Each statement of SAS program must end with semicolon (;).

3. Each program must end with run statement.

4. Statements can be started from any column.

5. One can use upper case letters, lower case letters or the combination of the two.

 

Basic Sections of SAS Program

1. DATA section

2. CARDS section

3. PROCEDURE section

 

Data Section

We shall discuss some facts regarding data before we give the syntax for this section.

 

Data value: A single unit of information, such as name of the specie to which the tree belongs, height of one tree, etc.

 

Variable: A set of values that describe a specific data characteristic e.g. diameters of all trees in a group. The variable can have a name upto a maximum of 8 characters and must begin with a letter or underscore. Variables are of two types:

 

Character Variable: It is a combination of letters of alphabet, numbers and special characters or symbols.

 

Numeric Variable: It consists of numbers with or without decimal points and with + or -ve signs.

 

Observation: A set of data values for the same item i.e. all measurement on a tree. Data section starts with Data statements as

DATA NAME (it has to be supplied by the user);

 

Input Statements

Input statements are part of data section. This statement provides the SAS system the name of the variables with the format, if it is formatted.

 

List Directed Input

 Data are read in the order of variables given in input statement.

 Data values are separated by one or more spaces.

 Missing values are represented by period (.).

 Character values are followed by $ (dollar sign).

 

Example

Data A;

INPUT ID SEX $ AGE HEIGHT WEIGHT; CARDS;

1 M 23 68 155

2 F . 61 102

3. M 55 70 202

;

 

Column Input

Starting column for the variable can be indicated in the input statements for example:

 

INPUT ID 1-3 SEX $ 4 HEIGHT 5-6 WEIGHT 7-11; CARDS;

001M68155.5

2F61 99

3M53 33.5

;

 

Alternatively, starting column of the variable can be indicated along with its length as

INPUT @ 1 ID 3.

@ 4 SEX $ 1.

@ 9 AGE 2.

@ 11 HEIGHT 2.

@ 16 V_DATE MMDDYY 6.

;

Reading More than One Line Per Observation for One Record of Input Variables

 

 

INPUT # 1 ID 1-3 AGE 5-6 HEIGHT 10-11

# 2 SBP 5-7 DBP 8-10; CARDS;

001 56 72

140 80

;

 

Reading the Variable More than Once

Suppose id variable is read from six columns in which state code is given in last two columns of id variable for example:

 

INPUT @ 1 ID 6. @ 5 STATE 2.; OR

INPUT ID 1-6 STATE 5-6;

 

Formatted Lists

DATA B;

INPUT ID @1(X1-X2)(1.)

@4(Y1-Y2)(3.); CARDS;

11 563789

22 567987

;

PROC PRINT; RUN;

 

Output

Obs.

ID

x1

x2

y1

y2

1

11

1

1

563

789

2

22

2

2

567

987

 

DATA C;

INPUT X Y Z @; CARDS;

1 1 1 2 2 2 5 5 5 6 6 6

1 2 3 4 5 6 3 3 3 4 4 4

;

PROC PRINT; RUN;

 

Output

Obs. X Y Z

1 1 1 1

2 1 2 3

DATA D;

INPUT X Y Z @@;

 

CARDS;

1 1 1 2 2 2 5 5 5 6 6 6

1 2 3 4 5 6 3 3 3 4 4 4

;

PROC PRINT; RUN;

 

Output:

Obs.

X

Y

Z

1

1

1

1

2

2

2

2

3

5

5

5

4

6

6

6

5

1

2

3

6

4

5

6

7

3

3

3

8

4

4

4

 

SAS System Can Read and Write

DATA FILES

A. Simple ASCII files are read with input and infile statements

B. Output Data files

Creation of SAS Data Set

DATA EX1;

INPUT GROUP $ X Y Z; CARDS;

T1 12 17 19

T2 23 56 45

T3 19 28 12

T4 22 23 36

T5 34 23 56

;

Creation of SAS File From An External (ASCII) File

DATA EX2;

INFILE 'B:MYDATA'; INPUT GROUP $ X Y Z;

OR

DATA EX2A;

FILENAME ABC 'B:MYDATA'; INFILE ABC;

INPUT GROUP $ X Y Z;

;

 

Creation of A SAS Data Set and An Output ASCII File Using an External File

DATA EX3;

FILENAME IN 'C:MYDATA';

 

FILENAME OUT 'A:NEWDATA'; INFILE IN;

FILE OUT;

INPUT GROUP $ X Y Z; TOTAL =SUM (X+Y+Z);

PUT GROUP $ 1-10 @12 (X Y Z TOTAL)(5.); RUN;

 

This above program reads raw data file from 'C: MYDATA', and creates a new variable TOTAL

and writes output in the file 'A: NEWDATA’.

 

Creation of SAS File from an External (*.csv) File

data EX4;

infile'C:UsersAdmnDesktopsscnars.csv' dlm=',' ;

/*give the exact path of the file, file should not have column headings*/

input sn loc $ year season $ crop $ rep trt gyield syield return kcal; /*give the variables in ordered list in the file*/

/*if we have the first row as names of the columns then we can write in the above statement

firstobs=2 so that data is read from row 2 onwards*/ biomass=gyield+syield; /*generates a new variable*/ proc print data=EX4;

run;

 

Note: To create a SAS File from a *.txt file, only change csv to txt and define delimiter as per file created.

 

Creation of SAS File from an External (*.xls) File

Note: it is always better to copy the name of the variables as comment line before Proc Import.

/* name of the variables in Excel File provided the first row contains variable name*/

proc import datafile = 'C:UsersDesktopDATA_EXERCISEdescriptive_stats.xls'

/*give the exact path of the file*/

out = descriptive_stats replace; /*give output file name*/

proc print;

run;

If we want to make some transformations, then we may use the following statements:

data a1;

set descriptive_stats;

x = fs45+fw;

run;

 

Here proc import allows the SAS user to import data from an EXCEL spreadsheet into SAS. The datafile statement provides the reference location of the file. The out statement is used to name the SAS data set that has been created by the import procedure. Print procedure has been utilized to view the contents of the SAS data set descriptive_stats. When we run above codes we obtain the output which will same as shown above because we are using the same data.

 

Creating a Permanent SAS Data Set LIBNAME XYZ 'C:SASDATA'; DATA XYZ.EXAMPLE;

INPUT GROUP $ X Y Z; CARDS;

.....

.....

..... RUN;

 

This program reads data following the cards statement and creates a permanent SAS data set in a subdirectory named SASDATA on the C: drive.

 

Using Permanent SAS File

LIBNAME XYZ 'C:SASDATA';

PROC MEANS DATA=XYZ.EXAMPLE; RUN;

 

TITLES

One can enter upto 10 titles at the top of output using TITLE statement in your procedure.

 

PROC PRINT;

TITLE ‘HEIGHT-DIA STUDY’; TITLE3 ‘1999 STATISTICS’; RUN;

 

Comment cards can be added to the SAS program using

/* COMMENTS */;

FOOTNOTES

One can enter upto 10 footnotes at the bottom of your output.

 

PROC PRINT DATA=DIAHT; FOOTNOTE ‘1999’;

FOOTNOTE5 ‘STUDY RESULTS’; RUN;

 

For obtaining output as RTF file, use the following statements

Ods rtf file=’xyz.rtf’ style =journal; Ods rtf close;

 

For obtaining output as PDF/HTML file, replace rtf with pdf or html in the above statements. If we want to get the output in continuos format, then we may use

Ods rtf file=’xyz.rtf’ style =journal bodytitle startpage=no;

LABELLING THE VARIABLES

Data dose;

title ‘yield with factors N P K’;

input N P K Yield;

 

Label N = “Nitrogen”; Label P = “ Phosphorus”; Label K = “ Potassium”; cards;

...

...

...

;

Proc print;

run;

We can define the linesize in the output using statement OPTIONS. For example, if we wish that the output should have the linesize (number of columns in a line) is 72 use Options linesize

=72; in the beginning.

 

2. Statistical Procedure

SAS/STAT has many capabilities using different procedures with many options. There are a total of 73 PROCS in SAS/STAT. SAS/STAT is capable of performing a wide range of statistical analysis that includes:

1. Elementary / Basic Statistics

2. Graphs/Plots

3. Regression and Correlation Analysis

4. Analysis of Variance

5. Experimental Data Analysis

6. Multivariate Analysis

7. Principal Component Analysis

8. Discriminant Analysis

9. Cluster Analysis

10. Survey Data Analysis

11. Mixed model analysis

12. Variance Components Estimation

13. Probit Analysis and many more…

A brief on SAS/STAT Procedures is available at http://support.sas.com/rnd/app/da/stat/procedures/Procedures.html

 

Example 2.1: To Calculate the Means and Standard Deviation: DATA TESTMEAN;

INPUT GROUP $ X Y Z; CARDS;

CONTROL 12 17 19

TREAT1 23 25 29

TREAT2 19 18 16

TREAT3 22 24 29

CONTROL 13 16 17

TREAT1 20 24 28

TREAT2 16 19 15

 

TREAT3 24 26 30

CONTROL 14 19 21

TREAT1 23 25 29

TREAT2 18 19 17

TREAT3 23 25 30

;

PROC MEANS; VAR X Y Z; RUN;

 

The default output displays mean, standard deviation, minimum value, maximum value of the desired variable. We can choose the required statistics from the options of PROC MEANS. For example, if we require mean, standard deviation, median, coefficient of variation, coefficient of skewness, coefficient of kurtosis, etc., then we can write

 

PROC MEANS mean std median cv skewness kurtosis; VAR X Y Z;

RUN;

The default output is 6 decimal places, desired number of decimal places can be defined by using option maxdec=…. For example, for an output with three decimal places, we may write

 

PROC MEANS mean std median cv skewness kurtosis maxdec=3; VAR X Y Z;

RUN;

 

For obtaining means group wise use, first sort the data by groups using

 

Proc sort; By group; Run;

And then make use of the following

PROC MEANS; VAR X Y Z;

by group; RUN;

Or alternatively, me may use PROC MEANS; CLASS GROUP; VAR X Y Z;

RUN;

For obtaining descriptive statistics for a given data one can use PROC SUMMARY. In the above example, if one wants to obtain mean standard deviation, coefficient of variation, coefficient of skewness and kurtosis, then one may utilize the following:

 

PROC SUMMARY PRINT MEAN STD CV SKEWNESS KURTOSIS; CLASS GROUP;

 

VAR X Y Z; RUN;

 

Most of the Statistical Procedures require that the data should be normally distributed. For testing the normality of data, PROC UNIVARIATE may be utilized.

 

PROC UNIVARIATE NORMAL; VAR X Y Z;

RUN;

 

If different plots are required then, one may use:

PROC UNIVARIATE DATA=TEST NORMAL PLOT;

/*plot option displays stem-leaf, boxplot & Normal prob plot*/ VAR X Y Z;

/*creates side by side BOX-PLOT group-wise. To use this option first sort the file on by variable*/

BY GROUP;

HISTOGRAM/KERNEL NORMAL; /*displays kernel density along with normal curve*/ PROBPLOT; /*plots probability plot*/

QQPLOT X/NORMAL SQUARE; /*plot quantile-quantile QQ-plot*/

CDFPLOT X/NORMAL; /*plots CDF plot*/

/*plots pp plot which compares the empirical cumulative distribution function (ecdf) of a variable with a specified theoretical cumulative distribution function. The beta, exponential, gamma, lognormal, normal, and Weibull distributions are available in both statements.*/

PPPLOT X/NORMAL;

RUN;

 

Example 2.2: To Create Frequency Tables

DATA TESTFREQ;

INPUT AGE $ ECG CHD $ CAT $ WT; CARDS;

<55 0 YES YES 1

<55 0 YES YES 17

<55 0 NO YES 7

<55 1 YES NO 257

<55 1 YES YES 3

<55 1 YES NO 7

<55 1 NO YES 1

55+ 0 YES YES 9

55+ 0 YES NO 15

55+ 0 NO YES 30

55+ 1 NO NO 107

55+ 1 YES YES 14

55+ 1 YES NO 5

55+ 1 NO YES 44

55+ 1 NO NO 27

;

PROC FREQ DATA=TESTFREQ;

 

TABLES AGE*ECG/MISSING CHISQ; TABLES AGE*CAT/LIST;

RUN:

 

SCATTER PLOT

PROC PLOT DATA = DIAHT; PLOT HT*DIA = ‘*’;

/*HT=VERTICAL AXIS DIA = HORIZONTAL AXIS.*/ RUN;

 

CHART

PROC CHART DATA = DIAHT; VBAR HT;

RUN;

 

PROC CHART DATA = DIAHT; HBAR DIA;

RUN;

 

PROC CHART DATA = DIAHT; PIE HT;

RUN;

 

Example 2.3: To Create A Permanent SAS DATASET and use that for Regression

LIBNAME FILEX 'C:SASRPLIB'; DATA FILEX.RP;

INPUT X1-X5;

CARDS;

1 0 0 0 5.2

.75 .25 0 0 7.2

.75 0 .25 0 5.8

.5 .25 .25 0 6.3

.75 0 0 .25 5.5

.5 0 .25 .25 5.7

.5 .25 0 .25 5.8

.25 .25 .25 .25 5.7

; RUN;

 

 

 

 

LIBNAME FILEX 'C:SASRPLIB'; PROC REG DATA=FILEX.RP; MODEL X5 = X1 X2/P;

MODEL X5 = X1 X2 X3 X4 / SELECTION = STEPWISE;

TEST: TEST X1-X2=0; RUN;

 

 

Various other commonly used PROC Statements are PROC ANOVA, PROC GLM; PROC CORR; PROC NESTED; PROC

1 2 3 4 5 6 7
Go to page:

Free e-book «Training Manual for Data Analysis using SAS, Sujai Das [best management books of all time TXT] 📗» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment