Training Manual for Data Analysis using SAS, Sujai Das [best management books of all time TXT] 📗
- Author: Sujai Das
Book online «Training Manual for Data Analysis using SAS, Sujai Das [best management books of all time TXT] 📗». Author Sujai Das
The Type III and IV sums of squares differ from the Type II sums of squares in the sense that the coefficients on the higher order interaction or nested effects that contain the effects in question are also adjusted so as to satisfy either the orthogonality condition (Type III) or the equitable distribution property (Type IV).
The coefficients on these effects are no longer functions of the nij and consequently, are the same for all designs with the same general form of estimable functions. If there are no empty cells (no nij = 0) both conditions can be satisfied at the same time and Type III and Type IV sums of squares are equal. The hypothesis being tested is the same as when the data is balanced.
When there are empty cells, the hypotheses being tested by the Type III and Type IV sums of squares may differ. The Type III criterion of orthogonality reproduces the same hypotheses one obtains if effects are assumed to add to zero. When there are empty cells this is modified to “the effects that are present are assumed to be zero”. The Type IV hypotheses utilize balanced subsets of non-empty cells and may not be unique. For a 2x3 factorial for illustration purpose adding the terms to the model in the order A, B, AB various types sums of squares can be explained as follows:
Effect
Type I
Type II
Type III
Type IV
General Mean
R()
R()
A
R(A/ )
R(A/ ,B)
R(A/,B,AB)
B
R(B/,A)
R(B/,A)
R(B/,A,AB)
A*B
R(A*B/ ,A,B)
R(A*B/,A,B)
R(AB/,A,B)
R (A/) is sum of squares adjusted for , and so on.
Thus in brief the four sets of sums of squares Type I, II, III & IV can be thought of respectively as sequential, each - after-all others, -restrictions and hypotheses.
There is a relationship between the four types of sums of squares and four types of data structures (balanced and orthogonal, unbalanced and orthogonal, unbalanced and non-orthogonal (all cells filled), unbalanced and non-orthogonal (empty cells)). For illustration, let nIJ denote the number of observations in level I of factor A and level j of factor B. Following table explains the relationship in data structures and Types of sums of squares in a two-way classified data.
Data Structure Type
Effect
1
Equal nIJ
2
Proportionate
3
Disproportionate
4
Empty Cell
A
I=II=III=IV
nIJ
I=II,III=IV
non-zero nIJ
III=IV
B
I=II=III=IV
I=II,III=IV
I=II,III=IV
I=II
A*B
I=II=III=IV
I=II=III=IV
I=II=III=IV
I=II=III=IV
In general,
I=II=III=IV
(balanced data); II=III=IV
(no interaction models)
I=II, III=IV
(orthogonal data); III=IV
(all cells filled data).
Proper Error terms: In general F-tests of hypotheses in ANOVA use the residual mean squares in other terms are to be used as error terms. For such situations PROC GLM provides the TEST statement which is identical to the test statement available in PROC ANOVA. PROC GLM also allows specification of appropriate error terms in MEANS LSMEANS and CONTRAST statements. To illustrate it let us use split plot experiment involving the yield of different irrigation (IRRIG) treatments applied to main plots and cultivars (CULT) applied to subplots. The data so obtained can be analysed using the following statements.
data splitplot;
input REP IRRIG CULT YIELD;
cards;
. . .
. . .
. . .
;
PROC print; run; PROC GLM;
class rep, irrig cult;
model yield = rep irrig rep*irrig cult irrig* cult;
test h = irrig e = rep * irrig;
contrast ‘IRRIGI Vs IRRIG2’ irrig 1 -1 / e = rep* irrig;
run;
As we know here that the irrigation effects are tested using error (A) which is sum of squares due to rep* irrig, as taken in test statement and contrast statement respectively.
It may be noted here that the PROC GLM can be used to perform analysis of covariance as well. For analysis of covariance, the covariate should be defined in the model without specifying under CLASS statement.
PROC RSREG fits the parameters of a complete quadratic response surface and analyses the fitted surface to determine the factor levels of optimum response and performs a ridge analysis to search for the region of optimum response.
PROC RSREG < options >;
MODEL responses = independents / <options >; RIDGE < options >;
WEIGHT variable; ID variable;
By variable;
run;
The PROC RSREG and model statements are required. The BY, ID, MODEL, RIDGE, and WEIGHT statements are described after the PROC RSREG statement below and can appear in any order.
The PROC RSREG statement invokes the procedure and following options are allowed with the
PROC RSREG:
DATA = SAS - data-set : specifies the data to be analysed.
NOPRINT : suppresses all printed results when only the output data set is required.
OUT : SAS-data-set: creates an output data set.
The model statement without any options transforms the independent variables to the coded data. By default, PROC RSREG computes the linear transformation to perform the coding of variables by subtracting average of highest and lowest values of the independent variable from the original value and dividing by half of their differences. Canonical and ridge analyses are performed to the model fit to the coded data. The important options available with the model statement are: NOCODE : Analyses the original data.
ACTUAL : specifies the actual values from the input data set.
COVAR = n : declares that the first n variables on the independent side of the model are simple linear regression (covariates) rather than factors in the quadratic response surface.
LACKFIT : Performs lack of fit test. For this the repeated observations must appear
together.
NOANOVA : suppresses the printing of the analysis of variance and parameter
estimates from the model fit.
NOOPTIMAL (NOOPT): suppresses the printing of canonical analysis for quadratic response surface.
NOPRINT : suppresses both ANOVA and the canonical analysis. PREDICT : specifies the values predicted by the model. RESIDUAL : specifies the residuals.
A RIDGE statement computes the ridge of the optimum response. Following important options
available with RIDGE statement are
MAX: computes the ridge of maximum response. MIN: computes the ridge of the minimum response.
At least one of the two options must be specified.
NOPRINT: suppresses printing the ridge analysis only when an output data set is required. OUTR = SAS-data-set: creates an output data set containing the computed optimum ridge. RADIUS = coded-radii: gives the distances from the ridge starting point at which to compute the optimum.
PROC REG is the primary SAS procedure for performing the computations for a statistical analysis of data based on a linear regression model. The basic statements for performing such an analysis are
PROC REG;
MODEL list of dependent variable = list of independent variables/ model options; RUN;
The PROC REG procedure and model statement without any option gives ANOVA, root mean square error, R-squares, Adjusted R-square, coefficient of variation etc.
The options under model statement are
P: It gives predicted values corresponding to each observation in the data set. The estimated standard errors are also given by using this option.
CLM: It yields upper and lower 95% confidence limits for the mean of subpopulation corresponding to specific values of the independent variables.
CLI : It yields a prediction interval for a single unit to be drawn at random from a subpopulation.
STB: Standardized regression coefficients.
XPX, I: Prints matrices used in regression computations.
NOINT: This option forces the regression response to pass through the origin. With this option total sum of squares is uncorrected and hence R-square statistic are much larger than those for the models with intercept.
However, if no intercept model is to be fitted with corrected total sum of squares and hence usual definition of various statistic viz R2, MSE etc. are to be retained then the option RESTRICT intercept = 0; may be exercised after the model statement.
For obtaining residuals and studentized residuals, the option ‘R’ may be exercised under model statement and Cook’s D statistic.
The ‘INFLUENCE’ option under model statement is used for detection of outliers in the data and provides residuals, studentized residuals, diagonal elements of HAT MATRIX, COVRATIO, DFFITS, DFBETAS, etc.
For detecting multicollinearity in the data, the options ‘VIF’ (variance inflation factors) and
‘COLLINOINT’ or ‘COLLIN’ may be used.
Besides the options for weighted regression, output data sets, specification error, heterogeneous variances etc. are available under PROC REG.
PROC PRINCOMP can be utilized to perform the principal component analysis.
Multiple model statements are permitted in PROC REG unlike PROC ANOVA and PROC GLM. A model statement can contain several dependent variables.
The statement model y1, y2, y3, y4=x1 x2 x3 x4 x5 ; performs four separate regression analyses of variables y1, y2, y3 and y4 on the set of variables x1, x2, x3, x4, x 5.
Polynomial models can be fitted by using independent variables in the model as x1=x, x2=x**2, x3=x**3, and so on depending upon the order of the polynomial to be fitted. From a variable, several other variables can be generated before the model statement and transformed variables can be used in model statement. LY and LX gives Logarithms of Y & X respectively to the base e and LogY, LogX gives logarithms of Y and X respectively to the base 10.
TEST statement after the model statement can be utilized to test hypotheses on individual or any linear function(s) of the parameters.
For e.g. if one wants to test the equality of coefficients of x1 and x2 in y=o+1x1+2 x2
regression model, statement
TEST 1: TEST x1 - x2 = 0;
Label: Test < equation ..., equation >;
The fitted model can be changed by using a separate model statement or by using DELETE
variables; or ADD variables; statements.
The PROC REG provides two types of sums of squares obtained by SS1 or SS2 options under model statement. Type I SS are sequential sum of squares and Types II sum of squares are partial SS are same for that variable which is fitted at last in the model.
For most applications, the desired test for a single parameter is based on the Type II sum of squares, which are equivalent to the t-tests for the parameter estimates. The Type I sum of squares, however, are useful if there is a need for a specific sequencing of tests on individual coefficients as in polynomial models.
PROC ANOVA and PROC GLM are general purpose procedures that can be used for a broad range of data classification. In contrast, PROC NESTED is a specialized procedure that is useful only for nested classifications. It provides estimates of the components of variance using the analysis of variance method of estimation. The CLASS statement in PROC NESTED has a
broader purpose then it does in PROC ANOVA and PROC GLM; it encompasses the purpose of MODEL statement as well. But the data must be sorted appropriately. For example in a laboratory microbial counts are made in a study, whose objective is to assess the source of variation in number of microbes. For this study n1 packages of the test material are purchased and n2 samples are drawn from each package i.e. samples are nested within packages. Let logarithm transformation is to be used for microbial counts. PROPER SAS statements are:
PROC SORT; By package sample; PROC NESTED;
CLASS package sample; Var logcount;
run;
Corresponding PROC GLM statements are
PROC GLM;
Class package sample;
Model Logcount= package sample (package);
The F-statistic in basic PROC GLM output is not necessarily correct. For this RANDOM statement with a list of all random effects in the model is used and Test option is utilized to get correct error term. However, for fixed effect models same arguments for proper error terms hold as in PROC GLM and PROC ANOVA. For the analysis of the data using linear mixed effects model, PROC MIXED of SAS should be used. The best linear unbiased predictors and solutions for random and fixed effects can be obtained by using option ‘s’ in the Random statement.
PROCEDURES FOR SURVEY DATA ANALYSIS
PROC SURVEYMEANS procedure produces estimates of population means and totals from sample survey data. You can use PROC SURVEYMEANS to compute the following statistics:
estimates of population means, with corresponding standard errors and t tests
estimates of population totals, with corresponding standard deviations and t tests
estimates of proportions for categorical variables, with standard errors and t tests
ratio estimates of population means and proportions, and their standard errors
confidence limits for population means, totals, and proportions
data summary information
PROC SURVEYFREQ procedure produces one-way to n-way frequency and crosstabulation tables from sample survey data. These tables include estimates of population totals, population proportions (overall proportions, and also row and column proportions), and corresponding standard errors. Confidence limits, coefficients of variation, and design effects are also available. The
Comments (0)