# How does JMP compare to R.

## Analysis of variance and Tukey's test in SAS, R and JMP - the scaling of the explanatory variables is of enormous importance

Transcript

1 Analysis of variance and Tukey test in SAS, R and JMP - the scaling of the explanatory variables is of enormous importance Analysis of variance and Tukey's test performed. It is shown what enormous importance the scaling of the explanatory variables, i.e. the factor in the model, has. In SAS, the evaluation procedure is generally determined by the choice of procedure. The variable of the factor in the variance analysis model must be in the CLASS statement. It can be of either character or numeric type. For the evaluation with R and using the function aov, the explanatory variable must be declared as a factor. If this does not happen and an analysis of variance is to be carried out, an error message is issued when the Tukey test is carried out. For analysis of variance and Tukey's test in JMP, the factor can be numeric, but only nominal or ordinal. If it is continuous, the regression analysis is used instead of the variance analysis. In this case, multiple tests are issued without a corresponding warning or error message. Key words: scaling, analysis of variance, Tukey test, SAS, R, JMP 1 Preliminary remarks and objectives It is basic knowledge that the properties of the variables, especially their scaling (character, numerical: continuous, ordinal or nominal), are decisive for the choice of statistical Analysis are. If this is forgotten or overlooked when using software, it can lead to incorrect conclusions or evaluations. A simple analysis of variance and Tukey's test are to be carried out in SAS, R and JMP with the data from Example 1 [1]. The scaling of the variable group is of particular importance. The data are: group y

2 E. Moll, D. Gabriel 2 Analysis of Variance and Tukey Test in SAS, R and JMP 2.1 SAS 9.4 SAS provides several procedures that are used depending on the model properties and the objective of analysis of variance and Tukey test. Graphics are generated automatically with the help of the ODS GRAPHICS instruction. The group variable, the factor in the analysis of variance model, is numeric. The following program provides the results for the Tukey test at the significance level a = 5%. DATA example1; INPUT group y CARDS; ; ODS GRAPHICS ON / reset = all imagefmt = emf; PROC GLM DATA = example1; CLASS group; MODEL y = group / ss3; LSMEANS group / adjust = tukey cl; RUN; ODS GRAPHICS OFF; The text output of the results is limited here only to the variance table, the mean value comparisons and the confidence intervals. Two graphics are created. Sum of Source DF Squares Mean Square F Value Pr> F Model Error Corrected Total Least Squares Means Adjustment for Multiple Comparisons: Tukey-Kramer LSMEAN group y LSMEAN Number

3 Poster Least Squares Means for Effect gruppe Pr> t for H0: LSMean (i) = LSMean (j) Dependent Variable: yi / j gruppe y LSMEAN 95% Confidence Limits Least Squares Means for Effect gruppe ij Difference Simultaneous 95% Between Confidence Limits for Means LSMean (i) -LSMean (j) The first graphic (Fig. 1) illustrates the position of the mean values ​​for each group, Figure 1: Mean values ​​for each group 217

4 E. Moll, D. Gabriel the second (Fig. 2) the significance decisions for the differences. Figure 2: Decisions of significance for the differences In SAS, the evaluation method is generally determined by the choice of procedure. With the MODEL statement, a variable in the example becomes the variable group in connection with the CLASS statement, a factor in the variance analysis model. For this reason, the group variable can be of either character or numeric type. If a (numeric) variable is on the right-hand side of the MODEL statement and is missing in the CLASS statement, it becomes a covariate in the linear model. If it is the only variable in the model, it becomes the regressor and instead of the analysis of variance, the regression analysis is calculated: Dependent Variable: y Sum of Source DF Squares Mean Square F Value Pr> F Model Error Corrected Total

5 R-Square Coeff Var Root MSE y Mean Source DF Type III SS Mean Square F Value Pr> F group Standard Parameter Estimate Error t Value Pr> t Intercept group Poster The mean value comparison formulated in the LSMEANS instruction is not carried out. The error message indicates that the classification variable is missing. 2.2 R The variable group as a factor (function aov) The tasks of the SAS procedures take on the special function calls in R. Several packages are available in R for multiple mean value comparisons. To use the aov function, the group variable in the example is read in as a factor. example1 <-data.frame (group = factor (c (1,1,1,2,2,2,3,3,3,3)), y = c (15,17,19,17,20,23 , 22,25,27,30)) mod <-aov (y ~ group, data = example1) summary (mod) Df Sum Sq Mean Sq F value Pr (> F) group * Residuals Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 '' 1 TukeyHSD (mod) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov (formula = y ~ group, data = example1) \$ group diff lwr upr p adj

6 E. Moll, D. Gabriel plot (tukeyhsd (mod)) This plot instruction creates an illustration (Fig. 3) of the significance decisions. Figure 3: Significance decisions for the differences In addition, the Tukey test can be carried out from the Agricolae package with the HSD.test function. library (agricolae) out <-hsd.test (mod, "group") out \$ statistics Mean CV MSerror HSD r.harmonic \$ parameters Df ntr StudentizedRange alpha test name.t Tukey group \$ means y std r Min Max \$ comparison NULL 220

7 Poster \$ groups trt means M a ab b bar.group (out \$ groups, ylim = c (0.45), density = 4, border = "blue") This instruction provides a bar chart of the mean values ​​with the significance decision using Letters (fig. 4). Figure 4: Significance decisions using letters The variable group as a numerical variable (function aov) If the factor group is agreed numerically as group2 and the function aov is also used, then another variance analysis table is output and an error message is issued if the Tukey test is selected. example1 \$ group2 <-c (1,1,1,2,2,2,3,3,3,3) mod2 <-aov (y ~ group2, data = example1) summary (mod2) Df Sum Sq Mean Sq F value Pr (> F) group ** Residuals Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 '' 1 TukeyHSD (mod2) Error in TukeyHSD.aov (mod2): no factors in the adapted model In addition: Warning message: In replications (paste ("~", xx), data = mf): non-factors ignored: group2 221

8 E. Moll, D. Gabriel The error message states that group2, the numerically agreed variable group, is not recognized as a factor for the Tukey test. It remains to be clarified what kind of analysis of variance that has now been calculated. The function lm is used for this: mod2 <-lm (y ~ group2, data = example1) summary (mod2) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr (> t) (Intercept) ** group ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 '' 1 Residual standard error: on 8 degrees of freedom Multiple R-squared:, Adjusted R-squared: F-statistic: on 1 and 8 DF, p-value: anova (mod2) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr (> F) group ** Residuals Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 '' 1 This variance table is the same as the one above for group2. It is the fit to a linear regression model. As with the SAS-GLM procedure without the CLASS statement, the procedure is therefore the regression analysis. As in SAS, the estimated values ​​for the regression coefficients of the linear function y = f (group2) are calculated accordingly. For the task, the calculation of the simple analysis of variance and the Tukey test, this means that the explanatory variable group must be declared as a factor. 222

9 Poster 2.3 JMP The variable group as a nominal or ordinal variable When the file of the example is opened or the data is entered directly, the properties of the variables are automatically defined: y: continuous, group: continuous. For the analysis of variance and Tukey's tests, the numeric variable group must be nominal or ordinal. I.e. the property of the variable group must be changed! Analysis of variance and multiple comparisons of mean values ​​can be accessed via the Analyze Fit Model pull-down menu. The variable to be evaluated and the effects in the model are to be assigned: Y: y and Add: group. The standard edition is extensive and contains several graphics. The information here is only limited to the variance table. Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 2 148,, 2500 8.6625 Error 7 60,, 5714 Prob> F C. Total 9 208,, 0128 * The red triangle leading to the multiple mean value comparisons is in front of the Heading Response y. The choice of estimates leads to the decision multiple comparisons. In a separate window you can select the type of estimator (Least Square Means Estimates), the effect (preset: group) and the test procedure All Pairwise Comparisons - Tukey HSD. Multiple Comparisons Estimates group Estimate Std Error DF t Ratio Prob> t Lower 95% Upper 95% 1 17,,, 06 <, 0001 * 13,,,,, 83 <, 0001 * 16,,,,, 76 <, 0001 * 22,, Tukey HSD All Pairwise Comparisons Quantile = 2.94498, Adjusted DF = 7.0, Adjustment = Tukey-Kramer All Pairwise Differences group -group Difference Std Error t Ratio Prob> t Lower 95% Upper 95% 1 2- 3,,, 25 0,, 0398 4,,,, 02 0.0122 * -15.5852-2,,,, 68 0,, 5852 0,

10 E. Moll, D. Gabriel The graphic representation of the significance decisions (Fig. 5) is included. 3 Legend Significant Not Significant 2 y 20 1 Figure 5: Significance decisions for the pair comparisons The variable group as a continuous variable If you forget to switch the numerical variable group, the factor in the model, from continuous to nominal or ordinal, this leads to regression analysis. Multiple mean comparisons can be carried out; there is no warning or error message. However, these mean value comparisons do not correspond to the objective because they are based on the regression model. To illustrate this, the same analysis as above is repeated with the continuous variable group. The variance table is the same as the variance table of the outputs in SAS if the variable group is not in the CLASS statement, and in R for the numeric variable group2. Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 1 143,, 804 17.7823 Error 8 64,, 087 Prob> F C. Total 9 208,, 0029 * Among other things, the calculated regression coefficients are also output: Parameter Estimates Term Estimate Std Error t Ratio Prob> t Intercept 11,,, 87 0.0012 Gruppe 4,, 0826 4.22 0, y All Pairwise Comparisons for gruppe

11 Poster If you now also select Response y Estimates and Multiple Comparisons via the red triangle in front of the heading, you can also come to a Tukey test here. These results differ from the Tukey test of analysis of variance: Multiple Comparisons Estimates Group Estimate Std Error DF t Ratio Prob> t Lower 95% Upper 95% 1 16,,, 04 <,,,,,, 23 <,,,,, , 31 <,,, Tukey HSD All Pairwise Comparisons Quantile = 2.85742, Adjusted DF = 8.0, Adjustment = Tukey-Kramer All Pairwise Differences group group Difference Std Error t Ratio Prob> t Lower 95% Upper 95% 1 2 -4,,, 22 0.0073-7.6587-1,,,, 22 0,, 3173-2,,,, 22 0.0073-7.6587-1.47177 The graphic representation of the significance decisions (Fig 6) is different from the above. They seem to be shifted parallel to Fig. 5. The cause is quickly found. It lies in the variances and degrees of freedom. 30 Legend Significant Not Significant 25 y y All Pairwise Comparisons for group Figure 6: Significance decisions for the pair comparisons in the regression analysis 225

12 E. Moll, D. Gabriel If the scaling of the variables that are to become factors in the model is not taken into account in JMP, then a Tukey test can also be carried out with a continuous variable group without a hint. These results are based on the regression analysis and do not correspond to the actual goal of the analysis of variance with the factor group. 3 Conclusions The scaling of the variables is essential for the calculation of measures and the implementation of statistical analyzes. The simple analysis of variance and the Tukey test are to be calculated as examples. In SAS you generally specify the choice of procedure, the listing of the variables in the CLASS and MODEL statement, so that e.g. the variable group is a factor. It makes no difference whether this factor is of the character or numeric type. For the analysis of variance with R, the variable group must be agreed as a factor when using the function aov with a subsequent Tukey test. The position of these variables in the variance analysis model is thus clear. In JMP, the automatic selection of the properties of the variables must be taken into account. It is easiest if the variable that is to become a factor is of the type character. If it is numerical and is automatically recognized as a continuous variable, then this property must be changed to nominal or ordinal if this variable is to be used as a factor for analysis of variance and Tukey's test. If the variable that is to be used for the analysis of variance and the Tukey test factor remains constantly scaled, then a regression analysis is carried out and the mean value comparisons based on it fail to achieve the actual goal of the analysis. Literature [1] Schumacher, E. (2004): Comparison of more than two parameters In: Moll, E., J. Gröger, M. Liesebach, P.E. Rudolph, T. Stauber and M. Ziller (Eds.): Introduction to Biometry, Issue 3, 2nd Edition, ISBN