Stepwise regression

You don't need to be Editor-In-Chief to add or edit content to WikiDoc. You can begin to add to or edit text on this WikiDoc page by clicking on the edit button at the top of this page. Next enter or edit the information that you would like to appear here. Once you are done editing, scroll down and click the Save page button at the bottom of the page.

Jump to: navigation, search

In statistics, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure.[1][1][1] Usually, this takes the form of a sequence of F-tests, but other techniques are possible, such as t-tests, Adjusted R-square, Akaike information criterion, Bayesian information criterion, Mallows' Cp, or false discovery rate.

Image:Stepwise.jpg
In this example from engineering, necessity and sufficiency are usually determined by F-tests. For additional consideration, when planning an experiment, computer simulation, or scientific survey to collect data for this model, one must keep in mind the number of parameters, P, to estimate and adjust the sample size accordingly. For K variables, P = 1(Start)+ K(Stage I)+ (K2-K)/2(Stage II)+ 3K(Stage III)= .5K2+ 3.5K + 1. For K<17, an efficient design of experiments exists for this type of model, a Box-Behnken design,[1] augmented with positive and negative axial points of length min(2,sqrt(int(1.5+K/4))), plus point(s) at the origin. There are more efficient designs, requiring fewer runs, even for K>16.

The main approaches are:

a) Forward selection, which involves starting with no variables in the model, trying out the variables one by one and including them if they are 'statistically significant'.

b) Backward selection, which involves starting with all candidate variables and testing them one by one for statistical significance, deleting any that are not significant.

c) Methods that are a combination of the above, testing at each stage for variables to be included or excluded.

A widely used algorithm was proposed by Efroymson (1960).[1] This is an automatic procedure for statistical model selection in cases where there are a large number of potential explanatory variables, and no underlying theory on which to base the model selection. The procedure is used primarily in regression analysis, though the basic approach is applicable in many forms of model selection. This is a variation on forward selection. At each stage in the process, after a new variable is added, a test is made to check if some variables can be deleted without appreciably increasing the residual sum of squares (RSS). The procedure terminates when the measure is (locally) maximized, or when the available improvement falls below some critical value.

Stepwise regression procedures are used in data mining, but are controversial. Several points of criticism have been made.

1. A sequence of F-tests is often used to control the inclusion or exclusion of variables, but these are carried out on the same data and so there will be problems of multiple comparisons for which many correction criteria have been developed.

2. It is difficult to interpret the p-values associated with these tests, since each is conditional on the previous tests of inclusion and exclusion (see "dependent tests" in false discovery rate).

3. The tests themselves are biased, since they are based on the same data. (Rencher and Pun, 1980, Copas, 1983).[1][1] Wilkinson and Dalall (1981)[1] computed percentage points of the multiple correlation coefficient by simulation and showed that a final regression obtained by forward selection, said by the F-procedure to be significant at 0.1% was in fact only significant at 5%.

Critics regard the procedure as a paradigmatic example of data dredging, intense computation often being inadequate substitute for subject area expertise.

See also

References


Acknowledgement and Attribution Regarding Sources of Content

Some of the initial content on this page may be incorporated in part from copyleft sources in the public domain including wikis such as Wikipedia and AskDrWiki. Drug information for patients came from the The National Library of Medicine. Infectious disease information may have come from the Centers for Disease Control (CDC). Differential Diagnoses are drawn from clinicians as well as an amalgamation of 3 sources: 1.The Disease Database; 2. Kahan, Scott, Smith, Ellen G. In A Page: Signs and Symptoms. Malden, Massachusetts: Blackwell Publishing, 2004:3; 3. Sailer, Christian, Wasner, Susanne. Differential Diagnosis Pocket. Hermosa Beach, CA: Borm Bruckmeir Publishing LLC, 2002:7 .

Personal tools