Abstract
A new technique is devised to mitigate the errors-in-variables bias in linear regression. The procedure mimics a 2-stage least squares procedure where an auxiliary regression which generates a better behaved predictor variable is derived. The generated variable is then used as a substitute for the error-prone variable in the first-stage model. The performance of the algorithm is tested by simulation and regression analyses. Simulations suggest the algorithm efficiently captures the additive error term used to contaminate the artificial variables. Regressions provide further credit to the simulations as they clearly show that the compact genetic algorithm-based estimate of the true but unobserved regressor yields considerably better results. These conclusions are robust across different sample sizes and different variance structures imposed on both the measurement error and regression disturbances.
Notes
1. More generally, the CGA is also a member of the Estimation of Distribution Algorithms (EDAs). EDAs can be viewed as GAs whose search space is described by a single probability vector instead of an entire population.
2. In particular, Frisch [Citation12] makes use of the analysis framework he develops to discuss, among others, the EIVs problem.
3. see, for example, Greene.[Citation1]
4. Reviewing the downward bias in estimating the coefficient on the return to schooling, Card [Citation28] reports that the attenuation in the coefficient estimates range around 25–33%.
5. Furthermore, it can also turn out that the IV itself may be correlated not only with the endogenous regressor, but also with the dependent variable, in which case the instrument will no longer be viewed as exogenous.
6. In passing, note that the term ‘probability vector’ used therein does not refer to some conventional probability vector in which case the values of the individual elements must sum up to 1. The one used here is actually a vector of probabilities where each individual element shows the likelihood that the corresponding gene of the chromosome can take on the value of 1.
7. As noted previously, the CGA is but one member of a broader family of Evolutionary Algorithms called the ‘distribution of estimation algorithms’ (EDAs) [Citation23]. In subsequent analyses, we consider this point and check our results by performing further analyses.
8. Henceforth, refers to the normal distribution where μ denotes the mean and σ the standard deviation.
9. Simulations are performed in a PC with an Intel Core i5 2.27 GHz CPU and 4 GB rams installed. Average CPU times are 0.32, 0.39 and 0.54 s, respectively, for n=30,50 and 100. We used the R software to run the simulations. The package eive (version 2.1) implements our algorithm and can be downloaded from the Cran repositories.
10. We thank the referee for his suggestion to report the relative bias and the relative MSE.
11. The relative bias and relative MSE would also help interpreting the accuracy of the results obtained from the CGA-based technique. Indeed, if either the rBias and rMSE are less than unity, this translates into a weaker performance of the estimator with respect to the OLS, and vice versa.
12. Moreover, even if the regressions have weaker explanatory power in some cases, the CGA still provides estimates with smaller variances than the MC method. We have not reported the variance values in our tables to keep them readable.
13. We thank the referee for pointing this issue out.
14. We have also performed these additional simulations using other configurations of the two error terms as provided in Tables –. The results have not been altered considerably.
15. This conjecture mainly draws on the case where one makes use of a ‘distributional knowledge’ about the error-prone variable. In particular, if it is known that X is not normal, it is then possible to make use of this as if an ‘IV’ is available and, consequently, to derive a linear estimator for β1 (see, for instance, [Citation6, p.72–73]).
16. For simplicity, we assume that X1 and X2 are independent.
17. In particular, the correlation between the and the clean X is up to 97%.
18. The p-value of the F-test is 0.0016.
19. Note that we do not know the real value of β2 and we assume that is an unbiased estimator of β2.
20. We acknowledge the referee's proposition to use a different nomenclature for the R2.
21. These values correspond, obviously, to those reported in Panel A of Table .
22. Concerning the IV regressions, for instance, finding good instruments still remains a difficult task.