252
Views
3
CrossRef citations to date
0
Altmetric
Original Articles

Reducing errors-in-variables bias in linear regression using compact genetic algorithms

&
Pages 3216-3235 | Received 13 Mar 2014, Accepted 31 Aug 2014, Published online: 23 Sep 2014
 

Abstract

A new technique is devised to mitigate the errors-in-variables bias in linear regression. The procedure mimics a 2-stage least squares procedure where an auxiliary regression which generates a better behaved predictor variable is derived. The generated variable is then used as a substitute for the error-prone variable in the first-stage model. The performance of the algorithm is tested by simulation and regression analyses. Simulations suggest the algorithm efficiently captures the additive error term used to contaminate the artificial variables. Regressions provide further credit to the simulations as they clearly show that the compact genetic algorithm-based estimate of the true but unobserved regressor yields considerably better results. These conclusions are robust across different sample sizes and different variance structures imposed on both the measurement error and regression disturbances.

ASM Subject Classification:

Notes

1. More generally, the CGA is also a member of the Estimation of Distribution Algorithms (EDAs). EDAs can be viewed as GAs whose search space is described by a single probability vector instead of an entire population.

2. In particular, Frisch [Citation12] makes use of the analysis framework he develops to discuss, among others, the EIVs problem.

3. see, for example, Greene.[Citation1]

4. Reviewing the downward bias in estimating the coefficient on the return to schooling, Card [Citation28] reports that the attenuation in the coefficient estimates range around 25–33%.

5. Furthermore, it can also turn out that the IV itself may be correlated not only with the endogenous regressor, but also with the dependent variable, in which case the instrument will no longer be viewed as exogenous.

6. In passing, note that the term ‘probability vector’ used therein does not refer to some conventional probability vector in which case the values of the individual elements must sum up to 1. The one used here is actually a vector of probabilities where each individual element shows the likelihood that the corresponding gene of the chromosome can take on the value of 1.

7. As noted previously, the CGA is but one member of a broader family of Evolutionary Algorithms called the ‘distribution of estimation algorithms’ (EDAs) [Citation23]. In subsequent analyses, we consider this point and check our results by performing further analyses.

8. Henceforth, N(μ,σ) refers to the normal distribution where μ denotes the mean and σ the standard deviation.

9. Simulations are performed in a PC with an Intel Core i5 2.27 GHz CPU and 4 GB rams installed. Average CPU times are 0.32, 0.39 and 0.54 s, respectively, for n=30,50 and 100. We used the R software to run the simulations. The package eive (version 2.1) implements our algorithm and can be downloaded from the Cran repositories.

10. We thank the referee for his suggestion to report the relative bias and the relative MSE.

11. The relative bias and relative MSE would also help interpreting the accuracy of the results obtained from the CGA-based technique. Indeed, if either the rBias and rMSE are less than unity, this translates into a weaker performance of the estimator with respect to the OLS, and vice versa.

12. Moreover, even if the regressions have weaker explanatory power in some cases, the CGA still provides estimates with smaller variances than the MC method. We have not reported the variance values in our tables to keep them readable.

13. We thank the referee for pointing this issue out.

14. We have also performed these additional simulations using other configurations of the two error terms as provided in Tables . The results have not been altered considerably.

15. This conjecture mainly draws on the case where one makes use of a ‘distributional knowledge’ about the error-prone variable. In particular, if it is known that X is not normal, it is then possible to make use of this as if an ‘IV’ is available and, consequently, to derive a linear estimator for β1 (see, for instance, [Citation6, p.72–73]).

16. For simplicity, we assume that X1 and X2 are independent.

17. In particular, the correlation between the XˆCGA and the clean X is up to 97%.

18. The p-value of the F-test is 0.0016.

19. Note that we do not know the real value of β2 and we assume that β2ˆ is an unbiased estimator of β2.

20. We acknowledge the referee's proposition to use a different nomenclature for the R2.

21. These values correspond, obviously, to those reported in Panel A of Table .

22. Concerning the IV regressions, for instance, finding good instruments still remains a difficult task.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 1,209.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.