20,272
Views
10
CrossRef citations to date
0
Altmetric
ARTICLES

Play It Again: Teaching Statistics With Monte Carlo Simulation

&
 

ABSTRACT

Monte Carlo simulations (MCSs) provide important information about statistical phenomena that would be impossible to assess otherwise. This article introduces MCS methods and their applications to research and statistical pedagogy using a novel software package for the R Project for Statistical Computing constructed to lessen the often steep learning curve when organizing simulation code. A primary goal of this article is to demonstrate how well-suited MCS designs are to classroom demonstrations, and how they provide a hands-on method for students to become acquainted with complex statistical concepts. In this article, essential programming aspects for writing MCS code in R are overviewed, multiple applied examples with relevant code are provided, and the benefits of using a generate–analyze–summarize coding structure over the typical “for-loop” strategy are discussed.

Notes

1. As a refresher, the CLT is typically taught as follows: “For any population with mean μ and standard deviation σ, the distribution of sample means for sample size n will have a mean of μ and a standard deviation of σ/n and will approach a normal distribution as n approaches infinity.” (Gravetter and Wallnau Citation2012, p. 205).

2. See Wickham (Citation2014) and Matloff (Citation2011) for introductions to programming in R, and Fox and Weisberg (Citation2010) and Teetor (Citation2011) to learn about conducting basic statistical analyses. While it does not use the functions presented in this article, for a detailed introduction to programming simulation studies in R, see Jones, Maillardet, and Robinson (Citation2009).

3. When calling apply(), the second argument refers to the margin to apply the function to. Passing a 1 here indicates that the function should be applied to each row, while a 2 pertains to each column.

4. As a reminder, the italicized R in and the remainder of the text refers to the number of replications in the current study.

5. If one is starting a new simulation, an alternative method to set-up the project is to use the convenience function, SimFunctions(’my-simulation’), with a desired simulation name placed within the quotations. Once this has been run in the R console, two Rscript files will be automatically generated (one for the design and execution elements, and the other for the generate, analyze, and summarize functions) and placed in the current working directory.

6. A data.frame is analogous to a matrix with information stored in and referenced to by rows and columns. However, in R, matrices have the restriction that every cell must be the same data type (e.g., numeric); data.frame objects do not have this property, so the first column may be numeric and the second text, for example. They are most often used to store rectangular datasets, with each row pertaining to a case and each column a variable.

7. The creation of a Design object with more than one factor using expand.grid() will be demonstrated in a subsequent example.

8. An optional argument fixed_objects is also present, which passes an object (usually a list) that contains additional user-defined objects that should remain fixed across conditions. However, this is not commonly used, and is not employed in any of the following examples. For further details regarding this input, please refer to the help file by running ?runSimulation.

9. See the probability distributions task view on CRAN at https://cran.r-project.org/web/views/Distributions.html.

10. Alternatively, it is possible to use the built-in Attach() function in SimDesign to allow all variables in the condition object to be accessed directly. However, this approach is a matter of preference, and is therefore not demonstrated in the in-text examples.

11. If the vector is not named, then SimDesign will report an error in the console stating that the results are ambiguous. These are the types of built-in safety features that we believe are of paramount importance in developing good MCS coding practices for students and researchers alike.

12. It is important to note that to make a simulation exactly reproducible, one can include a vector of seeds to the seed argument in runSimulation() or by saving the internal .Random.seed states to external text files by passing save_seeds = TRUE.

13. The resulting data.frame object was converted to a LaTeX table using the xtable package.

14. It is easy to verify this equation of bias from the definition E(ψ˜ψ)=E(ψ˜)ψ=1Rr=1Rψ^rψ=1Rr=1R(ψ^rψ). However, bias is presented in this form to show the connection to RMSE.

15. An alternate approach for calculating these summary statistics is to use the more elegant functions provided in the plyr, dplyr, or data.table packages.

16. As well, comparing power rates becomes more difficult when Type I error rates do not perform optimally. Therefore, power rates generally cannot be interpreted independent of their respective Type I error rates when the p-values are deemed too liberal or conservative.

17. It is often easier to use ECR()on individual samples, and to return a value of 1 or 0 which indicates whether the population parameter was contained in the CI or not. Then, in the summarise function, the mean of these 1's and 0's can be found to represent the estimated ECR. However, ECR()can also be used directly in the summarise step given the obtained CIs. It is simply a matter of preference regarding how the user wants these values to be organized.

18. This is usual for simulations involving iterative methods, such as those using maximum-likelihood estimation. In these situations, nonconvergent models are common, especially when testing smaller sample sizes, and it is important to resample the required datasets to maintain an equivalent number of simulations across design cells. In these cases, one should report not only the simulations that converged to a proper solution, but also how many models failed to converge.

19. It should be noted that there are other packages available for conducting MCSs, both inside and outside of R. However, SimDesign is the only available package that explicitly features the Generate–Analyze–Summarize foundation, as well as previously mentioned convenience features. Further, this package can be used to assess any statistical phenomena, while other options (such as online applets) are almost necessarily hardcoded to only show a particular finding.