Search in:

Journal of Statistics and Data Science Education Volume 30, 2022 - Issue 3: Teaching Reproducibility

Submit an article Journal homepage

Open access

1,321

Views

CrossRef citations to date

Altmetric

Articles

A Journey from Wild to Textbook Data to Reproducibly Refresh the Wages Data from the National Longitudinal Survey of Youth Database

Dewi Amaliaha Department of Econometrics and Business Statistics, Monash University, Clayton, AustraliaCorrespondence[email protected]

https://orcid.org/0000-0001-6903-6237

Dianne Cooka Department of Econometrics and Business Statistics, Monash University, Clayton, Australia

https://orcid.org/0000-0002-3813-7155

Emi Tanakaa Department of Econometrics and Business Statistics, Monash University, Clayton, Australia

https://orcid.org/0000-0002-1455-259X

Kate Hydea Department of Econometrics and Business Statistics, Monash University, Clayton, Australia

Nicholas Tierneyb Geospatial Health and Development, Telethon Kids Institute, Nedlands, Australia

https://orcid.org/0000-0003-1460-8722

Pages 289-303 | Published online: 26 Jul 2022

Cite this article
https://doi.org/10.1080/26939169.2022.2094300

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Figures & data

Fig. 1 Documented steps taken to select variables of interest and download the raw data.

Table 1 Frequency table of the age at the start of the survey in NSLY79 cohort in the extracted data.

Download CSV Display Table

Table 2 Contingency table for sex and race for the extracted NLSY79 demographic data.

Download CSV Display Table

Fig. 2 Longitudinal profiles of wages for a random sample of 36 individuals in the pre-cleaned data. There is considerable variation in wages. Some individuals (2799, 11,041, 11,146) are only measured for a short period. Some individuals (8296, 9962) possibly have errors in wages in some years, because of the extreme fluctuation.

Fig. 3 Summary plots to check the data after the tidying stage: (A) longitudinal profiles of wages for all individuals 1979–2018, (B) boxplots of minimum, median, and maximum wages of each individual, (C) and one individual (id = 39) with an unusual wage relative to their years of data. It reveals that some values of hourly wages are unbelievable, and some individuals have extremely unusual wages in some years. Accordingly, more cleaning is necessary to treat these extreme values.

Fig. 4 Comparison between the original (black dots) and the corrected (solid gray) mean hourly wage for same sample of individuals as shown in . A robust linear model prediction was used to identify and correct mean hourly wages value. The extreme spikes, corresponding to implausible wages, have been replaced with values more similar to wages in neighboring years for individuals 8296 and 9962, but otherwise the profiles have not changed.

Fig. 4 Comparison between the original (black dots) and the corrected (solid gray) mean hourly wage for same sample of individuals as shown in Figure 2. A robust linear model prediction was used to identify and correct mean hourly wages value. The extreme spikes, corresponding to implausible wages, have been replaced with values more similar to wages in neighboring years for individuals 8296 and 9962, but otherwise the profiles have not changed.

Fig. 5 Remake of the summary plots of the fully processed data suggest it is now in a reasonable state: (A) longitudinal profiles of wages for all individuals 1979–2018, (B) boxplots of minimum, median, (C) and maximum wages of each individual, and one individual with an unusual wage relative to their years of data.

Fig. 6 The stages of data cleaning from the raw data to get three datasets contained in yowie. “# of individuals” means the number of respondents included in each stage, while “# of observations” means the number of rows in the data. The color represents the stage of data cleaning in the statistical value chain (M. P. J. van der Loo and de Jonge Citation2021). Pink, blue, and green represent the raw, input, and valid data, respectively.

Fig. 7 Comparison of original and refreshed data: (A) highest grade completed, (B) experience, and (C) log wages. Some difference in wages would be expected because the refreshed data is not inflation-adjusted, but the two sets are reasonably similar.

van der Loo, M. P. J., and de Jonge, E. (2021), “Data Validation Infrastructure for R,” Journal of Statistical Software, 97, 1–31. DOI: 10.18637/jss.v097.i10..

Web of Science ®Google Scholar

Supplemental material

supplementary_materials.zip

Download Zip (4.3 MB)

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the supplementary materials.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

A Journey from Wild to Textbook Data to Reproducibly Refresh the Wages Data from the National Longitudinal Survey of Youth Database

Table 1 Frequency table of the age at the start of the survey in NSLY79 cohort in the extracted data.

Table 2 Contingency table for sex and race for the extracted NLSY79 demographic data.

supplementary_materials.zip

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

A Journey from Wild to Textbook Data to Reproducibly Refresh the Wages Data from the National Longitudinal Survey of Youth Database

Figures & data

Table 1 Frequency table of the age at the start of the survey in NSLY79 cohort in the extracted data.

Table 2 Contingency table for sex and race for the extracted NLSY79 demographic data.

supplementary_materials.zip

Data Availability Statement

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date