2,788
Views
0
CrossRef citations to date
0
Altmetric
Articles

NAICS Code Prediction Using Supervised Methods

, &
Pages 58-66 | Received 17 Jun 2021, Accepted 20 Jan 2022, Published online: 04 Apr 2022
 

Abstract

When compiling industry statistics or selecting businesses for further study, researchers often rely on North American Industry Classification System (NAICS) codes. However, codes are self-reported on tax forms and reporting incorrect codes or even leaving the code blank has no tax consequences, so they are often unusable. IRSs Statistics of Income (SOI) program validates NAICS codes for businesses in the statistical samples used to produce official tax statistics for various filing populations, including sole proprietorships (those filing Form 1040 Schedule C) and corporations (those filing Forms 1120). In this article we leverage these samples to explore ways to improve NAICS code reporting for all filers in the relevant populations. For sole proprietorships, we overcame several record linkage complications to combine data from SOI samples with other administrative data. Using the SOI-validated NAICS code values as ground truth, we trained classification-tree-based models (randomForest) to predict NAICS industry sector from other tax return data, including text descriptions, for businesses which did or did not initially report a valid NAICS code. For both sole proprietorships and corporations, we were able to improve slightly on the accuracy of valid self-reported industry sector and correctly identify sector for over half of businesses with no informative reported NAICS code.

Acknowledgments

The authors thank Barry Johnson, Director SOI, for the data and support and Mike Strudler, SOI Individual and Tax Exempt Branch for invaluable assistance with SOI treatment of multiple Schedule Cs. The authors also thank Dr. Karl Branting, our panel discussant, at JSM 2020, for helpful feedback. Portions of this paper appeared in the Proceedings of the 2020 Joint Statistical Meetings.

Disclosure Statement

The authors have no personal or financial stakes in the results of this study.

Notes

1 Correspondence, Mike Strudler, SOI, 2019/10/10

2 Internal Revenue Manual 3.11.3.12.(1-1-2016 revision)

3 Further description of multiclass MCC may be found on ‘The RK Page’, rth.dk/resources/rk/introduction/index.html.