423
Views
5
CrossRef citations to date
0
Altmetric
Articles

HDDA: DataSifter: statistical obfuscation of electronic health records and other sensitive datasets

ORCID Icon, , , , & ORCID Icon
Pages 249-271 | Received 01 Aug 2018, Accepted 03 Nov 2018, Published online: 11 Nov 2018
 

ABSTRACT

There are no practical and effective mechanisms to share high-dimensional data including sensitive information in various fields like health financial intelligence or socioeconomics without compromising either the utility of the data or exposing private personal or secure organizational information. Excessive scrambling or encoding of the information makes it less useful for modelling or analytical processing. Insufficient preprocessing may compromise sensitive information and introduce a substantial risk for re-identification of individuals by various stratification techniques. To address this problem, we developed a novel statistical obfuscation method (DataSifter) for on-the-fly de-identification of structured and unstructured sensitive high-dimensional data such as clinical data from electronic health records (EHR). DataSifter provides complete administrative control over the balance between risk of data re-identification and preservation of the data information. Simulation results suggest that DataSifter can provide privacy protection while maintaining data utility for different types of outcomes of interest. The application of DataSifter on a large autism dataset provides a realistic demonstration of its promise practical applications.

Acknowledgements

The authors are deeply indebted to the journal reviews and editors for their insightful comments and constructive critiques. Many colleagues at the Statistics Online Computational Resource (SOCR), Big Data Discovery Science (BDDS) and the Michigan Institute for Data Science provided valuable input. The DataSifter technology is patented (62/540,184 Date: 08/02/2017).

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This research was partially funded by the National Science Foundation (NSF grants 1734853, 1636840, 1416953, 0716055 and 1023115), the National Institutes of Health (NIH grants P20 NR015331, U54 EB020406, P50 NS091856, P30 DK089503, P30AG053760, UL1TR002240), the Elsie Andresen Fiske Research Fund, and the Michigan Institute for Data Science.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.