166
Views
14
CrossRef citations to date
0
Altmetric
Original Research

Big data in sleep medicine: prospects and pitfalls in phenotyping

, , , , &
Pages 11-29 | Published online: 16 Feb 2017

Abstract

Clinical polysomnography (PSG) databases are a rich resource in the era of “big data” analytics. We explore the uses and potential pitfalls of clinical data mining of PSG using statistical principles and analysis of clinical data from our sleep center. We performed retrospective analysis of self-reported and objective PSG data from adults who underwent overnight PSG (diagnostic tests, n=1835). Self-reported symptoms overlapped markedly between the two most common categories, insomnia and sleep apnea, with the majority reporting symptoms of both disorders. Standard clinical metrics routinely reported on objective data were analyzed for basic properties (missing values, distributions), pairwise correlations, and descriptive phenotyping. Of 41 continuous variables, including clinical and PSG derived, none passed testing for normality. Objective findings of sleep apnea and periodic limb movements were common, with 51% having an apnea–hypopnea index (AHI) >5 per hour and 25% having a leg movement index >15 per hour. Different visualization methods are shown for common variables to explore population distributions. Phenotyping methods based on clinical databases are discussed for sleep architecture, sleep apnea, and insomnia. Inferential pitfalls are discussed using the current dataset and case examples from the literature. The increasing availability of clinical databases for large-scale analytics holds important promise in sleep medicine, especially as it becomes increasingly important to demonstrate the utility of clinical testing methods in management of sleep disorders. Awareness of the strengths, as well as caution regarding the limitations, will maximize the productive use of big data analytics in sleep medicine.

Introduction

Polysomnography (PSG) offers a wealth of physiological information, informing clinical decision-making and clinical research. Large sleep-related datasets are increasingly available for public analysis. For example, the National Sleep Research Resource (NSRR),Citation1 PhysioNet (www.physionet.com), the Montreal Archive of Sleep Studies (MASS),Citation2 and even consumer-facing efforts are underway.Citation3 As “big data” analysis efforts gain momentum, it is increasingly important to understand not only the potential benefits but also the potential pitfalls of PSG phenotyping. In an era when in-laboratory PSG is increasingly restricted, enhancing signal processing and big data analytics could justify resource allocation to inform individual- and population-level insights.

The goals of sleep phenotyping span basic and clinical investigations as well as genotype–phenotype associations, especially as academic centers are increasingly banking bio-samples. Advanced knowledge about normal and pathologic sleep physiology might be derived from studying the relationship between sleep-disordered breathing events and heart rate variability, or about how electromyography (EMG) dynamics in rapid eye movement (REM) sleep vary depending on the presence of different medications and disease states. Big data insights might link indices of fragmentation to comorbidities or predict response to treatment of obstructive sleep apnea (OSA).

The opportunity for advanced analysis and phenotyping from the rich data obtained in routine clinical practice cannot be overstated. However, the allure of big data should not distract from the potential risks associated with even basic statistics and inferential efforts. Numerous cautionary statistical articlesCitation4Citation12 and even entire monographsCitation13Citation15 have been published in recent decades highlighting the existence (and persistence) of common statistical misconceptions and pitfalls in basic and clinical research contexts. Large datasets do not mitigate these risks and in fact may present further challenges. We explore various kinds of PSG data in this framework, including insights and pitfalls from the existing literature, and through empirical analysis of a large dataset of diagnostic PSGs from our center. With the growing capacity for large-scale analytics, recognizing the strengths and limitations of phenotyping will help maximize the utility of large database resources.

Methods

The Partners Institutional Review Board (IRB) approved retrospective analysis of our database without requiring additional consent for use of the clinically acquired data (IRB number: 2009P000758). We selected diagnostic PSGs from adults in our database from 2011 to 2015, yielding n=1835 studies in our dataset. We did not have any exclusion criteria.

PSG was performed and scored according to the American Academy of Sleep Medicine (AASM) practice standards. Channels included six electroencephalogram leads, bilateral electrooculogram, submentalis EMG, nasal thermistor, oronasal airflow, snore vibration sensor, single-lead electrocardiogram (ECG), chest and abdomen effort belts, pulse oximetry, and bilateral anterior tibialis EMG.

From presleep questionnaires, we analyzed self-reported symptoms associated with sleep apnea, insomnia, restless legs, and narcolepsy. OSA symptoms included checkboxes spanning symptoms of sleep apnea, such as snoring, gasping, and witnessed apnea. Insomnia symptoms included check-boxes for difficulty regarding sleep onset (30–60 minutes or >60 minutes of sleep-onset latency), sleep maintenance (>3 awakenings per night), and insomnia as the reason for the PSG. Although at the time our intake form included “waking earlier than desired”, we have found clinically that, for this question, many false positives were occurring (e.g., work requiring early waking, as not desirable), and thus we did not include in our current analysis. Restless leg syndrome (RLS) symptoms included checkboxes for legs feeling uncomfortable, feeling better with movement, and feeling worse at night. Narcolepsy symptoms included checkboxes for perisleep hallucinations, sleep paralysis, and cataplexy. The intake form was designed to provide basic language symptom inventories, but it has not been independently validated against clinical diagnosis of sleep physician evaluation. We did not include standardized questionnaires for each of the many subcomponents, to strike a balance between information that assists providers in interpreting PSG data and the burden on patients. As the majority (>70%) of patients undergoing PSG in our center are direct referrals (have not seen a sleep specialist in our division before the PSG), we did not have clinical interview-based validation of the symptom reporting.

Statistical analyses were performed using GraphPad Prism 6 software (GraphPad Software Inc., La Jolla, CA, USA).

Results and discussion

In the following sections, we combine the analysis of simulated data with the analysis of a large sample (n=1835) of diagnostic PSG data from our center to illustrate important considerations in analyzing big data in sleep medicine. We can consider four basic categories of information that support clinical phenotyping derived from PSG databases (). In each category, methods of cleaning and analysis are implemented, which we discuss in the following sections. Inferential analysis and insights can also be obtained by combining information across categories. For example, correlation and regression analysis can be performed on variables within or between categories, as can more complex predictive analytics be performed using methods of supervised machine learning. Unsupervised learning, also known as clustering, can be applied as well for discovery of novel phenotypes.

Figure 1 Analysis hierarchy.

Notes: Categories of sleep data obtained from or associated with clinical PSG recordings. Each requires core processes of cleaning, analysis, and plotting. Combining information between categories can provide further insights, such as linking scored events (e.g., PLMS) and physiology (ECG changes), or using stage annotations to calculate transition frequencies as an adjunct to stage percentage.
Abbreviations: PLMS, periodic limb movements of sleep; PSG, polysomnography; ECG, electrocardiogram.
Figure 1 Analysis hierarchy.

Standard PSG metrics and data types

The standard metrics in most clinical PSG reports are readily accessible in sleep databases without requiring off-line extra processing. These include basic demographics and summary statistics of PSG scoring, such as stage percentages, total sleep time (TST), efficiency, and apnea–hypopnea index (AHI). The importance of standardization in human scoring and basic metrics has been emphasized, especially for multicenter trials and data repositories involving PSG.Citation16 Clinical metrics should be assessed in several steps to prepare for large-scale analytics.

PSG scoring annotations

PSG annotations include technician-scored labels for sleep–wake stage and various events (arousals, limb movements, breathing events) often with time stamps. These data can be exported for off-line processing and/or combining with other sources of clinical information. Aligning these files with exported time series data allows stage- or event-specific analysis of physiological signals.

Event label errors include errors of omission and of commission, and they are best assessed by manual rescoring. Some scoring errors may have indirect effects, such as failure to score an epoch of wake that interrupted a block of REM, which will have the dual effect of missing an awakening and resulting in a larger REM bout duration measurement.

Annotation data are also commonly used for inter-rater reliability analysis, or in groupwise comparisons of technician- or center-level differences in scoring. Because inter-rater reliability for various scoring tasks tends to be in the 80%–85% range,Citation17 this sets a theoretical ceiling for performance of automated algorithms.

PSG time series data

Each channel of a standard PSG is a time series, to which a number of signal processing techniques can be applied to extract information. Initial preprocessing can involve detection and removal of periods with prominent muscle artifact, or removal of ECG signal contaminating electroencephalography (EEG) channels, or de-trending if slow drift is present.

Spectral analysis of the EEG is commonly performed using the fast Fourier transform (FFT) algorithm, which is applied using a moving window to produce an image of spectral characteristics as they change over time. However, the FFT alone provides noisy estimates of the underlying spectral characteristics of the data; thus, it is common to apply spectral and temporal smoothing to improve the estimates. The multi-taper spectral analysis method optimizes the trade-off between retaining fine details (spectral and temporal resolution) while still reducing noise (variance reduction).Citation18 EEG time series analysis is common in research settings but has not enjoyed similar clinical applications in routine practice, although some clinical acquisition software includes basic frequency analysis options.

The ECG time series has been the subject of extensive analysis of heart rate variability,Citation19 as well as point-process modeling variants.Citation20 Another method of ECG analysis, known as cardiopulmonary coupling (CPC), has been suggested to provide an important window into sleep quality beyond that observed in EEG-defined states. Whereas stable non-REM (NREM) sleep is characterized by stable breathing and “high-frequency” coupling (HFC) at the respiratory frequency, processes that disrupt sleep tend to increase low-frequency coupling (LFC). Of particular interest is that treatment-emergent central apnea (and clinical failure of continuous positive airway pressure [CPAP]) was predicted by the degree of narrow band LFC.Citation21 Whereas obstructive apnea is characterized by the broadband LFC phenotype, chemoreceptor-driven sleep apnea (e.g., central apnea) is associated with narrow band coupling.

Self-reported clinical information

Patients undergoing in-laboratory PSG are often asked to self-report symptoms, medical problems, and medications in the questionnaire form. Our center uses a custom form as a basic symptom and history screening tool, which includes the Epworth Sleepiness Scale (ESS), as well as checkboxes and boxes for free-text responses. When self-reporting methods are used, the data require manual or semiautomated review and cleaning before analysis is possible. If medications are listed as free text, spelling errors or nonstandard terminology (e.g., sleeping pill) requires reconciliation. Internal inconsistencies require attention, such as listing multiple antihypertensive medications but not listing hypertension in the medical history. We found ~75% concordance between listing hypertension from a checkbox selection of medical problems and listing of antihypertensive agents (data not shown). Such a discrepancy could be a simple omission or could be that the patient is on treatment and thus no longer feels they have the disorder.

Combining information categories to inform phenotyping

Using simple combinations of existing metrics, or more involved extractions from the clinical scoring (annotation files), additional data for phenotyping can be generated beyond that which may be available in the acquisition software system. For example, event-related signal analysis, manual scoring annotations, and temporally associated time series data can be combined to explore phenotypes. Several examples of event-specific metrics have been reported, with potential clinical relevance. Chervin et alCitation22,Citation23 analyzed respiratory event-linked EEG changes to sub-phenotype OSA patients and found a stronger relation with sleepiness by this advanced analysis compared to the usual AHI value. EEG analysis of alpha power and spindle activity has been used to predict arousal response to auditory stimulation delivered during sleep,Citation24,Citation25 reflecting possible biomarkers of sleep fragility. Additional work investigating arousals and autonomic features highlights opportunities to stratify episodic physiological events during sleep that are not currently distinguished in routine scoring.Citation26Citation32

Database-driven sleep phenotyping

Symptom heterogeneity

We used a convenience sample of n=1835 individuals who underwent diagnostic PSG in our laboratory. In this dataset, symptom combinations were common. illustrates the overlap between self-reported OSA symptoms, insomnia symptoms, and leg-related symptoms in the cohort. The majority of individuals reported more than one of these categories, with less than one-third reporting from only one category. Within the group reporting OSA symptoms, isolated snoring was present in over half, with nearly as many reporting a combination of snoring and either gasping arousals or witnessed apneas (). Among those with insomnia symptoms, difficulties with sleep maintenance was the most common isolated symptom, while about half reported more than one insomnia symptom or checked “insomnia” from a list of reasons for the study, along with at least one insomnia symptom (). In rare cases, insomnia was listed as the reason for study by the patient, but no insomnia symptoms were checked. Among those with leg symptoms, about one-quarter reported all three symptoms consistent with RLS (uncomfortable sensation while awake, worse at night, better with movement; ). Narcolepsy symptoms were the least common. The isolated reporting of only one of the three cardinal REM-related phenomena was more common than combinations of any two or all three ().

Figure 2 Symptom overlap reported by adults undergoing diagnostic PSG.

Notes: OSA symptoms and insomnia symptoms commonly coexisted (solid with yellow fill, and dotted with blue fill, respectively). Leg symptoms (either RLS or PLMS) were also commonly present (dashed circle with red fill). The area of the shapes approximate the n-value (sample size) for each category: OSA only =210; insomnia only =253; legs only =32; OSA and insomnia =584; legs and insomnia =166; OSA and legs =89; OSA and legs and insomnia =454.
Abbreviations: OSA, obstructive sleep apnea; PLMS, periodic limb movements of sleep; PSG, polysomnography; RLS, restless leg syndrome.
Figure 2 Symptom overlap reported by adults undergoing diagnostic PSG.

Sleep–wake architecture and fragmentation

Sleep–wake stages are most commonly reported as the number of minutes, and relative percentage of wake, REM, and N1–3. Stage percentage during PSG may be noted in clinical interpretations, and there are normative data available across the lifetime.Citation33 In some settings, such coarse descriptive metrics may be useful. For example, when commenting on the presence or severity of sleep apnea, one might consider the potential for underestimation if the night happened to contain little or no REM, as OSA is often more pronounced in REM sleep (i.e., REM dominant). A relative increase in N3 percentage may suggest rebound sleep after recent deprivation. Certain medications may alter stage composition (e.g., commonly by reducing REM or N3).Citation34,Citation35 Thus, stage percentage might combine with other categories of information, such as the clinical history, rather than provide a basis for actionable clinical recommendations in isolation.

A somewhat more granular approach to sleep architecture is to quantify sleep fragmentation, for example, via sleep efficiency, or by increased time spent in N1 (often because of frequent arousals), or reduced time spent in REM or N3 that may indirectly occur.Citation36,Citation37 We can consider the use of sleep efficiency to describe two patients with very different hypnograms, but similar efficiency values. Because efficiency does not distinguish between different patterns of wake after sleep onset (WASO), it runs the risk of lumping together quite different patterns of fragmentation.Citation38 shows two PSGs with similar sleep efficiency, but which differ by fivefold in terms of the number of transitions to the wake state. In fact, the PSG with greater frequency of wake transitions () actually has a slightly higher efficiency than the PSG with fewer but longer wake bouts (; 82% versus 78%, respectively). The reasons behind these patterns, the potential clinical impact, and therapeutic considerations may be quite distinct. shows the distribution of several metrics in a cohort of n=100 individuals with sleep efficiency values of 79.5%–80.5%. These broadly distributed values are a reminder that a sleep efficiency of “80%” can not only be achieved with distinct patterns of wake but can also be associated with diverse patterns of other potential contributors to (or markers of) fragmentation.

Figure 3 Sleep efficiency: a limited view of heterogenous sleep physiology.

Notes: (A and B) Hypnograms from different patients. (A) Sleep efficiency of 78.3%, associated with nine transitions to wake (age 23, male). (B) Sleep efficiency of 82%, associated with 46 transitions to wake (age 42, male). The Y-axis indicates scored stage; the time bar indicates 1 hour. (CG) Distribution of n=100 individuals with essentially the same sleep efficiency values (79.5%–80.5%), but differ widely across other factors that potentially contribute to or signify sleep fragmentation (age, AHI, PLMI, # W, and N1%).
Abbreviations: R, rapid eye movement sleep; N1–3, non-rapid eye movement stages 1–3; AHI, apnea–hypopnea index; PLMI, periodic limb movement index; # W, number of wake transitions.
Figure 3 Sleep efficiency: a limited view of heterogenous sleep physiology.

Stage percentage is also an insensitive measure of fragmentation. We can consider two patients each with 120 minutes of stage REM within an 8-hour TST: one patient could have four REM blocks each lasting 30 continuous minutes, and the other could have four blocks interrupted by frequent brief transitions to wake or N1. Each patient’s summary report could indicate REM as 25% of TST, yet the two patterns of consolidation versus fragmentation are quite different and might imply distinct pathophysiology. This example illustrates that stage percentage does not capture fragmentation phenotypes associated with a known cause of fragmentation (OSA). Several alternative methods to stage percentage have been proposed, including bout duration histograms,Citation39 bout duration survival analysis,Citation40 and others.Citation41Citation44 Some have used multi-exponential transition models,Citation45 while others have used power-law approachesCitation46 to describe these skewed patterns. Even stage transition rates have proven useful where percentages have shown no discriminatory value.Citation47,Citation48 Which one of these is the “best” descriptor of the distribution of sleep–wake stages remains open to discussion, although even statistically principled model selection methods may not distinguish true from alternative functions in simulation studies.Citation49

If stage percentage cannot distinguish individuals with no OSA from those with severe OSA (despite the obvious fragmentation seen visually), then it might be even less sensitive for comparing groups or evaluating interventions expected to have less dramatic impact on sleep. For example, in a study of yoga in healthy adults, distribution analysis revealed stage differences not evident by percentage analysis.Citation50 Likewise, stage percentage does not appear to distinguish patients with versus without misperception of TST, whereas differences were evident when stages were examined using bout distribution methods.Citation51 One wonders how often initial analysis of stage percentage reveals little or no group differences, and deeper analysis of fragmentation is simply not pursued.

Phenotyping sleep apnea

The summary metric most commonly acted upon in clinical practice is the AHI, which is used to define the presence and severity of OSA. This event rate has become the cornerstone of diagnosis, a threshold index for insurance coverage of therapy, and a metric for inclusion and outcome of research trials. However, the OSA phenotype is much more heterogeneous than differences in AHI values might suggest, even if we put aside desaturation thresholds for scoring hypopneasCitation52Citation55 and the potential for night-to-night variability,Citation56Citation61 and other anatomical and physiological contributors.Citation62Citation64 OSA phenotypes can be described by extracting further details from routine PSG. For example, the severity of OSA often depends on sleep stage, and on body position, although a single night of PSG recording may not contain sufficient time in the different combinations of stage and position to make this determination.Citation65 illustrates an example of severe hypoxia to <60% during REM in a highly REM-dominant case, despite categorization as normal (AHI, 4.7) when the event rate is calculated over the full night. illustrates a strongly supine-dominant case, with normal AHI while sleeping in the lateral position, and very severe AHI while sleeping supine. The full-night AHI is the weighted average of these extremes, which happened to be 19.2 on this night. Had the person slept supine the whole time, or lateral the whole time (or if positions were not recorded), then very different conclusions about the presence and severity of OSA would likely be drawn. In this case, it is also interesting that REM dominance could not be assessed as only lateral REM was seen, and no apnea was present while lateral.

Figure 4 AHI: a limited view of heterogenous sleep physiology

Notes: (A and B) Hypnograms from different patients. (A) A case of REM-dominant obstruction with prominent hypoventilation pattern, resulting in a normal 4% AHI value (4.7 per hour), but a severe oxygen nadir of 57% (age 66; female; BMI, 35). (B) A case of supine-dominant sleep apnea, with a full night AHI in the moderate range (19.7 per hour), resulting from the weighted average of supine AHI of 62 and non-supine AHI of 0.9 (age 74; female; BMI, 21). (CG) The distribution of n=100 individuals with similar 4% AHI values (30–35 per hour), but differ widely across other factors that shape the clinical phenotype and potentially therapy choices (age, BMI, supine AHI, PLMI, and CAI).
Abbreviations: R, rapid eye movement sleep; N1–3, non-rapid eye movement stages 1–3; AHI, apnea–hypopnea index; BMI, body mass index; CAI, central apnea index; PLMI, periodic limb movement index; REM, rapid eye movement.
Figure 4 AHI: a limited view of heterogenous sleep physiology

Which AHI is most relevant depends on several factors. For example, in a study of airway anatomy while supine in a scanner, the supine AHI might be most informative even if the individual never sleeps supine in the home. By contrast, in a study of clinical outcomes, the real-world AHI experienced by the patient is paramount: if the patient sleeps exclusively non-supine (and this can be demonstrated), then the lateral AHI is the relevant “phenotype” for that individual.

Characterizing supine dominance also has direct implications for clinical care. Patients with strong supine dominance may benefit by pursuing positional therapy. Much work exists in this area,Citation66 and devices to assist in position therapy exist in the consumer and prescriptionCitation67 spaces. Device-assisted therapy is important, especially because patients’ self-report of body position during sleep carries substantial uncertainty.Citation68 By contrast, REM dominance does not as easily translate into clinical care recommendations for therapy, but REM-dominant OSA has been increasingly linked to hypertension,Citation69 and thus might impact treatment motivation. Insufficient evidence exists regarding REM-suppressing agents as pharmacological therapy for OSA.Citation70

The heterogeneity in clinical features is apparent by examining a distinct set from our database with AHI in a very small range, 30–35 (n=100). In this group of very tightly clustered “severe” AHI cases, the age, body mass index (BMI), supine AHI, periodic limb movement index (PLMI), and central apnea index () are each quite broadly dispersed. In addition, the distributions do not visually suggest obvious cutoffs or subgroups. In each case, clinical decisions might be distinct depending on where in the dispersion an individual resides in each category, and across categories; similar clinical sub-phenotypes have been discussed recently.Citation64 For example, an AHI of 30 in a slender 25-year-old with no periodic limb movements of sleep (PLMS) and a high supine AHI might have different treatment options or preferences (not to mention risks and outcomes) than an older obese patient with comorbid PLMS and increased central component. Clinically and in many research settings, severity categories span much larger AHI ranges, and are thus likely to have even more heterogeneity across these and other potentially important phenotypic axes (medications, alcohol, airway anatomy, etc).

Phenotyping insomnia

Insomnia is clinically defined entirely by self-reported symptoms. While research efforts impose cutoffs for sleep latency or WASO as inclusion criteria, in clinical practice the emphasis is on the severity of the complaint and the self-reported impact on daytime function rather than on numerical requirements of sleep parameters. Even in research settings, it can be challenging to demonstrate objective impact on daytime function,Citation71 a reminder that chronic insomnia is not phenomenologically equivalent to experimental sleep restriction in healthy adults and the ensuing performance decrements. There is growing interest in using objective measures to study insomnia, with respect to the hyperarousal pathophysiology,Citation72 as well as recent work indicating that it is the combination of insomnia and objective short sleep duration on PSG that is specifically associated with incident medical and psychiatric risk.Citation73,Citation74

Despite the clinical reliance on self-report, extensive work highlights the challenges associated with the subjective experience of insomnia. As an example, the seemingly simple question of sleep duration, which forms the basis of nearly all epidemiological sleep research, depends on the demographics,Citation75 comorbid psychiatric disorders,Citation76,Citation77 and comorbid sleep disorders.Citation78,Citation79 In addition, within-individual analysis reveals some striking observations that whenCitation80 and howCitation81 sleep–wake durations are queried impacts patient responses. The prospect of internal inconsistency across query methods remains an important yet unresolved issue.

Prolonged sleep latency is a common complaint, and although it may seem a straightforward metric, it carries special challenges when understanding insomnia and specifically the misperception phenotype. Objective sleep latency measurement requires an operational definition, for which there is no gold standard. Although prior literature considered behavioral (non-EEG) approaches to identify sleep onset,Citation82,Citation83 clinical reporting usually involves defining sleep onset by either the first epoch of any sleep or the first instance of a consolidated bout (e.g., 10 epochs) of uninterrupted sleep. Different definitions impact calculations and therefore experimental results. We can consider a patient with delayed sleep phase syndrome, who exhibits a 2-hour latency, but subsequent sleep was well consolidated, compared to an individual who spends the first 2 hours with fragmented brief transitions between wake and sleep, perhaps due to pain, and also has a 2-hour onset latency to persistent sleep. It is difficult to rationalize lumping these together under a definition of latency to persistent sleep (both are 2 hours).

For studies of misperception, the subjective sleep latency is compared to some definition of objective sleep onset; clearly, the definition of objective onset may impact the resulting calculation. We recently introduced a novel metric of sleep onset misperception that obviates the need to define objective sleep onset.Citation51 The fundamental goal of quantifying sleep onset misperception is to capture how much sleep was misinterpreted as wake, and thus we calculated the total sleep duration occurring during the time between lights out and subjective onset. This also addresses a potential confound of assuming that onset misperception and TST misperception are independent. We can consider patients with objective sleep of 8-hour duration, with a 1-minute onset latency, who report subjectively a 4-hour onset latency and 4 hours of TST. Typically, these persons would be labeled as having both onset and total sleep misperception (4 hours each). However, if they had anchored their total sleep estimate to their own sleep latency estimate of 4 hours, then their 4-hour total sleep guess is an accurate estimate of TST occurring since they believed that they fell asleep. We recently showed that a substantial portion of patients would be reclassified if their misperception phenotype is based on the sleep during subjective latency, and the “corrected” total sleep misperception.Citation51 Big datasets may allow further evaluation of misperception phenotype(s), which have not enjoyed consistent predictors in the prior literature.Citation77

Analysis and inference

Missing and erroneous database entries

Routine clinical data can be easily arranged in tabular format to facilitate an initial data evaluation. When these metrics are exported into spreadsheets with columns of features (and each row is one patient’s data), some straightforward cleaning methods can be implemented (). Minimum, maximum, and counting commands can identify columns with missing data (e.g., count if empty), improperly formatted data (e.g., count if text is present), or implausible values (e.g., count if outside limit value). In some cases, such outliers or errors would be missed in routine plotting such as bar plots with standard deviations (SDs) or even box and whisker plots depending on whether outliers are plotted and how the axis ranges are chosen.

Several reasons for missing values are possible, including corrupted data (data were collected but were no longer accessible), collected but not recorded (paper copies fail to transfer to electronic database), and not collected. Some qualitative assessment of the distribution of variables from individuals who are missing at least one other entry in a data matrix can be useful.Citation84 Specific decisions regarding how to handle missing data points or error values are best handled by prespecifying a plan, which could involve excluding subjects or imputing missing values; more advanced discussions are available.Citation85

In some cases, missing values occur for appropriate reasons and the absence can be informative. For example, the REM AHI cannot be calculated if no REM sleep was observed during a PSG. Likewise, position dependence of OSA cannot be calculated for PSGs in which only supine position was observed. In these cases, removing such subjects might be favored over imputation.

Determining whether a given value is erroneous (versus a biologically plausible outlier) may depend on certain clues such as “impossible” values or known placeholders for missing data used by an acquisition software system. For example, negative values where only positive values are possible (e.g., age) are easily identified as both errors and outliers. Likewise, when letters are present instead of numbers, or where the value is out of range (e.g., ESS value of >24), these are also easily identified. In some examples (), simply plotting the data identifies outliers. Database software such as Excel can easily show the maximum and minimum values for inspection of implausible values. As an example of errors not readily detected by the abovementioned methods, in our database the BMI and ESS are manually entered in adjacent fields, such that an out-of-range ESS value prompted inspection of the BMI as well, and in some cases it was shown by viewing the original data that these two values were interchanged (in this instance, the BMI value of 18 is plausible, and so it would not have been flagged as an outlier).

In some cases, we may still wish to exclude plausible data from analysis. Examples are related to stage- and position-specific metrics, wherein the amount of time spent in the condition of interest is the “sampling” problem, rather than the number of subjects. We may wish, for instance, to exclude people with minimal time in REM or minimal time spent sleeping supine, not just the zero time individuals, when calculating OSA dominance ratios or oxygen nadirs in REM. For AHI, the values could be artificially high (one apnea in one epoch), or artificially low (insufficient time spent in REM to manifest obstructions).

Distributions and plotting

Evaluating the distribution of individual variables can inform multiple aspects of analysis and inference. The most basic reason to understand the distribution is to decide what kind of statistical approach is most appropriate, such as whether continuous data are normally distributed or skewed in some manner, in which case data transformations to make the data distribution approximately normal (e.g., logarithmic transformation of positive-valued data) or nonparametric analysis methods may be preferred. Moreover, like plotting the raw data, evaluating the distribution using one of several techniques can also inform the approach to outliers, or the possibility of interesting biological heterogeneity. For example, multimodal distributions may imply that the population contains different sub-phenotypes that might be worthy of further investigation.

In our cohort, none of the variables passed statistical testing for normality, similar to prior work using the Sleep Heart Health Study database.Citation84 Of note, large samples may be highly “powered” to reject the null hypothesis of a normal distribution, even when the distribution appears nearly normal. Conversely, small samples are more likely to pass tests of normality, even if known to be non-normal.Citation39 Indeed, when we under-sampled the current dataset, there was increasing probability of passing tests for normality (). Non-normal data can be handled by either nonparametric methods or transformation that render the data approximately normal. The challenge is as much statistical as biological: non-normal distributions may have phenotyping implications.

The method of plotting can impact the viewer’s impression of the data. Bar graphs with mean and SD or standard error of the mean (SEM) are commonly used, but these routine methods risk inadvertently concealing potentially important information. illustrates different plotting methods for groups of simulated data from known distributions. In the case of bar plots with SEM, casual inspection might give the false sense of reduced variance of the actual observations in the dataset (). This happens because the SEM is obtained by dividing the SD by the square root of the sample size, which makes error bars smaller. The SEM thus does not reflect variance in the data, but rather reflects the precision of the estimate of the mean value – one should not conflate the two.

Figure 5 Plotting views of three common distributions.

Notes: Each row contains one plotting method for three distributions (G, LT skew, and BM). Each column contains a simulated sample size of n=10 (left) or n=30 (right). The individual points showed as dot plots (A) are given for comparison visually with more common views (BD). Bar plots with SEM (B) appear quite similar across the simulated distributions. Similarly, plotting a bard with SD (C), there is little suggestion that three different distributions are shown. When plotted using box and whiskers, we observe clues of non-Gaussian distributions; here the box is the 25–75 percentile, the line is the median, + is the mean, and the whiskers are the 2.5–97.5 percentile. In each graph, the Y-axis is in a.u.
Abbreviations: a.u., arbitrary units; BI, bimodal; G, Gaussian; LT, long tail; SD, standard deviation; SEM, standard error of the mean.
Figure 5 Plotting views of three common distributions.

The SD, in contrast, reflects the dispersion in the data, and does not diminish with increasing sample size like the SEM. However, the SD may still be misleading in a bar graph when it is constructed from data with a non-normal distribution. Because the SD is by convention shown as symmetric bars around the mean (regardless of the actual underlying data distribution), viewers may be left with the potentially false impression of symmetric spread around the mean simply because of the display convention (). Sometimes the only clue in a bar graph that the population is skewed is that the SD value is greater than the mean value for a dataset that cannot take on negative values, which implies a long tail (i.e., non-normal). This is common, for example, in known skewed distributions such as AHI or sleep latency, where a value of, say, 15±18 would be interpreted as evidence of a long-tail non-normal distribution. Asymmetries and skew are visually evident in box and whisker plots (). However, even the box and whisker method can “hide” the distribution for unusually structured data, such as bimodal distributions, which would be phenotypically important to recognize.

Other techniques for visually assessing structure in populations include frequency histograms and cumulative distribution functions (CDFs). Histograms are used in to illustrate the distribution of TST, sleep efficiency, and the number of ≥30-second awakenings. The choice of bin size for histogram plots should consider the trade-off between granularity of the variable of interest, and sample size per bin. Too many bins cause the variable values to be either 0 or 1 in each bin, or to vary randomly from bin to bin due to sampling noise, and therefore offer little visual insight. Too few bins cause the underlying distribution to be overly smoothed. Histogram views can reveal outliers, suggest heterogeneity of the population, or inform selection of cutoff values (e.g., if a “valley” was seen between two modes within the data, suggests two populations). In contrast, the histograms shown in do not have clear “valleys” on visual inspection.

Figure 6 Distributions of common PSG metrics.

Notes: Each panel contains a frequency histogram of n=1835 PSGs for TST (A), efficiency (B), number of wake transitions (C), and PLMI (D). In each panel, the inlay graph contains three plots of the same distribution: box and whisker, bar with SD, and bar plot with SEM. The clearly skewed distributions are largely hidden by the bar plots, compared to the histogram and box and whisker views.
Abbreviations: PLMI, periodic limb movement index; PSG, polysomnography; SEM, standard error of the mean; SD, standard deviation; TST, total sleep time.
Figure 6 Distributions of common PSG metrics.

CDF plots can also be informative, especially when comparing groups, or when the metric of interest is a threshold imposed upon a continuous variable. Unlike histograms, CDFs do not require specification of bin size; however, their visualization may be less intuitive. shows CDF plots for different sleep apnea metrics, such as position dependence of the AHI () and of the central apnea index (). shows the distribution of time spent in different body positions during sleep. Threshold values can be evaluated visually, such as the portion of the population with at least 50% of the night spent supine (; ~60%), or who had a supine AHI value >30 (; ~20%).

Figure 7 Distribution of common sleep apnea PSG metrics.

Notes: (A) CDFs for AHI, Sup AHI, and NonSup AHI; the Sup AHI values were higher as indicated by the slower rise of the CDF. (B) CDFs for CAI, which also showed some degree of position dependence. (C) A CDF of the percentage of sleep time spent in the four cardinal body positions; left and right showed similar distributions, while prone was not commonly observed.
Abbreviations: AHI, apnea–hypopnea index; CAI, central apnea index; CDFs, cumulative distribution functions; NonSup, non-supine; PSG, polysomnography; Sup, supine.
Figure 7 Distribution of common sleep apnea PSG metrics.

Correlation analysis

One of the powerful approaches enabled by large datasets is investigating correlations between variables. Nonparametric (Spearman) correlation was performed between AHI and BMI, which are well known to be positively correlated. Taking the full cohort, the unadjusted Spearman’s R-value is ~0.25. shows the distribution of R-values for AHI versus BMI obtained when repeatedly analyzing randomly selected smaller subsets of the cohort. For the subgroups of the cohort, the range of R-values is larger for smaller sample sizes. In other words, smaller samples of the large cohort (n=1835) show much larger range of correlation values than the value of the whole set (~0.25). This variation includes more extreme R-values such as actually negative correlations in some cases (for the subsets of size n=10, n=20). Similar patterns are observed with another pair of parameters that showed a positive correlation in the large cohort (age and PLMI; ).

Figure 8 Sample size impacts correlations calculated pairwise from PSG metrics.

Notes: (AC) Box and whisker plots of Spearman’s R-values obtained from random subsets of the full cohort (n=10, 20, 50, 100) for pairwise correlations: AHI with BMI (A), PLMI with age (B), and REM percentage with random numbers. Note the more extreme range of R-values (positive and negative) observed with smaller samples, and the wide range of observations even when correlating a PSG metric with uncorrelated random numbers. In (D), Spearman’s R-values were calculated for all pairwise correlations across 40 continuous variables (age, BMI, AHI, etc), but only those reaching a P-value <0.05 are shown. The resulting dot plots are shown across a range of sample sizes from n=20 up to the full cohort. Note that the range of significant R-values is strongly related to sample size, with only very large values reaching significance when the random subsets were small.
Abbreviations: AHI, apnea–hypopnea index; BMI, body mass index; PLMI, periodic limb movement index; PSG, polysomnography; REM, rapid eye movement.
Figure 8 Sample size impacts correlations calculated pairwise from PSG metrics.

To emphasize the risks of under-sampled (small) data resulting in spurious findings using correlation analysis, we also demonstrate the R-values obtained when pairing REM% from the cohort with a vector of random numbers (). This plot clearly shows that significant correlations can occur, even with convincingly large R-values, with random data. These plots illustrate the concept that extreme values for statistical estimates (such as correlation coefficient or mean) are more common in under-sampled data. Most investigators reflexively think of “power” in the sense that lack of statistical significance when a true difference exists (type 2 error) could be a symptom of insufficient sample size. However, small sample sizes also harbor false-positive risk (type 1 error).

Large datasets can mitigate false-positive risks associated with the issue of small numbers mentioned earlier, except when the dataset is parsed into smaller and smaller subsets in data-mining queries of ever more specific subsets. Across all 41 continuous variables in the database, the threshold R-value for meeting significance was quite small when calculated from the entire cohort. The threshold for obtaining a significant R-value increases as progressively smaller subsets of the cohort are considered. illustrates how the “power” of the Spearman correlation calculated between any two pairs of variables decreases as the sample size decreases. In other words, when the entire cohort is considered, even quite small R-values in pairwise correlations meet significance criteria, because the large size essentially provides power to detect small correlations as significant. By contrast, small samples provide insufficient power to detect small correlations as statistically significant, and thus only large R-values meet significance criteria. This latter issue creates an interesting conundrum: because only large R-values can be significant when small datasets are considered, any significant correlation (whether true or false in reality) will necessarily have compelling-appearing R-values, which may overestimate the true R-value of that pair of variables, had a larger sample been utilized.

Sometimes, we may have prior information to help mitigate false inferences. Given the strong known relation between BMI and AHI, insignificant or paradoxical (negative associations) can be interpreted as false findings. However, in other cases, we may not have the benefit of strong prior knowledge, so assessing new data becomes more challenging.

Inferential pitfalls of small and large sample sizes

We have seen through empirical analyses and simulations that a spectrum of information and pitfalls are possible when working with large datasets. We now turn to illustrative examples from the literature in which large datasets may not be as explanatory as they appear. While there are many examples from which to choose, these example situations are representative of some key challenges.

Situation 1: when big data are still under-sampled

A recent study of more than 50 million pregnancies in the US sought to correlate adverse maternal and baby outcomes with OSA.Citation86 The study used billing codes from a massive registry to assign case labels for OSA. By this method, the prevalence of OSA in pregnancy was 3 per 10,000, approximately 100-fold lower than expected in this demographic. The discrepancy raises the possibility that the OSA coding is not just underestimating prevalence, but may also be biased, for example, toward the most severe or most symptomatic cases. If so, implications of any results based on these data greatly shrink in scope, as it they would apply only to the most severe cases of OSA or the most vulnerable or symptomatic individuals.

Situation 2: when big data explain little

While large sample sizes increase power, an associated risk involves the potential for being “overpowered” to detect very small correlations or group differences. We can observe this in the effect of modafinil on sleep latency of shift workers,Citation87 or in the relation of OSA severity with ESS.Citation88 A striking example of a very large sample supporting very small effects can be found in the analysis of mood rhythms detected in word analysis of more than 500 million tweets.Citation89 Relative change in day length was significantly related to positive affect, with an R-value of 1.2×10−3, suggesting that the rhythm explained a fraction of a percent of affect fluctuations. These observations highlight the well-known but under-practiced mandate to focus scientific investigations on determining effect sizes, causal relationships, and establishing practical or clinical relevance, rather than focusing on simplistic binary questions at the heart of statistical significance testing.Citation6

Situation 3: when big data are misinterpreted

The largest published study of home sleep testing was recently published, with the stated goal of determining if home testing was being used clinically in accordance with AASM standards.Citation90,Citation91 The sample size of nearly 200,000 home tests is orders of magnitude larger than any prior home testing report. The authors concluded based on a high posttest probability of OSA (~80%) that indeed testing was in line with AASM guidelines. However, the AASM recommends that pretest probability (not posttest probability) should be >80% for at least moderate OSA (AHI >15). Bayes theorem tells us that the pretest probability of AHI >15 was <10% in the published cohort (and 50% if AHI >5 threshold is used),Citation92 and thus the data actually support the opposite conclusion to that reached by the authors: home testing for OSA is being used too liberally, and not in line with the AASM guidance.

Another recent articleCitation93 using administrative data from more than 2000 patients to derive a screening algorithm for OSA cases failed to recognize, by Bayes theorem, that their algorithm’s sensitivity and specificity were indicative of chance performance. One always needs to consider both sensitivity and specificity when evaluating any test. We use a simple calculation, the “rule of 100”, which can avoid this statistical fallacy: if the sensitivity and specificity of a test add to 100%, the probability of disease is unchanged by the result of the test (i.e., chance performance).

Conclusion

Clinical databases have important strengths that can support big data research goals. Clinical data contain diversity and heterogeneity that may be specifically excluded in clinical trial databases, which are often designed to reduce sources of variability that can be detrimental to power calculations and outcome testing. Clinical databases are more likely to reflect “real-world” variation in clinical phenotypes. This can be important for testing whether predictive algorithms can generalize across a diversity of clinical phenotypes. In addition, heterogeneous sets may be more amenable to clustering and other exploratory methods that allow discovery of new phenotypes that can be explored in subsequent prospective studies. From a resource utilization standpoint, clinical databases are a natural extension of already acquired data supporting patient care, which allows valuable and limited resources to be applied at the analysis phase.

Despite these advantages, certain limitations must be recognized. Academic centers may have different referral biases, for example, being enriched for complicated cases. Although most clinical laboratories have standardized physiological recording protocols, the collection of self-reported clinical information may not be standardized. Variation across recording and scoring technologists may contribute heterogeneity despite quality efforts required in accredited laboratories. Centralized scoring common to large clinical trials may not be practical for clinical databases.

Large sleep datasets offer the opportunity to pursue complex phenotyping exploration, and to detect scientifically or clinically interesting differences or patterns in health and disease. Despite the clear advantages, analysis of big data in sleep medicine also carries risks. Understanding common pitfalls can help mitigate the risks, whether one is conducting the analysis or reviewing publications involving big data. Ideally, what is learned from population-level big data efforts can then inform individual clinical care decisions. In an era when insurance restrictions are driving at-home limited channel alternatives, these efforts will be critical to elaborate and justify the current and possibly more advanced future use of PSG for clinical care. The era of big data in sleep medicine is poised to provide unprecedented insights, especially as it coincides with massive shifts in reimbursement and availability of laboratory-based PSG.

Supplementary materials

Figure S1 Overlap of symptoms associated with sleep apnea and insomnia.

Notes: (A) Venn diagram of symptoms related to sleep apnea: snoring (solid line, blue fill), gasping arousals (dashed line, red fill), and witnessed apnea (dotted line, yellow fill). The n-value (sample size) for each category: snoring only =656; snoring and witnessed apnea =337; gasping and snoring and witnessed apnea =166; snoring and gasping =70; gasping only =34; witnessed apnea and gasping =13; witnessed apnea only =61. (B) Venn diagram of insomnia symptoms: onset (dashed line, blue fill), maintenance (solid line, yellow fill), and listing insomnia as the reason for PSG (dotted line, red fill). The n-values for each category: onset only =148; onset and maintenance =331; maintenance only =521; maintenance and listing insomnia as the reason for PSG =121; listing insomnia as the reason for PSG but no other symptoms were indicated =10; onset and maintenance insomnia and listing insomnia as the reason for PSG =288; onset and listing insomnia as the reason for PSG =38.

Abbreviation: PSG, polysomnography.

Figure S1 Overlap of symptoms associated with sleep apnea and insomnia.Notes: (A) Venn diagram of symptoms related to sleep apnea: snoring (solid line, blue fill), gasping arousals (dashed line, red fill), and witnessed apnea (dotted line, yellow fill). The n-value (sample size) for each category: snoring only =656; snoring and witnessed apnea =337; gasping and snoring and witnessed apnea =166; snoring and gasping =70; gasping only =34; witnessed apnea and gasping =13; witnessed apnea only =61. (B) Venn diagram of insomnia symptoms: onset (dashed line, blue fill), maintenance (solid line, yellow fill), and listing insomnia as the reason for PSG (dotted line, red fill). The n-values for each category: onset only =148; onset and maintenance =331; maintenance only =521; maintenance and listing insomnia as the reason for PSG =121; listing insomnia as the reason for PSG but no other symptoms were indicated =10; onset and maintenance insomnia and listing insomnia as the reason for PSG =288; onset and listing insomnia as the reason for PSG =38.Abbreviation: PSG, polysomnography.

Figure S2 Overlap of symptoms associated with restless legs and with narcolepsy.

Notes: (A) Venn diagram of symptoms related to restless legs: uncomfortable sensation in the legs (solid line, blue fill), better with movement (w/mov’t) (dotted line, red fill), and worse at night (dashed line, yellow fill). The n-values (sample sizes) for each category: uncomfortable sensation alone =94; uncomfortable and better with movement =72; better with movement alone =26; better with movement and worse at night =33; uncomfortable and better with movement and worse at night =90; worse at night alone =42; uncomfortable and worse at night =49. (B) Venn diagram of narcolepsy symptoms: peri-sleep hallucinations (dashed line, blue fill), sleep paralysis (dotted line, red fill), and cataplexy (solid line, yellow fill). The n-values for each category: hallucinations alone =77; hallucinations and cataplexy =11; cataplexy alone =82; hallucinations and sleep paralysis =21; hallucinations and cataplexy and sleep paralysis =17; cataplexy and sleep paralysis =30; sleep paralysis alone =59.

Figure S2 Overlap of symptoms associated with restless legs and with narcolepsy.Notes: (A) Venn diagram of symptoms related to restless legs: uncomfortable sensation in the legs (solid line, blue fill), better with movement (w/mov’t) (dotted line, red fill), and worse at night (dashed line, yellow fill). The n-values (sample sizes) for each category: uncomfortable sensation alone =94; uncomfortable and better with movement =72; better with movement alone =26; better with movement and worse at night =33; uncomfortable and better with movement and worse at night =90; worse at night alone =42; uncomfortable and worse at night =49. (B) Venn diagram of narcolepsy symptoms: peri-sleep hallucinations (dashed line, blue fill), sleep paralysis (dotted line, red fill), and cataplexy (solid line, yellow fill). The n-values for each category: hallucinations alone =77; hallucinations and cataplexy =11; cataplexy alone =82; hallucinations and sleep paralysis =21; hallucinations and cataplexy and sleep paralysis =17; cataplexy and sleep paralysis =30; sleep paralysis alone =59.

Figure S3 Assessing missing data.

Notes: (A) An example of missing data, outliers, and data reversals (subject code MGH24: the BMI and ESS scores are switched, but the error codes are only implausible for ESS), indicated by gray shading. Column statistics (maximum, minimum, count if text, and missing cell entries) can be helpful to alert potential anomalous data. (B) The age variable from (A) represented as a bar plot with SD, a box and whisker plot, and a dot plot; the outlier is not evident in the bar with SD. (C) The BMI variable from (A); similarly, the presence of an outlier is not evident in the bar with SD, and none hint at the switch with ESS because the erroneous value was plausible. In the Sex column in (A), 0= female and 1= male.

Abbreviations: BMI, body mass index; ESS, Epworth Sleepiness Scale; m, male; Max, maximum; Min, minimum; SD, standard deviation; Subj, subject; TST, total sleep time.

Figure S3 Assessing missing data.Notes: (A) An example of missing data, outliers, and data reversals (subject code MGH24: the BMI and ESS scores are switched, but the error codes are only implausible for ESS), indicated by gray shading. Column statistics (maximum, minimum, count if text, and missing cell entries) can be helpful to alert potential anomalous data. (B) The age variable from (A) represented as a bar plot with SD, a box and whisker plot, and a dot plot; the outlier is not evident in the bar with SD. (C) The BMI variable from (A); similarly, the presence of an outlier is not evident in the bar with SD, and none hint at the switch with ESS because the erroneous value was plausible. In the Sex column in (A), 0= female and 1= male.Abbreviations: BMI, body mass index; ESS, Epworth Sleepiness Scale; m, male; Max, maximum; Min, minimum; SD, standard deviation; Subj, subject; TST, total sleep time.

Figure S4 Normality testing results vary by sample size.

Notes: The variables listed were tested for normality (D’Agostino–Pearson test), with “−” indicating failed testing for normality and “+” indicating passed testing for normality. The columns indicate the sample size of random subsets of the full dataset, with none passing when the sample size was 1800.

Abbreviations: AHI, apnea–hypopnea index; AHI NonSup, apnea–hypopnea index in non-supine; AHI Sup, apnea–hypopnea index in supine; BMI, body mass index; CAI, central apnea index; CAI NonSup, central apnea index in non-supine; CAI Sup, central apnea index in supine; ESS, Epworth Sleepiness Scale; LPS, latency to persistent sleep; mean HR, mean heart rate; Min O2 NR, minimum oxygen in non-REM; N1–3, non-REM stages 1–3; Min O2 REM, minimum oxygen in REM; PLMI, periodic limb movement index; REM, rapid eye movement; Spont AI, spontaneous apnea index; Sup%, supine percentage; TST, total sleep time; #W≥30s, number of wakes >30 seconds.

Figure S4 Normality testing results vary by sample size.Notes: The variables listed were tested for normality (D’Agostino–Pearson test), with “−” indicating failed testing for normality and “+” indicating passed testing for normality. The columns indicate the sample size of random subsets of the full dataset, with none passing when the sample size was 1800.Abbreviations: AHI, apnea–hypopnea index; AHI NonSup, apnea–hypopnea index in non-supine; AHI Sup, apnea–hypopnea index in supine; BMI, body mass index; CAI, central apnea index; CAI NonSup, central apnea index in non-supine; CAI Sup, central apnea index in supine; ESS, Epworth Sleepiness Scale; LPS, latency to persistent sleep; mean HR, mean heart rate; Min O2 NR, minimum oxygen in non-REM; N1–3, non-REM stages 1–3; Min O2 REM, minimum oxygen in REM; PLMI, periodic limb movement index; REM, rapid eye movement; Spont AI, spontaneous apnea index; Sup%, supine percentage; TST, total sleep time; #W≥30s, number of wakes >30 seconds.

Disclosure

Dr Matt T Bianchi has received funding from the Department of Neurology, Massachusetts General Hospital, the Center for Integration of Medicine and Innovative Technology, the Milton Family Foundation, the MGH-MIT Grand Challenge, and the American Sleep Medicine Foundation. He has a pending patent on a sleep wearable device, received research funding from MC10 Inc and Insomnisolv Inc, has a consulting agreement with McKesson Health and International Flavors and Fragrances, serves as a medical monitor for Pfizer, and has provided expert testimony in sleep medicine. Dr M Brandon Westover receives funding from NIH-NINDS (1K23NS090900), the Rappaport Foundation, and the Andrew David Heitman Neuroendovascular Research Fund. The authors report no other conflicts of interest in this work.

References

  • DeanDAGoldbergerALMuellerRScaling up scientific discovery in sleep medicine: the National Sleep Research ResourceSleep20163951151116427070134
  • O’ReillyCGosselinNCarrierJNielsenTMontreal Archive of Sleep Studies: an open-access resource for instrument benchmarking and exploratory researchJ Sleep Res201423662863524909981
  • DouglassC webpage on the InternetAmerican Sleep Apnea Association and IBM Launch Patient-led Sleep Study App2016 Available from: http://www-03.ibm.com/press/us/en/pressrelease/49275.wssAccessed April 4, 2016
  • FalkRGreenbaumCWSignificance tests die hard: the amazing persistence of a probabilistic misconceptionTheory Psychol199557598
  • IoannidisJPWhy most published research findings are falsePLoS Med200528e12416060722
  • WestoverMBWestoverKDBianchiMTSignificance testing as perverse probabilistic reasoningBMC Med201192021356064
  • GillJThe insignificance of null hypothesis significance testingPolit Res Q1999523647674
  • GoodmanSNToward evidence-based medical statistics. 1: the P value fallacyAnn Intern Med199913012995100410383371
  • ButtonKSIoannidisJPMokryszCPower failure: why small sample size undermines the reliability of neuroscienceNat Rev Neurosci201314536537623571845
  • NuzzoRScientific method: statistical errorsNature2014506748715015224522584
  • BianchiMTPhillipsAJWangWKlermanEBStatistics for Sleep and Biological Rhythms Research: from distributions and displays to correlation and causationJ Biol Rhythms Epub20161024
  • KlermanEBWangWPhillipsAJBianchiMTStatistics for Sleep and Biological Rhythms Research: longitudinal analysis of biological rhythms dataJ Biol Rhythms Epub20161024
  • HuckSStatistical MisconceptionsAbingdon-on-ThamesRoutledge2008
  • ZiliakSTMcCloskeyDNThe Cult of Statistical SignificanceAnn Arbor, MIThe University of Michigan Press2008
  • ReinhartAStatistics Done Wrong: The Woefully Complete GuideSan Francisco, CANo Starch Press2015
  • RedlineSDeanD3rdSandersMHEntering the era of “big data”: getting our metrics rightSleep201336446546923564993
  • SilberMHAncoli-IsraelSBonnetMHThe visual scoring of sleep in adultsJ Clin Sleep Med20073212113117557422
  • BabadiBBrownENA review of multitaper spectral analysisIEEE Trans Biomed Eng20146151555156424759284
  • SteinPKPuYHeart rate variability, sleep and sleep disordersSleep Med Rev2012161476621658979
  • CitiLBianchiMTKlermanEBBarbieriRInstantaneous monitoring of sleep fragmentation by point process heart rate variability and respiratory dynamicsConf Proc IEEE Eng Med Biol Soc201120117735773822256131
  • ThomasRJMietusJEPengCKDifferentiating obstructive from central and complex sleep apnea using an automated electrocardiogram-based methodSleep200730121756176918246985
  • ChervinRDShelgikarAVBurnsJWRespiratory cycle-related EEG changes: response to CPAPSleep201235220320922294810
  • ChervinRDBurnsJWRuzickaDLElectroencephalographic changes during respiratory cycles predict sleepiness in sleep apneaAm J Respir Crit Care Med2005171665265815591467
  • McKinneySMDang-VuTTBuxtonOMSoletJMEllenbogenJMCovert waking brain activity reveals instantaneous sleep depthPLoS One201163e1735121408616
  • Dang-VuTTMcKinneySMBuxtonOMSoletJMEllenbogenJMSpontaneous brain rhythms predict sleep stability in the face of noiseCurr Biol20102015R626R62720692606
  • SforzaEPichotVBarthelemyJCHaba-RubioJRocheFCardiovascular variability during periodic leg movements: a spectral analysis approachClin Neurophysiol200511651096110415826850
  • SforzaEJuonyCIbanezVTime-dependent variation in cerebral and autonomic activity during periodic leg movements in sleep: implications for arousal mechanismsClin Neurophysiol2002113688389112048047
  • AritakeSBlackwellTPetersKWPrevalence and associations of respiratory-related leg movements: the MrOS sleep studySleep Med201516101236124426429752
  • BuxtonOMEllenbogenJMWangWSleep disruption due to hospital noises: a prospective evaluationAnn Intern Med2012157317017922868834
  • WinkelmanJWThe evoked heart rate response to periodic leg movements of sleepSleep199922557558010450592
  • YounesMHanlyPJImmediate postarousal sleep dynamics: an important determinant of sleep stability in obstructive sleep apneaJ Appl Physiol (1985)2016120780180826718786
  • AzarbarzinAOstrowskiMHanlyPYounesMRelationship between arousal intensity and heart rate response to arousalSleep201437464565324899756
  • OhayonMMCarskadonMAGuilleminaultCVitielloMVMeta-analysis of quantitative sleep parameters from childhood to old age in healthy individuals: developing normative sleep values across the human lifespanSleep20042771255127315586779
  • MayersAGBaldwinDSAntidepressants and their effect on sleepHum Psychopharmacol200520853355916229049
  • SullivanSSInsomnia pharmacologyMed Clin North Am201094356358020451033
  • WesenstenNJBalkinTJBelenkyGDoes sleep fragmentation impact recuperation? A review and reanalysisJ Sleep Res19998423724510646163
  • ThomasRJSleep fragmentation and arousals from sleep-time scales, associations, and implicationsClin Neurophysiol2006117470771116500146
  • BianchiMTThomasRJTechnical advances in the characterization of the complexity of sleep and sleep disordersProg Neuropsychopharmacol Biol Psychiatry20134527728623174482
  • BianchiMTCashSSMietusJPengCKThomasRObstructive sleep apnea alters sleep stage transition dynamicsPLoS One201056e1135620596541
  • NormanRGScottMAAyappaIWalslebenJARapoportDMSleep continuity measured by survival curve analysisSleep200629121625163117252894
  • SwihartBJCaffoBBandeen-RocheKPunjabiNMCharacterizing sleep structure using the hypnogramJ Clin Sleep Med20084434935518763427
  • LaffanACaffoBSwihartBJPunjabiNMUtility of sleep stage transitions in assessing sleep continuitySleep201033121681168621120130
  • KlermanEBDavisJBDuffyJFDijkDJKronauerREOlder people awaken more frequently but fall back asleep at the same rate as younger peopleSleep200427479379815283016
  • KishiAStruzikZRNatelsonBHTogoFYamamotoYDynamics of sleep stage transitions in healthy humans and patients with chronic fatigue syndromeAm J Physiol Regul Integr Comp Physiol20082946R1980R198718417644
  • BianchiMTEisemanNACashSSMietusJPengCKThomasRJProbabilistic sleep architecture models in patients with and without sleep apneaJ Sleep Res201221333034121955148
  • LoCCChouTPenzelTCommon scale-invariant patterns of sleep-wake transitions across mammalian speciesProc Natl Acad Sci U S A200410150175451754815583127
  • Haba-RubioJIbanezVSforzaEAn alternative measure of sleep fragmentation in clinical practice: the sleep fragmentation indexSleep Med20045657758115511704
  • BianchiMTKimSGalvanTWhiteDPJoffeHNocturnal hot flashes: relationship to objective awakenings and sleep stage transitionsJ Clin Sleep Med20161271003100926951410
  • Chu-ShoreJWestoverMBBianchiMTPower law versus exponential state transition dynamics: application to sleep-wake architecturePLoS One2010512e1420421151998
  • KudesiaRSBianchiMTDecreased nocturnal awakenings in young adults performing Bikram yoga: a Low-Constraint Home Sleep Monitoring StudyISRN Neurol2012201215374522577578
  • SalineAGoparajuBBianchiMTSleep fragmentation does not explain misperception of latency or total sleep timeJ Clin Sleep Med20161291245125527250816
  • Grigg-DambergerMMThe AASM scoring manual: a critical appraisalCurr Opin Pulm Med200915654054919738470
  • ParrinoLFerriRZucconiMFanfullaFCommentary from the Italian Association of Sleep Medicine on the AASM manual for the scoring of sleep and associated events: for debate and discussionSleep Med200910779980819564132
  • RedlineSBudhirajaRKapurVThe scoring of respiratory events in sleep: reliability and validityJ Clin Sleep Med20073216920017557426
  • RuehlandWRRochfordPDO’DonoghueFJPierceRJSinghPThorntonATThe new AASM criteria for scoring hypopneas: impact on the apnea hypopnea indexSleep200932215015719238801
  • AarabGLobbezooFHamburgerHLNaeijeMVariability in the apnea-hypopnea index and its consequences for diagnosis and therapy evaluationRespiration2009771323718957843
  • AhmadiNShapiroGKChungSAShapiroCMClinical diagnosis of sleep apnea based on single night of polysomnography vs. two nights of polysomnographySleep Breath200913322122619067010
  • Le BonOHoffmannGTeccoJMild to moderate sleep respiratory events: one negative night may not be enoughChest2000118235335910936124
  • LevendowskiDStewardDWoodsonBTOlmsteadRPopovicDWestbrookPThe impact of obstructive sleep apnea variability measured in-lab versus in-home on sample size calculationsInt Arch Med200921219121211
  • MoskoSSDickelMJAshurstJNight-to-night variability in sleep apnea and sleep-related periodic leg movements in the elderlySleep19881143403483206054
  • StepnowskyCJJrOrrWCDavidsonTMNightly variability of sleep-disordered breathing measured over 3 nightsOtolaryngol Head Neck Surg2004131683784315577777
  • RihaRLGislassonTDiefenbachKThe phenotype and genotype of adult obstructive sleep apnoea/hypopnoea syndromeEur Respir J200933364665519251800
  • SubramaniYSinghMWongJKushidaCAMalhotraAChungFUnderstanding phenotypes of obstructive sleep apnea: applications in anesthesia, surgery, and perioperative medicineAnesth Analg2017124117919127861433
  • ZinchukAVGentryMJConcatoJYaggiHKPhenotypes in obstructive sleep apnea: a definition, examples and evolution of approachesSleep Med Rev Epub20161012
  • EisemanNAWestoverMBEllenbogenJMBianchiMTThe impact of body posture and sleep stages on sleep apnea severity in adultsJ Clin Sleep Med201286655A666A23243399
  • de VriesNPositional Therapy in Obstructive Sleep ApneaBerlinSpringer2014
  • LevendowskiDJSeagravesSPopovicDWestbrookPRAssessment of a neck-based treatment and monitoring device for positional obstructive sleep apneaJ Clin Sleep Med201410886387125126032
  • RussoKBianchiMTHow reliable is self-reported body position during sleep?J Clin Sleep Med201612112712826518704
  • AppletonSLVakulinAMartinSAHypertension is associated with undiagnosed OSA during rapid eye movement sleepChest2016150349550527001264
  • MorgenthalerTIKapenSLee-ChiongTPractice parameters for the medical therapy of obstructive sleep apneaSleep20062981031103516944671
  • ShekletonJARogersNLRajaratnamSMSearching for the daytime impairments of primary insomniaSleep Med Rev2010141476019963414
  • BonnetMHArandDLHyperarousal and insomnia: state of the scienceSleep Med Rev201014191519640748
  • VgontzasANFernandez-MendozaJLiaoDBixlerEOInsomnia with objective short sleep duration: the most biologically severe phenotype of the disorderSleep Med Rev201317424125423419741
  • Fernandez-MendozaJSheaSVgontzasANCalhounSLLiaoDBixlerEOInsomnia and incident depression: role of objective sleep duration and natural historyJ Sleep Res201524439039825728794
  • KurinaLMMcClintockMKChenJHWaiteLJThistedRALauderdaleDSSleep duration and all-cause mortality: a critical review of measurement and associationsAnn Epidemiol201323636137023622956
  • Fernandez-MendozaJCalhounSLBixlerEOSleep misperception and chronic insomnia in the general population: role of objective sleep duration and psychological profilesPsychosom Med2011731889720978224
  • HarveyAGTangNK(Mis)perception of sleep in insomnia: a puzzle and a resolutionPsychol Bull201213817710121967449
  • VanablePAAikensJETadimetiLCaruana-MontaldoBMendelsonWBSleep latency and duration estimates among sleep disorder patients: variability as a function of sleep disorder diagnosis, sleep history, and psychological characteristicsSleep2000231717910678467
  • BianchiMTWilliamsKLMcKinneySEllenbogenJMThe subjective-objective mismatch in sleep perception among those with insomnia and sleep apneaJ Sleep Res201322555756823521019
  • FichtenCSCretiLAmselRBailesSLibmanETime estimation in good and poor sleepersJ Behav Med200528653755316320096
  • AlameddineYEllenbogenJMBianchiMTSleep-wake time perception varies by direct or indirect queryJ Clin Sleep Med201511212312925406269
  • OgilvieRDThe process of falling asleepSleep Med Rev20015324727012530990
  • PrerauMJHartnackKEObregon-HenaoGTracking the sleep onset process: an empirical model of behavioral and physiological dynamicsPLoS Comput Biol20141010e100386625275376
  • EisemanNAWestoverMBMietusJEThomasRJBianchiMTClassification algorithms for predicting sleepiness and sleep apnea severityJ Sleep Res201221110111221752133
  • CismondiFFialhoASVieiraSMRetiSRSousaJMFinkelsteinSNMissing data in medical databases: impute, delete or classify?Artif Intell Med2013581637223428358
  • LouisJMMogosMFSalemiJLRedlineSSalihuHMObstructive sleep apnea and severe maternal-infant morbidity/mortality in the United States, 1998–2009Sleep201437584384924790262
  • CzeislerCAWalshJKRothTModafinil for excessive sleepiness associated with shift-work sleep disorderN Engl J Med2005353547648616079371
  • GottliebDJWhitneyCWBonekatWHRelation of sleepiness to respiratory disturbance index: the Sleep Heart Health StudyAm J Respir Crit Care Med199915925025079927364
  • GolderSAMacyMWDiurnal and seasonal mood vary with work, sleep, and daylength across diverse culturesScience201133360511878188121960633
  • CollopNATracySLKapurVObstructive sleep apnea devices for out-of-center (OOC) testing: technology evaluationJ Clin Sleep Med20117553154822003351
  • CollopNAAndersonWMBoehleckeBClinical guidelines for the use of unattended portable monitors in the diagnosis of obstructive sleep apnea in adult patients. Portable Monitoring Task Force of the American Academy of Sleep MedicineJ Clin Sleep Med20073773774718198809
  • BianchiMTEvidence that home apnea testing does not follow AASM practice guidelines – or Bayes’ theoremJ Clin Sleep Med201511218925580609
  • LarattaCRTsaiWHWickJPendharkarSRJohannsonKARonksleyPEValidity of administrative data for identification of obstructive sleep apneaJ Sleep Res Epub20161020