![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
Incidence vs. Cumulative Cases (ICC) curves are introduced and shown to provide a simple framework for parameter identification in the case of the most elementary epidemiological model, consisting of susceptible, infected, and removed compartments. This novel methodology is used to estimate the basic reproduction ratio of recent outbreaks, including those associated with the ongoing COVID-19 pandemic.
1. Overview
This article introduces the concept of ICC (Incidence vs. Cumulative Cases) curves, which are nonlinear transformations of the traditional epidemiological (EPI) curves (plots of disease incidence versus time). The main message of the present work is that describing an outbreak in terms of its associated ICC curve is not only natural from a dynamical point of view, but also advantageous for parameter identification and forecasting purposes. Specifically, we provide a method for the practical identification of epidemiological parameters of the SIR (Susceptible – Infected – Removed) compartmental model, directly from its associated ICC curve. Because the result is robust to systematic under-reporting, this approach gives a simple way to estimate the basic reproduction ratio of a disease from reported incidence data. The present analysis also provides a theoretical justification for the forecasting methodology put forward in [Citation11], which was shown to lead to useful estimates of the characteristics (expected duration, final number of cases, and peak intensity) of ongoing outbreaks.
We develop the concept of ICC curves in the context of the classical SIR model [Citation8, Citation9], which is the simplest compartmental model that captures the basic phenomenon of disease transmission: before recovering (or being removed through disease-induced death), infected individuals transmit the disease to susceptible individuals, who in turn become infected. The process repeats itself until the epidemic has followed its course and no additional infections occur. Since many outbreaks have a time scale that is much shorter than the scale at which the total population N changes, a common simplification is to assume that N is constant. It is then natural to define the time-dependent quantities
and
as the expected sizes of the susceptible (S), infected (I), and recovered (R) compartments relative to N, so that
The resulting SIR model is then given by
where β is the contact rate of the disease and γ is its recovery rate. Scaling time by
turns the above equations into a one-parameter (
) model, where
is the basic reproduction ratio [Citation3, Citation4, Citation17] of the modelled epidemic. The disease spreads if
and dies out when
It is also clear from the right-hand side of the second SIR equation that for
the proportion of infected individuals grows as a function of time as long as
In a general compartmental model, an outbreak is a trajectory that starts in the vicinity of the disease-free equilibrium and ends at the endemic equilibrium, when the number of infected individuals approaches zero. We define an ICC curve, where ICC stands for Incidence vs. Cumulative Cases, as a representation of the dynamics associated with an outbreak, in the plane of coordinates C (cumulative cases) and (incidence). The following theorem, established in Section 2, gives an exact formulation of the ICC curve of the SIR model.
Theorem 1.1
For a total susceptible population of size N and each initial condition , the ICC curve of the SIR model is given by
(1)
(1) where
and
the final number of cases, is the positive solution of the transcendental equation
(2)
(2)
Details of the derivation of this theorem are given in Section 2.1. Its significance is illustrated in Figure , which shows a numerical simulation of the SIR model and a plot of the corresponding ICC curve for . Instead of considering the time dependence of S, I, and R as shown in the left panel of Figure , the course of the outbreak is captured by a single curve: a plot of incidence versus cumulative cases. As explained in Section 2.1, knowledge of the ICC curve is sufficient to reconstruct the temporal behaviour of S, I, and R. In other words, the ICC curve contains the entire information associated with any outbreak described by the SIR model.
Figure 1. (Color online). Simulation of the SIR model showing the time-dependence of the S, I, and R compartments (left), for . The corresponding ICC curve is shown on the right. The exact expression given by Equation (Equation1
(1)
(1) ) is plotted as a solid black line; the numerically estimated ICC curve is shown as a thick cyan line.
![Figure 1. (Color online). Simulation of the SIR model showing the time-dependence of the S, I, and R compartments (left), for R0=2. The corresponding ICC curve is shown on the right. The exact expression given by Equation (Equation1(1) Gκ,N(C)=β(C+NR0ln(1−CN)−NR0ln(κ))(1−CN),(1) ) is plotted as a solid black line; the numerically estimated ICC curve is shown as a thick cyan line.](/cms/asset/97f6ffee-ff3e-4f0f-b973-578f2e678a88/tjbd_a_1912419_f0001_oc.jpg)
An important motivation for studying ICC curves is that they can easily be fitted to case data. Specifically, we show in Section 3.1 that, for fixed N, there exists a single set of parameters that minimizes the root mean square error between (noisy) simulated outbreak data and the ICC curve
given by Equation (Equation1
(1)
(1) ). Additionally, a numerical investigation indicates that such an estimation is robust to noise. As a consequence, the present approach provides a method to estimate the parameters of the SIR model, thereby making this model practically identifiable.
The rest of this manuscript is organized as follows. Section 2 presents the derivation of Equations (Equation1(1)
(1) ) and (Equation2
(2)
(2) ), and discusses the robustness of ICC curves to systemic under- (or over-) reporting. Section 3 explains how model parameters may be retrieved from ICC curves and assesses the robustness of the proposed procedure to reporting noise. Section 4 briefly discusses parameter estimation as an outbreak unfolds. Section 5 shows a few examples of application to real data, for outbreaks of gastroenteritis and COVID-19. Future extensions and applications of the methodology proposed in this article are reviewed in Section 6.
2. ICC curves of the SIR model
2.1. Proof of Theorem 1.1
Epidemiologists use the epidemiological (EPI) curve, a plot of incidence as a function of time, to visualize the course of an epidemic. In the SIR model, incidence is defined as
where C = R + I is the cumulative number of cases at time t. Note that C includes those who have recovered (and thus were a case at some point in time), as well as those who are currently infected. To derive Equation (Equation1
(1)
(1) ), write
which can be integrated to give
where
is a constant that depends on initial conditions. Typically,
(see e.g. [Citation1]) since for an outbreak in a naive population we normally have
and thus
However here we do not make this assumption and keep the parameter κ in the following discussion. With
and
we obtain
and consequently
Moving back to dimensional variables gives the expression in Equation (Equation1
(1)
(1) ), i.e.
.
An approximate expression of the SIR ICC curve, assuming and C/N sufficiently small was given in [Citation11]. Since
on
, for a given initial condition
, the part of the ICC curve traced out as the outbreak unfolds will be such that
. However, the ICC curve is well defined for C in the range
, which is the definition we adopt here. Even though the time variable does not appear in the ICC curve, the value of κ is not arbitrary:
is the value of C when R = 0, where R represents the removed compartment.
ICC curves are particularly useful for forecasting the course of an outbreak because they contain information on quantities that are of interest to public health practitioners, namely the final number of cases () and peak incidence (
). The latter is simply the maximum of
; the former may be found as follows. Since C is the cumulative number of cases, it is a monotonic function of time and its derivative,
must remain non-negative. For C/N small,
where
By continuity,
will remain non-negative on the interval
where
is the unique positive solution of Equation (Equation2
(2)
(2) ). By writing Equation (Equation2
(2)
(2) ) as
where the right-hand-side is the sum of two non-negative monotonic functions of
, one increasing and the other decreasing, and since
one can see that such a solution exists and is unique. Asymptotically, when the disease has followed its course and all infections have occurred, I = 0 and
Then,
and the conservation law N = S + I + R becomes
In other words,
defined in Equation (Equation2
(2)
(2) ) is simply the final (asymptotic) number of cases associated with the outbreak. This concludes the proof of Theorem 1.1.
To see that the entire dynamics of an outbreak in the SIR model is captured by the associated ICC curve, note that knowledge of leads to knowledge of
In other words, the time course of S, I, and R may be obtained from the solution of the differential equation
prescribed by the ICC curve, with appropriate initial conditions (which in turn define the value of κ).
2.2. Robustness to under- or over-reporting
The quantity N appearing in Equation (Equation1(1)
(1) ) refers to the total population taken into account in the model, which is conserved by the dynamics. In case of over- or under-reporting, the observed incidence expectation
is a fraction of the actual incidence
, so that
where we assume that α is constant or varying slowly enough for its derivative to be negligible. The ICC curve estimated from case report data will then be a graph of
as a function of
We thus have
where
. Therefore, the ICC curve in the presence of over-reporting (
) or under-reporting (
) has the same functional form, with the same parameter values, as the true ICC curve, provided
, C, and N are rescaled by the factor α. For the same reason,
is also independent of the value of α.
While systematic under- or over-reporting does not affect the ICC curve provided the size of each compartment is appropriately rescaled, the reproduction ratio depends on N through the ratio , as can be seen from Equation (Equation2
(2)
(2) ), in which setting
leads to
(3)
(3) Finally, rescaling time amounts to rescaling β and γ, and consequently the ICC curve.
3. Parameter identification
Structural identifiability (see e.g. [Citation14] for a review) relates to the ability of uniquely identifying model parameters from knowledge of the model output. In this context, N is assumed to be known since it represents the population of the system to which the model is applied. Given one can calculate
From Equation (Equation1
(1)
(1) ), it is clear that knowledge of
and
uniquely determines β, γ, and κ. Indeed, if two sets of parameters,
and
, led to the same output
(and thus to the same values of its derivative
), we would have
for all values of
, which leads to
,
, and
Therefore, all of the parameters of the SIR model, including initial conditions, are structurally identifiable from knowledge of
the cumulative number of cases, and N, the size of the susceptible population. Similar results were obtained in [Citation5, Citation6, Citation18]. If N is unknown, setting
for all values of
, additionally leads to
, making the size of the population N a structurally identifiable parameter as well.
Whereas structural identifiability depends on the nature of the model itself, practical identifiability relates to parameter identifiability in the presence of measurement noise, as well as (hopefully small) deviations between model and reality (see e.g. [Citation14]). This question may be addressed numerically by simulating an exact solution of the model, adding noise to its output, and estimating the model parameters that best fit the perturbed output. Two recent articles [Citation16, Citation18] discuss the practical identifiability of a variety of compartmental models and reach different conclusions regarding parameter identification of the SEIR model from cumulative data. In [Citation18], the cumulative data is multiplied by where ϵ is normally distributed with zero mean, whereas in [Citation16], Poisson-distributed noise of mean equal to the current incidence value [Citation2] is used as a substitute for the incidence data. Whether or not parameters are identifiable in practice therefore depends on the type of observational noise that affects surveillance reports and thus the EPI curve.
Adding independently distributed noise to the incidence rather than the cumulative data is more realistic since errors on C are considered to accumulate and not be independent [Citation10]. On the one hand, assuming Poisson-type of uncertainty is natural for data that represent counts, but the effect of replacing incidence data by a Poisson random variable of same mean will lead to significant relative deviations only if N is small (a few hundreds). For larger values of N and thus of the relative change
scales like
Indeed,
With Stirling's formula, we obtain
If it is believed that the diminishing relative influence of the Poisson-distributed noise for large values of N is not realistic for a specific epidemic, then multiplying
by
where ϵ is normally distributed with mean zero, circumvents this issue. We consider both types of noise below.
To test practical identifiability, we simulate the SIR model and generate a discrete time series for the cumulative number of cases of the simulated outbreak. This time series is used as a reference and represents the unperturbed (or true) data. We then add noise to the associated discrete incidence data (defined as the difference of cumulative values between consecutive time points), following the procedure described in [Citation2], with the additional realistic constraint that the final number of cases is the same for all noisy realizations of the same outbreak. Two examples of simulated outbreaks generated in this fashion are shown in Figure for different values of (associated with different orders of magnitude of the incidence
) and different discretization time steps
. For
, we use
. For
, we use
As shown in Figure , this latter choice leads to a discretized ICC curve with very few points in regions of large incidence, and simulates a situation where incidence data is of poorer frequency. Each time series, consisting of noisy incidence data, is used to calculate an associated time series for the corresponding cumulative data. The resulting set of points is plotted in the (C,
) plane, and defines an empirical ICC curve. In both panels of Figure , C and
are estimated from discrete incidence data as described in Equation (Equation5
(5)
(5) ) below.
Figure 2. (Color online). Empirical ICC curves for two hypothetical outbreaks described by the SIR model with Poisson-distributed reporting noise. Incidence information is assumed to be available every , with
for
and
for
. In each panel, stars correspond to the simulated outbreak data plotted in the (C,
) plane. Data points without added noise (circles) and the exact ICC curve (solid line) are also shown.
![Figure 2. (Color online). Empirical ICC curves for two hypothetical outbreaks described by the SIR model with Poisson-distributed reporting noise. Incidence information is assumed to be available every δt, with δt=1 for R0=2 and δt=2 for R0=10.5. In each panel, stars correspond to the simulated outbreak data plotted in the (C, I) plane. Data points without added noise (circles) and the exact ICC curve (solid line) are also shown.](/cms/asset/0f08ab6d-c1d2-454e-9280-499c0794ca10/tjbd_a_1912419_f0002_oc.jpg)
3.1. Fitting ICC curves to data
Given a simulated time series for the number of cases of an outbreak in a population of size N, we find the parameters of the ICC curve that best fits the data by least square minimization. This approach is different from parameter identification methods that have been proposed in the literature until now. Indeed, traditional methods aim to find the best combination of model parameters consistent with observed time series; they therefore require numerical integration of the compartmental model whose parameters need to be identified, in order to estimate and then minimize an associated cost function (see for instance [Citation2, Citation16, Citation18]). Instead, the approach we put forward here consists in finding model parameters by directly fitting the relevant ICC curve to the data. We claim that this is a more robust approach because the functional that needs to be minimized has a unique extremizer in parameter space, which can be calculated exactly from the data points.
Given a time series of discrete incidences or of discrete cumulative cases
recorded every
units of time, we introduce a cost function that measures how much the data points depart from the ICC curve of the SIR model parametrized by β, γ, and
This function,
, is defined by
(4)
(4) where
and
(5)
(5) A Maple [Citation13] calculation shows that the cost function
has a unique critical point, whose expression is given in Appendix. For outbreaks that start in a naive population, the formulas are hugely simplified by setting p = 0. The resulting cost functional then has a unique minimum, whose value may easily be calculated. With the following notation,
we have (after setting p = 0),
Setting the right-hand sides to zero and solving the resulting linear system for β and γ leads to the following expression for the unique global minimizer of
when p = 0:
where
In what follows, we use the expressions given in Appendix, which can easily be coded up. We checked that for time series generated from an SIR model with
, the simplified expressions above provide good estimates of the actual values of β and γ that were used to generate the reference outbreak. Importantly, we now have a means of associating a single set of values for the parameters β, γ, and
to each outbreak time series.
3.2. Parameter identification in the presence of noise
We now use the expressions found in Section 3.1 to assess the robustness of the proposed parameter identification method to noise. As discussed above, the SIR model will be declared practically identifiable provided good estimates of the parameters may be obtained from noisy data. We therefore simulate 100,000 noisy time series from a ‘reference’ numerical integration of the SIR model, use the formulas of Appendix to calculate ,
, and
for each time series, and plot histograms of the estimated values of β and γ. We consider the two types of noise discussed above, both affecting the reported discrete incidence
In Case A, we assume that at each time point
the reported value of
has a Poisson distribution of mean
In Case B, it is assumed that the reported value of
at each time point is of the form
where
is the discrete incidence of the reference outbreak, ϵ is the noise amplitude, and
is standard normal.
3.2.1. Case A: Poisson-distributed incidence
Figure shows the distributions (in the form of properly scaled histograms) of
and
obtained from 100,000 noisy realization from a reference simulation of an SIR model with parameters
and
(
). Figure shows similar plots for
and
, so that
. In each case, the size of the total population is N = 10, 000 individuals. The corresponding sample means and standard deviations are displayed in Table . These results indicate that the respective means of
and
provide very good estimates of the parameters β and γ. Accuracy is not as good for the larger value of
, whose distribution shows a fairly long tail. This is to be expected because the actual value of γ in this case is close to zero. Moreover, the number of data points away from the end points of the outbreak is much smaller for
than for
(see Figure ), leading to a potential loss of information.
Figure 3. (Color online). Numerically estimated distributions of the parameter values ,
, and
for 100,000 outbreaks corresponding to noisy (Poisson-distributed) realizations of a reference outbreak with parameters
and
(
). Corresponding means and standard deviations are displayed in Table .
![Figure 3. (Color online). Numerically estimated distributions of the parameter values βc, γc, and R0=βc/γc for 100,000 outbreaks corresponding to noisy (Poisson-distributed) realizations of a reference outbreak with parameters β=0.5 and γ=0.25 (R0=2). Corresponding means and standard deviations are displayed in Table 1.](/cms/asset/2bd53187-cb69-4aa4-9c8b-06200530e281/tjbd_a_1912419_f0003_oc.jpg)
Figure 4. (Color online). Same as Figure , but with parameters and
(
). Corresponding means and standard deviations are displayed in Table .
![Figure 4. (Color online). Same as Figure 3, but with parameters β=0.5 and R0=10.5 (γ≃0.0476). Corresponding means and standard deviations are displayed in Table 1.](/cms/asset/5b3cc2ad-2192-45dc-b75a-ad612d290ac2/tjbd_a_1912419_f0004_oc.jpg)
Table 1. Empirical mean (μ) and standard deviation (σ) of estimated parameter values in the presence of Poisson-distributed noise, for two values of .
3.2.2. Case B: normally distributed noise
At each time point, we now multiply the reference incidence by , where s is standard normal and the amplitude ϵ takes on one of three values: 0.05, 0.15, or 0.25. This type of noise is less realistic for incidence reports, whose variability is expected to be well described by a Poisson distribution, as in Case A above. We however perform the present tests to illustrate the robustness of the parameter estimation method introduced in this article.
For , the parameter estimation procedure is robust at all values of ϵ considered. Means and standard deviations of
,
, and
are displayed in Table . The corresponding distributions are shown in the top row of Figure . As expected, estimates of β and γ are not as accurate for simulations with
, but return mean values for β and γ that are close to the exact values. The distributions of
however show very long tails (see bottom row of Figure ), due to values of
close to zero. In such a case, estimating
as the ratio
however gives acceptable values, equal to 9.96, 9.19, and 7.53, for ϵ equal to 0.05, 0.15, and 0.25 respectively.
Figure 5. (Color online). Similar to Figures (top row) and (bottom row) but for normally distributed noise of variable amplitude, set at 0.05, 0.15, or 0.25. Corresponding means and standard deviations are displayed in Table .
![Figure 5. (Color online). Similar to Figures 3 (top row) and 4 (bottom row) but for normally distributed noise of variable amplitude, set at 0.05, 0.15, or 0.25. Corresponding means and standard deviations are displayed in Table 2.](/cms/asset/e4131e5d-1026-49a0-a40f-35ac022ad9c1/tjbd_a_1912419_f0005_oc.jpg)
Table 2. Empirical mean (μ) and standard deviation (σ) of estimated parameter values in the presence of normally distributed noise of mean zero and amplitude ϵ, for two values of .
In summary, the above numerical simulations indicate that the ICC-based parameter estimation method presented here provides good results in the presence of reporting noise. Not surprisingly, the quality of the estimates depends on the rate at which incidence is sampled and, in the case of normally distributed noise, on the amplitude of the noise.
4. Parameter identification as the outbreak unfolds
In principle, the formulas for and
given in Appendix may be used with any number of points M, as long as the data points represent the entire ICC curve. It is natural to ask how the method performs when applied to a developing outbreak. To test this, we simulate 50,000 possible outbreaks using the same reference simulations as before, and with the two different types of noise introduced above to represent reporting error. We then follow the evolution of the distributions of the estimated parameters as time increases. For
, the outbreak has run its course after 60 units of time, and we include up to
data points, with consecutive values separated by 1 unit of time (which could correspond to a day or a week depending on the disease). For
, we pick
, with data points separated by 2 units of time. Figure shows estimated values of β, γ, and
for the two types of reporting noise. In both cases, estimates of β and γ are close to their actual values near t = 20, which is before the peak of the outbreak (see left panel of Figure ). The value of
converges more slowly (circles in the bottom row of Figure ) but the ratio of the estimated values of β and γ (stars) is close to the actual value of
near t = 12, fairly early in the outbreak.
Figure 6. (Color online). Estimates of β, γ and as the outbreak unfolds for Poisson-distributed noise (left) and normally distributed noise of size 0.15 (right), in a situation where
. Estimates of
show the evolution of
(circles) and
(stars), where
indicates sample mean. Error bars correspond to one standard deviation on each side of the mean.
![Figure 6. (Color online). Estimates of β, γ and R0 as the outbreak unfolds for Poisson-distributed noise (left) and normally distributed noise of size 0.15 (right), in a situation where R0=2. Estimates of R0 show the evolution of <β/γ> (circles) and <β>/<γ> (stars), where <⋅> indicates sample mean. Error bars correspond to one standard deviation on each side of the mean.](/cms/asset/d4a85080-45df-4181-970d-4e25366eaf8f/tjbd_a_1912419_f0006_oc.jpg)
Similar results for , together with plots showing the evolution of the distributions of
,
, and
, are provided as supplementary information. Parameter estimates for β and γ are close to their true values near t = 22, which is near the peak of the outbreak, with the ratio
giving a reasonable estimate of
starting at t = 18, even in the presence of large uncertainty.
5. Application to outbreak data
So far, we have assumed the size of the susceptible population, N, to be known. In practice, the value of N that should be used to fit the ICC curve to outbreak data is often unknown since it may reflect under-reporting (as discussed in Section 2.2) or spatial effects (existence of localized clusters to which the spread is restricted) associated with the outbreak. Moreover, it could be argued that the SIR model is too simplistic to represent real-life outbreaks. The examples of [Citation11] and those discussed below however show there is merit to the present approach.
In what follows, we fit ICC curves to surveillance reports by picking a range of values of N for which the root mean square error (RMSE) between the ICC curve and the data is within 2% of the minimum RMSE value. The data points used in the evaluation of the RMSE are found from the reported cumulative data after smoothing and interpolation. This latter procedure estimates missing incidence values and removes negative incidence reports, if any. The resulting time series is expected to be a better approximation of the ‘true’ (i.e. as close as possible to the output of a compartmental model) evolution of the cumulative data, and is used for the parameter estimation procedure discussed in this article. Our first example is a gastroenteritis outbreak in a nursing home in Mallorca, Spain, and is therefore spatially localized. The other examples are estimations of the basic reproduction ratio for COVID-19 outbreaks in Hubei Province (China), the Republic of Korea, and France, based on data available until 15 April 2020.
5.1. Gastroenteritis outbreak in mallorca, Spain
We use the same data set as in [Citation11], estimated from the epidemiological curve provided in [Citation12]. We fit the ICC curve to the interpolated data (i.e. find ,
, and
) for a range of values of N, identify the value
of N that best fits the data (i.e. for which the RMSE is minimum), and select a range of values of N near
that give a RMSE within 2% of its minimum. Then, for each value of N in the selected range, we apply the parameter estimation procedure of Section 3, with 10,000 oubreak realizations, obtained from adding Poisson-distributed noise to the smoothed and interpolated data. Figure shows the resulting parameter distributions (top panel;
) and a plot of the ICC curve for the optimal parameter values (bottom panel; N = 120,
, and
). The value of
for the ICC curve displayed is 1.52 and corresponds to an attack rate of 59.8%. This is near the upper boundary of the 95% confidence interval (38.5–61.3%) for the overall attack rate, but well within the 95% confidence interval of the attack rate among nursing home residents (42.1–72.2%) reported in [Citation12].
Figure 7. (Color online). Parameter estimates for a gastroenteritis outbreak in a nursing home in Mallorca, Spain [Citation12]. The top panel shows histograms (scaled to represent probability distributions) of ,
, and
, for
. The bottom panel shows the ICC curve for the optimal parameter values, corresponding to a value of
.
![Figure 7. (Color online). Parameter estimates for a gastroenteritis outbreak in a nursing home in Mallorca, Spain [Citation12]. The top panel shows histograms (scaled to represent probability distributions) of βc, γc, and R0, for N∈[93,186]. The bottom panel shows the ICC curve for the optimal parameter values, corresponding to a value of R0=1.52.](/cms/asset/f38680a9-c65e-4a34-b07e-e299f24ed33e/tjbd_a_1912419_f0007_oc.jpg)
5.2. COVID-19
Figure shows ICC curves for the first waves of COVID-19 outbreaks in a few countries. The data was obtained from the World Health Organization daily situation reports [Citation19] as well as from the Wuhan Health Municipal Commission [Citation21]. For Hubei Province, only laboratory confirmed cases were used. Consequently, daily increment values for 17–19 February 2020 (days 65–67 in the data set, when China reported combined laboratory and clinically confirmed cases for a brief period of time) were set to half of the reported number of confirmed cases; the number of cumulative cases was then obtained by summing daily reports of laboratory confirmed cases. For the Republic of Korea, only the first 58 points in the data set were used, since they correspond to the first wave of the outbreak. The third example is for a country (France), where the outbreak was not over as of 15 April 2020. For each of these examples, we show the ICC curve for the set of parameters that best fits the interpolated data. The optimal values of N are N = 51, 160 for Hubei Province, N = 10, 282 for the Republic of Korea, and N = 161, 142 for France. The corresponding optimal basic reproduction ratio values are for Hubei Province,
for the Republic of Korea, and
for France. A histogram of estimated values of
, scaled to represent a probability distribution function, is shown in the right panel of each row of Figure . These were obtained as described above in the case of the outbreak in Mallorca, except that no Poisson-distributed noise was added to the data. This is because incidence values are generally large and that, as discussed in Section 3, the relative effect of the added noise is minimal, leading to a level of variability of
that is insignificant compared to the effect of changing N. Values of
in the 2–3 range are consistent with early estimates of the basic reproduction ratio of COVID-19 published in the literature [Citation20, Citation22].
Figure 8. (Color online). Optimal ICC curves and estimated distributions of values for early COVID-19 outbreaks in Hubei Province (top panel), the Republic of Korea (middle panel), and France (bottom panel).
![Figure 8. (Color online). Optimal ICC curves and estimated distributions of R0 values for early COVID-19 outbreaks in Hubei Province (top panel), the Republic of Korea (middle panel), and France (bottom panel).](/cms/asset/91b47201-31e9-4463-b544-9d5de6a5e180/tjbd_a_1912419_f0008_oc.jpg)
The time course of the COVID-19 outbreak in Hubei Province, as described by the SIR model with optimal parameters identified by the present method (,
(
), and N = 51, 160) is shown in Figure , together with the reported data. The solid curves are obtained by integration of the ICC curve, with initial conditions corresponding to day 10 of the outbreak, and alignment with the data at day 45. Similar plots for the other outbreaks discussed in this section are provided as supplementary information. The good visual agreement between simulation and data indicates that the SIR model, and therefore the ICC curve approach presented in this manuscript, are able to capture the overall dynamics of real-life outbreaks.
Figure 9. (Color online). Cumulative cases (left) and incidence (right) as functions of time for the COVID-19 outbreak in Hubei Province, China. The solid curves represent the predictions of the optimal ICC curve and the open circles are the data points. Time is measured in days from 14 December 2019 (day 0).
![Figure 9. (Color online). Cumulative cases (left) and incidence (right) as functions of time for the COVID-19 outbreak in Hubei Province, China. The solid curves represent the predictions of the optimal ICC curve and the open circles are the data points. Time is measured in days from 14 December 2019 (day 0).](/cms/asset/de136fb1-d24b-4759-ab72-b6ae1d198868/tjbd_a_1912419_f0009_oc.jpg)
6. Conclusions
This article introduces a parameter estimation method for disease outbreaks that bypasses the numerical integration of epidemiological models. The approach relies on the use of ICC curves, also introduced here. ICC curves relate incidence to the cumulative number of cases C and may be viewed as nonlinear transformations of the traditional epidemiological curves, in which the time variable is replaced by C, a monotonically increasing but nonlinear function of time. For each single-wave outbreak, the ICC curve has a simple graph that crosses the horizontal axis near the origin and at the final value of C.
The formulas presented in Section 2, which extend the parabolic approximation suggested in [Citation11], are exact for the SIR model. The numerical experiments of Sections 3 and 4 show the method is robust to noise and may be used for parameter estimation as an outbreak unfolds, with the understanding that reasonably accurate estimates can only be reached shortly before, or after, incidence peaks. This is not a shortcoming of the present approach and was also observed in [Citation18] when fitting synthetic prevalence time series.
Because of the simplicity of Equation (Equation1(1)
(1) ), and the existence of a unique set of parameters that best fits any collection of data points, parameter distributions may be generated directly from epidemiological data. In the case of the SIR model, this methodology presents a fast and novel alternative to more traditional parameter estimation strategies, such as particle Makov Chain Monte Carlo methods (PMCMC; see [Citation7] for a review). This may be beneficial in pandemic situations where epidemiological estimates need to be updated daily and in many locations simultaneously. For instance, recent work on COVID-19 data from Wuhan suggests that an SIR model can better capture the information contained in case reports than an SEIR model [Citation15]. For more complex compartmental models, distributions generated by the present approach may be used as priors for PMCMC estimations of the contact and recovery rates of a disease.
When applying the proposed method to surveillance data, an estimate of N may not always be available. The examples of Section 5 suggest that this parameter may be identified by optimizing the fit between the ICC curve and a smoothed and interpolated version of the reported data. However, any variability in the selected value of N will be associated with variability in estimates of the basic reproduction number . Indeed, for an outbreak that has completed its course, any good fits of the reported data by an ICC curve will produce similar values of
, the final number of cases for the outbreak (see for instance the plots of Figure ). Because of Equation (Equation3
(3)
(3) ), if
is known, the estimated value of
is only a function of N. This uncertainty is inherent to the SIR model and cannot be avoided. The existence of a value
of N that best fits the data is encouraging, but more robust estimates are likely to be obtained if additional information is available. In the absence thereof, a range of values of N close to the optimal value
should be used to produce credible intervals for the basic reproduction ratio
.
As previously mentioned, empirical (see [Citation11]) and numerical observations by the author suggest ICC curves of shape similar to Equation (Equation1(1)
(1) ) are ubiquitous in outbreak data and in compartmental epidemiological models. It would therefore be interesting to explore methods which, like the discussion of Section 2, lead to exact formulations of ICC curves. In particular, knowledge of how model parameters affect the shape of ICC curves could provide simple means to visualize the consequences of mitigation effects. Combined with data assimilation, ICC curves may thus present a convenient paradigm for forecasting the course of an outbreak.
Supplemental Material
Download PDF (821.4 KB)Disclosure statement
No potential conflict of interest was reported by the author(s).
Data availability statement
For codes and data sets used in this study, please see https://jocelinelega.github.io/EpiGro/.
References
- N.F. Britton, Essential Mathematical Biology, Springer Undergraduate Mathematics Series, 3rd ed., 2005, pp. 90–93. Available at https://www.springer.com/us/book/9781852335366.
- G. Chowell, Fitting dynamic models to epidemic outbreaks with quantified uncertainty: A primer for parameter uncertainty, identifiability, and forecasts, Infect. Dis. Model. 2 (2017), pp. 379–398. Available at https://doi.org/https://doi.org/10.1016/j.idm.2017.08.001.
- O. Diekmann, J.A.P. Heesterbeek and J.A.J. Metz, On the definition and the computation of the basic reproduction ratio R0 in models for infectious diseases in heterogeneous populations, J. Math. Biol. 28 (1990), pp. 365–382. Available at https://doi.org/https://doi.org/10.1007/BF00178324.
- O. Diekmann, J.A.P. Heesterbeek and M.G. Roberts, The construction of next-generation matrices for compartmental epidemic models, J. R. Soc. Interface 7 (2010), pp. 873–885. Available at https://doi.org/https://doi.org/10.1098/rsif.2009.0386.
- M.C. Eisenberg, S.L. Robertson and J.H. Tien, Identifiability and estimation of multiple transmission pathways in cholera and waterborne diseases, J. Theor. Biol. 324 (2013), pp. 84–102. Available at https://doi.org/https://doi.org/10.1016/j.jtbi.2012.12.021.
- N. Evans, L. White, M. Chapman, K. Godfrey and M. Chappell, The structural identifiability of the susceptible infected recovered model with seasonal forcing, Math. Biosci. 194 (2005), pp. 175–197. Available at https://doi.org/https://doi.org/10.1016/j.mbs.2004.10.011.
- A. Endo, E. van Leeuwenb and M. Baguelin, Introduction to particle Markov-chain monte carlo for disease dynamics modellers, Epidemics 29 (2019), p. 100363. Available at https://doi.org/https://doi.org/10.1016/j.epidem.2019.100363.
- H.W. Hethcote, The mathematics of infectious diseases, SIAM Rev. 42 (2000), pp. 599–653. Available at https://doi.org/https://doi.org/10.1137/S0036144500371907.
- W.O. Kermack and A.G. McKendrick, A contribution to the mathematical theory of epidemics, Proc. R. Soc. Lond. A115 (1927), pp. 700–721. Available at https://doi.org/https://doi.org/10.1098/rspa.1927.0118.
- A.A. King, M. Domenech de Celles, F.M.G. Magpantay and P. Rohani, Avoidable errors in the modelling of outbreaks of emerging pathogens, with special reference to ebola, Proc. R. Soc. B. 282 (2015), p. 20150347. Available at https://doi.org/https://doi.org/10.1098/rspb.2015.0347.
- J. Lega and H.E. Brown, Data-driven outbreak forecasting with a simple nonlinear growth model, Epidemics 17 (2016), pp. 19–26. Available at https://doi.org/https://doi.org/10.1016/j.epidem.2016.10.002.
- M.A. Luque Fernández, A. Galmés Truyols, D. Herrera Guibert, G. Arbona Cerdá and F. Sancho Gayá, Cohort study of an outbreak of viral gastroenteritis in a nursing home for elderly, Majorca, Spain, February 2008, Euro Surveill. 13(51) (2008), pii=19070. Available at https://www.eurosurveillance.org/content/https://doi.org/10.2807/ese.13.51.19070-en.
- Maple. Maplesoft, a division of Waterloo Maple Inc., Waterloo, Ontario, 2018.
- H. Miao, X. Xia, A.S. Perelson and H. Wu, On identifiability of nonlinear ODE models and applications in viral dynamics, SIAM Rev. 53 (2011), pp. 3–39. Available at https://doi.org/https://doi.org/10.1137/090757009.
- W.C. Roda, M.B. Varughese, D. Han and M.Y. Li, Why is it difficult to accurately predict the COVID-19 epidemic?, Infect. Dis. Model. 5 (2020), pp. 271–291. Available at https://doi.org/https://doi.org/10.1016/j.idm.2020.03.001.
- K. Roosa and G. Chowell, Assessing parameter identifiability in compartmental dynamic models using a computational approach: application to infectious disease transmission models, Theor. Biol. Medical Model. 16 (2019), pp. 1.Available at https://doi.org/https://doi.org/10.1186/s12976-018-0097-6.
- P. van den Driessche and J. Watmough, Reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission, Math Biosci 180 (2002), pp. 29–48. Available at https://doi.org/https://doi.org/10.1016/S0025-5564(02)00108-6.
- N. Tuncer and T.T. Le, Structural and practical identifiability analysis of outbreak models, Math. Biosci. 299 (2018), pp. 1–18. Available at https://doi.org/https://doi.org/10.1016/j.mbs.2018.02.004.
- WHO. Coronavirus disease (COVID-2019) situation reports. Available at https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports.
- J.T. Wu, K. Leung and G.M. Leung, Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study, Lancet 395 (2020), pp. 689–697. Available at https://doi.org/https://doi.org/10.1016/S0140-6736(20)30260-9.
- Wuhan Health Municipal Commission. Available at http://wjw.wuhan.gov.cn/.
- See also the MIDAS Network dashboard, which lists estimates of R0 in a range of preprints. Available at https://midasnetwork.us/covid-19/.
Appendix. Critical parameter values
The function defined in Equation (Equation4
(4)
(4) ) has a unique extremizer
given by the following expressions:
where
and we have used the notation