Search in:

Biostatistics & Epidemiology Volume 7, 2023 - Issue 1

Submit an article Journal homepage

Open access

768

Views

CrossRef citations to date

Altmetric

Listen

Research Article

Public transportation network scan for rapid surveillance

Yuta Tanouea Institute for Business and Finance, Waseda University, JapanCorrespondence[email protected]

https://orcid.org/0000-0002-0005-0468 View further author information

Daisuke Yoneokab Infectious Disease Surveillance Center, National Institute of Infectious Diseases, Tokyo, Japan;c Department of Health Policy and Management, School of Medicine, Keio University, Tokyo, Japan;d Department of Global Health Policy, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan;e Tokyo Foundation for Policy Research, Tokyo, Japan

https://orcid.org/0000-0002-3525-5092 View further author information

Takayuki Kawashimaf Department of Mathematical and Computing Science, Tokyo Institute of Technology, JapanView further author information

Shinya Uryug Center for Environmental Biology and Ecosystem Studies, National Institute for Environmental Studies (NIES), JapanView further author information

Shuhei Nomurac Department of Health Policy and Management, School of Medicine, Keio University, Tokyo, Japan;d Department of Global Health Policy, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan;e Tokyo Foundation for Policy Research, Tokyo, Japan

https://orcid.org/0000-0002-2963-7297 View further author information

Akifumi Eguchih Department of Sustainable Health Science, Center for Preventive Medical Sciences, Chiba University, JapanView further author information

Koji Makiyamai HOXO-M Inc., Tokyo, JapanView further author information

Kentaro Matsuuraj Department of Management Science, Graduate School of Engineering, Tokyo University of Science, JapanView further author information

show all

Article: e2069458 | Received 04 Nov 2021, Accepted 02 Apr 2022, Published online: 02 Jun 2022

Cite this article
https://doi.org/10.1080/24709360.2022.2069458
CrossMark

In this article

1. Introduction
2. Methods
3. Simulation data analysis result
4. Conclusion and discussion
Disclosure statement
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

As people move around using public transportation networks such as train and airplanes, it is expected that emerging infectious diseases will spread on the network. The scan statistics approach has been frequently applied to identify high-risk locations, and the results are widely used for making a clinical decision in a timely manner. However, they are not optimally designed for modeling the spread and might not effectively work in the emergency situation where computational time is essentially important. We propose a new scan statistics approach for the public transportation network, called PTNS (Public Transportation Network Scan). PTNS utilizes the available network structure to construct potential candidates of clusters, and thus it can work well especially in situations where public transportation is the main medium of the infection spread. Further, it is designed for rapid surveillance. Lastly, PTNS is generalized to detect space-time clusters by customizing the iteration for potential clusters creation. Using the simulation data generated with a real railway network, we showed that, PTNS outperformed the conventional methods, including Circular- and Flex-scan approaches, in terms of the detection performance, while the computational time is feasible.

Keywords:

Scan statistics
Spatio-Temporal analysis
rapid surveillance
public Transportation
covid-19

1. Introduction

1.1. Previous studies and settings

Detecting high-risk geographical areas, say hotspots, of emerging infectious diseases such as COVID-19 is an important task that is expected to provide the estimates accurately and rapidly. To address it, the scan statistics approach has been frequently applied to rapidly identify the hotspots in the area of interest, and the results are widely used for making a clinical decision on medical resource allocations, prioritization, and interventions in a timely manner. In addition, the scan statistics approach has been applied not only in the field of infectious diseases but also in medical imaging, parasitology, forestry cancer epidemiology, and astronomy [Citation1–9].

Consider a network structure with vertices and edges $G = (V, E)$ , and counts $c_{i}$ for each vertex $s_{i} (\in V)$ , scan statistic approach tries to detect the subset $S \subset V$ , where the number or rate of interest is larger than other subsets in V. To detect the subset S with a significantly higher number or rate of interest, scan statistic approach takes following three steps. (1) construct candidate potential clusters, which are denoted by $Z_{j} (j = 1, \dots, M)$ , and sets of $s_{i}$ , for a given region, network, and area, (2) calculate summary (likelihood-based) statistics $D (Z_{j})$ , and then (3) identify the most likely cluster (MLC) that has the highest $D (Z_{j})$ .

For example, if we consider a train network, $s_{i}$ represents each station. That is, $s_{1}, s_{2}, \dots, s_{m}$ represent Shinjuku, Yoyogi, and Tokyo stations, and so on, respectively. Further, $Z_{j}$ represents the set of stations ${s_{i}}$ . The fact that $Z_{j}$ is a hotspot means that the stations which belong to $Z_{j}$ have higher infection rate than the other stations.

Various types of summary statistics in Step (2) have been proposed [Citation10–13]. For example, Neil et al. proposed Bayesian version of summary statistics [Citation14]. Further, Gangon and Clayton proposed weighted likelihood-based summary statistics [Citation15].

Scan statistics approach counts the number of events $c_{i}$ observed in a fixed area $s_{i} \in S (i = 1, \dots, m)$ , where S is the whole area of interest. There are three main steps:

1.2. How to construct potential clusters of scan statistics and space-time scan statistics for rapid surveillance

Here, we explain how to construct candidate potential clusters ${Z_{j}}_{j = 1}^{M}$ . Ideally, all possible clusters should be constructed as potential clusters: $M = 2^{m}$ combinations should be constructed. However, it is difficult in practice because M increases exponentially as m increases, and the computational complexity becomes enormous.

To tackle it, various methods for potential cluster construction were proposed. To make potential clusters within a feasible time, these methods restrict the shape of potential clusters to certain types. Naus [Citation16], and Loader [Citation17] proposed a rectangle-shaped potential cluster. Also, Openshaw et al. [Citation18], Besag and Newwell [Citation19], and Kulldorff et al. [Citation10] proposed a circular-shaped potential cluster, and the radius of the cluster varies. In a similar way, Christiansen et al. [Citation20] and Kulldorff et al. [Citation21] proposed an elliptic potential cluster. Due to the restriction on the shape, the computational time for the calculation remains reasonable. In these methods, a method proposed by Kulldorff is the most famous and widely used [Citation10]. Software called SatScan (https://www.satscan.org/) is frequently used in various practical studies such as Azage et al. [Citation22], Coleman et al. [Citation23], and Sherman et al. [Citation24].

As pointed out by Tango and Takahashi [Citation25], although the above-mentioned methods, which construct the restricted shape of potential clusters, are useful and widely used in practice, Christiansen et al. [Citation20], and so on, there exist the clusters whose shapes are irregular, and the existing methods cannot cover such irregularly shaped clusters.

To address it, Tango and Takahashi proposed methods that can construct non-circle-shaped and flexible type potential clusters [Citation25]. The details can be found elsewhere [Citation12, Citation13, Citation25–30]. These methods are known to work well, especially when the number of potential clusters is small, and thus the computational time remains reasonable range. However, as prior studies pointed out [Citation30–32], these methods become unfeasible due to the long computational time as the number of potential clusters goes large. In addition, in the field of computer science, there are similar attempts that use external information such as GraphScan approach [Citation33, Citation34] to restrict the search space for the potential clusters.

These methods construct potential clusters from a spatial perspective, but several methods that try to extend the potential cluster from a space-time perspective have also been proposed [Citation35, Citation36].

1.3. Cluster detection on public transportation network

In today's cities, there are many means of transportation, such as cars, trains, planes, and buses, and they make their unique network structure. As people move around on the network, it is expected that infection will spread on the network. In the public transportation networks, airborne, droplet, and contact infections are likely to occur. Thus, infectious diseases which are likely to be transmitted by airborne, droplet, or contact are likely to spread on the networks.

Although existing methods could theoretically be applied to the network data, they are not optimally designed for modeling the spread of infection on the network and might not effectively work on detecting clusters along with the network within the limited computational time. Circular-shaped approach might not be suitable for the detection of long consecutive clusters: i.e. when many infected persons move on a long train line, and thus the cluster takes a linearly expanded shape along with the line, Circular approach can not capture it because of its non-circular shape. In addition, although flexible type approaches can construct the arbitrary shape of potential clusters, and thus they can contain any clusters along with the network as the subset, the computational burden is heavy, and it is too slow to be used in rapid surveillance. To capture the real-time spread of infection, we need to develop a new method to construct potential clusters suitable for detecting long consecutive clusters along with a network of interest in a timely manner. Note that we assume $s_{i}$ is each node, which represents a station, bus stop, or car park, on a given network instead of a geographical area in the conventional scan statistics.

1.4. Contribution

To address the above issues, we propose a new scan statistics approach for the public transportation network, called PTNS (Public transportation network scan), for rapid surveillance of infectious diseases. This paper's contribution is four-fold.

Computational burden of our method is small compared with conventional scan methods. The number of iterations is linear in the number of nodes in the networks (Proposition 2.2).
Our method can detect long hotspots which are difficult to detect in the conventional methods.
Our method can be generalized to detect both space and space-time hotspots.
For the long hotspot detection, the result shows that our method outperforms the conventional methods in terms of accuracy, sensitivity, and positive predicted value.

The remainder of the article is organized as follows. In Section 2, we review the basic idea of the scan statistics approach and introduce our new scan-based method to detect high-risk clusters along with an available network structure, and then it is extended to detect space-time clusters. To demonstrate that the proposed method outperforms the conventional approaches in terms of accuracy, sensitivity, and the positive predicted values, the results using the simulation data generated with a real railway network in Japan are presented in Section 3. Finally, Section 3 contains a discussion and our conclusion.

2. Methods

Here, we describe the proposed method. First, we briefly explain the basic idea of scan statistics for count data. We follow the notations used in [Citation37]. Let $G$ be a network (also called as a graph) which consists of m nodes. $s_{i} (i = 1, \dots, m)$ denotes a node on the network, e.g. railway station, bus stop and car park. G is the set of $s_{i}$ , i.e. $G := {s_{1}, \dots, s_{m}}$ . $b_{i}$ and $c_{i}$ are the expected number of infected people and the observed number of infected people on the node $s_{i}$ for $i = 1, \dots, m$ , respectively. Generally, $b_{i}$ is assumed to be given or calculated from the past observed data, as follows: $b_{i} = n_{i} \frac{\sum_{i = 1}^{m} c_{i, p a s t}}{\sum_{i = 1}^{m} n_{i}},$ where $n_{i}$ represents the number of residents around $s_{i}$ , the number of users of $s_{i}$ , and so on. You can tailor $n_{i}$ to your problem. Then, $Z_{j} (j = 1, \dots, M)$ denote potential clusters, which consist of multiple nodes, e.g. $Z_{j} := {s_{1}, s_{2}, s_{3}}$ . The aim of scan statistics is to compare potential clusters and identify the specified $Z_{j}$ which has the significantly higher risk under the hypothesis testing.

2.1. Scan statistic for count data

Here we explain the specified examples of the scan statistic used in the hypothesis testing. We assume that $c_{i}$ follows the Poisson distribution with the mean parameter $q b_{i}$ as follows: (1) $c_{i} \sim Po (q b_{i}),$ (1) where $Po (a)$ indicates the Poisson distribution with the mean parameter a, q represents the infection rate. Under the null hypothesis $H_{0}$ , we assume that $c_{i}$ are generated by $Po (q_{a l l} b_{i})$ for all $s_{i} \in G$ , where $q_{a l l}$ takes the same constant rate for all $s_{i} \in G$ . Under the alternative hypothesis $H_{1} (Z_{j})$ , we assume that $c_{i}$ are generated by $Po (q_{i n} b_{i})$ for all $s_{i} \in Z_{j}$ and $c_{k}$ are generated by $Po (q_{o u t} b_{k})$ for all $s_{k} \in G - Z_{j}$ , where $q_{i n}$ and $q_{o u t}$ are some constants and $q_{i n} > q_{o u t}$ . In the above settings, the following test statistic is known [Citation10]: (2) $\begin{aligned} D (Z_{j}) : & = {(\frac{C_{i n}}{B_{i n}})}^{C_{i n}} {(\frac{C_{a l l} - C_{i n}}{B_{a l l} - B_{i n}})}^{C_{a l l} - C_{i n}}, \\ if \frac{C_{i n}}{B_{i n}} > \frac{C_{a l l} - C_{i n}}{B_{a l l} - B_{i n}}, \end{aligned}$ (2) and $D (Z_{j}) := 1$ otherwise, where $C_{i n} = \sum_{s_{i} \in Z_{j}} c_{i}$ , $B_{i n} = \sum_{s_{i} \in Z_{j}} b_{i}$ , $C_{a l l} = \sum_{s_{i} \in G} c_{i}$ and $B_{a l l} = \sum_{s_{i} \in G} b_{i}$ . When $C_{a l l} = B_{a l l}$ , the condition $\frac{C_{i n}}{B_{i n}} > \frac{C_{a l l} - C_{i n}}{B_{a l l} - B_{i n}}$ can be reduced to $C_{i n} > B_{i n}$ [Citation31]. The above scan statistic focuses on both potential clusters and the outside of them. On the other hand, Neil et al. (2005) focuses on the only potential clusters in the following sense [Citation37]: Under the null hypothesis $H_{0}$ , we assume that $c_{i}$ are generated by $Po (b_{i})$ for all $s_{i} \in G$ . Under the alternative hypothesis $H_{1} (Z_{j})$ , we assume that $c_{i}$ are generated by $Po (q b_{i})$ for all $s_{i} \in Z_{j}$ and $c_{k}$ are generated by $Po (b_{k})$ for all $s_{k} \in G - Z_{j}$ , where q is some constant and q>1. The test statistic $D (Z_{j})$ is represented as follows: (3) $D (Z_{j}) := {(\frac{C_{i n}}{B_{i n}})}^{C_{i n}} \exp (B_{i n} - C_{i n}), if B_{i n} > C_{i n},$ (3) and $D (Z_{j}) := 1$ otherwise. Here we provide the following relation for these scan statistics.

Proposition 2.1

Under $C_{a l l} = B_{a l l}$ and $C_{a l l} \to \infty$ , the Equations (Equation2(2) $\begin{aligned} D (Z_{j}) : & = {(\frac{C_{i n}}{B_{i n}})}^{C_{i n}} {(\frac{C_{a l l} - C_{i n}}{B_{a l l} - B_{i n}})}^{C_{a l l} - C_{i n}}, \\ if \frac{C_{i n}}{B_{i n}} > \frac{C_{a l l} - C_{i n}}{B_{a l l} - B_{i n}}, \end{aligned}$ (2) ) and (Equation3(3) $D (Z_{j}) := {(\frac{C_{i n}}{B_{i n}})}^{C_{i n}} \exp (B_{i n} - C_{i n}), if B_{i n} > C_{i n},$ (3) ) are equivalent.

Proof.

From the Equation (Equation2(2) $\begin{aligned} D (Z_{j}) : & = {(\frac{C_{i n}}{B_{i n}})}^{C_{i n}} {(\frac{C_{a l l} - C_{i n}}{B_{a l l} - B_{i n}})}^{C_{a l l} - C_{i n}}, \\ if \frac{C_{i n}}{B_{i n}} > \frac{C_{a l l} - C_{i n}}{B_{a l l} - B_{i n}}, \end{aligned}$ (2) ), $\begin{aligned} D (Z_{j}) & = {(\frac{C_{i n}}{B_{i n}})}^{C_{i n}} {(\frac{C_{a l l} - C_{i n}}{B_{a l l} - B_{i n}})}^{C_{a l l} - C_{i n}} \\ = {(\frac{C_{i n}}{B_{i n}})}^{C_{i n}} {(1 + \frac{B_{i n} - C_{i n}}{B_{a l l} - B_{i n}})}^{C_{a l l} - C_{i n}} \\ = {(\frac{C_{i n}}{B_{i n}})}^{C_{i n}} {{(1 + \frac{B_{i n} - C_{i n}}{C_{a l l} - B_{i n}})}^{\frac{C_{a l l} - C_{i n}}{B_{i n} - C_{i n}}}}^{B_{i n} - C_{i n}} . \end{aligned}$ Then, we have $\to {(\frac{C_{i n}}{B_{i n}})}^{C_{i n}} \exp (B_{i n} - C_{i n}) as C_{a l l} \to \infty . ■$

With these statistics, we can detect the clusters whose count of interest is higher than expected. For all M potential clusters, we calculate ${D (Z_{j})}_{j = 1}^{M}$ . Then, MLC with the highest $D (Z_{j})$ is selected. We confine our interest to the efficient construction of set of potential clusters ${Z_{j}}_{j = 1}^{M}$ for rapid surveillance of infectious disease.

2.2. Proposed algorithm: PTNS

Let $| S |$ , K, and $A_{i}$ be the size of set S, maximum size of a potential cluster, and the set of nodes that is adjacent to $s_{i}$ on a given network or in a given geographical area. Define the distance between two nodes $s_{i}$ , and $s_{j}$ as $d (s_{i}, s_{j})$ .

There are several possible distance measures: for example, Euclidean distance between two geographical locations or network-connectivity-based distance such as passenger volume between two stations. Further, travel time between two stations can be used as the distance measure. Then, we construct the list of potential clusters ${Z_{i k}}_{i = 1, \dots, m, k = 1, \dots, K}$ as the following Algorithm 1.

Lastly, the constructed $Z_{i k}$ with the highest $D (Z_{i k})$ defined in Equation (Equation2(2) $\begin{aligned} D (Z_{j}) : & = {(\frac{C_{i n}}{B_{i n}})}^{C_{i n}} {(\frac{C_{a l l} - C_{i n}}{B_{a l l} - B_{i n}})}^{C_{a l l} - C_{i n}}, \\ if \frac{C_{i n}}{B_{i n}} > \frac{C_{a l l} - C_{i n}}{B_{a l l} - B_{i n}}, \end{aligned}$ (2) ) is selected as the MLC. The set ${Z_{i k}}_{i = 1 \dots, m, k = 1, \dots, K}$ is equivalent with $Z_{j}$ in the above. The following proposition guarantees the computational efficiency or equivalence of PTNS, which is measured by the number of internal iterations, and compared with other conventional scan statistics approach. Algorithm ?? directly induces the following proposition:

Proposition 2.2

PTNS requires $O (m K)$ iterations to construct size of (at most) K-nodes potential cluster centered at each m-node. Thus, the number of iterations is linear in the size of potential cluster K. The same order can be applied to Circular-scan statistics developed by [Citation10]: i.e. Circular-scan algorithm requires $O (m K)$ iterations. On the other hand, Flex-scan algorithm developed by [Citation25] requires $O (m K^{2})$ iterations to construct size of K-nodes potential cluster centered at each m-node. Thus, The number of iterations is exponential in the size of potential cluster K.

It is worth noting that, although PTNS has the same iteration order $O (m K)$ to that of the Circular-scan, PTNS can capture any clusters that are along with the network structure and thus be able to identify clusters that have a longer diameter of a given network $G$ , as shown in the simulation data analysis section. Figure illustrates the examples of characteristic clusters detected by each algorithm. It shows that Circular-scan approach tends to detect a circular-shaped cluster centered with one station, and Flex-scan tends to have a smaller diameter of the cluster on the network because the clusters tend to become compact in one place with many short branches on the network. In contrast, PTNS tends to detect a cluster with a larger diameter on the network because it expands the set of potential clusters along with the network structure. Therefore, PTNS can work well in situations where the railway is the major media of the infection spreads.

Figure 1. Examples of detected clusters by each algorithm: Proposed (Top), Circular-scan (Bottom left), and Flex-scan (Bottom right).

2.3. Extension of PTNS to space-time potential clusters

We have explained how to construct spatial potential clusters for spatial scanning, but the idea of the cluster can be easily extended from a space-time perspective [Citation38–41]. Here we assume only the nodal attribute (i.e. $c_{i}$ ) is time-varying and other network structure is time-invariant.

Let T, and $Z_{j, t} (j = 1, \dots, M, t = 1, \dots, T)$ be the number of time points, and the jth potential cluster at time t, respectively. Assume that the set of nodes included in $Z_{j, t}$ is fixed same over time. We construct a list of space-time potential clusters, which is denoted as ${Z_{j, t_{s} : t_{e}}}_{j = 1 \dots, M, t_{s} = 1, \dots T, t_{e} = 1, \dots, T, t_{s} \leq t_{e}}$ , between time $t_{s}$ , and $t_{e}$ ( $t_{s} \leq t_{e}$ ) as follows.

In addition, we can apply this method to other algorithms. By using this algorithm, we make space-time potential clusters for other space potential cluster generating algorithms in the following simulation study.

3. Simulation data analysis result

3.1. Settings

In this section, we examine the performance of PTNS by using a simulation data generated with the Japan railway (JR) network in Tokyo. Figure shows the train station network. The black line and gray circle indicate train lines, and an 800-m ball centered at each station.

Figure 2. JR network in Tokyo: black line is JR railway line, and gray circle is 800-m ball centered at each station.

JR network in Tokyo includes 142 train stations and 22 lines. We simulate one true space-time cluster, each of which is a set of stations, and check the performance of our, and conventional algorithms, including Circular-, and Flex-scan approaches, for identifying the true cluster. When constructing potential clusters by PTNS, Circular and Flex-scan, we need to decide what kind of the distance measure we use. In this study, we use geographical Euclidean distance. Of course, as we mentioned in the previous section, we can use other distances such as passenger volume and travel time between two stations. We denote the ith station at time t by $s_{i, t} (i = 1, \dots, 142, t = 1, \dots, T)$ .

The procedure to prepare for the true clusters is as follows:

Select n as the size of a true cluster H
Select a line randomly
Select the sequential n-stations on the line randomly and set the n-station as a true cluster H
Select the number of time points T from ${1, 2, 3, 4, 5}$ randomly
Select $t_{s}$ as the start-time of the true cluster from ${1, \dots, T}$ randomly
Select $t_{e}$ as the start-time of the true cluster from ${t_{s} \dots, T}$ randomly
Extend H to $H_{t_{s} : t_{e}}$ , which represents a true time-space cluster
For each station at time of t, $s_{i, t} \in H_{t_{s} : t_{e}}$ , generate the number of patient $c_{i, t} \sim Po(200)$
For each station at time of t, $s_{i, t} \notin H_{t_{s} : t_{e}}$ , generate the number of patient $c_{i, t} \sim Po(100)$

$P o (a)$ is a Poisson distribution with the mean parameter a. We assume that there is only one true cluster in the JR network, and its size varies from 5 to 20 ( $n = 5, \dots, 20$ ). We perform Monte Carlo simulations 100 times for each n.

3.2. Evaluation metrics

For the MLCs detected by each method in each setting, we calculate the accuracy, sensitivity, and positive predicted value (PPV) to compare the performance. The evaluation metrics are defined as follows.

Let H, and M represent the set of stations in one true cluster, and in one MLC, respectively.

We define accuracy as $Accuracy = {\begin{cases} 1 & (H = M) \\ 0 & (otherwise), \end{cases}$ which measures the exact detection power of each algorithm. An algorithm with high accuracy can detect the true cluster more accurately than an algorithm with low accuracy.

Sensitivity is defined as the proportion of stations detected correctly among the stations in the true cluster: $Sensitivity = | M \cap H | / | H | .$ PPV is defined as $PPV = | M \cap H | / | M | .$ PPV is the proportion of stations detected correctly among the stations in the MLC. These metrics are often used in the studies of scan statistics [Citation42, Citation43].

3.3. Computational efficiency of PTNS

The number of potential clusters that were created in each algorithm is shown in Table . For each K, each algorithm provides potential clusters. Note that as the number of potential clusters increases, the computational time becomes longer.

Table 1. Number of potential clusters created in each algorithm.

Download CSV Display Table

The number of potential clusters generated by the proposed method is about 1.35, and 285.87 times smaller than that of Circular-, and Flex-scan algorithms, resulting in a shorter computational time. In our experimental environment (CPU: Intel Xeon (R) Gold 6242, 2.8GHz, Memory: 384 GB), each computation for creating potential clusters is taken 0.91, 0.10, and 31.55 seconds for the proposed, Circular-, and Flex-scan methods, respectively, in the case of K = 20. As mentioned in Proposition 2.2, the number of potential clusters generated by the proposed, and Circular-scan methods increases in a linear order. On the other hand, the number of potential clusters generated by the Flex-scan methods increases in an exponential order.

Note that, in addition to the creation of potential clusters, the calculation of $D (Z_{j})$ is required, whose iterations also take as many times as the iteration for the creation of potential clusters. It means that the total computational time of the whole procedure, including the creation of potential clusters in Algorithm 1 or 2, and the calculation of $D (Z_{j})$ , in Circular- and Flex-scan should be 1.35 × (the number of potential clusters in Circular-scan), and 285.87 × (the number of potential clusters in Flex-scan) times longer than that of PTNS.

The result implies that the proposed method can provide the estimated high-risk area without spending much time, which is a similar computational burden with Circular-scan.

3.4. Performance results

Figure shows the results of 100 simulations for each true cluster size n $(n = 5, \dots, 20)$ . In terms of accuracy, when n is small, Flex-scan is superior to the proposed method, and Circular-scan (n = 5: Proposed = 0.47, Flex-scan = 0.84, Circular = 0.17). In contrast, as n increases, the proposed method outperforms Flex-scan, and the accuracy of Flex-scan converges to almost 0, which is same with that of Circular scan ( $n = 20$ : Proposed = 0.09, Flex-scan = 0.00, Circular = 0.00). A similar trend is observed in the sensitivity: as n increases, the proposed method outperforms the conventional methods: when n = 5: Proposed = 0.85 (0.20), Flex-scan = 0.96 (0.09), and Circular = 0.79 (0.22), and when n = 20: Proposed = 0.81 (0.13), Flex-scan = 0.63 (0.10), and Circular = 0.57 (0.14). In terms of PPV, when n is small, Flex-scan provides slightly better performance than the proposed method, and Circular-scan (n = 5: Proposed = 0.91 (0.19), Flex-scan = 0.99 (0.04), and Circular = 0.84 (0.20)). However, the PPV of the proposed method is equivalently high with that of Flex-scan, especially when n>10 (n = 20: Proposed = 0.98 (0.03), Flex-scan = 1.00 (0.00), and Circular = 0.81 (0.16)).

Figure 3. Simulation results: Accuracy, Sensitivity, Positive predicted value (PPV) and calculation time (second) of proposed (PTNS), Circular-, and Flex-scan approaches.

These results imply that given the large true cluster, the proposed method PTNS can work better than the conventional methods, especially in terms of accuracy, and sensitivity.

4. Conclusion and discussion

In this study, we proposed a new method of scan statistics, PTNS, with the information of the public transportation network to construct potential clusters that stretch along with the network structure. PTNS is tailored for the rapid surveillance of emerging infectious diseases such as COVID-19. It means that the goal of this study is to propose a faster but accurate algorithm for detecting high-risk areas of infectious diseases under the emergency situation where the computational time the method takes to provide results is essentially important, and our method is expected to be used in a special situation where high-risk areas should be rapidly identified and shut down immediately. Given this goal, we showed that PTNS outperforms the conventional methods, including Circular- (fast but not accurate), and Flex-scan (slow but accurate), by using the Using the simulation data generated with the real railway network: PTNS succeeds in identifying true high-risk clusters with better performance than the other methods while the computational burden still remains in the preferable range. Especially, the results show that PTNS is superior to Circular- and Flex-scan approaches in terms of the accuracy, and sensitivity of detecting true clusters when the true cluster is large.

As the limitations of PTNS, it is noteworthy that it highly depends on the available network structure. For example, the iterations in PTNS will stop if the node is on a loop. Another limitation is that PTNS does not contain a mechanism to return to the original node, and it tends to grow only in one direction along with the network, thus they may not work well in networks that show complex branching. In the fields of change detection in network structure or surveillance, there are various studies related to the identification of the hot-spots. These studies include applications such as social network detection [Citation44], financial market analysis [Citation45], network traffic monitoring [Citation46], and detection of natural disasters [Citation47]. These studies try to detect the changes in summary information on network by using the field-specific information. When considering the epidemic of infectious diseases, human mobility is a crucial factor for understanding disease transmission [Citation48, Citation49]. That is, the utilization of the information on passenger volume or travel time between two stations mentioned in Section 2 is important. In addition, PTNS utilizes only the information on the simple distance metric between each node. However, network structure contains more additional information, which we can utilize as $d ()$ in Algorithm ?? instead of the simple distance metric [Citation50]: e.g. cumulative edge weight, centrality, hierarchical structure, and similarity between sub-networks. For example, the idea of the shortest path between two nodes might work well here because the human and the associated infection are assumed to move to their destination node along with the shortest path to save their mobility time [Citation51]. In addition, when we consider a public transportation system in a real world, it might be a mixture of several networks such as buses, trains, and airplanes. To model such a mixture network, one idea is to consider the union of available networks (i.e. $G_{1} \cup G_{2} \cup \dots \cup G_{M} = G_{u n i o n}$ , where $G_{m} (m = 1, \dots, M)$ is a one public transportation network structure.) and define an appropriate $d ()$ on $G_{u n i o n}$ .

Disclosure statement

Mr Kentaro Matsuura reports personal fees from Chugai Pharmaceutical Co., Ltd., outside the submitted work.

Data availability statement

In this study, we do not use real data.

Additional information

Funding

This work was partially supported by KAKENHI Grant-in-Aid for Young Scientists (21K17292), KAKENHI Grant-in-Aid for Research Activity Start-up (19K24340), Daiwa Securities Health Foundation, a grant from the Ministry of Education, Culture, Sports, Science and Technology of Japan (21H03203), and Precursory Research for Embryonic Science and Technology from the Japan Science and Technology Agency (JPMJPR21RC).

Notes on contributors

Yuta Tanoue

Dr Yuta Tanoue is an Assistant Professor in the Institute for Business and Finance, Waseda University, Tokyo, Japan. His research interests include data science for finance and management.

Daisuke Yoneoka

Dr Daisuke Yoneoka is a Chief of the Epidemiology and Statistics unit at Infectious Disease Surveillance Center, National Institute of Infectious Diseases. His research interests include statistics and its applications of related fields.

Takayuki Kawashima

Dr Takayuki Kawashima is an Assistant Professor in the Department of Mathematical and Computing Science at Tokyo Institute of Technology, Tokyo, Japan. His research interests include mathematical statistics and its applications of related fields.

Shinya Uryu

Mr Shinya Uryu is an Engineer working with National Institute for Environmental Studies (NIES). He is passionate about data science, management and statistics. His favorite open-source language is R.

Shuhei Nomura

Dr Shuhei Nomura is an Associate Professor working in the Department of Health Policy and Management, School of Medicine, Keio University, Tokyo, Japan; and an Assistant Professor of the Department of Global Health Policy, Graduate School of Medicine, The University of Tokyo, Tokyo Japan. His major research interests include global burden of disease, global health policy, biostatistics, and epidemiology.

Akifumi Eguchi

Dr Akifumi Eguchi is an Assistant Professor in the Department of Sustainable Health Science, Center for Preventive Medical Sciences, Chiba University, Chiba, Japan. His research interests include statistics and its applications of related fields as well as environmental impact assessment.

Koji Makiyama

Mr Koji Makiyama is an Engineer working with HOXO-M Inc., Tokyo, Japan. His research interest includes data science and biostatistics.

Kentaro Matsuura

Mr Kentaro Matsuura is an Engineer working with Chugai Pharmaceutical Co., Ltd. Tokyo, Japan. He is also a student of Department of Management Science, Graduate School of Engineering, Tokyo University of Science, Tokyo, Japan. His research interest includes data science and biostatistics.

References

Kulldorff M, Feuer EJ, Miller BA, et al. Breast cancer clusters in the northeast united states: a geographic analysis. Am J Epidemiol. 1997;146(2):161–170.
PubMed Web of Science ®Google Scholar
Fukuda Y, Umezaki M, Nakamura K, et al. Variations in societal characteristics of spatial disease clusters: examples of colon, lung and breast cancer in Japan. Int J Health Geogr. 2005;4(1):16.
PubMedGoogle Scholar
Malleson N, Andresen MA. Spatio-temporal crime hotspots and the ambient population. Crime Sci. 2015;4(1):1–8.
Google Scholar
Dahly D, Gilthorpe M. P1-16 a latent class analysis of socioeconomic status and obesity in young adults from Cebu, Philippines. J Epidemiol Community Health. 2011;65(Suppl 1):A71–A71.
PubMedGoogle Scholar
Tuia D, Ratle F, Lasaponara R, et al. Scan statistics analysis of forest fire clusters. Commun Nonlinear Sci Numer Simul. 2008;13(8):1689–1694.
Web of Science ®Google Scholar
Yoshida M, Naya Y, Miyashita Y. Anatomical organization of forward fiber projections from area te to perirhinal neurons representing visual long-term memory in monkeys. Proc Natl Acad Sci. 2003;100(7):4257–4262.
PubMed Web of Science ®Google Scholar
Enemark HL, Ahrens P, Juel CD, et al. Molecular characterization of Danish Cryptosporidium parvum isolates. Parasitology. 2002;125(4):331.
PubMedGoogle Scholar
Coulston JW, Riitters KH. Geographic analysis of forest health indicators using spatial scan statistics. Environ Manage. 2003;31(6):764–773.
PubMed Web of Science ®Google Scholar
de La Fuente Marcos R, de La Fuente Marcos C. From star complexes to the field: open cluster families. Astrophys J. 2008;672(1):342–351.
Web of Science ®Google Scholar
Kulldorff M. A spatial scan statistic. Commun Stat-Theor Meth. 1997;26(6):1481–1496.
Web of Science ®Google Scholar
Neill DB, Cooper GF. A multivariate Bayesian scan statistic for early event detection and characterization. Mach Learn. 2010;79(3):261–282.
Web of Science ®Google Scholar
Shiode S. Street-level spatial scan statistic and STAC for analysing street crime concentrations. Trans GIS. 2011;15(3):365–383.
Web of Science ®Google Scholar
Shiode S, Shiode N. A network-based scan statistic for detecting the exact location and extent of hotspots along urban streets. Comput Environ Urban Syst. 2020;83:101500.
Web of Science ®Google Scholar
Neill D, Moore A, Cooper G. A Bayesian spatial scan statistic. Adv Neural Inf Process Syst. 2005;18:1003–1010.
Google Scholar
Gangnon RE, Clayton MK. A weighted average likelihood ratio test for spatial clustering of disease. Stat Med. 2001;20(19):2977–2987.
PubMed Web of Science ®Google Scholar
Naus JL. Clustering of random points in two dimensions. Biometrika. 1965;52(1–2):263–266.
Google Scholar
Loader CR. Large-deviation approximations to the distribution of scan statistics. Adv Appl Probab. 1991;23(4):751–771.
Web of Science ®Google Scholar
Openshaw S, Charlton M, Wymer C, et al. A mark 1 geographical analysis machine for the automated analysis of point data sets. Int J Geogr Inf Syst. 1987;1(4):335–358.
Google Scholar
Besag J, Newell J. The detection of clusters in rare diseases. J R Stat Soc: Ser A (Stat Soc). 1991;154(1):143–155.
Web of Science ®Google Scholar
Christiansen LE, Andersen JS, Wegener HC, et al. Spatial scan statistics using elliptic windows. J Agric Biol Environ Stat. 2006;11(4):411–424.
Web of Science ®Google Scholar
Kulldorff M, Huang L, Pickle L, et al. An elliptic spatial scan statistic. Stat Med. 2006;25(22):3929–3943.
PubMed Web of Science ®Google Scholar
Azage M, Kumie A, Worku A, et al. Childhood diarrhea exhibits spatiotemporal variation in northwest Ethiopia: a satscan spatial statistical analysis. PLoS ONE. 2015;10(12):e0144690.
PubMed Web of Science ®Google Scholar
Coleman M, Coleman M, Mabuza AM, et al. Using the satscan method to detect local malaria clusters for guiding malaria control programmes. Malar J. 2009;8(1):68.
PubMedGoogle Scholar
Sherman RL, Henry KA, Tannenbaum SL, et al. Peer reviewed: applying spatial analysis tools in public health: an example using satscan to detect geographic targets for colorectal cancer screening interventions. Prev Chronic Dis. 2014;11:
PubMed Web of Science ®Google Scholar
Tango T, Takahashi K. A flexibly shaped spatial scan statistic for detecting clusters. Int J Health Geogr. 2005;4(1):421.
Google Scholar
Patil GP, Taillie C. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environ Ecol Stat. 2004;11(2):183–197.
Web of Science ®Google Scholar
Duczmal L, Assuncao R. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Comput Stat Data Anal. 2004;45(2):269–286.
Web of Science ®Google Scholar
Shiode S, Shiode N, Block R, et al. Space-time characteristics of micro-scale crime occurrences: an application of a network-based space-time search window technique for crime incidents in Chicago. Int J Geogr Inf Sci. 2015;29(5):697–719.
Web of Science ®Google Scholar
Quick M, Law J. Exploring hotspots of drug offences in Toronto: a comparison of four local spatial cluster detection methods. Can J Criminol Crim Justice. 2013;55(2):215–238.
Web of Science ®Google Scholar
Torabi M, Rosychuk RJ. An examination of five spatial disease clustering methodologies for the identification of childhood cancer clusters in Alberta, Canada. Spat Spatiotemporal Epidemiol. 2011;2(4):321–330.
PubMedGoogle Scholar
Tango T, Takahashi K. A flexible spatial scan statistic with a restricted likelihood ratio for detecting disease clusters. Stat Med. 2012;31(30):4207–4218.
PubMed Web of Science ®Google Scholar
Rashidi P, Wang T, Skidmore A, et al. Spatial and spatiotemporal clustering methods for detecting elephant poaching hotspots. Ecol Modell. 2015;297:180–186.
Web of Science ®Google Scholar
Cadena J, Chen F, Vullikanti A. Near-optimal and practical algorithms for graph scan statistics with connectivity constraints. ACM Trans Knowl Discov Data (TKDD). 2019;13(2):1–33.
Web of Science ®Google Scholar
Speakman S, McFowland III E, Neill DB. Scalable detection of anomalous patterns with connectivity constraints. J Comput Graph Stat. 2015;24(4):1014–1033.
Web of Science ®Google Scholar
Ishioka F, Kurihara K, Suito H, et al. Detection of hotspots for three-dimensional spatial data and its application to environmental pollution data. J Environ Sci Sustain Soc. 2007;1:15–24.
Google Scholar
Mennis J, Guo D. Spatial data mining and geographic knowledge discovery – an introduction. Comput Environ Urban Syst. 2009;33(6):403–408.
Web of Science ®Google Scholar
Neill DB, Moore AW, Sabhnani M, et al. Detection of emerging space-time clusters. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining; 2005. p. 218–227.
Google Scholar
Rao H, Shi X, Zhang X. Using the Kulldorff's scan statistical analysis to detect spatio-temporal clusters of tuberculosis in Qinghai province, China, 2009–2016. BMC Infect Dis. 2017;17(1):1–11.
PubMedGoogle Scholar
Butt UM, Letchmunan S, Hassan FH, et al. Spatio-temporal crime hotspot detection and prediction: a systematic literature review. IEEE Access. 2020;8:166553–166574.
Google Scholar
Tang J-H, Tseng T-J, Chan T-C. Detecting spatio-temporal hotspots of scarlet fever in Taiwan with spatio-temporal Gi* statistic. PLoS ONE. 2019;14(4):e0215434.
PubMed Web of Science ®Google Scholar
Agarwal S, Yadav L, Thakur MK. Circular and cylindrical hotspots detection for spatial and spatio-temporal data. 2019 Twelfth International Conference on Contemporary Computing (IC3); IEEE; 2019. p. 1–5.
Google Scholar
Jung I, Kulldorff M, Klassen AC. A spatial scan statistic for ordinal data. Stat Med. 2007;26(7):1594–1607.
PubMed Web of Science ®Google Scholar
Jung I, Ali M. Spatial scan statistics for matched case-control data. PLoS ONE. 2019;14(8):e0221225.
PubMed Web of Science ®Google Scholar
Yu R, He X, Liu Y. Glad: group anomaly detection in social media analysis. ACM Trans Knowl Discov Data (TKDD). 2015;10(2):1–22.
Web of Science ®Google Scholar
Durante D, Dunson DB. Bayesian dynamic financial networks with time-varying predictors. Stat Probab Lett. 2014;93:19–26.
Web of Science ®Google Scholar
Sun J, Tao D, Faloutsos C. Beyond streams and graphs: dynamic tensor analysis. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining; 2006. p. 374–383.
Google Scholar
Cheng H, Tan P-N, Potter C, et al. A robust graph-based algorithm for detection and characterization of anomalies in noisy multivariate time series. 2008 IEEE International Conference on Data Mining Workshops; IEEE; 2008. p. 349–358.
Google Scholar
Nomura S, Tanoue Y, Yoneoka D, Gilmour S, Kawashima T, Eguchi A, Miyata H. Mobility Patterns in Different Age Groups in Japan during the COVID-19 Pandemic: a Small Area Time Series Analysis through March 2021. Journal of Urban Health. 2021;98(5):635–641. https://doi.org/10.1007/s11524-021-00566-7.
PubMed Web of Science ®Google Scholar
Nomura S, Yoneoka D, Tanoue Y, et al. Time to reconsider diverse ways of working in Japan to promote social distancing measures against the covid-19. J Urban Health. 2020;97(4):457–460.
PubMed Web of Science ®Google Scholar
Brandes U. Network analysis: methodological foundations. Vol. 3418, New York: Springer Science & Business Media; 2005.
Google Scholar
Kolaczyk ED, Csárdi G. Statistical analysis of network data with R. Vol. 65, New York: Springer; 2014.
Google Scholar

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Public transportation network scan for rapid surveillance

Abstract