![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
ABSTRACT
Spectral clustering is a popular method for community detection in network graphs: starting from a matrix representation of the graph, the nodes are clustered on a low-dimensional projection obtained from a truncated spectral decomposition of the matrix. Estimating correctly the number of communities and the dimension of the reduced latent space is critical for good performance of spectral clustering algorithms. Furthermore, many real-world graphs, such as enterprise computer networks studied in cyber-security applications, often display heterogeneous within-community degree distributions. Such heterogeneous degree distributions are usually not well captured by standard spectral clustering algorithms. In this article, a novel spectral clustering algorithm is proposed for community detection under the degree-corrected stochastic blockmodel. The proposed method is based on a transformation of the spectral embedding to spherical coordinates, and a novel modeling assumption in the transformed space. The method allows for simultaneous and automated selection of the number of communities and the latent dimension for spectral embeddings of graphs with uneven node degrees. Results show improved performance over competing methods in representing computer networks.
1 Introduction
Network data are commonly observed in a variety of scientific fields, representing, for example, interactions between neurons in the brain in biology, or connections between computers in communication technologies. A fundamental problem in the statistical analysis of networks is the task of finding groups of similar nodes, known as community detection. Spectral clustering methods (Ng, Jordan, and Weiss Citation2001; von Luxburg Citation2007) provide one of the most popular approaches for the community detection task. Such techniques essentially consist of two steps: (i) spectrally embedding the graph adjacency matrix, or some transformation thereof, into a low-dimensional space, and (ii) apply a clustering algorithm, usually Gaussian mixture modeling (GMM) or k-means, in the low-dimensional space.
Spectral clustering algorithms can be used to obtain estimates of the community structure under a variety of classical network models. The traditional model for community detection is the stochastic blockmodel (SBM, Holland, Laskey, and Leinhardt Citation1983), each node in the network is assigned to one of K communities, and the probability of a connection between two nodes only depends on their community memberships. Asymptotic theory suggests that embeddings arising from SBMs can be modeled using Gaussian mixture models (GMMs) (Rubin-Delanchy et al. Citation2017). This article mainly concerns the degree-corrected stochastic blockmodel (DCSBM, Karrer and Newman Citation2011), which extends the SBM, allowing for heterogeneity in the within-community degree-distribution. In DCSBMs, the probability of a connection depends on the community memberships, but is adjusted by node-specific degree-correction parameters. However, unlike the SBM, spectral embeddings under the DCSBM do not adhere to a GMM, since the communities are represented by rays.
In principle, DCSBMs appear to be particularly suitable for modeling graphs arising from cyber-security applications, and in particular computer network flow data representing summaries of connections between Internet Protocol (IP) addresses, since machines within the same organization tend to have different levels of activity depending on their purpose. Furthermore, in computer networks, the need for degree-correction seems most obvious when nodes are observed for different amounts of time, so that their connection probabilities scale with their “total time on test.” For example, if a new node enters the network, it would be beneficial to identify its community, despite having very few connections.
The suitability of DCSBMs for community detection in cyber-security, our application of interest, is demonstrated in , where the within-community out-degree distributions arising from two simulated bipartite SBM () and DCSBM (), are compared to the out-degree distribution of a real computer network (). A detailed description of the simulation is given in Section 6.2. The shape of the degree-distribution of the computer network resembles the simulated DCSBM much more closely than the SBM, suggesting that a degree-correction is required for correctly estimating the communities.
Fig. 1 Histogram of within-community degree distributions from three bipartite networks with size , obtained from (a) a simulation of a SBM, (b) a simulation of a DCSBM, and (c) a real-world computer network (ICL2, see Section 6.2).
![Fig. 1 Histogram of within-community degree distributions from three bipartite networks with size 439×60,635, obtained from (a) a simulation of a SBM, (b) a simulation of a DCSBM, and (c) a real-world computer network (ICL2, see Section 6.2).](/cms/asset/bdaf66fb-d14e-4ea8-b187-8d2301d66cac/utch_a_2008503_f0001_c.jpg)
This article makes two main contributions. First, a novel spectral clustering algorithm for community detection under the DCSBM is proposed, based on a transformation of the embedding. Second, the proposed methodology is incorporated within a model selection framework for d, the embedding dimension, and K, the number of communities, providing a joint estimation method for those two parameters and the communities. The proposed method is shown to be competitive on simulated and real-world computer network data.
The article is structured as follows: Section 2 describes the research question and related literature, followed by a preliminary discussion of spectral embedding techniques. Our proposed model is presented in Section 3. Parameter estimation and model selection are then discussed in Section 4. The proposed model is validated in Section 5, and results on simulated and real world computer network data are presented in Section 6.
2 Background and Motivation
A network can be expressed as a graph , consisting of a set of nodes V of cardinality n, and a set of edges
representing the pairs of nodes which have interacted. The graph is summarized by its adjacency matrix
, where
for
, with
. If
, the graph is undirected, implying that A is symmetric; otherwise the graph is directed.
The degree-corrected stochastic blockmodel (DCSBM, Karrer and Newman Citation2011) is a popular model for community detection in graphs. For K communities, the nodes are divided into blocks by random assignment of community membership indicators , with community probabilities
. Furthermore, each node is assigned a degree-correction parameter
. Each entry of the adjacency matrix is then independently modeled as
(1)
(1) where
is a K × K matrix of probabilities such that
is a baseline probability for a node from community k interacting with a node from community
.
Beyond DCSBMs, random dot product graphs (RDPG, Young and Scheinerman Citation2007) represent a wider and more flexible class of models for network data. Each node is assigned a d-dimensional latent position such that
for
. The probability of a link between i and j is then determined as
The latent positions can be arranged in a matrix
such that
. For a positive definite block connectivity probability matrix B, DCSBMs can be expressed as RDPGs. If each community is assigned an uncorrected position
, such that
, the DCSBM is obtained by setting
, conditional on the communities
and degree-correction parameters
.
This article is primarily concerned with a novel technique for estimating the underlying node communities given an adjacency matrix, under a RDPG interpretation of the DCSBM. A joint estimation method is proposed for the community structure , the number of communities K, and the latent dimension d of the latent positions.
2.1 Related Literature, Shortcomings, and Proposed Solutions
Community detection based on DCSBMs is an active field of research. Zhao, Levina, and Zhu (Citation2012) presented a theory for assessing consistency under the DCSBM. Amini et al. (Citation2013) used a pseudo-likelihood approach, providing consistency results for the estimators. Peng and Carvalho (Citation2016) frame the DCSBM in a Bayesian setting, using a logistic regression formulation with node correction terms. Chen, Li, and Xu (Citation2018) proposed a convexified modularity maximization approach. Gao et al. (Citation2018) obtained a minimax risk result for community detection in DCSBMs and propose a two-step clustering algorithm based on k-medians.
Spectral clustering methods have emerged as one of the most popular approaches for community detection under the DCSBM (Lei and Rinaldo Citation2015; Gulikers, Lelarge, and Massoulié Citation2017). A common technique uses k-means on the normalized rows of the embedding (Qin and Rohe Citation2013) obtained from the spectral decomposition of the regularized Laplacian matrix (Chaudhuri, Chung, and Tsiatas Citation2012). The row-normalization of the embedding is a well-established approach for spectral clustering (Ng, Jordan, and Weiss Citation2001) under the DCSBM, but the normalized rows live in a d – 1 dimensional manifold, so it is not fully appropriate to fit a model for d-dimensional clusters to such an embedding. Alternative methods include the SCORE algorithm of Jin (Citation2015), which proposes to use k-means on an embedding scaled by the leading eigenvector, showing that the effect of degree heterogeneity can be largely removed.
This article proposes a novel methodology for spectral clustering under the DCSBM, interpreted as a special case of RDPG: the d-dimensional spectral embedding is reduced to a set of d – 1 directions, or angles, changing from a Cartesian coordinate system to a spherical system. This choice of transformation is carefully motivated by asymptotic theoretical properties of the embeddings arising from DCSBMs.
Additionally, many estimation methods commonly require the number of communities K to be known. Furthermore, spectral clustering methods require the specification of the embedding dimension d, and clustering is usually carried out after selecting this parameter. This sequential approach is suboptimal, since the clustering configuration and the number of communities K would ideally be estimated jointly with d. In practice, selecting d and K is a difficult task. Sanna Passino and Heard (Citation2020) and Yang et al. (Citation2020) independently proposed an automatic model selection framework for both the number of communities K and the dimension d of the latent node positions in SBMs, interpreted as RDPGs. In this work, the methodology is extended to DCSBMs, providing an algorithm for practitioners.
2.2 Spectral Embedding
Given a network adjacency matrix, spectral embedding methods provide estimates of the latent positions X in RDPGs, from decompositions of the adjacency matrix or its Laplacian. This article will mainly discuss the adjacency spectral embedding, defined below.
Definition 1
(Adjacency spectral embedding). Consider a symmetric adjacency matrix and a positive integer
. The adjacency spectral embedding (ASE) of A in
is
, where
is a diagonal d × d matrix containing on the main diagonal the absolute value of the top-d eigenvalues of A in magnitude, in decreasing order, and
is a n × d matrix containing corresponding orthonormal eigenvectors.
For a directed graph, the RDPG model assumes each node has two latent positions
, such that
. The corresponding spectral embedding uses singular value decomposition. Bipartite graphs can be interpreted as a special case of directed graphs, and therefore are also spectrally embedded using the same procedure.
Definition 2
(Directed adjacency spectral embedding). Consider an adjacency matrix , not necessarily symmetric, and a positive integer
. The directed adjacency spectral embedding (DASE) of A in
is jointly given by
and
, where
is a diagonal d × d matrix containing on the main diagonal the top-d singular values of A in magnitude, in decreasing order, and
and
are n × d matrices containing corresponding orthonormal left and right singular vectors respectively.
2.3 Asymptotic Properties of Spectral Embedding of DCSBMs
An important result in the RDPG literature establishes the rows of the ASE as consistent estimators of the latent positions (Sussman, Tang, and Priebe Citation2014). Furthermore, central limit theorems (ASE-CLTs; Athreya et al. Citation2016; Rubin-Delanchy et al. Citation2017; Tang and Priebe Citation2018) provide strong justification for estimation of the latent positions via ASE.
When ASE is applied to DCSBMs, the asymptotic theory (see, e.g., Rubin-Delanchy et al. Citation2017) predicts that each community is represented as a ray from the origin in the embedding space. An example is given in , which shows the two-dimensional ASE for a simulated DCSBM with n = 1000 nodes and K = 4 communities.
Fig. 2 Scatterplots of the two-dimensional ASE of a simulated DCSBM with (a) K = 4, and (b) K = 2. also highlights the true and estimated latent position for six nodes, with the corresponding 50%, 75%, and 90% contours from the ASE-CLT, and the estimated latent positions for
from simulated DCSBM adjacency matrices
.
![Fig. 2 Scatterplots of the two-dimensional ASE of a simulated DCSBM with (a) K = 4, and (b) K = 2. Figure 2(b) also highlights the true and estimated latent position for six nodes, with the corresponding 50%, 75%, and 90% contours from the ASE-CLT, and the estimated latent positions x̂1,l for x1 from simulated DCSBM adjacency matrices Al, l=1,…,1000.](/cms/asset/2bd232e8-5ae9-46f5-85fb-58535d0d0ff9/utch_a_2008503_f0002_c.jpg)
To describe each community, let be the underlying latent position for a node in community k. Further, let
be the ASE estimator of
, obtained from a graph with n nodes. For d fixed and known, the ASE-CLT (e.g., Athreya et al. Citation2016), applied to DCSBMs, establishes that
(2)
(2) where
,
is an orthogonal matrix,
is a d × d community-specific covariance matrix depending on ρ, and
is the CDF of a d-dimensional Gaussian distribution with zero mean and covariance
. Jones and Rubin-Delanchy (Citation2020) extended the result (2) to the DASE (see Definition 2). In simpler terms, (2) implies that, for n large, the estimated latent position
is normally distributed about
, after a suitable orthogonal transformation has been applied to the embedding (accounting both for the ambiguity in the choice of eigenvectors or singular vectors within the spectral embedding procedure, and the latent position identifiability in the RDPG). The theorem is exemplified by , which displays the two-dimensional ASE of a simulated DCSBM with n = 1000 and K = 2 equally probable communities, with
,
, and
. For 6 nodes, the plot highlights the true and estimated latent positions, and corresponding theoretical Gaussian contours from the ASE-CLT. Additionally, the simulation of the adjacency matrix is repeated 1000 times using the same true underlying latent positions as the first simulation. Then, the ASE-estimated latent position for the first node is plotted for each simulation, along with the Gaussian contours estimated from the 1000 estimates. The empirical contours (dashed lines) remarkably correspond to the theoretical ASE-CLT contours (solid lines) around the true latent position
. Also, shows that, within the same community, the true latent positions all have the same spherical coordinates, or angle to the origin, whereas their corresponding ASE estimates are distributed around the line of the true latent positions, forming a community-specific ray. The ASE-CLT (2) also establishes that estimated latent positions tend to asymptotically concentrate increasingly tightly around the rays connecting the origin and the unnormalized latent positions
. Therefore, intuitively motivates the novel modeling choice proposed in this article: estimating the node communities from the spherical coordinates, or angles, obtained from the ASE. The use of alternative coordinate systems for network analysis has been previously shown to have beneficial properties (see, e.g., Krioukov et al. Citation2010; Braun and Bonfrer Citation2011; McCormick and Zheng Citation2015; Alanis-Lobato, Mier, and Andrade-Navarro Citation2016). Furthermore, a central limit theorem for the spherical coordinates of the latent positions is proved for d = 2 in the supplementary material, further establishing the suitable properties of such a transformation of the embedding.
One of the main characteristics of the proposed methodology will be to allow for an initial misspecification of the parameter d, choosing an m-dimensional embedding with , and then recovering the correct latent dimension by proposing a discriminative model for the extended embedding. In the remainder of the article, the notation
denotes the first d columns of
, and
denotes the m – d remaining columns. Similarly,
represents the first d components
of the vector
, and
the last m – d components
. Also, the row-normalized embedding is denoted as
, where
. Importantly, the parameter m is always assumed to be fixed.
3 Modeling a Transformation of DCSBM Embeddings
Consider an m-dimensional vector . The m Cartesian coordinates
can be converted in m – 1 spherical coordinates
on the unit m-sphere using a mapping
such that
, where:
(3)
(3)
(4)
(4) where
is the Euclidean norm.
Consider an -dimensional adjacency embedding
and define its transformation
, where
. The symbols
and
will denote respectively the first d columns of the matrix and d elements of the vector, and
and
will represent the remaining m – d components.
In this article, a model is proposed for the transformed embeddings . Suppose a latent space dimension d, K communities, and latent community assignments
. The transformed coordinates
are assumed to be generated independently from community-specific m-dimensional multivariate normal distributions:
(5)
(5) where
,
, represents a community-specific mean angle,
is a m-dimensional vector of ones,
is a d × d full covariance matrix, and
is a vector of positive variances. The model in (5) could be also completed in a Bayesian framework using the same prior distributions chosen in Sanna Passino and Heard (Citation2020). For fixed d and K, consider a set of mixing proportions
such that
, where
and
. After marginalizing out
, the likelihood function is
(6)
(6)
where
is the density of a q-dimensional normal distribution with mean
and variance
. Note that, normally, a wrapped normal distribution would be preferred for circular data. On the other hand, in the context of DCSBMs, it is known that
arises from a transformation of the embedding
, and the form of the transformation can be used to inform the modeling decisions. The arccosine function is monotonically decreasing, and communities will tend to have similar values in
for
, see (4). If two points in
reach the extremes 0 and
, then they are unlikely to belong to the same community. Therefore, wrapped distributions do not apply to this context.
The only case that could cause concern is , where the transformation (3) is not monotonic, but has a discontinuity at 0. The first column
of the ASE corresponds to the scaled leading eigenvector of the adjacency matrix, and therefore its elements have all the same sign by the Perron–Frobenius theorem for non-negative matrices (see, e.g., Meyer Citation2000). The theorem makes the transformation
in (3) monotonic, since one of the two conditions in the equation for θ1 is satisfied by all the values in
.
The rationale behind the model assumptions in (5) is to use the method of normalization to the unit circle (Qin and Rohe Citation2013) but assume normality on the spherical coordinates, not on their Cartesian counterparts, as discussed in the comments to . The transformed initial components are assumed to have unconstrained mean vector
, and a positive-definite community-specific d × d covariance matrix
.
In contrast to the structured model for , the remaining m – d dimension of the embedding are modeled as noise, using similar constraints on
to those imposed in Sanna Passino and Heard (Citation2020) and Yang et al. (Citation2020): the mean of the distribution is a
-dimensional vector centered at
, and the covariance is a diagonal matrix
with positive diagonal entries. The mean value of π reflects the assumption in Sanna Passino and Heard (Citation2020) and Yang et al. (Citation2020) of clusters centered at zero: for j > d,
is expected to be near 0, which makes the transformed coordinate center fluctuate around π. Importantly, the assumption of cluster-specific variances on
implies that d does not have the simple interpretation of being the number of dimensions relevant for clustering. This fundamentally differentiates the proposed modeling framework from traditional variable selection methods within clustering (see, e.g., Raftery and Dean Citation2006). The parameter d + 1 in this model represents the dimension of the latent positions that generate the network, or the rank of the block connectivity matrix B.
It must be further remarked that the model (5) concerns the distribution of the spherical coordinates of the ASE estimator of the DCSBM latent positions, and it is not a generative model for DCSBMs. If used as a generative model, paired with a distributional assumption on the degree-correction parameters, the model (5) would generate a noisy DCSBM, where the underlying latent positions are scattered around the rays, and not perfectly on the rays as a traditional DCSBM (see ). Considering the ASE-CLT (2), the estimation procedure proposed in this work would still be applicable also to such a noisy DCSBM.
Finally, note that the method relies on an initial choice of the embedding dimension . The parameter should be large enough to avoid potential issues with the case m < d. Choosing m is arguably easier than choosing d, and in principle one could pick m = n. As a rule of thumb, the parameter could be chosen as the third or fourth elbow based on the criterion of Zhu and Ghodsi (Citation2006). Note that, for
is invariant to the choice of m: embeddings
calculated from m-dimensional and
-dimensional embeddings, with
, are identical. This is not the case for the row-normalized embedding: embeddings
obtained from m-dimensional and
-dimensional embeddings are in general not equal.
4 Model Selection and Parameter Estimation
For selection of the number of communities K and latent dimension d, a classical approach of model comparison via information criteria is adopted, already used by Yang et al. (Citation2020) in the context of simultaneous model selection in SBMs. Suppose that the maximum likelihood estimate (MLE) of the model parameters for fixed d and K is . The Bayesian information criterion (BIC) is defined as follows:
(7)
(7)
The first term of the BIC (7) is the negative log-likelihood (6), whereas the second term is a penalty. The estimates of d and K will correspond to the pair that minimizes (7), obtained using grid search: . The latent dimension d has range
, whereas
. In practice, it is convenient to fix a maximum number of clusters
in the grid search procedure, such that
.
From (7), it follows that the MLE of the Gaussian model parameters is required for each pair (d, K). The expectation–maximization (EM, Dempster, Laird, and Rubin Citation1977) algorithm is typically used for problems involving likelihood maximization in model based clustering (e.g., Fraley and Raftery Citation2002). Finding the MLE for (6) only requires a simple modification of the standard algorithm for GMMs, adding constraints on the means and covariances of the last m – d components. Under the assumption that the model for the spherical coordinates is correctly specified, then the same framework described in Theorem 1 in Yang et al. (Citation2020) could be used to obtain theoretical guarantees on the estimates.
Given the maximum likelihood estimates for (d, K), the communities are estimated as(8)
(8) where a
-component GMM is fitted to the
-dimensional embedding, and the second Gaussian term in the likelihood (6), accounting for the last m – d components of the embedding, is removed to reduce the bias-variance tradeoff (Yang et al. Citation2020).
For a given pair (d, K), fast estimation and convergence can be achieved by initializing the EM algorithm with the MLE of a GMM fitted on the initial d components of the m-dimensional embedding. This approach for initialization will be followed in Section 6.2. The full procedure is summarized in Algorithm 1. It must be remarked that the grid search procedure for estimating (d, K) from (7) requires EM-algorithms. This is computationally intensive, especially if the number of nodes is large, or m and
are large. Therefore, the methodology for estimation of (d, K) is not scalable to very large graphs.
Algorithm 1:
Estimation of the latent dimension d, number of communities K, and community allocations , given a graph adjacency matrix A.
Input: adjacency matrix , maximum number of communities
.
Result: estimated community allocations and estimated latent dimensionality
.
1 select m as the third or fourth elbow of the scree plot (Zhu and Ghodsi Citation2006),
2 calculate the m-dimensional ASE (or DASE for directed graphs),
3 transform the embedding into spherical coordinates
,
4 for do
5 for do
6 calculate MLE of model (5) using the EM algorithm,
7 calculate using (7),
8 end
9 end
10 obtain the estimate for the pair (d, K),
11 fit a -component GMM to
, and estimate the communities
using (8).
5 Empirical Model Validation
In order to validate the modeling approach in (5), a simulation study has been carried out. Additional results on model validation are also reported in the supplementary material.
5.1 Gaussian Mixture Modeling of DCSBM Embeddings
First, a simple simulation is used to show that k-means or Gaussian mixtures are not appropriate for modeling an embedding under the DCSBM, even when the row-normalized embedding
is used. The ASE was obtained from a simulated DCSBM with
nodes, K = 2 and d = 2, with an equal number of nodes allocated to each group, and
, and
, corrected by parameters
. Since the community allocations
are known a priori in the simulation, it is possible to evaluate the community-specific distributions. The results are plotted in , which shows the scatterplot and histograms of the marginal distributions for the two-dimensional adjacency spectral embedding
and its normalized version
, and for its transformation to spherical coordinates
, all labelled by community. From , it is clear that applying k-means or Gaussian mixture modeling on
is suboptimal, since the joint distribution or cluster-specific marginal distributions are not normally distributed, as predicted by the ASE-CLT (2). shows that row-normalization is beneficial, since the marginal distributions for
are normally distributed within each community, but it is not appropriate for modeling the joint distribution and for at least one of the two marginal distributions for
. On the other hand, the transformation
in visually meets the assumption of normality. This visual impression is confirmed in the Supplementary Material, which describes a CLT which strongly supports the normality in
for the two-dimensional case.
5.2 Structure of the Likelihood
In order to validate the conjecture on the model likelihood proposed in (5), a simulation study has been carried out. N = 1000 DCSBMs with n = 2000 nodes have been simulated, fixing K = 3, and pre-allocating the nodes to equal-sized clusters. The community-specific latent positions used in the simulation are , and
, resulting in the block-probability matrix
(9)
(9)
The matrix is positive-definite and has full rank 3, implying that d = 2 in the embedding . For each of the N simulations, the link probabilities are corrected using the degree-correction parameters ρi sampled from a
distribution, and the adjacency matrices A are obtained using (1). For each of the simulated graphs ASE is calculated for a large value of m. The results are summarized in .
Fig. 4 Boxplots for simulations of a degree-corrected stochastic blockmodel with
nodes, K = 3, equal number of nodes allocated to each group, and B described in (9), corrected by parameters ρi sampled from a
distribution.
![Fig. 4 Boxplots for N=1000 simulations of a degree-corrected stochastic blockmodel with n=2000 nodes, K = 3, equal number of nodes allocated to each group, and B described in (9), corrected by parameters ρi sampled from a Uniform(0,1) distribution.](/cms/asset/5bc687f8-08b1-4953-ad04-b6e66f7f0cc6/utch_a_2008503_f0004_c.jpg)
The true underlying cluster allocations are known in the simulation, and can be used to validate the model assumptions. In this section, the community-specific mean and covariance matrices obtained from the embedding of the sampled graph will be denoted as
and
. shows the boxplots of the community-specific estimated means for the first two components of
. The mean values show minimal variation across the different simulations, and are clearly different from 0 and differ between clusters. In , boxplots for the community-specific estimated variances in
are plotted. Again, it seems that having cluster-specific variances is sensible, and at least one of the covariances is significantly different from 0, as expected from the theory. On the other hand, shows the boxplots for the community-specific estimated means in
, which are all centered at π. Hence, the assumption on the mean structure in (5) seems to be justified.
A potentially more controversial modeling choice is the community-specific variance for the last m – d components of the embedding. shows the estimated variances for each community on different dimensions exceeding 2. It is clear that the variance is different across different communities. This consideration is reinforced by , which shows the boxplots of Kolmogorov–Smirnov (KS) scores obtained from fitting community-specific Gaussian distributions on the dimensions exceeding d, compared to the KS score for a Gaussian fit on all the estimated latent positions for the corresponding dimension. Clearly, a correct modeling approach must use community-specific variances. This implies that residual cluster information is present also in the last m – d dimensions of the embedding. Finally, plots the boxplots of estimated covariances between and
. Consistent with the model in (5), the correlations are scattered around 0, and the assumption of independence between those components seems reasonable. This is confirmed by the plot of the average estimated covariance matrix for the first community, in .
5.3 Normality of the Spherical Coordinates
The most important comparison is to establish whether the embedding is better suited to GMM than the row-normalized embedding
traditionally used in the literature. To make this comparison, the p-values for the two Mardia tests for multivariate normality (Mardia Citation1970) have been calculated for each community-specific distribution, for each simulated graph. The tests are based on multivariate extensions of skewness and kurtosis: assume a sequence of random vectors
, and define the sample mean and sample covariance as
and
, respectively. Mardia (Citation1970) defined two test statistics for multivariate skewness and kurtosis
(10)
(10)
(11)
(11)
(12)
(12)
Under the null hypothesis of multivariate normality, and
in distribution for
(Mardia Citation1970). Given observed values of TS and TK, p-values pS and pK can be calculated from the asymptotic distribution.
Under the same setup as the simulation in Section 5.2, p-values are calculated for the two Mardia tests applied for each community on the spherical embedding and the row-normalized embedding
. Then, binomial sign tests for paired observations are calculated on the differences between the p-values obtained from
, and those obtained from
, separately for pS and pK, under the null hypothesis that those are sampled from the same distribution. The alternative hypothesis is that the distribution of the p-values obtained from
is stochastically larger than the corresponding distribution for
. The p-value of the sign test is
for both skewness and kurtosis, confirming the impression in that the transformation (4) to
tends to Gaussianize the embeddings
and
.
6 Applications and Results
In this section, the model selection procedure is assessed on simulated DCSBMs and real-world bipartite graphs obtained from the network flow data collected at Imperial College London. The DCSBM for bipartite graphs is a simple extension of the undirected model, and has a similar RDPG-structure to the bipartite stochastic co-blockmodel (ScBM, Rohe, Qin, and Yu Citation2016). In bipartite graphs, the nodes are divided into two nonoverlapping groups V and such that
. Such networks are conveniently represented by rectangular adjacency matrices
, where
and
. Suppose the nodes in V and
respectively belong to K and
communities, with respective community allocations
and
. Also, suppose for each of the two sets of communities there are latent positions
, and
, such that
. This gives the link probability
(13)
(13) where
and
are degree correction parameters for each of the nodes in V and
. From A, it is possible to obtain embeddings
and
using the DASE in Section 2.2, and cluster the two embeddings jointly or separately. In this work, the quality of the clustering is evaluated using the adjusted Rand index (ARI, Hubert and Arabie Citation1985). Higher values of the ARI correspond to better clustering performance, reaching a maximum of 1 for perfect agreement between the estimated clustering and the true labels.
6.1 Synthetic Networks
The performance of the model selection procedure described in Section 4 is evaluated on simulated DCSBMs. N = 250 undirected graphs with n = 1000 and were simulated, randomly selecting B from
and sampling the degree correction parameters from
. The nodes were allocated to communities of equal size. For each of the graphs, the models of Sanna Passino and Heard (Citation2020) and Yang et al. (Citation2020) are applied to the ASE
and its row-normalized version
for m = 10, selecting the estimates of d and K using BIC. The value m = 10 is usually approximately equal in the simulations to the third elbow of the scree plot using the criterion of Zhu and Ghodsi (Citation2006), considering a total of 25 eigenvalues or singular values. Also, the model in (5) is fitted to
, estimating d and K using the selection procedure in Section 4, with
. The results of the simulations are reported in .
Table 1 Estimated performance for N = 250 simulated DCSBMs and bipartite DCScBMs.
A similar simulation has been repeated for bipartite DCScBMs. N = 250 graphs with n = 1000 and =1500 were generated, setting
, communities of equal size,
, and
. The results are reported in .
The table shows that the transformed embedding sometimes has a slightly inferior performance when estimating the correct value of the latent dimension d (see ), but outperforms the alternative methodologies significantly in the ability to estimate the number of communities K. In particular, the Gaussian mixture model is not well suited to either the standard embedding nor the row-normalized
, and the distortion caused by the degree-corrections and row-normalization does not allow correct estimation of K. This problem is alleviated when the spherical coordinates estimator is used. The improvement is reflected in a significant difference in the clustering performance, demonstrated by the average ARI scores for the three different procedures. The table also shows that estimates of d based on the model of Sanna Passino and Heard (Citation2020) and Yang et al. (Citation2020) on
and
seem to be slightly more accurate than alternative methods on the DCScBM. It might be therefore tempting to construct a hybrid model that uses
for the top-d embeddings and
for the remaining components, and proceed to select the most appropriate d under such a joint model. Unfortunately, model comparison via BIC is not possible in that setting.
The simulation is also repeated for different values of n, evaluating the asymptotic behavior of the proposed community detection procedure. The results are plotted in , demonstrating that the performance of the spherical coordinates estimator improves when the number of nodes in the graph increases. This is consistent with the CLT presented in the Supplementary Material. On the other hand, the results obtained from the alternative estimators degrade with n, providing further evidence that the proposed model (5) appears to be more appropriate for community detection under the DCSBM.
Fig. 5 Estimated performance for N = 250 simulated DCSBMs and DCScBMs, for . For bipartite DCScBMs,
.
![Fig. 5 Estimated performance for N = 250 simulated DCSBMs and DCScBMs, for n∈{100,200,500,1000,2000}. For bipartite DCScBMs, n′∈{150,300,750,1500,3000}.](/cms/asset/336cea74-44f7-4dba-b7a3-69f6164376a3/utch_a_2008503_f0005_c.jpg)
Also, the boxplots for the paired differences between ARIs are plotted in . The clustering based on consistently outperforms
and
(see ). The difference can be quantitatively evaluated using binomial sign tests for paired observations, similarly to Section 5.3. For both undirected and bipartite graphs, the p-values of the sign tests are
, overwhelmingly suggesting that the clustering based on
is superior to the competing methodologies. Furthermore, the difference increases when the number of nodes increases (see ). Overall, the simulations suggest that the proposed spectral clustering procedure for estimation of the DCSBM, based on the spherical coordinates of the ASE estimator of the latent positions, appears to outperform competing estimators, including spectral clustering on the row-normalized ASE estimator.
6.2 Imperial College Network Flow Data
The model proposed in this work has been specifically developed for clustering networks obtained from the network flow data collected at Imperial College London (ICL). Finding communities of machines with similar behavior is important in network monitoring and intrusion detection (Neil et al. Citation2013). In this application, the edges relate to all the HTTP (port 80) and HTTPS (port 443) connections observed from machines hosted in computer labs in different departments at ICL in January 2020. The source nodes V are computers hosted in college laboratories, and the destination nodes are internet servers. Intuitively, computers in a laboratory tend to have a heterogeneous degree distribution of their web connections because they are not used uniformly: a computer located closed to the entrance of the laboratory might be used more than a machine at the back. Therefore, the total time of activity is different across machines in the same community, which suggests that the DCSBM is an appropriate modeling choice. This is confirmed by the within-community degree distributions in (Section 1), which compares one of the Imperial College networks (ICL2, see ) with a simulated SBM and DCSBM (see ). displays the within-community out-degree distribution of a simulated ScBM, see (13), with
, equal-sized communities, and block-community matrix
, with
,
, and
if
. displays the same distribution for a simulated DCScBM with block-community matrix
, corrected sampling
.
Three real-world computer networks are considered here, and corresponding summary statistics are presented in , where ZG denotes the position of the
th elbow of the scree plot according to the method of Zhu and Ghodsi (Citation2006). For the source nodes, a known underlying community structure is given by the department to which each machine belongs. ICL1 corresponds to machines hosted in the departments of Physics, Electrical Engineering, and Earth Science, whereas computers in Chemistry, Civil Engineering, Mathematics, and Medicine are considered in ICL2. For ICL3, the departments are Aeronautical Engineering, Civil Engineering, Electrical Engineering, Mathematics, and Physics. Students use the computer laboratories for tutorials and classes, and therefore some variation might be expected in the activities of different machines across different departments.
Table 2 Summary statistics for the Imperial College London computer networks.
shows the scatterplots of the leading two dimensions of the m-dimensional embeddings ,
and
for m = 30 for ICL2, showing that the clustering task is particularly difficult in this network, and there is not much separation between the communities. Despite this, the transformation to spherical coordinates (see ) appears to make the communities more Gaussian-shaped, as opposed to the standard and row-normalized DASE, where the within-community embeddings are curved (see ), and therefore not amenable to Gaussian mixture modeling. This impression is quantitatively confirmed by the Mardia tests for multivariate normality on each community (see Section 5.3): the difference of the log-p-values obtained from the spherical coordinates transformation and the standard DASE is on average
for the kurtosis and
for the skewness, further demonstrating that the proposed transformation tends to Gaussianize the embedding.
For each of the three ICL network graphs, the DASE has been calculated for m = 30 and m = 50, and the model in Sanna Passino and Heard (Citation2020) and Yang et al. (Citation2020) has been fitted on the source embeddings and the row-normalized version
, whereas the model (5) is used on the transformation
, setting
. The resulting estimated values of d and K, obtained using the minimum BIC, are reported in . In order to reduce the sensitivity to initialization, the model was fitted ten times for each pair (d, K), and the parameter estimates corresponding to the minimum BIC were retained. The choice of
corresponds to values around the third and fourth elbows in the scree plot of singular values, according to the criterion of Zhu and Ghodsi (Citation2006) (see ).
Table 3 Estimates of (d, K) and ARIs for the embeddings and
for
and alternative methodologies.
Comparing for the two different values of m, it seems that the estimates of K based on are closer to the underlying true number of communities. In particular, K = 3 for ICL1, K = 4 for ICL2, and K = 5 for ICL3, based on the number of departments.
Based on the estimates in , the estimated community allocations zi were obtained using (8), and the adjusted Rand index was calculated using the department as labels. For further comparisons, the results were compared to other popular community detection methods: the Louvain algorithm (Blondel et al. Citation2008) adjusted for bipartite graphs (Dugué and Perez Citation2015), and the hierarchical Louvain (HLouvain) and Paris methods (Bonald et al. Citation2018), all in their default implementation for bipartite graphs in the python library scikit-network. The results are reported in . Clearly, clustering on the embedding outperforms the alternatives, including
and
, in all the three networks. In some cases, the improvement is substantial, for example in ICL2, where the proposed method reaches the score 0.938, corresponding to only 9 misclassified nodes out of 439, particularly remarkable considering the lack of separation of the clusters in .
The results were confirmed by binomial paired sign tests on the difference between ARI scores for N = 25 iterations of the community detection algorithms, which returned p-values in favour of the spherical coordinates estimator over the alternative methodologies.
In terms of network structure, the results could be interpreted as follows: the computers connect to a set of shared college-wide web servers (e.g., the virtual learning environment and the library services), and to services specific to the discipline carried out in each department. Furthermore, each machine has different levels of activity, which leads to heterogeneous within-community degree distributions. The SBM, corresponding to , only clusters the nodes based on their degree, whereas the DCSBM is able to uncover the departmental structure, in particular when the spherical coordinates estimator
is used. Under the assumption that the networks were generated under a DCSBM, the results confirm that the estimate based on spherical coordinates proposed in this work appears to produce superior results when compared to the standard or row-normalized ASE.
7 Conclusion
In this article, a novel method for spectral clustering under the degree-corrected stochastic blockmodel has been proposed. The model is based on a transformation to spherical coordinates of the commonly used adjacency spectral embedding. Such a transformation seems more suited to Gaussian mixture modeling than the row-normalized embedding, which is widely used in the literature for spectral clustering. The proposed methodology is then incorporated within a simultaneous model selection scheme that allows the model dimension d and the number of communities K to be determined. The optimal values of d and K are chosen using the popular Bayesian information criterion. The framework also extends simply to include directed and bipartite graphs. Results on synthetic data and real-world computer networks show superior performance over competing methods.
Supplemental Material
Download Zip (225.1 KB)Acknowledgments
The authors thank the editor, associate editor, and two referees for their valuable comments, that greatly improved the quality of the article. FSP acknowledges funding from EPSRC.
Supplementary Material
The supplementary material for this article describes a central limit theorem for the spherical coordinates of the latent positions for d = 2, and additional simulation studies for model validation. The first author’s Github page (https://github.com/fraspass/dcsbm) contains python code to reproduce the results and simulations presented in this article.
Additional information
Funding
References
- Alanis-Lobato, G., P. Mier, and M. A. Andrade-Navarro (2016), “Efficient Embedding of Complex Networks to Hyperbolic Space Via Their Laplacian,” Scientific Reports, 6, 30108. DOI: https://doi.org/10.1038/srep30108.
- Amini, A. A., Chen, A., Bickel, P. J., and Levina, E. (2013), “Pseudo-Likelihood Methods for Community Detection in Large Sparse Networks,” Annals of Statistics, 41, 2097–2122.
- Athreya, A., Priebe, C. E., Tang, M., Lyzinski, V., Marchette, D. J., and Sussman, D. L. (2016), “A Limit Theorem for Scaled Eigenvectors of Random Dot Product Graphs,” Sankhya A, 78, 1–18. DOI: https://doi.org/10.1007/s13171-015-0071-x.
- Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008), “Fast Unfolding of Communities in Large Networks,” Journal of Statistical Mechanics: Theory and Experiment, P10008. DOI: https://doi.org/10.1088/1742-5468/2008/10/P10008.
- Bonald, T., Charpentier, B., Galland, A., and Hollocou, A. (2018), “Hierarchical Graph Clustering Based on Node Pair Sampling,” in Proceedings of the 14th International Workshop on Mining and Learning With Graphs (MLG).
- Braun, M., and Bonfrer, A. (2011), “Scalable Inference of Customer Similarities From Interactions Data Using Dirichlet Processes,” Marketing Science, 30, 513–531. DOI: https://doi.org/10.1287/mksc.1110.0640.
- Chaudhuri, K., Chung, F., and Tsiatas, A. (2012), “Spectral Clustering of Graphs With General Degrees in the Extended Planted Partition Model,” in Proceedings of the 25th Annual Conference on Learning Theory, pp. 35.1–35.23.
- Chen, Y., Li, X., and Xu, J. (2018), “Convexified Modularity Maximization for Degree-Corrected Stochastic Block Models,” Annals of Statistics, 46, 1573–1602.
- Dempster, A., Laird, N., and Rubin, D. (1977), “Maximum Likelihood From Incomplete Data Via The EM Algorithm,” Journal of the Royal Statistical Society, Series B, 39, 1–38.
- Dugué, N., and Perez, A. (2015), “Directed Louvain: Maximizing Modularity in Directed Networks,” Technical Report hal-01231784, Université d’Orléans.
- Fraley, C., and Raftery, A. E. (2002), “Model-Based Clustering, Discriminant Analysis, and Density Estimation,” Journal of the American Statistical Association, 97, 611–631. DOI: https://doi.org/10.1198/016214502760047131.
- Gao, C., Ma, Z., Zhang, A. Y., and Zhou, H. H. (2018), “Community Detection in Degree-Corrected Block Models,” Annals of Statistics, 46, 2153–2185.
- Gulikers, L., Lelarge, M., and Massoulié, L. (2017), “A Spectral Method for Community Detection in Moderately Sparse Degree-Corrected Stochastic Block Models,” Advances in Applied Probability, 49, 686–721. DOI: https://doi.org/10.1017/apr.2017.18.
- Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983), “Stochastic Blockmodels: First Steps,” Social Networks 5, 109–137. DOI: https://doi.org/10.1016/0378-8733(83)90021-7.
- Hubert, L., and Arabie, P. (1985), “Comparing Partitions,” Journal of Classification, 2, 193–218. DOI: https://doi.org/10.1007/BF01908075.
- Jin, J. (2015), “Fast Community Detection by SCORE,” Annals of Statistics, 43, 57–89.
- Jones, A., and Rubin-Delanchy, P. (2020), “The Multilayer Random Dot Product Graph,” arXiv e-prints.
- Karrer, B., and Newman, M. E. J. (2011), “Stochastic Blockmodels and Community Structure in Networks,” Physical Review E, 83. DOI: https://doi.org/10.1103/PhysRevE.83.016107.
- Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A., and Boguñá, M. (2010), “Hyperbolic Geometry of Complex Networks,” Physical Review E, 82. DOI: https://doi.org/10.1103/PhysRevE.82.036106.
- Lei, J., and Rinaldo, A. (2015), “Consistency of Spectral Clustering in Stochastic Block Models,” Annals of Statistics, 43, 215–237.
- Mardia, K. V. (1970), “Measures of Multivariate Skewness and Kurtosis With Applications,” Biometrika, 57, 519–530. DOI: https://doi.org/10.1093/biomet/57.3.519.
- McCormick, T. H., and Zheng, T. (2015), “Latent Surface Models for Networks Using Aggregated Relational Data,” Journal of the American Statistical Association, 110, 1684–1695. DOI: https://doi.org/10.1080/01621459.2014.991395.
- Meyer, C. D. (2000), Matrix Analysis and Applied Linear Algebra, Philadelphia, PA: SIAM.
- Neil, J., Hash, C., Brugh, A., Fisk, M., and Storlie, C. B. (2013), “Scan Statistics for the Online Detection of Locally Anomalous Subgraphs,” Technometrics, 55, 403–414. DOI: https://doi.org/10.1080/00401706.2013.822830.
- Ng, A. Y., Jordan, M. I., and Weiss, Y. (2001), “On Spectral Clustering: Analysis and an Algorithm,” in Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, pp. 849–856.
- Peng, L., and Carvalho, L. (2016), “Bayesian Degree-Corrected Stochastic Blockmodels for Community Detection,” Electronic Journal of Statistics, 10, 2746–2779. DOI: https://doi.org/10.1214/16-EJS1163.
- Qin, T., and Rohe, K. (2013), “Regularized Spectral Clustering Under the Degree-Corrected Stochastic Blockmodel,” in Proceedings of the 26th International Conference on Neural Information Processing Systems, pp. 3120–3128.
- Raftery, A. E., and Dean, N. (2006), “Variable Selection for Model-Based Clustering,” Journal of the American Statistical Association, 101, 168–178. DOI: https://doi.org/10.1198/016214506000000113.
- Rohe, K., Qin, T., and Yu, B. (2016), “Co-Clustering Directed Graphs to Discover Asymmetries and Directional Communities,” Proceedings of the National Academy of Sciences, 113, 12679–12684. DOI: https://doi.org/10.1073/pnas.1525793113.
- Rubin-Delanchy, P., Cape, J., Tang, M., and Priebe, C. E. (2017), “A Statistical Interpretation of Spectral Embedding: The Generalised Random Dot Product Graph.” arXiv e-prints.
- Sanna Passino, F., and Heard, N. A. (2020), “Bayesian Estimation of the Latent Dimension and Communities in Stochastic Blockmodels,” Statistics and Computing, 30, 1291–1307. DOI: https://doi.org/10.1007/s11222-020-09946-6.
- Sussman, D. L., Tang, M., and Priebe, C. E. (2014), “Consistent Latent Position Estimation and Vertex Classification for Random Dot Product Graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 48–57. DOI: https://doi.org/10.1109/TPAMI.2013.135.
- Tang, M., and Priebe, C. E. (2018), “Limit Theorems for Eigenvectors of the Normalized Laplacian for Random Graphs,” Annals of Statistics, 46, 2360–2415.
- von Luxburg, U. (2007), “A Tutorial on Spectral Clustering,” Statistics and Computing, 17. DOI: https://doi.org/10.1007/s11222-007-9033-z.
- Yang, C., Priebe, C. E., Park, Y., and Marchette, D. J. (2020), “Simultaneous Dimensionality and Complexity Model Selection for Spectral Graph Clustering,” Journal of Computational and Graphical Statistics, 30, 422–441. DOI: https://doi.org/10.1080/10618600.2020.1824870.
- Young, S. J., and Scheinerman, E. R. (2007), “Random Dot Product Graph Models for Social Networks,” in Algorithms and Models for the Web-Graph, pp. 138–149.
- Zhao, Y., Levina, E., and Zhu, J. (2012), “Consistency of Community Detection in Networks Under Degree-Corrected Stochastic Block Models,” Annals of Statistics, 40, 2266–2292.
- Zhu, M., and Ghodsi, A. (2006), “Automatic Dimensionality Selection From the Scree Plot Via the Use of Profile Likelihood,” Computational Statistics & Data Analysis, 51, 918–930.