Abstract
In many applications of network analysis, it is important to distinguish between observed and unobserved factors affecting network structure. We show that a network model with discrete unobserved link heterogeneity and binary (or discrete) covariates corresponds to a stochastic blockmodel (SBM). We develop a spectral estimator for the effect of covariates on link probabilities, exploiting the correspondence of SBMs and generalized random dot product graphs (GRDPG). We show that computing our estimator is much faster than standard variational expectation–maximization algorithms and scales well for large networks. Monte Carlo experiments suggest that the estimator performs well under different data generating processes. Our application to Facebook data shows evidence of homophily in gender, role and campus-residence, while allowing us to discover unobserved communities. Finally, we establish asymptotic normality of our estimators.
Acknowledgments
We are grateful to Cong Mu and Jipeng Zhang for excellent research assistance. We thank the editor, associate editor and two referees, Avanti Athreya, Eric Auerbach, Federico Bandi, Stephane Bonhomme, Youngser Park and Eleonora Patacchini for comments and suggestions.
Funding
Screeplots (upper left and center left), Estimated latent positions (right, only 2 dimensions out of 4 per plot) and estimated latent positions
, that is
and
(bottom left, up to orthogonal transformation) for n = 2000.
Notes
1 Recent advances use further approximations and parallelization to improve computational efficiency (Roy, Atchade, and Michailidis Citation2019; Vu, Hunter, and Schweinberger Citation2013). We do not pursue such extensions in this article.
2 It must be noted that the support of F, is a subset of
such that
for all
.
3 Alternatively, we can think of a network where the vectors are drawn from a discrete mixture with mass centered at
, that is,
(4)
4 The extensions to multiple binary or discrete covariates is shown later in the article.
5 In the Bernoulli case, is a shifted Bernoulli variable, with values
with probability
and
with probability
.
6 A possible alternative is to exploit the overidentification of β in the matrix to develop a minimum distance estimator. We do not pursue this direction here, as we focus on a fully spectral estimator.
7 The notation means that for any real constant a > 0 there exists an
such that
for every integer
.
8 We do not compare the spectral estimator to the variational EM estimator, because the latter is too slow for a Monte Carlo with 1000 repetitions, even after parallelizing the execution.
9 The time of estimation reported in the table includes the following steps: (a) compute the ASE from the adjacency matrix; (b) compute the matrix of latent positions; (c) cluster latent positions to recover blocks; (d) compute matrix ; (e) cluster diagonal entries of matrix
to recover the unobservable block structure; (f) estimate
using the information on the block structure and the entries of matrix
; (g) compute simple mean and weighted mean of the estimated
according to (41) and (43). The simulation takes a little longer because we need to generate the data and the adjacency matrices for the Monte Carlo. Code is available in GitHub.
10 The entire dataset is available at https://archive.org/details/oxford-2005-facebook-matrix.
11 Roles include students, faculty, staff, alumni. We focus on students because they are the ones mostly using the platform in 2005.
12 This is a standard procedure in the literature on SBMs (Athreya et al. Citation2018; Abbe Citation2018; Roy, Atchade, and Michailidis Citation2019).
13 Before we proceed to estimation, we regularize the adjacency matrix using the standard method proposed in (Le, Levina, and Vershynin Citation2017). This regularization step avoids numerical issues with the spectral decomposition arising from significant node degree heterogeneity.
14 Multiple methods exist for selecting the embedding dimension in practice, and this remains a topic of current research. In the context of networks, choosing a dimension smaller than the true d will introduce bias in the estimated latent positions; on the other hand, using an embedding dimension larger than the true d will increase the variance of the estimated latent positions. In this tradeoff we prefer to err on the side of overestimating d. Specifically, we choose the value one plus the location of the first elbow in the screeplot.