363
Views
2
CrossRef citations to date
0
Altmetric
Articles

Spectral Estimation of Large Stochastic Blockmodels with Discrete Nodal Covariates

ORCID Icon, , ORCID Icon &
Pages 1364-1376 | Published online: 15 Nov 2022
 

Abstract

In many applications of network analysis, it is important to distinguish between observed and unobserved factors affecting network structure. We show that a network model with discrete unobserved link heterogeneity and binary (or discrete) covariates corresponds to a stochastic blockmodel (SBM). We develop a spectral estimator for the effect of covariates on link probabilities, exploiting the correspondence of SBMs and generalized random dot product graphs (GRDPG). We show that computing our estimator is much faster than standard variational expectation–maximization algorithms and scales well for large networks. Monte Carlo experiments suggest that the estimator performs well under different data generating processes. Our application to Facebook data shows evidence of homophily in gender, role and campus-residence, while allowing us to discover unobserved communities. Finally, we establish asymptotic normality of our estimators.

Acknowledgments

We are grateful to Cong Mu and Jipeng Zhang for excellent research assistance. We thank the editor, associate editor and two referees, Avanti Athreya, Eric Auerbach, Federico Bandi, Stephane Bonhomme, Youngser Park and Eleonora Patacchini for comments and suggestions.

Funding

Screeplots (upper left and center left), Estimated latent positions Ŷ (right, only 2 dimensions out of 4 per plot) and estimated latent positions X̂, that is p̂ and q̂ (bottom left, up to orthogonal transformation) for n = 2000.

Notes

1 Recent advances use further approximations and parallelization to improve computational efficiency (Roy, Atchade, and Michailidis Citation2019; Vu, Hunter, and Schweinberger Citation2013). We do not pursue such extensions in this article.

2 It must be noted that the support Xd of F, is a subset of Rd such that xId1,d2y[0,1] for all x,yXd.

3 Alternatively, we can think of a network where the vectors Xi are drawn from a discrete mixture with mass centered at ν, that is,

Xiπ1δν1+π2δν2++πKδνK. (4)

4 The extensions to multiple binary or discrete covariates is shown later in the article.

5 In the Bernoulli case, Eij is a shifted Bernoulli variable, with values Eij=1Pij with probability Pij and Eij=Pij with probability 1Pij.

6 A possible alternative is to exploit the overidentification of β in the matrix θ̂Z to develop a minimum distance estimator. We do not pursue this direction here, as we focus on a fully spectral estimator.

7 The notation nρn=ω(n) means that for any real constant a > 0 there exists an n01 such that ρn>a/n0 for every integer nn0.

8 We do not compare the spectral estimator to the variational EM estimator, because the latter is too slow for a Monte Carlo with 1000 repetitions, even after parallelizing the execution.

9 The time of estimation reported in the table includes the following steps: (a) compute the ASE from the adjacency matrix; (b) compute the matrix of latent positions; (c) cluster latent positions to recover blocks; (d) compute matrix B̂Z; (e) cluster diagonal entries of matrix B̂Z to recover the unobservable block structure; (f) estimate β using the information on the block structure and the entries of matrix B̂Z; (g) compute simple mean and weighted mean of the estimated β̂ according to (41) and (43). The simulation takes a little longer because we need to generate the data and the adjacency matrices for the Monte Carlo. Code is available in GitHub.

10 The entire dataset is available at https://archive.org/details/oxford-2005-facebook-matrix.

11 Roles include students, faculty, staff, alumni. We focus on students because they are the ones mostly using the platform in 2005.

12 This is a standard procedure in the literature on SBMs (Athreya et al. Citation2018; Abbe Citation2018; Roy, Atchade, and Michailidis Citation2019).

13 Before we proceed to estimation, we regularize the adjacency matrix using the standard method proposed in (Le, Levina, and Vershynin Citation2017). This regularization step avoids numerical issues with the spectral decomposition arising from significant node degree heterogeneity.

14 Multiple methods exist for selecting the embedding dimension in practice, and this remains a topic of current research. In the context of networks, choosing a dimension smaller than the true d will introduce bias in the estimated latent positions; on the other hand, using an embedding dimension larger than the true d will increase the variance of the estimated latent positions. In this tradeoff we prefer to err on the side of overestimating d. Specifically, we choose the value one plus the location of the first elbow in the screeplot.

Additional information

Funding

Funding from the Institute of Data Intensive Engineering and Science (IDIES) at Johns Hopkins University and NSF grant SES-1951005 is gratefully acknowledged. Joshua Cape also gratefully acknowledges support from NSF grant DMS-1902755.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 123.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.