363
Views
2
CrossRef citations to date
0
Altmetric
Articles

Spectral Estimation of Large Stochastic Blockmodels with Discrete Nodal Covariates

ORCID Icon, , ORCID Icon &
 

Abstract

In many applications of network analysis, it is important to distinguish between observed and unobserved factors affecting network structure. We show that a network model with discrete unobserved link heterogeneity and binary (or discrete) covariates corresponds to a stochastic blockmodel (SBM). We develop a spectral estimator for the effect of covariates on link probabilities, exploiting the correspondence of SBMs and generalized random dot product graphs (GRDPG). We show that computing our estimator is much faster than standard variational expectation–maximization algorithms and scales well for large networks. Monte Carlo experiments suggest that the estimator performs well under different data generating processes. Our application to Facebook data shows evidence of homophily in gender, role and campus-residence, while allowing us to discover unobserved communities. Finally, we establish asymptotic normality of our estimators.

Acknowledgments

We are grateful to Cong Mu and Jipeng Zhang for excellent research assistance. We thank the editor, associate editor and two referees, Avanti Athreya, Eric Auerbach, Federico Bandi, Stephane Bonhomme, Youngser Park and Eleonora Patacchini for comments and suggestions.

Funding

Screeplots (upper left and center left), Estimated latent positions Ŷ (right, only 2 dimensions out of 4 per plot) and estimated latent positions X̂, that is p̂ and q̂ (bottom left, up to orthogonal transformation) for n = 2000.

Notes

1 Recent advances use further approximations and parallelization to improve computational efficiency (Roy, Atchade, and Michailidis Citation2019; Vu, Hunter, and Schweinberger Citation2013). We do not pursue such extensions in this article.

2 It must be noted that the support Xd of F, is a subset of Rd such that xId1,d2y[0,1] for all x,yXd.

3 Alternatively, we can think of a network where the vectors Xi are drawn from a discrete mixture with mass centered at ν, that is,

Xiπ1δν1+π2δν2++πKδνK. (4)

4 The extensions to multiple binary or discrete covariates is shown later in the article.

5 In the Bernoulli case, Eij is a shifted Bernoulli variable, with values Eij=1Pij with probability Pij and Eij=Pij with probability 1Pij.

6 A possible alternative is to exploit the overidentification of β in the matrix θ̂Z to develop a minimum distance estimator. We do not pursue this direction here, as we focus on a fully spectral estimator.

7 The notation nρn=ω(n) means that for any real constant a > 0 there exists an n01 such that ρn>a/n0 for every integer nn0.

8 We do not compare the spectral estimator to the variational EM estimator, because the latter is too slow for a Monte Carlo with 1000 repetitions, even after parallelizing the execution.

9 The time of estimation reported in the table includes the following steps: (a) compute the ASE from the adjacency matrix; (b) compute the matrix of latent positions; (c) cluster latent positions to recover blocks; (d) compute matrix B̂Z; (e) cluster diagonal entries of matrix B̂Z to recover the unobservable block structure; (f) estimate β using the information on the block structure and the entries of matrix B̂Z; (g) compute simple mean and weighted mean of the estimated β̂ according to (41) and (43). The simulation takes a little longer because we need to generate the data and the adjacency matrices for the Monte Carlo. Code is available in GitHub.

10 The entire dataset is available at https://archive.org/details/oxford-2005-facebook-matrix.

11 Roles include students, faculty, staff, alumni. We focus on students because they are the ones mostly using the platform in 2005.

12 This is a standard procedure in the literature on SBMs (Athreya et al. Citation2018; Abbe Citation2018; Roy, Atchade, and Michailidis Citation2019).

13 Before we proceed to estimation, we regularize the adjacency matrix using the standard method proposed in (Le, Levina, and Vershynin Citation2017). This regularization step avoids numerical issues with the spectral decomposition arising from significant node degree heterogeneity.

14 Multiple methods exist for selecting the embedding dimension in practice, and this remains a topic of current research. In the context of networks, choosing a dimension smaller than the true d will introduce bias in the estimated latent positions; on the other hand, using an embedding dimension larger than the true d will increase the variance of the estimated latent positions. In this tradeoff we prefer to err on the side of overestimating d. Specifically, we choose the value one plus the location of the first elbow in the screeplot.

Additional information

Funding

Funding from the Institute of Data Intensive Engineering and Science (IDIES) at Johns Hopkins University and NSF grant SES-1951005 is gratefully acknowledged. Joshua Cape also gratefully acknowledges support from NSF grant DMS-1902755.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.