Spectral Estimation of Large Stochastic Blockmodels with Discrete Nodal Covariates: Journal of Business & Economic Statistics: Vol 41 , No 4

Abstract

In many applications of network analysis, it is important to distinguish between observed and unobserved factors affecting network structure. We show that a network model with discrete unobserved link heterogeneity and binary (or discrete) covariates corresponds to a stochastic blockmodel (SBM). We develop a spectral estimator for the effect of covariates on link probabilities, exploiting the correspondence of SBMs and generalized random dot product graphs (GRDPG). We show that computing our estimator is much faster than standard variational expectation–maximization algorithms and scales well for large networks. Monte Carlo experiments suggest that the estimator performs well under different data generating processes. Our application to Facebook data shows evidence of homophily in gender, role and campus-residence, while allowing us to discover unobserved communities. Finally, we establish asymptotic normality of our estimators.

KEYWORDS:

Acknowledgments

We are grateful to Cong Mu and Jipeng Zhang for excellent research assistance. We thank the editor, associate editor and two referees, Avanti Athreya, Eric Auerbach, Federico Bandi, Stephane Bonhomme, Youngser Park and Eleonora Patacchini for comments and suggestions.

Funding

Screeplots (upper left and center left), Estimated latent positions $\hat{Y}$ (right, only 2 dimensions out of 4 per plot) and estimated latent positions $\hat{X}$ , that is $\hat{p}$ and $\hat{q}$ (bottom left, up to orthogonal transformation) for n = 2000.

Notes

1 Recent advances use further approximations and parallelization to improve computational efficiency (Roy, Atchade, and Michailidis Citation2019; Vu, Hunter, and Schweinberger Citation2013). We do not pursue such extensions in this article.

2 It must be noted that the support $X_{d}$ of F, is a subset of $R^{d}$ such that $x^{⊤} I_{d_{1}, d_{2}} y \in [0, 1]$ for all $x, y \in X_{d}$ .

3 Alternatively, we can think of a network where the vectors $X_{i}$ are drawn from a discrete mixture with mass centered at $ν$ , that is,

$X_{i} \sim π_{1} δ_{ν_{1}} + π_{2} δ_{ν_{2}} + \dots + π_{K} δ_{ν_{K}} .$ (4)

4 The extensions to multiple binary or discrete covariates is shown later in the article.

5 In the Bernoulli case, $E_{i j}$ is a shifted Bernoulli variable, with values $E_{i j} = 1 - P_{i j}$ with probability $P_{i j}$ and $E_{i j} = P_{i j}$ with probability $1 - P_{i j}$ .

6 A possible alternative is to exploit the overidentification of β in the matrix ${\hat{θ}}_{Z}$ to develop a minimum distance estimator. We do not pursue this direction here, as we focus on a fully spectral estimator.

7 The notation $n ρ_{n} = ω (\sqrt{n})$ means that for any real constant a > 0 there exists an $n_{0} \geq 1$ such that $ρ_{n} > a / \sqrt{n} \geq 0$ for every integer $n \geq n_{0}$ .

8 We do not compare the spectral estimator to the variational EM estimator, because the latter is too slow for a Monte Carlo with 1000 repetitions, even after parallelizing the execution.

9 The time of estimation reported in the table includes the following steps: (a) compute the ASE from the adjacency matrix; (b) compute the matrix of latent positions; (c) cluster latent positions to recover blocks; (d) compute matrix ${\hat{B}}_{Z}$ ; (e) cluster diagonal entries of matrix ${\hat{B}}_{Z}$ to recover the unobservable block structure; (f) estimate $β$ using the information on the block structure and the entries of matrix ${\hat{B}}_{Z}$ ; (g) compute simple mean and weighted mean of the estimated $\hat{β}$ according to (41) and (43). The simulation takes a little longer because we need to generate the data and the adjacency matrices for the Monte Carlo. Code is available in GitHub.

10 The entire dataset is available at https://archive.org/details/oxford-2005-facebook-matrix.

11 Roles include students, faculty, staff, alumni. We focus on students because they are the ones mostly using the platform in 2005.

12 This is a standard procedure in the literature on SBMs (Athreya et al. Citation2018; Abbe Citation2018; Roy, Atchade, and Michailidis Citation2019).

13 Before we proceed to estimation, we regularize the adjacency matrix using the standard method proposed in (Le, Levina, and Vershynin Citation2017). This regularization step avoids numerical issues with the spectral decomposition arising from significant node degree heterogeneity.

14 Multiple methods exist for selecting the embedding dimension in practice, and this remains a topic of current research. In the context of networks, choosing a dimension smaller than the true d will introduce bias in the estimated latent positions; on the other hand, using an embedding dimension larger than the true d will increase the variance of the estimated latent positions. In this tradeoff we prefer to err on the side of overestimating d. Specifically, we choose the value one plus the location of the first elbow in the screeplot.

Additional information

Funding

Funding from the Institute of Data Intensive Engineering and Science (IDIES) at Johns Hopkins University and NSF grant SES-1951005 is gratefully acknowledged. Joshua Cape also gratefully acknowledges support from NSF grant DMS-1902755.

Log in via your institution

Access through your institution

Log in to Taylor & Francis Online

Shibboleth

Log in to Taylor & Francis Online

Restore content access

Restore content access for purchases made as guest

Purchase options * Save for later

PDF download + Online access

48 hours access to article PDF & online version
Article PDF can be downloaded
Article PDF can be printed

USD 61.00 Add to cart

Issue Purchase

30 days online access to complete issue
Article PDFs can be downloaded
Article PDFs can be printed

USD 123.00 Add to cart

* Local tax will be added as applicable

Spectral Estimation of Large Stochastic Blockmodels with Discrete Nodal Covariates

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Spectral Estimation of Large Stochastic Blockmodels with Discrete Nodal Covariates

Abstract

Acknowledgments

Funding

Notes

Additional information

Funding

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature