Views

CrossRef citations to date

Altmetric

Articles

Spectral Estimation of Large Stochastic Blockmodels with Discrete Nodal Covariates

Angelo Melea Carey Business School, Johns Hopkins University, Baltimore, MDCorrespondence[email protected]

https://orcid.org/0000-0003-2890-6042 View further author information

Lingxin Haob Department of Sociology, Johns Hopkins University, Baltimore, MDView further author information

Joshua Capec Department of Statistics, University of Wisconsin–Madison, Madison, WI

https://orcid.org/0000-0002-1471-1650 View further author information

Carey E. Priebed Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MDView further author information

Abstract

In many applications of network analysis, it is important to distinguish between observed and unobserved factors affecting network structure. We show that a network model with discrete unobserved link heterogeneity and binary (or discrete) covariates corresponds to a stochastic blockmodel (SBM). We develop a spectral estimator for the effect of covariates on link probabilities, exploiting the correspondence of SBMs and generalized random dot product graphs (GRDPG). We show that computing our estimator is much faster than standard variational expectation–maximization algorithms and scales well for large networks. Monte Carlo experiments suggest that the estimator performs well under different data generating processes. Our application to Facebook data shows evidence of homophily in gender, role and campus-residence, while allowing us to discover unobserved communities. Finally, we establish asymptotic normality of our estimators.

KEYWORDS:

Acknowledgments

We are grateful to Cong Mu and Jipeng Zhang for excellent research assistance. We thank the editor, associate editor and two referees, Avanti Athreya, Eric Auerbach, Federico Bandi, Stephane Bonhomme, Youngser Park and Eleonora Patacchini for comments and suggestions.

Funding

Screeplots (upper left and center left), Estimated latent positions $\hat{Y}$ (right, only 2 dimensions out of 4 per plot) and estimated latent positions $\hat{X}$ , that is $\hat{p}$ and $\hat{q}$ (bottom left, up to orthogonal transformation) for n = 2000.

Notes

1 Recent advances use further approximations and parallelization to improve computational efficiency (Roy, Atchade, and Michailidis Citation2019; Vu, Hunter, and Schweinberger Citation2013). We do not pursue such extensions in this article.

2 It must be noted that the support $X_{d}$ of F, is a subset of $R^{d}$ such that $x^{⊤} I_{d_{1}, d_{2}} y \in [0, 1]$ for all $x, y \in X_{d}$ .

3 Alternatively, we can think of a network where the vectors $X_{i}$ are drawn from a discrete mixture with mass centered at $ν$ , that is,

$X_{i} \sim π_{1} δ_{ν_{1}} + π_{2} δ_{ν_{2}} + \dots + π_{K} δ_{ν_{K}} .$ (4)

4 The extensions to multiple binary or discrete covariates is shown later in the article.

5 In the Bernoulli case, $E_{i j}$ is a shifted Bernoulli variable, with values $E_{i j} = 1 - P_{i j}$ with probability $P_{i j}$ and $E_{i j} = P_{i j}$ with probability $1 - P_{i j}$ .

6 A possible alternative is to exploit the overidentification of β in the matrix ${\hat{θ}}_{Z}$ to develop a minimum distance estimator. We do not pursue this direction here, as we focus on a fully spectral estimator.

7 The notation $n ρ_{n} = ω (\sqrt{n})$ means that for any real constant a > 0 there exists an $n_{0} \geq 1$ such that $ρ_{n} > a / \sqrt{n} \geq 0$ for every integer $n \geq n_{0}$ .

8 We do not compare the spectral estimator to the variational EM estimator, because the latter is too slow for a Monte Carlo with 1000 repetitions, even after parallelizing the execution.

9 The time of estimation reported in the table includes the following steps: (a) compute the ASE from the adjacency matrix; (b) compute the matrix of latent positions; (c) cluster latent positions to recover blocks; (d) compute matrix ${\hat{B}}_{Z}$ ; (e) cluster diagonal entries of matrix ${\hat{B}}_{Z}$ to recover the unobservable block structure; (f) estimate $β$ using the information on the block structure and the entries of matrix ${\hat{B}}_{Z}$ ; (g) compute simple mean and weighted mean of the estimated $\hat{β}$ according to (41) and (43). The simulation takes a little longer because we need to generate the data and the adjacency matrices for the Monte Carlo. Code is available in GitHub.

10 The entire dataset is available at https://archive.org/details/oxford-2005-facebook-matrix.

11 Roles include students, faculty, staff, alumni. We focus on students because they are the ones mostly using the platform in 2005.

12 This is a standard procedure in the literature on SBMs (Athreya et al. Citation2018; Abbe Citation2018; Roy, Atchade, and Michailidis Citation2019).

13 Before we proceed to estimation, we regularize the adjacency matrix using the standard method proposed in (Le, Levina, and Vershynin Citation2017). This regularization step avoids numerical issues with the spectral decomposition arising from significant node degree heterogeneity.

14 Multiple methods exist for selecting the embedding dimension in practice, and this remains a topic of current research. In the context of networks, choosing a dimension smaller than the true d will introduce bias in the estimated latent positions; on the other hand, using an embedding dimension larger than the true d will increase the variance of the estimated latent positions. In this tradeoff we prefer to err on the side of overestimating d. Specifically, we choose the value one plus the location of the first elbow in the screeplot.

Additional information

Funding

Funding from the Institute of Data Intensive Engineering and Science (IDIES) at Johns Hopkins University and NSF grant SES-1951005 is gratefully acknowledged. Joshua Cape also gratefully acknowledges support from NSF grant DMS-1902755.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Spectral Estimation of Large Stochastic Blockmodels with Discrete Nodal Covariates

Information for

Open access

Opportunities

Help and information

Spectral Estimation of Large Stochastic Blockmodels with Discrete Nodal Covariates

Abstract

Acknowledgments

Funding

Notes

Additional information

Funding

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature