Abstract
One of the most important challenges of network analysis remains the scarcity of reliable information on existing connection structures. This work explores theoretical and empirical methods of inferring directed networks from nodes attributes and from functions of these attributes that are computed for connected nodes. We discuss the conditions, under which an underlying connection structure can be (probabilistically) recovered, and propose a Bayesian recovery algorithm. In an empirical application, we test the algorithm on the data from the European School Survey Project on Alcohol and Other Drugs.
Acknowledgments
Thanks to the ESPAD team for letting us use the data. The United Kingdom ESPAD study was conducted by Professor Martin Plant and Dr Patrick Miller of the University of the West of England, Bristol (UWE). It was mainly funded by the Wates Foundation and UWE. Additional support was provided by the Joseph Rowntree Foundation, the Oakdale Trust, Butcombe Brewery Ltd, Dr. George Carey, the Jack Goldhill Charitable Trust, RJ Lass Charities Ltd, and the North British Distillery Company Ltd. Thanks also to seminar participants at Queen's University Belfast for their useful comments.
Notes
1Generically means that the probability of the event, for which the claim does not hold, is zero.
2The Fröbenius norm of a N × T matrix A is defined as ||A||:= . For example, if A is an adjacency matrix, ||A||2 is the number of directed links in A.
Note. We report, respectively, estimates for exact and for noisy reports and theoretical approximations (in parentheses, reported for T ≲ ln N only). We assume that noisy report is normally distributed according to (m i x t , Var(m i X t )), where m i and x t are realizations of the random processes for the i-neighborhood and the t-attribute, respectively.
Note. We report, respectively, estimates for exact and for noisy reports and theoretical approximations (in parentheses, reported for T ≲ ln N only). We assume that noisy report is normally distributed according to (m i x t , Var(m i X t )), where m i and x t are realizations of the random processes for the i-neighborhood and the t-attribute, respectively.
Note. Rows 1–4 contain, respectively, the school identifier, number of pupils in the class, the total/average number of links and the number of symmetric links and the p-value of the independence test. Row 5 reports the averages over 10 simulation experiments of the overlap ratios for the data sets (x, y = mx), (x, y = mx + ϵ) and (), respectively, where x () contains the reported (generated) attribute values, and the connection matrix m was randomly drawn from the same prior as used in the network recovery. Rows 6–7 contain the variance estimate and the R-square goodness of fit statistic.
Mean values of 15 attributes for schools in Table 4:
Variances of 15 attributes for schools in Table 4:
3In other words, the report of agent i on attribute t is distributed as .
4We minimized (21) over the set of weighted adjacency matrices M i = {{m ij }, j=1,…, N : m ij ∈ & m ij ≠ m ik > 0}.
5By definition, cannot be any other constant over the entire support of the pdf .