400
Views
12
CrossRef citations to date
0
Altmetric
Original Articles

Global Network Inference from Ego Network Samples: Testing a Simulation Approach

Pages 125-162 | Published online: 31 Mar 2015
 

Abstract

Network sampling poses a radical idea: that it is possible to measure global network structure without the full population coverage assumed in most network studies. Network sampling is only useful, however, if a researcher can produce accurate global network estimates. This article explores the practicality of making network inference, focusing on the approach introduced in Smith (Citation2012). The method uses sampled ego network data and simulation techniques to make inference about the global features of the true, unknown network. The validity check here includes more difficult scenarios than previous tests, including those that go beyond the initial scope conditions of the method. I examine networks with a skewed degree distribution and surveys that limit the number of social ties a respondent can list. For each network/survey combination, I take a random ego network sample, run the simulation method, and compare the results to the true values (using measures of connectivity and cohesion). I also test the method on local measures of network structure. The results, on the whole, are encouraging. The method produces good estimates even in cases where the degree distribution is skewed and the survey is strongly restricted. I also find that is it better to not truncate the survey if possible. If the survey must be restricted, the researcher would do well to infer the missing data, rather than use the raw data naively.

ACKNOWLEDGMENTS

The author would like to thank Miller McPherson, James Moody, Peter Mucha, Robin Gauthier, and the network working group at Duke for comments on earlier drafts and presentations of this work.

Notes

1It is important to note that the method is only appropriate for well defined populations with a sampling frame. The population of interest is thus assumed to be nonhidden (i.e., not female sex workers or drug injectors), and the size of the population is assumed to be known. The method also assumes that the relationship of interest is symmetric, so that if i nominates j, then j nominates i.

2It can be quite tedious to describe the demographic characteristics of many alters along many demographic dimensions.

3One can estimate the strength of homophily as one knows the characteristics of the respondents and the respondents’ alters.

4This is largely because ego network data provide biased estimates for many typical triadic measures; such as global transitivity, defined as the proportion of two-step paths where there is also a one-step path (Soffer & Vazquez, Citation2005; Bansal, Khandelwal, & Meyers, Citation2009). Thus, for our top-right respondent, there is one tie out of a possible six.

5This initial simulation can be done within an ERGM framework or using a stub-based algorithm (Newman, Strogatz, & Watts, Citation2001; Viger, Latapy, & Wang, Citation2005).

6See Pattison, Robins, Snijders, and Wang (Citation2013) for approaches that do not require the size of the network to be known.

7It is important to note that the number of people in the simulated network is larger than the number of sampled respondents. This means that a sampled respondent may be seeded multiple times in the simulated network. This is unlikely to cause problems, however, as the network ties are probabilistically determined; thus, nodes in the simulated network with the exact same set of characteristics need not be tied together. Or, there is no definitional reason that a node seeded multiple times will have to be tied to herself. Any nodes with similar characteristics will have a high probability of being tied together. More substantively, it may be the case that many people have the same race and grade in a school; in which case the simulation does not deviate far from the empirical setting. Future work could, however, explicitly deal with this duplication of nodes by modeling how the characteristics go together, rather than simply drawing them from the data itself.

8The method calculates a starting value by estimating a dyadic independent ERGM on the ego networks. It is also possible to use other terms than GWESP.

9A good fit means that the ego network configurations in the simulated network are found at the same rate as in the sampled data.

10Note that the simulations here are not part of the inferential process, but are rather used to generate networks to make inference about.

11Specifically, because this is a hypothetical survey, we do not have the respondent's report on alter race, gender and the alter–alter ties. I thus use information from the actual network as a (perhaps idealistic) proxy of what a respondent would report for the characteristics of their alters. Similarly, the number of alters is their degree from the true network.

12Note that even though the top degree is larger in the high skew network, homophily is still quite important in organizing the social structure of the school.

13Even in the low skew network there could be differences across survey conditions as many individuals have more than 10 friends. Across all survey conditions, I assume that alter demographic information is only recorded for five alters.

14Past work has shown that the process of gaining and losing ties will often yield a Poisson distribution for the total number of ties (McPherson, Citation2009); a negative binomial distribution offers a more flexible form for fitting the data, but is still theoretically close to a Poisson distribution, making it an ideal option.

15The best parameters will generate the observed degree distribution, once we collapse all of the values above the truncated value (say 10) into the truncated value. Thus, the model should generate a distribution with the right proportions in each value, including the right proportion above the truncated value.

16The simulated value is not allowed to fall below the truncated amount. In this way, respondents cannot have values below the number of alters they listed.

17The simulated values are restricted to the range of the categorical response.

18Although it is important to note that under no truncation the researcher response is not varied, as there is nothing to try and “fill in.”

19The high degree nodes approach 75 ties, well beyond the number found in strong tie networks for which the method was initially designed.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 1,078.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.