165
Views
1
CrossRef citations to date
0
Altmetric
Original Articles

Estimating Sizes of Social Networks via Biased Sampling

, , &
Pages 335-359 | Published online: 15 Sep 2014
 

Abstract

This article presents algorithms for estimating the number of users in online social networks. Although such networks sometimes publish such statistics, there are good reasons to validate their reports. The proposed schemes can also estimate the cardinality of network subpopulations. Because this information is seldom voluntarily divulged, such algorithms must operate only by interacting with the social networks’ public Applications Programming Interfaces (APIs). No other external information can be assumed. Due to obvious traffic and privacy concerns, the number of such interactions is severely limited. We therefore focus on minimizing the number of API interactions needed for producing good-sized estimates.

We adopt the standard abstraction of social networks as undirected graphs and perform random walk-based node sampling. By counting the number of collisions or nonunique nodes in the sample, we produce a size estimate. Then we show analytically that the estimate error vanishes with high probability for fewer samples than those required by prior-art algorithms. Moreover, although provably correct for any graph, our algorithms excel when applied to social network-like graphs. The proposed algorithms were evaluated on synthetic and real social networks such as Facebook, IMDB, and DBLP. Our experiments corroborate the theoretical results and demonstrate the effectiveness of the algorithms.

Notes

Moreover, the published statistics do not provide an estimate for connected sub-graphs, e.g., 20-30 year olds living in the US.

Note that online social networks’ public APIs provide lists of connected users for every user. Thus, acting like a neighbor list representation of the graph.

Sampling uniformly from this table is possible in our setting because random walks on graphs sample edges uniformly.

Other names for this method or closely related ones, include capture-recapture, capture-mark-recapture, mark-recapture, and mark-release-recapture.

We make a more general statement later in this paper.

In fact, the authors try to compare between two different search services but their approach is suitable for this task as well.

Compared to the definitions in [CitationKatzir et al. 11] here R includes an extra −r term. This is due to examining instead of . This is numerically insignificant because Ψ1Ψ−1 is typically Θ(r2) but it makes the analysis slightly simpler.

Since the random walk sampling is performed on the full graph, which is assumed to be connected, the algorithm is agnostic to the subgraph connectivity property.

The DBLP database can be found at http://dblp.uni-trier.de/xml/.

The IMDB database can be found at ftp://ftp.fu-berlin.de/pub/misc/movies/database/.

The Facebook uniformly sampled crawl can be found at http://odysseas.calit2.uci. edu/research/.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access
  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart
* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.