Abstract
This article presents algorithms for estimating the number of users in online social networks. Although such networks sometimes publish such statistics, there are good reasons to validate their reports. The proposed schemes can also estimate the cardinality of network subpopulations. Because this information is seldom voluntarily divulged, such algorithms must operate only by interacting with the social networks’ public Applications Programming Interfaces (APIs). No other external information can be assumed. Due to obvious traffic and privacy concerns, the number of such interactions is severely limited. We therefore focus on minimizing the number of API interactions needed for producing good-sized estimates.
We adopt the standard abstraction of social networks as undirected graphs and perform random walk-based node sampling. By counting the number of collisions or nonunique nodes in the sample, we produce a size estimate. Then we show analytically that the estimate error vanishes with high probability for fewer samples than those required by prior-art algorithms. Moreover, although provably correct for any graph, our algorithms excel when applied to social network-like graphs. The proposed algorithms were evaluated on synthetic and real social networks such as Facebook, IMDB, and DBLP. Our experiments corroborate the theoretical results and demonstrate the effectiveness of the algorithms.
Notes
Moreover, the published statistics do not provide an estimate for connected sub-graphs, e.g., 20-30 year olds living in the US.
Note that online social networks’ public APIs provide lists of connected users for every user. Thus, acting like a neighbor list representation of the graph.
Sampling uniformly from this table is possible in our setting because random walks on graphs sample edges uniformly.
Other names for this method or closely related ones, include capture-recapture, capture-mark-recapture, mark-recapture, and mark-release-recapture.
We make a more general statement later in this paper.
In fact, the authors try to compare between two different search services but their approach is suitable for this task as well.
Compared to the definitions in [CitationKatzir et al. 11] here R includes an extra −r term. This is due to examining instead of
. This is numerically insignificant because Ψ1Ψ−1 is typically Θ(r2) but it makes the analysis slightly simpler.
Since the random walk sampling is performed on the full graph, which is assumed to be connected, the algorithm is agnostic to the subgraph connectivity property.
The DBLP database can be found at http://dblp.uni-trier.de/xml/.
The IMDB database can be found at ftp://ftp.fu-berlin.de/pub/misc/movies/database/.
The Facebook uniformly sampled crawl can be found at http://odysseas.calit2.uci. edu/research/.