Abstract
The top-down guidance of visual attention is one of the main factors allowing humans to effectively process vast amounts of incoming visual information. Nevertheless we still lack a full understanding of the visual, semantic, and memory processes governing visual attention. In this paper, we present a computational model of visual search capable of predicting the most likely positions of target objects. The model does not require a separate training phase, but learns likely target positions in an incremental fashion based on a memory of previous fixations. We evaluate the model on two search tasks and show that it outperforms saliency alone and comes close to the maximal performance of the Contextual Guidance Model (CGM; Torralba, Oliva, Castelhano, & Henderson, 2006; Ehinger, Hidalgo-Sotelo, Torralba, & Oliva, 2009), even though our model does not perform scene recognition or compute global image statistics. The search performance of our model can be further improved by combining it with the CGM.
Acknowledgments
The support of the European Research Council under award number 203427 “Synchronous Linguistic and Visual Processing” is gratefully acknowledged.
We would like to thank Moreno I. Coco for sharing his data, and for numerous comments and suggestions regarding this work. We are also grateful to the authors of Ehinger, Hidalgo-Sotelo, Torralba, and Oliva (Citation2009) and Torralba, Oliva, Castelhano, and Henderson (Citation2006) for sharing the image corpora and eye-tracking data used in their studies.
A preliminary version of the study reported in this paper has been published as Dziemianko, Keller, and Coco (2011).
Notes
1This approach does not require a separate training phase, which makes it more adaptable to different data sets, tasks, and experimental conditions.
2The histograms of target positions (see below) suggest that the distribution of target locations is slightly bimodal, so a modest improvement may result from employing a mixture of Gaussians instead of a single Gaussian.
3For the visual search data, the mean size of an object is 0.93° visual angle horizontally and 1.92° visual angle vertically. For the visual count data, the mean size is 1.77° horizontally and 3.90° vertically.
4Thresholding works by selecting the points with the highest model values until the threshold is reached. For example, a threshold of 10% on a saliency map means that we select the points with the highest saliency until we have selected 10% of the image. We then count how many of the fixations fall within these 10%. If we select 100% of image, we trivially predict all fixations correctly.
5Ehinger et al. (Citation2009) designed their stimuli as follows: “For the target-present images, targets were spatially distributed across the image periphery (target locations ranged from 2.7° to 13° from the screen centre; median eccentricity was 8.6°), and were located in each quadrant of the screen with approximately equal frequency” (p. 950). The fact that the authors deliberately placed the target at the screen periphery explains the bimodality of horizontal positions in (bottom panel). There is only a weak bimodality in vertical positions in (bottom panel), which is probably due to the fact that their target objects (which were always pedestrians) show a central bias vertically, which presumably counteracts the peripheral bias in the stimulus design.