443
Views
3
CrossRef citations to date
0
Altmetric
Forthcoming Special Issue on: Visual Search and Selective Attention

Modelling attention control using a convolutional neural network designed after the ventral visual pathway

, , &
Pages 416-434 | Received 04 Mar 2019, Accepted 12 Aug 2019, Published online: 05 Sep 2019
 

ABSTRACT

We recently proposed that attention control uses object-category representations consisting of category-consistent features (CCFs), those features occurring frequently and consistently across a category’s exemplars [Yu, C.-P., Maxfield, J. T., & Zelinsky, G. J. (2016). Searching for category-consistent features: A computational approach to understanding visual category representation. Psychological Science, 27(6), 870–884.] Here we extracted from a Convolutional Neural Network (CNN) designed after the primate ventral stream (VsNet) CCFs for 68 object categories spanning a three-level category hierarchy, and evaluated VsNet against the gaze behaviour of people searching for the same categorical targets. We also compared its success in predicting attention control to two other CNNs that differed in their degree and type of brain inspiration. VsNet not only replicated previous reports of stronger attention guidance to subordinate-level targets, but with its powerful CNN-CCFs it predicted attention control to individual target categories. Moreover, VsNet outperformed the other CNN models tested, despite these models having more trainable convolutional filters. We conclude that CCFs extracted from a brain-inspired CNN can predict goal-directed attention control.

Acknowledgements

Invaluable feedback was provided by the members of the Computer Vision Lab and the Eye Cog Lab at Stony Brook University, and by Dr. Talia Konkle and the members of the Harvard Vision Sciences Lab.

Disclosure statement

No potential conflict of interest was reported by the authors.

Notes

1 In referring to “attention control” we draw a distinction between an “attention target”, which we define as the high-level semantic representation specifying an immediate behavioural or cognitive goal (e.g., a designated target category in a search task), and “target features”, which we define to be the lower-level visual features representing the attention target in a perceptual input. It therefore follows that “attention control” is the goal-specific biasing of lower-level target features for the purpose of controlling an interaction with the attention target (e.g., directing the fovea to the location of a target goal in a visual input), and a measure of attention control is one that evaluates the success or efficiency in achieving this goal (e.g., the time required to align the fovea with the target). Understanding both the attention target and the target features is essential to understanding attention control and goal-directed behaviour. There could be a top-down attention target reflecting a desire to find a Pekin duck in a pond, but this visual goal could not be realized unless features of Pekin ducks have been learned by the visual system and are therefore available for top-down biasing.

2 Note that our model quantification treats two filters as equivalent in complexity despite their having different numbers of free parameters. Quantifying model complexity in terms of free parameters is arguably not meaningful with respect to higher-level perceptual and cognitive behaviours, such as categorical search. It essentially places more weight on the number of connections comprising a representation rather than the representation itself. A 100 × 100 filter covers more area (and has 9900 more parameters) than a 10 × 10 filter, but the larger filter is not likely to be 10× more predictive than the smaller filter; the information coded by the two simply differs in scale and type. Moreover, applying such a quantification scheme to the visual system would imply that the responses from neurons having large receptive fields are more predictive than the responses of neurons having smaller receptive fields, when the opposite is arguably more likely to be true. Our quantification approach draws an equivalency between filters (the units at each layer) and features (or feature maps, as these are derived by convolution with a filter), which is how complexity has historically been conceptualized in the attention literature.

3 We thank Aude Oliva for this term, which she used in a personal communication at the Vision Sciences Society annual meeting (May, 2017); see also Bau, Zhou, Khosla, Oliva, & Torralba, Citation2017, for a similar sentiment.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.