646
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Multi-Cue Gate-Shift Networks for Mouse Behavior Recognition

ORCID Icon, , , , , , , & show all
Article: 2151680 | Received 23 Jun 2022, Accepted 18 Nov 2022, Published online: 02 Dec 2022

References

  • Burgos-Artizzu, X. P., P. Dollár, D. Lin, D. J. Anderson, and P. Perona (2012). Social behavior recognition in continuous video. In 2012 IEEE conference on computer vision and pattern recognition, 3964–3981. Providence, Rhode Island: IEEE.
  • Carreira, J., and A. Zisserman (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, USA, 6299–308.
  • Chaaraoui, A., J. Padilla-Lopez, and F. Florez-Revuelta (2013). Fusion of skeletal and silhouette-based features for human action recognition with rgb-d devices. In Proceedings of the IEEE international conference on computer vision workshops, Sydney, Australia, 91–97.
  • Chen, C., R. Jafari, and N. Kehtarnavaz. 2014. Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Transactions on Human-Machine Systems 45 (1):51–61. doi:10.1109/THMS.2014.2362520.
  • Dollár, P., V. Rabaud, G. Cottrell, and S. Belongie (2005). Behavior recognition via sparse spatio-temporal features. In 2005 IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, 65–72. Beijing, China: IEEE.
  • Feichtenhofer, C. (2020). X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, USA, 203–13.
  • Feichtenhofer, C., H. Fan, J. Malik, and K. He (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Korea, 6202–11.
  • Feichtenhofer, C., A. Pinz, and R. P. Wildes. 2017. Spatiotemporal residual networks for video action recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7445–54. doi:10.1109/CVPR.2017.787.
  • Feichtenhofer, C., A. Pinz, and A. Zisserman (2016). Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, USA, 1933–41.
  • Hara, K., H. Kataoka, and Y. Satoh (2018). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, USA, 6546–55.
  • He, K., X. Zhang, S. Ren, and J. Sun (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, USA, 770–78.
  • Ioffe, S., and C. Szegedy (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 448–56. Lille, France: PMLR.
  • Jhuang, H., E. Garrote, X. Yu, V. Khilnani, T. Poggio, A. D. Steele, and T. Serre. 2010. Automated home-cage behavioural phenotyping of mice. Nature Communications 1 (1):1–10. doi:10.1038/ncomms1064.
  • Jiang, Z., D. Crookes, B. D. Green, Y. Zhao, H. Ma, L. Li, S. Zhang, D. Tao, and H. Zhou. 2018. Context-aware mouse behavior recognition using hidden Markov models. IEEE Transactions on Image Processing 28 (3):1133–48. doi:10.1109/TIP.2018.2875335.
  • Karpathy, A., G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus, USA, 1725–32.
  • Kramida, G., Y. Aloimonos, C. M. Parameshwara, C. Fermüller, N. A. Francis, and P. Kanold (2016). Automated mouse behavior recognition using vgg features and lstm networks. In Proceedings of the Visual Observation and Analysis of Vertebrate and Insect Behavior Workshop (VAIB), Cancun, MEXICO, 1–3.
  • Li, Y., B. Ji, X. Shi, J. Zhang, B. Kang, and L. Wang (2020). Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, USA, 909–18.
  • Li, M., H. Leung, and H. P. Shum (2016). Human action recognition via skeletal and depth based feature fusion. In Proceedings of the 9th international conference on motion in Games, Burlingame, California, 123–32.
  • Liu, S., D. Huang, and Y. Wang. 2019. Learning spatial fusion for single-shot object detection. arXiv e-prints arXiv:1911.09516.
  • Liu, Z., L. Wang, W. Wu, C. Qian, and T. Lu (2021). Tam: Temporal adaptive module for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, Montreal, Canada, 13708–18.
  • Li, H., Z. Wu, A. Shrivastava, and L. S. Davis (2021). 2d or not 2d? adaptive 3d convolution selection for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Virtual, 6155–64.
  • Nguyen, N. G., D. Phan, F. R. Lumbanraja, M. R. Faisal, B. Abapihi, B. Purnama, M. K. Delimayanti, K. R. Mahmudah, M. Kubo, and K. Satou. 2019. Applying deep learning models to mouse behavior recognition. Journal of Biomedical Science and Engineering 12 (02):183–96. doi:10.4236/jbise.2019.122012.
  • Qiu, Z., T. Yao, and T. Mei (2017). Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE international conference on computer vision, Venice, Italy, 5533–41.
  • Sanchez-Riera, J., K. -L. Hua, Y. -S. Hsiao, T. Lim, S. C. Hidayati, and W. -H. Cheng. 2016. A comparative study of data fusion for rgb-d based visual recognition. Pattern Recognition Letters 73:1–6. doi:10.1016/j.patrec.2015.12.006.
  • Shi, L., Y. Zhang, J. Cheng, and H. Lu (2019). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, USA, pp. 12026–35.
  • Simonyan, K., and A. Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems 27.
  • Stroud, J., D. Ross, C. Sun, J. Deng, and R. Sukthankar (2020). D3d: Distilled 3d networks for video action recognition. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, Snowmass Village, United States, 625–34.
  • Sudhakaran, S., S. Escalera, and O. Lanz (2020). Gate-shift networks for video action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, USA, 1102–11.
  • Tran, D., L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015a). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, Santiago, Chile, 4489–97.
  • Tran, D., L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015b). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, Santiago, Chile, 4489–97.
  • Tran, D., H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, USA, 6450–59.
  • Wang, Y., W. Huang, F. Sun, T. Xu, Y. Rong, and J. Huang. 2020. Deep multimodal fusion by channel exchanging. Advances in Neural Information Processing Systems 33:4835–45.
  • Wang, L., Y. Qiao, and X. Tang (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, USA, 4305–14.
  • Wang, Z., Q. She, and A. Smolic (2021). Action-net: Multipath excitation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Virtual, 13214–23.
  • Wang, L., Z. Tong, B. Ji, and G. Wu (2021). Tdn: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Virtual, 1895–904.
  • Wang, G., K. Wang, and L. Lin (2019). Adaptively connected neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, USA, 1781–90.
  • Wang, L., Y. Xiong, Z. Wang, and Y. Qiao. 2015. Towards good practices for very deep two-stream convnets. arXiv e-prints arXiv:1507.02159.
  • Wang, L., Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016a). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, 20–36. Amsterdam, Netherlands: Springer.
  • Wang, L., Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016b). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, 20–36. Amsterdam, Netherlands: Springer.
  • Wu, Z., H. Li, C. Xiong, Y. -G. Jiang, and L. S. Davis (2020). A dynamic frame selection framework for fast video recognition. IEEE Transactions on Software Engineering PP (99).
  • Xie, S., C. Sun, J. Huang, Z. Tu, and K. Murphy (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 305–21.
  • Zhang, Z., Y. Yang, and Z. Wu (2019). Social behavior recognition in mouse video using agent embedding and lstm modelling. In Chinese conference on pattern recognition and computer vision (PRCV), 530–41. Xi'an, China: Springer.
  • Zhang, C., Y. Zou, G. Chen, and L. Gan. 2020. Pan: Towards fast action recognition via learning persistence of appearance. arXiv e-prints arXiv:2008.03462.
  • Zhu, X., Y. Xiong, J. Dai, L. Yuan, and Y. Wei (2017). Deep feature flow for video recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, USA, 2349–58.