471
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Hindustani raga and singer classification using 2D and 3D pose estimation from video recordings

ORCID Icon, ORCID Icon, &
Received 15 Oct 2021, Accepted 10 Mar 2024, Published online: 03 Apr 2024

References

  • Al  Ghamdi, M., Zhang, L., & Gotoh, Y. (2012). Spatio-temporal SIFT and its application to human action classification. In Fusiello A., Murino V., & Cucchiara R. (Eds.), Computer Vision – ECCV 2012. Workshops and Demonstrations. ECCV 2012. Lecture Notes in Computer Science (Vol. 7583, pp. 301–310). Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-33863-2_30
  • Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., & Sheikh, Y. (2021). OpenPose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1), 172–186. https://doi.org/10.1109/TPAMI.2019.2929257
  • Clarke, A., Weinzierl, M., & Li, J. (2021). Pose estimation for raga (v1.0.1). Zenodo. https://doi.org/10.5281/zenodo.5526676
  • Clayton, M. (2007a). Observing entrainment in music performance: Video-based observational analysis of Indian musicians’ tanpura playing and beat marking. Musicae Scientiae, 11(1), 27–59. https://doi.org/10.1177/102986490701100102
  • Clayton, M. (2007b). Time, gesture and attention in a Khyāl performance. Asian Music, 38(2), 71–96. https://doi.org/10.1353/amu.2007.0032
  • Clayton, M., Jakubowski, K., & Eerola, T. (2019). Interpersonal entrainment in Indian instrumental music performance: Synchronization and movement coordination relate to tempo, dynamics, metrical and cadential structure. Musicae Scientiae, 23(3), 304–331. https://doi.org/10.1177/1029864919844809
  • Clayton, M., Jakubowski, K., Eerola, T., Keller, P. E., Camurri, A., Volpe, G., & Alborno, P. (2020). Interpersonal entrainment in music performance: Theory, method and model. Music Perception, 38(2), 136–194. https://doi.org/10.1525/mp.2020.38.2.136
  • Clayton, M., Leante, L., & Tarsitani, S. (2021a). North Indian raga performance. OSF. May 14. https://doi.org/10.17605/OSF.IO/NKJGZ
  • Clayton, M., Li, J., Clarke, A. R., Weinzierl, M., Leante, L., & Tarsitani, S. (2021b). Hindustani raga and singer classification using pose estimation. OSF. October 14. https://doi.org/10.17605/OSF.IO/T5BWA
  • Clayton, M., Rao, P., Shikarpur, N., Roychowdhury, S., & Li, J. (2022). Raga classification from vocal performances using multimodal analysis. In Proceedings of the 23rd International Society for Music Information Retrieval Conference, Bengaluru, India. https://dap-lab.github.io/multimodal-raga-supplementary/
  • Dahl, S., Bevilacqua, F., Bresin, R., Clayton, M., Leante, L., Poggi, I., & Rasamimanana, N. (2009). Gestures in performance. In R. I. Godoy & M. Leman (Eds.), Musical gestures: Sound, movement, and meaning (pp. 36–68). Routledge.
  • Fatone, G. A., Clayton, M., Leante, L., & Rahaim, M. (2011). Imagery, melody and gesture in cross-cultural perspective. In A. Gritten & E. King (Eds.), New perspectives on music and gesture (pp. 203–220). Ashgate.
  • Godoy, R. I., & Leman, M. (Eds.). (2009). Musical gestures: Sound, movement, and meaning. Routledge.
  • Goldin-Meadow, S. (2003). Hearing gesture: How our hands help us think. Harvard University Press.
  • Gritten, A., & King, E. (Eds.). (2011). New perspectives on music and gesture. Ashgate.
  • Jakubowski, K., Eerola, T., Alborno, P., Volpe, G., Camurri, A., & Clayton, M. (2017). Extracting coarse body movements from video in music performance: A comparison of automated computer vision techniques with motion capture data. Frontiers in Digital Humanities, 4, 9. https://doi.org/10.3389/fdigh.2017.00009
  • Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231. https://doi.org/10.1109/TPAMI.2012.59
  • Kendon, A. (2004). Gesture: Visible action as utterance. Cambridge University Press.
  • Leante, L. (2009). The lotus and the king: Imagery, gesture and meaning in a Hindustani Rāg. Ethnomusicology Forum, 18(2), 185–206. https://doi.org/10.1080/17411910903141874
  • Leante, L. (2013a). Gesture and imagery in music performance: Perspectives from North Indian classical music. In T. Shephard & A. Leonard (Eds.), The Routledge companion to music and visual culture (pp. 145–152). Routledge.
  • Leante, L. (2013b). Imagery, movement and listeners’ construction of meaning in North Indian classical music. In M. Clayton, B. Dueck, & L. Leante (Eds.), Experience and meaning in music performance (pp. 161–187). Oxford University Press.
  • Leante, L. (2018). The cuckoo’s song: Imagery and movement in monsoon ragas. In I. Rajamani, M. Pernau, & K. R. Butler Schofield (Eds.), Monsoon feelings: A history of emotions in the rain (pp. 255–290). Niyogi Books.
  • Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014). Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 661–670). https://doi.org/10.1145/2623330.2623612
  • Liu, M., Liu, H., & Chen, C. (2017). Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68, 346–362. https://doi.org/10.1016/j.patcog.2017.02.030
  • Liu, Z., Zhang, H., Chen, Z., Wang, Z., & Ouyang, W. (2020). Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 143–152). https://openaccess.thecvf.com/content_CVPR_2020/papers/Liu_Disentangling_and_Unifying_Graph_Convolutions_for_Skeleton-Based_Action_Recognition_CVPR_2020_paper.pdf
  • Maaten, L., & Hinton, G. E. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
  • McNeill, D. (1992). Hand and mind: What gestures reveal about thought. The University of Chicago Press.
  • McNeill, D. (2005). Gesture and thought. University of Chicago Press.
  • Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Sridhar, S., Pons-Moll, G., & Theobalt, C. (2018). Single-shot multi-person 3d pose estimation from monocular RGB. In 2018 International Conference on 3D Vision (pp. 120–130). https://arxiv.org/abs/1712.03453v3
  • Moran, N. (2013). Social co-regulation and communication in North Indian duo performances. In M. Clayton, B. Dueck, & L. Leante (Eds.), Experience and meaning in music performance (pp. 64–94). Oxford University Press.
  • Paschalidou, S., & Clayton, M. (2015). Towards a sound-gesture analysis in Hindustani Dhrupad vocal music: Effort and raga space. In International Conference on the Multimodal Experience of Music (ICMEM), Sheffield. https://www.researchgate.net/publication/312029966_Towards_a_sound-gesture_analysis_in_Hindustani_Dhrupad_vocal_music_effort_and_raga_space
  • Paschalidou, S., Eerola, T., & Clayton, M. (2016). Voice and movement as predictors of gesture types and physical effort in virtual object interactions of classical Indian singing. In Proceedings of the 3rd International Symposium on Movement and Computing (MOCO ‘16), Association for Computing Machinery, New York, NY, USA, Article 45 (pp. 1–2). https://doi.org/10.1145/2948910.2948914
  • Pearson, L. (2013). Gesture and the sonic event in Karnatak music. Empirical Musicology Review, 8(1), 2–14. https://doi.org/10.18061/emr.v8i1.3918
  • Pearson, L., & Pouw, W. (2022). Gesture–vocal coupling in Karnatak music performance: A neuro-bodily distributed aesthetic entanglement. Annals of the New York Academy of Sciences, 1515(1), 219–236. https://doi.org/10.1111/nyas.14806
  • Potempski, F., Sabo, A., & Patterson, K. K. (2021). Technical note: Quantifying music-dance synchrony with the application of a deep learning-based 2D pose estimator. bioRxiv 2020.10.09.333617. https://doi.org/10.1101/2020.10.09.333617
  • Rahaim, M. (2012). Musicking bodies: Gesture and voice in Hindustani music. Wesleyan University Press.
  • Savitzky, A., & Golay, M. J. E. (1964). Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36(8), 1627–1639. https://doi.org/10.1021/ac60214a047
  • Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. arXiv Preprint. https://arxiv.org/abs/1406.2199
  • Stehman, S. V. (1997). Selecting and interpreting measures of thematic classification accuracy. Remote Sensing of Environment, 62(1), 77–89. https://doi.org/10.1016/S0034-4257(97)00083-7
  • Tao, Y., & Papadias, D. (2006). Maintaining sliding window skylines on data streams. IEEE Transactions on Knowledge and Data Engineering, 18(3), 377–391. https://doi.org/10.1109/TKDE.2006.48
  • Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4489–4497). https://doi.org/10.1109/ICCV.2015.510
  • Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI Conference on Artificial Intelligence, North America. April 2018. https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17135
  • Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., & Zheng, N. (2017). View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2117–2126). https://arxiv.org/abs/1703.08274