380
Views
5
CrossRef citations to date
0
Altmetric
Research Article

A review of silhouette extraction algorithms for use within visual hull pipelines

ORCID Icon, , , ORCID Icon & ORCID Icon
Pages 649-670 | Received 09 Dec 2019, Accepted 28 Jun 2020, Published online: 17 Jul 2020
 

ABSTRACT

Markerless motion capture would permit the study of human biomechanics in environments where marker-based systems are impractical, e.g. outdoors or underwater. The visual hull tool may enable such data to be recorded, but it requires the accurate detection of the silhouette of the object in multiple camera views. This paper reviews the top-performing algorithms available to date for silhouette extraction, with the visual hull in mind as the downstream application; the rationale is that higher-quality silhouettes would lead to higher-quality visual hulls, and consequently better measurement of movement. This paper is the first attempt in the literature to compare silhouette extraction algorithms that belong to different fields of Computer Vision, namely background subtraction, semantic segmentation, and multi-view segmentation. It was found that several algorithms exist that would be substantial improvements over the silhouette extraction algorithms traditionally used in visual hull pipelines. In particular, FgSegNet v2 (a background subtraction algorithm), DeepLabv3+ JFT (a semantic segmentation algorithm), and Djelouah 2013 (a multi-view segmentation algorithm) are the most accurate and promising methods for the extraction of silhouettes from 2D images to date, and could seamlessly be integrated within a visual hull pipeline for studies of human movement or biomechanics.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1. A silhouette is defined as the outer contour of an object. However, most algorithms discussed in this paper give as an output a mask of the object (i.e. a silhouette and the area enclosed by it). This is not an issue for the purposes of shape-from-silhouette tools, as a silhouette or an object mask will behave similarly when used as inputs for the visual hull reconstruction.

2. Methods that rely on depth data (RGB-D) will not be covered here, as they cannot be easily applied to the visual hull, which is the main focus of this review paper. For a detailed review of RGB-D methods, please refer to (Camplani et al. Citation2017).

3. In this context, the term ‘traditional’ refers to algorithms that do not employ deep learning.

4. For a thorough review of traditional methods for background subtraction, please refer to the review by Bouwmans et al. (Bouwmans Citation2014).

5. The 200 frames extracted from each video act as a prior from which the model learns the distribution of the frames present in the video. In other words, the model assumes that the contents of the first 200 frames are representative ‘enough’ (where ‘enough’ cannot be easily defined mathematically) of the contents of all frames present in the video. In cases where this is not true (for example, videos that change scenery significantly, like a camera that starts in forest and during the course of the video ends underwater), the model’s ability to generalise from the first 200 frames will be lower. However, this problem can be circumnavigated by selecting the 200 frames so that all ‘phases’ of the video are represented; in the toy example from above, that would mean that some of the 200 frames would come from when the camera was in the forest, and some from when it was in the water.

6. Only the ground truth for the frame being analysed is provided to the network during training.

7. The distinction between a GAN and a cGAN is that the generator in a GAN is only shown random noise during the first stages of training, whereas the generator of a cGan is trained by using the random noise to modify a real input example.

8. In an encoder-decoder network, the encoder module gradually reduces the spatial dimension and captures higher semantic information, while the decoder module gradually recovers the spatial information and brings the output back to the original size of the input. This kind of network is explained in more detail in Section 3.2.

9. For more details on this, please refer to (Garcia-Garcia et al. Citation2017).

10. Dilated convolutions are described in detail in Section 3.2.1.

11. The baseline category contains a mixture of mild challenges that belong to the other categories, and therefore is the easiest category for algorithms to analyse.

13. The version of DeepLab3 pre-trained on the JFT-300 M dataset takes the name of ‘DeepLab3-JFT’, while the version of DeepLab3 pre-trained on the MS-COCO dataset simply takes the name of ‘DeepLabv3’.

14. Information is progressively encoded as the layers get deeper. Therefore, the first layers will contain information that is scarcely encoded, and which is consequently defined as low-level. An example of a low-level feature is an edge map of the original image, which requires little encoding to obtain.

15. This review focuses on methods for silhouette extraction that can be applied to markerless motion capture. Because such a silhouette extraction algorithm would only have to deal with a single object in the image (i.e. the human subject being recorded), instance segmentation algorithms, which focus on the segmentation of multiple objects of the same class, were not considered in this review.

16. Although this dataset is more pertinent to algorithms designed for self-driving cars or similar applications, we include it in this review paper because it features human beings as one of its object classes, and because it is such a widely recognised benchmark dataset that many readers will be familiar with it.

Additional information

Notes on contributors

Guido Ascenso

Guido Ascenso is a PhD student at Manchester Metropolitan University. The topic of his PhD is the development of a markerless motion capture system for the biomechanical analysis of swimmers. He obtained an MSc in Sports Biomechanics from Robert Gordon University, Aberdeen, UK, in 2016. His main area of interest is the application of deep learning techniques to biomechanical problems. 

Moi Hoon Yap

Moi Hoon Yap is Reader (Associate Professor) in Computer Vision at the Manchester Metropolitan University and a Royal Society Industry Fellow with Image Metrics Ltd.  She received her Ph.D. in Computer Science from Loughborough University in 2009. After her Ph.D., she worked as Postdoctoral Research Assistant (April 09 - Oct 11) in the Centre for Visual Computing at the University of Bradford. She serves as an Associate Editor for Journal of Open Research Software and reviewers for IEEE transactions/journals (Image Processing, Multimedia, Cybernetics, biomedical health, and informatics). Her research expertise is in computer vision and deep learning.

Thomas Allen

Thomas Allen is a Senior Lecturer in Mechanical Engineering at Manchester Metropolitan University. He is a member of the International Sports Engineering Association and Associate Editor for their journal Sports Engineering. His main area of research is the application of experimental mechanics, finite element modelling, and advanced materials to sports equipment and protective clothing. He serves on the BSI Committee PH/3/11 Protective Equipment For Sports Players.

Simon S. Choppin

Simon Choppin is a Senior Research Fellow in the Centre for Sports Engineering Research at Sheffield Hallam University. He obtained a PhD in the analysis of elite tennis player movement from The University of Sheffield in 2008 and an MEng in Mechanical Engineering with Mathematics from The University of Nottingham in 2004. His main research interest is in the 3D analysis of human shape for applications in Sports and Health and he is a member of the IEEE 3d body processing group.

Carl Payton

Carl Payton holds the position of Reader in Biomechanics in the Musculoskeletal Science and Sports Medicine Research Centre at Manchester Metropolitan University, UK. He received his Ph.D in sports biomechanics from the same institution in 1999. His main research area is swimming biomechanics with a particular focus on the use of three-dimensional motion analysis to enhance the performance of elite swimmers.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access
  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart
* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.