Publication Cover
Bioacoustics
The International Journal of Animal Sound and its Recording
Latest Articles
106
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Extracting dolphin whistles in complex acoustic scenarios: a case study in the Bay of Biscay

ORCID Icon, , &
Received 08 Jun 2023, Accepted 25 Mar 2024, Published online: 15 May 2024

ABSTRACT

Accurate whistle contour extraction is crucial in many dolphin behavioural studies. Traditionally, whistle contour extraction involves a first step of finding whistle candidates by peak-level detection in the time-frequency domain, followed by a determination of when peaks are close enough to each other to be part of the same whistle contour. In complex scenarios, such as those with a large number of individuals vocalising simultaneously or those with a sudden increase in background noise, peak-level detection may not provide a number of accurate whistle candidates that is large enough to extract the whistle contour or to disambiguate individual whistles when they cross one another. In these adverse scenarios, a different approach, based on the pyknogram representation, can produce a more accurate detection of whistle candidates and evenly distributed candidates throughout the duration of the whistle. This work compares the peak-level extraction approach of the spectrogram with the point-density extraction approach of the pyknogram. We propose a technique that combines estimates of the central frequency and bandwidth to extract whistle candidates in adverse scenarios. The method has been successfully used for the vocalisation extraction of dolphins in the Bay of Biscay (Spain) using a database of more than 2000 dolphin whistles.

1. Introduction

Over the last few years, several studies have worked on methods for extracting marine mammal calls (whistles, moans, and other tonal sounds) when performing passive acoustic monitoring. In the majority of the cases, this has been achieved by automatically tracking these animal sounds in the time-frequency plane following the contour ridges or by peak-level detection. Some approaches found in the literature include: detecting all local maxima and fitting a curve through the peaks at successive time slices (Mallawaarachchi et al. Citation2008; Mellinger et al. Citation2011); using image processing techniques to track the spectral ridges (Kershenbaum and Roch Citation2013); using particle filters to find estimates of the posterior distribution (that of the estimated contour given the spectral peaks) (White and Hadley Citation2008; Roch et al. Citation2011); using an adaptive notch filter to minimise the output by placing notches at the whistle peaks (Johansson and White Citation2011); and using the probability hypothesis density filter (Gruden and White Citation2020) as an approximation to the optimal Bayesian filter. These approaches entail, either explicitly or implicitly, two stages: detecting the whistle candidates and then extracting the whistle by joining the candidates using different criteria. In this work, we use ‘candidates’ to refer to the set of peaks, pixels, high-density regions, or any other set of detected points that might be potentially part of a whistle contour. In all of the methods mentioned above, the whistle extraction stage works well when there is an accurate and even distribution of the candidates. However, these approaches behave very differently when some candidates are not properly detected. This may happen under adverse conditions such as low Signal-to-Noise Ratio (SNR) whistles (Mallawaarachchi et al. Citation2008), quick changes in the noise floor, or overlapping whistles from several animals vocalising simultaneously (Roch et al. Citation2011).

In 1996, Potamianos introduced (Potamianos and Maragos Citation1996) a new time-frequency representation method that was named pyknogram (from the Greek word ‘pykno’= dense). This representation proved to be especially appropriate for the extraction of formants in speech signals. The technique clearly displays the formant position and bandwidth with high and low-density regions (see for an example of how the pyknogram compares to the spectrogram).

Figure 1. (a) Spectrogram computed using a hamming window of length 10.7 ms (which results in 93.5 Hz frequency bin resolution) and (b) pyknogram computed using a filterbank of 1kHz bandwidth with 50% frequencial overlapping (which results in 500 Hz frequency bin resolution) of an underwater recording containing multiple dolphin whistles.

Figure 1. (a) Spectrogram computed using a hamming window of length 10.7 ms (which results in 93.5 Hz frequency bin resolution) and (b) pyknogram computed using a filterbank of 1kHz bandwidth with 50% frequencial overlapping (which results in 500 Hz frequency bin resolution) of an underwater recording containing multiple dolphin whistles.

Conceptually speaking, the pyknogram was devised to exploit the fact that speech production can be approximated by a sum of AM-FM models representing each one of the formants. It makes use of non-linear methods, such as the Teager-Keiser energy operator, to track the instantaneous frequency of each one of the components. Somewhat more recent works have proven how the pyknogram can help in different speech-related problems: speaker verification (Vijayan et al. Citation2016); speaker verification in overlapped scenarios (in which it achieved a relative 20% improvement across different signal-to-interference ratios (Shokouhi and Hansen Citation2017)); and identification of segments of overlapping speech in co-channel recordings (Yousefi et al. Citation2018). These are some of the most recent works, and the pyknogram showed good consistent behaviour in challenging scenarios in all of them.

Cetacean whistles and moans can also be approximated using AM-FM models, and thus the pyknogram might work well for the contour extraction of these tonal sounds. In a somewhat related line of work, in (Cornel et al. Citation2010), C. Ioana showed how, in spite of crossing or noise interferences, extracting the instantaneous frequency and phase provided a superior accuracy when following time-frequency variations. This is just more evidence that suggests that a thorough study on the use of the pyknogram, which is also based on the instantaneous frequency, might provide some benefits when trying to extract whistle contours in complex acoustic scenarios.

In this work, we evaluate how the pyknogram compares to the spectrogram when extracting tonal information or whistle candidates that could later be used for whistle contour extraction. As a case study, this has been applied to extract the whistles of bottlenose dolphins (Tursiups truncatus), striped dolphins (Stenella coeruleoalba) and common dolphins (Delphinus delphis) in the Bay of Biscay (on the northern coast of Spain). The rest of this work is structured as follows. In Subsection 2.1, we formulate the pyknogram equations to be used in discrete passive acoustic monitoring recordings. We also propose a new paradigm for whistle candidates detection based on the pyknogram representation. This new technique is later compared with a well-known whistle candidate detection based on the spectrogram method, which is summarised in Subsection 2.2. In Subsection 2.3, we explain how the whistle candidates obtained using the spectrogram and pyknogram are used to extract the whistle contours using the GM-PHD method (Gruden and White Citation2016). The results include assessing the accuracy of both techniques under adverse simulated scenarios (Subsections 3.1 and 3.2) as well as in a set of challenging real-world recordings from an acoustic campaign done in the Bay of Biscay (Subsection 3.3). The selected recordings were taken in an area with a high density of marine mammals, and they contain multiple bottlenose, striped, and common dolphins vocalising simultaneously as well as other interfering noises. We conclude the work in Section 4 discussing the possibilities and limitations of the pyknogram as an alternative to the spectrogram for whistle extraction in passive acoustic monitoring.

2. Methods

2.1. Pyknogram-based whistle candidate detection

Let’s assume a signal x(t) composed of N whistles wk(t), k={1,2,,N} with an arbitrary amount of additive pink noise η(t), as shown in EquationEquation (1):

(1) x(t)=k=1Nwk(t)+η(t).(1)

We used pink noise because its power spectral density decreases proportionally to the inverse of the frequency as happens with underwater ambient noise. Each one of the whistles can be approximated with an AM-FM model, as seen in EquationEquation (2):

(2) wk(t)=ak(t)cos2π0tfk(τ)+θk,(2)

where, ak(t), fk(t), and θk are respectively the instantaneous amplitude, the instantaneous frequency, and the initial phase of the whistle k. The equations for computing the pyknogram for a discrete signal x(n), of time index n, can be easily obtained by discretisation of the equations presented in (Potamianos and Maragos Citation1996) with a sampling frequency fs=1/Ts, Ts being the sampling period. In the following steps, we assume that the total number of samples of the discretised x(n) is equal to MQ, where M is the number of samples of the analysis frame and Q is an integer value.

  1. Use a Gabor filter bank (Gabor Citation1946) to decompose the broadband signal x(n) into a collection of relatively lower narrow band signals xi(n). We can create a discrete version of the Gabor filter by sampling the continuous version. The impulse response is thus a discrete Gaussian modulated sinusoid given by EquationEquation (3):

    (3) hi(n)=exp(α2(n/fs)2)cos(2πνin/fs),(3)

    where νi, i=[1,2,,I] is the centre frequency of the filter with I being the total number of bands and α being the bandwidth parameter (effective rms bandwidth approximated by (Maragos et al. Citation2002) BWGabor=α/2π). Although there are different alternatives for separating the broadband signal into narrow band signals, Gabor’s method is the simplest one (Hsu et al. Citation2011) and provides accurate instantaneous amplitude and frequency estimates (Delprat et al. Citation1992) even when compared with some other abrupt filter techniques.

    In this work, a bandwidth of BWGabor=1kHz for each Gabor filter was used. The filter bank covered from 3000 to 22,000 Hz with a bandwidth overlapping factor of ΔW=50%. The frequencies covered by the filter bank were selected in accordance with the frequency range of the three dolphin species studied. The overlapping factor value ΔW was empirically chosen as a trade off between computational complexity and a frequency resolution that was high enough to allow the whistle contour to be reconstructed.

  2. Estimate the Instantaneous Amplitude (IA) envelope ai(n) and Instantaneous Frequency (IF) fi(n) in each one of the filter bank bands. To do this, the Hilbert Transform Demodulation (HTD) was used. Even though the HTD presents a higher computational complexity than other techniques, such as the Energy Separation Algorithm (ESA), the HTD provides smoother bandwidth estimates (Potamianos and Maragos Citation1996) and therefore less variance. The process to obtain the IA and the IF involves computing the Hilbert transform H[] for each one of the narrow band signals, which is shown in EquationEquation (4) and also in EquationEquation (5) and EquationEquation (6):

    (4) ai(n)=xi(n)2+(H[xi(n)])2(4)
    (5) θi(n)=tan1H[xi(n)]xi(n)(5)
    (6) fi(n)=fs2πδ1[θi](n),(6)

    where δ1[θi](n) is the central difference estimate of θi, which is computed as follows:

    (7) δ1[θi](n)=(θi(n+1)θi(n1))/2.(7)

  3. Obtain the short-time estimates of the central frequency Fi(n0) and the bandwidth Bi2(n0) of a whistle candidate. In order to obtain a more natural estimate, weighted moments estimates of the IF and IA were used ar (Potamianos and Maragos Citation1996):

    (8) Fi(n0)=n=n0n0+M1fi(n)ai(n)2n=n0n0+M1ai(n)2(8)
    (9) Bi2(n0)=n=n0n0+M1(δ1[ai](n)/2π)2+(fi(n)Fi(n0))2ai(n)2n=n0n0+M1ai(n)2,(9)

    where δ1[ai](n) is the central difference estimation of ai, which is computed as in EquationEquation (7). The values n0 and M are the start sample and number of samples of the analysis frame, respectively. For example, for a 50% overlap, the time frames will start at n0=[0,M/2,M,,(Q1)×M].

Each one of the estimates of the central frequency Fi(n0) at time frame n0 and band fi is a pyknogram point. The scatter plot of Fi(n0) is known as the pyknogram representation (see Panel (b) of ).

Whistle candidates can be extracted from the pyknogram taking into account that the presence of a dolphin whistle (or any other signal resembling an AM-FM component) introduces high-density regions, i.e. the points are close to each other in the frequency domain (see ). The absence of whistles or the presence of any other broadband sounds do not alter the natural distribution of the central frequency estimates, producing a plot density that is defined by the separation among bands of the Gabor filter bank. If the pyknogram is interpreted as a 2D point cloud, we can adapt many of the point cloud denoising algorithms to extract only those points that are associated with whistle fragments. For the sake of simplicity, we will use the Parzen window technique (Parzen Citation1962) to obtain the probability density function of the pyknogram point separation in the frequency domain. This way, we can calculate the density of each pyknogram point using EquationEquation (10):

(10) d(Fi(n0))=1Ihj=1IKFi(n0)Fj(n0)h.(10)

where the function K() is a kernel function. In this work, we used a rectangular kernel K(p)=1,|p|1 and 0 for the rest of p. Other kernel functions, such as a Gaussian kernel, can be used, providing a slightly superior number of detected whistle candidates. However, the uniform kernel is more robust to noise and provides a number of whistle candidates that is large enough to track their contours in most situations. EquationEquation (10) computes the frequency separation among Fi(n0) and the rest of the pyknogram points Fj(n0) and normalizes it by the bandwidth h. In this work, h=250Hz. This is one-half of the Gabor filter separation between consecutive bands and is obtained as:

(11) h=BWGabor/2(1ΔW/100).(11)

As a result, d(Fi(n0)) is proportional to the number of points that are separated in frequency less than h Hz. Therefore, if P={pjR2} with j={1,2,,I×Q} is the set of all estimated pyknogram central frequency points where each pj=n0Ts,Fi(n0), we can obtain all of the whistle density points WD that are whistle candidates using WD={pjP|d(Fi(n0))>1/(Ih)}. The panel (a) of shows the whistle candidates WD detected for the sound-clip analysed in . We used a temporal window of 10.7 ms (M=10.7e3 fs) and 50% overlapping in the computation frequency Fi(n0) and bandwidth Bi2(n0).

Figure 2. An example of whistle candidate extraction in real dolphin whistles by method: (a) 1247 candidates for point cloud denoising (WD); (b) 5132 candidates for bandwidth threshold (WB); (c) 1159 candidates for PWCD (W); and (d) 487 candidates for binarized spectrogram. The candidate count was done for the whole pyknogram representation. The red circles indicate areas to focus on in order to see the dispersion of the extracted whistle candidates.

Figure 2. An example of whistle candidate extraction in real dolphin whistles by method: (a) 1247 candidates for point cloud denoising (WD); (b) 5132 candidates for bandwidth threshold (WB); (c) 1159 candidates for PWCD (W); and (d) 487 candidates for binarized spectrogram. The candidate count was done for the whole pyknogram representation. The red circles indicate areas to focus on in order to see the dispersion of the extracted whistle candidates.

A different way of selecting the whistle candidates from the pyknogram is by selecting only those central frequency estimates Fi(n0) that have an associated bandwidth Bi2(n0) that is lower than a given bandwidth threshold BWh. We call those candidates WB={pP|Bi2(n0)<BWh}. This technique produces slightly sharper whistle candidate estimates (compare the continuous line red circle region in Panels (a) and (b) of . However, empirical tests on real signals have shown that in order to have a number of whistle candidates throughout the duration of the whistle that is comparable to the one obtained with the previous technique, we need to use a high bandwidth threshold. As an example, Panel (b) of was obtained using BWh=BWGabor/4=250Hz, which is used in the remainder of this work. The result when using WB is a large number of random candidates that do not belong to real whistles (see the number of candidates in Panel (b) of which, in this example, is equal to 5132).

The final method proposed here for whistle candidates detection is the intersection of the candidates obtained using the two techniques, W={WDWB}. We have named this method Pyknogram-based Whistle Candidate Detection (PWCD). It combines the advantages of the two techniques, providing an accurate and even distribution of the whistle candidates while at the same time providing sharp whistle estimates (see Panel (c) of ).

2.2. Spectrogram-based whistle candidate detection

The proposed PWCD technique was compared with a traditional spectrogram peak-based candidate whistle detection method. Of all of the different spectrogram-based extraction techniques, we chose to compare it with the one described in Gillespie et al. (Citation2013). This technique is included in PamGuard, a popular software developed to automatically identify vocalisations of marine mammals, which has been used many times as a benchmark. In his work, D. Gillespie proposed a six-step process for whistle detection and tracking: click removal, spectrogram calculation, spectrogram noise removal (median filter, average subtraction, and Smoothing Kernel (SK)), 2D thresholding, connection of regions, and separation of crossing whistles. We only implemented the first four steps of the process for the comparison since those are the ones that are specifically related to the detection of whistle peaks or candidates as named here. The first four steps in Gillespie et al. (Citation2013), did not fully optimise the step of candidate detection (similarly to what happens with the PWCD technique), since subsequent whistle contour tracking stages would deal with a moderate false alarm rate of candidates at this intermediate stage. When computing the spectrogram (as we did with the PWCD), we used the same temporal window length (10.7 ms) and the 50% overlap that was used in Gillespie et al. (Citation2013). The spectrogram was computed using a hamming window.

Using the aforementioned technique, we analysed the sound-clip used in . The whistle peaks detected between 3 and 22 kHz are shown as a scatter plot in Panel (d) of . One of the first things that can be observed when looking at this particular example is that when using WD, the whistle candidates detected have less spectral resolution than the spectrogram has (compare the panels (a) and (d) of ). However, in some weak regions of the whistle with a fast sweep rate (compare the dashed line red circle), WD gives more uniform candidates and an overall higher number than the spectrogram does. This might be due to several factors that affect the spectrogram-based candidate detection: first, the click removal step described in Gillespie et al. (Citation2013) might also remove whistle candidates that follow a path close to a vertical slope; second, the candidate extraction in the spectrogram relies on a thresholding process that sometimes fails to extract candidates in the lower intensity parts of a whistle. Decreasing the spectrogram threshold should provide a general increase in whistle point candidates that subsequent whistle extraction and tracking stages will refine.

2.3. Spectrogram- and pyknogram-based whistle extraction

Whistle candidates detected using the PWCD and the spectrogram were used by the Gaussian mixture probability hypothesis density (GM-PHD) for whistle extraction as described in Gruden and White (Citation2016). The aim was to compare how the different whistle candidates behaved when used by the same whistle extraction algorithm. Different metrics were computed for the PWCD and the spectrogram before and after the whistle was extracted.

The GM-PHD algorithm, which was downloaded from Gruden (Citation2022), worked with the same settings used here: 10.7 ms window size, 50% overlap, and a time increment of 5.35 ms (see Sections 2.1, and 2.2). Similarly to what is done for extracting whistles with spectrogram candidates using the GM-PHD, the PWCD candidates were used to fit a quadratic polynomial and the maximum was obtained by using the fitted polynomial.

In this work, whistle extraction metrics were calculated only for real whistles (Section 3.3) and not for simulated whistles (Sections 3.1 and 3.2).

2.4. Metrics for comparing the performance of the pyknogram and the spectrogram approaches

In order to systematically study the number of whistle candidates that each technique succeeds in recovering (recall) when compared with the theoretical instantaneous frequencies as well as the errors committed in the process (precision), we need to define how these metrics are computed. The main idea is summarised in .

Figure 3. Precision and recall of whistle candidates can be obtained by looking for the true positive, the false positives, and the false negatives. FD is the maximum frequency deviation from the ground truth whistle contour.

Figure 3. Precision and recall of whistle candidates can be obtained by looking for the true positive, the false positives, and the false negatives. FD is the maximum frequency deviation from the ground truth whistle contour.

The frequency resolution of the PWCD is connected to the bandwidth of the Gabor filterbank BWGabor and its overlap ΔW, whereas the frequency resolution of the spectrogram is connected to time window length. As a result, for the same sampling frequency, the two techniques do not provide the same number of whistle candidates per Hertz. Additionally, the number of candidates in both techniques is very likely to vary (and not always in a similar way) due to many factors such as whistle slope, noise, bandwidth, etc. To achieve a fair comparison, we merged adjacent candidates within each time frame and counted that merged group of candidates as 1. We considered the group of candidates to be a valid whistle match if it failed within a frequency deviation (FD) of ±350Hz of the theoretical instantaneous frequency (the whistle contour) (Roch et al. Citation2011).

3. Analysis and results

3.1. Whistle candidate detection in simulated overlapping whistles

One of the advantages of the pyknogram, as is the case for the proposed PWCD method, is its ability to perform well in different overlapping sound scenarios. In order to study this, we performed simulations with two synthetic whistles crossing with different slopes. Consider the sum of two noisy whistles modelled as described in EquationEquation (1) and EquationEquation (2). The two whistles have instantaneous frequencies that vary linearly as given by f1(t)=f0ΔfT2t2T and f2(t)=f0+ΔfT2t2T with T=0.25 sec. The instantaneous amplitude of both whistles is constant and equal to one (a1(t)=a2(t)=1), and the initial phase is randomly distributed in the range [0,2π]. The two synthetic whistles have a SNR of −6 dB in the whistle band due to the added pink noise, n(t). In this study, the SNR was computed as the ratio of whistle power with respect to additive noise power in the bandwidth of interest (3–22 kHz). shows an example of a spectrogram for simulated crossing whistles, with f0=12kHz and Δf=9kHz. In this situation, the PWCD achieves lower precision than the spectrogram. However, it is capable of retrieving more whistle candidates (higher recall) than the spectrogram does.

Figure 4. An example of detecting whistle candidates in simulated crossing whistles using the Spectrogram and the PWCD techniques. Top panels Δf=2kHz. Bottom panels Δf=10kHz.

Figure 4. An example of detecting whistle candidates in simulated crossing whistles using the Spectrogram and the PWCD techniques. Top panels Δf=2kHz. Bottom panels Δf=10kHz.

With the aim to study how both techniques perform for different Δf, we performed 500 Monte Carlo runs when f0=12kHz and Δf varies from 2 kHz to 9 kHz. This value of maximum sweep frequency 9kHz/0.25s=36kHz/s is a realistic range of what it is measured for some species, such as the common dolphin (33.5kHz/s) according to (Gannier et al. Citation2010). The SNR was kept constant at −6 dB. The precision and recall rates of the detected candidates were computed as described within ±350Hz of the theoretical instantaneous frequency. The left panel of shows how the spectrogram achieves higher precision than the PWCD does. The recall rate of both techniques is shown in the middle panel of . Although there is a considerable difference in the recall metric for small Δf (almost 100% for the binarised spectrogram and around 87% for the PWCD), this difference is reduced as Δf increases. Thus, for high Δf, the PWCD shows higher recall than the spectrogram does (76% for the binarised spectrogram vs. 91% for the PWCD). With respect to recall, the PWCD shows a more stable behaviour when Δf changes than the spectrogram does.

Figure 5. Evolution of the precision, recall, and F2-measure for the PWCD and the spectrogram-based whistle candidate detection in crossing whistles as Δf increases. The results were obtained for 500 Monte Carlo runs with a SNR = −6 dB.

Figure 5. Evolution of the precision, recall, and F2-measure for the PWCD and the spectrogram-based whistle candidate detection in crossing whistles as Δf increases. The results were obtained for 500 Monte Carlo runs with a SNR = −6 dB.

We used the Fβ-measure, computed as Fβ=(1+β2)precisionrecall(β2precision)+recall, as a single-score metric summarising both precision and recall (Christen et al. Citation2023). We specifically used the F2-measure, which gives more weight to recall and less to precision. The selection of this measure was decided based on the fact that some of the false positives can be easily reduced at a later stage by looking for candidates that can be connected with previous or posterior candidates. However, there is always an extra difficulty in recovering the whistle contour if the number of false negatives becomes too high. The F2-measure as Δf increases is shown in the right panel of . The figure illustrates that the behaviour for large Δf is better for the PWCD than it is for the binarised spectrogram. However, for small Δf, the spectrogram gives better results.

We also computed the F2-measure of the spectrogram and the PWCD methods for different SNR and different Δf. The results are shown in . The table shows that, for Δf=2kHz, the spectrogram produces candidates with higher F2-measures than the PWCD. However, for Δf=9kHz, the PWCD produces higher F2-measures than the spectrogram. For Δf=5kHz for very low SNR (SNR = −5 dB and SNR = −6 dB), the PWCD outperforms the spectrogram. For SNR = −4 dB and SNR = −3 dB, the situation changes and the spectrogram produces better candidates than the PWCD. It is important to highlight that these results are obtained before doing any type of whistle extraction or tracking stage that will reduce the false positives in both methods.

Table 1. F2-measure computed over 500 Monte Carlo runs for simulated crossing whistles when the Signal to Noise Ratio (SNR) changes. The simulations are obtained for three different Δf values.

3.2. Whistle candidate detection in simulated sudden increases of the noise floor

Cetacean recordings often contain unexpected acoustic events that may lead to sudden rises in the noise floor: increases in wind velocity, rainfall, and anthropogenic sources are some examples (Roch et al. Citation2011). These noise floor changes produce a considerable number of false positives in many of the spectrogram peak-based whistle extraction techniques. In order to see how the proposed PWCD works in this situation, we performed a variation of the previously described simulation. As before, whistles were simulated using AM-FM components, but this time the whistle register was divided into two parts at a random time instant. The sudden changes were obtained by increasing/decreasing the SNR of the first and second parts by a factor of ±ΔSNR. The left panel in shows an example where the SNR is increased by 1.5 dB at t=0.1 sec for crossing whistles with Δf=9kHz. The middle and right panels show the whistle candidates detected by the PWCD and the binarised spectrogram, respectively. In , the PWCD provided a larger number of candidates than the spectrogram did and considerable good precision.

Figure 6. Example of the spectrogram and the PWCD technique to extract whistle candidates when a sudden change of ΔSNR=1.5 dB in the noise floor occurs. The sudden change occurs at 0.1 sec. And is marked with a vertical red-dashed line in the temporal representation of the signal.

Figure 6. Example of the spectrogram and the PWCD technique to extract whistle candidates when a sudden change of ΔSNR=1.5 dB in the noise floor occurs. The sudden change occurs at 0.1 sec. And is marked with a vertical red-dashed line in the temporal representation of the signal.

As before, we performed 500 Monte Carlo runs. However, this time we changed the ΔSNR from 0 to 3 dB in order to study the behaviour of the two techniques. In each one of the runs, the slope of the two crossing whistles was randomly changed (Δf was uniformly distributed between 3000 and 9000 Hz). The results are shown in . It can be concluded that although the number of properly detected candidates (precision rate) was higher for the spectrogram technique than it was for the PWCD, the number of possible candidates extracted (recall rate) was lower. The overall behaviour of the PWCD was slightly better than that of the spectrogram.

Figure 7. Evolution of the precision, recall, and F2-measure for the PWCD and the spectrogram-based whistle candidate detection when ΔSNR changes. The results were obtained for 500 Monte Carlo runs with a SNR = −6 dB and Δf randomly changing between 3000–9000 Hz.

Figure 7. Evolution of the precision, recall, and F2-measure for the PWCD and the spectrogram-based whistle candidate detection when ΔSNR changes. The results were obtained for 500 Monte Carlo runs with a SNR = −6 dB and Δf randomly changing between 3000–9000 Hz.

With regard to the number of candidates obtained for simulated signals, as can be observed, the PWCD has some advantages with respect to whistle slope and changes in noise floor levels over the binarised spectrogram. It is important to remember that all of the candidates from the PWCD and the spectrogram-based technique will be fed into a tracking algorithm, at a posterior stage. It is after this stage, where real precision and recall curves should be evaluated, as done in Subsection 3.3. Nevertheless, taking into account that the exact same tracking algorithm will be used later, a prior study of the precision and recall helps to determine the scenarios where one of the techniques might potentially work better than the other.

3.3. Whistle candidate detection and whistle extraction in real scenarios

With the aim of testing the performance of the proposed PWCD technique, different complex acoustic scenarios were selected. All of the scenarios come from the recordings of an acoustic campaign that was done in the Bay of Biscay on the 20th of June, 2019 as part of the RAGES EU project. The location corresponds to an area of high marine mammal density, which is shown in . The recording site is within close range of a gas platform (6 km), named Gaviota, where sudden noises occur during its operation. The signals were acquired with the SAMARUC passive acoustic monitoring device (Universitat Politècnica de València) ar:lar19 (Lara et al. Citation2019, Citation2020), equipped with a Cetacean Research hydrophone (C57) and a sampling frequency fs=192 kHz. The hydrophone depth was 414 meters. Although there was no visual confirmation, habitat-based density models of cetacean species (Camilo et al. Citation2018) along with signature whistles allowed us to identify bottlenose dolphins, striped dolphins, and common dolphins as the main species vocalising in the recordings. The selected scenarios shown in contain vocalisations of the aforementioned species and can be described as: overlapped whistles coming from many striped and common dolphins vocalising simultaneously (1); isolated whistles with low SNR mainly due to a sudden increase in background noise: (2a) for anthropogenic noise (2b) for ambient noise; and, a combination of isolated and multiple overlapped whistles in the presence of echolocation clicks (3).

Figure 8. Approximate location of the RAGES deployment and location of the recordings used (marked with a star on the map).

Figure 8. Approximate location of the RAGES deployment and location of the recordings used (marked with a star on the map).

Figure 9. Three examples of how the proposed technique behaves in the three described scenarios: left-column scenario (1), middle-column scenario (2a), and right-column scenario (3). The green points correspond to extracted whistle candidates that match the expert annotation (true positives); the red points indicate extracted whistle candidates that do not match the expert annotation (false positives).

Figure 9. Three examples of how the proposed technique behaves in the three described scenarios: left-column scenario (1), middle-column scenario (2a), and right-column scenario (3). The green points correspond to extracted whistle candidates that match the expert annotation (true positives); the red points indicate extracted whistle candidates that do not match the expert annotation (false positives).

shows an example of what whistle extraction looks like in the three different scenarios selected (previously described) using the spectrogram and the PWCD. The figure shows that while maintaining good precision, the number of frequencial whistle components extracted is higher for the PWCD than it is for the binarised spectrogram (higher recall). Visual comparison shows that the components are more uniformly distributed over the whistle contour in the PWCD than they are in the binarised spectrogram. At posterior stages, this should benefit the process of tracking the whistle contour or disambiguating individual whistles when they cross one another.

3.3.1. Ground truth and results

In order to establish how well the two methods compare when extracting the whistle candidates, we need to compare the output of the two methods with the whistle contours extracted by a trained analyst (ground truth information). For that purpose and similarly to what was done in (Roch et al. Citation2011), we created a custom software in MATLAB to allow the bioacoustics data analyst to interactively specify the whistle contours by clicking on a few whistle points. Cubic spline data interpolation was shown to the analyst to check that the manual annotated whistle matched the spectrogram contour (instantaneous frequency). This process was replicated for each and every one of the whistles in the scenario dataset. Even though a huge effort was made to record accurate ground truth information, there are always some errors and missed whistle fragments. However, the metrics previously used in the simulations were designed so that these errors affect both the spectrogram technique and the PWCD technique in a very similar way. We analysed over two thousand whistles in the three proposed scenarios. The metrics obtained in each one of the scenarios are the same ones already used for the simulations (precision, recall, and F2-measure). We computed the metrics for the spectrogram-based and pyknogram-based candidates (). Taking into account that the candidates were also used for whistle extraction using the GM-PHD, shows the metrics with the SK and without the SK (¬SK). Be aware that the peak candidate extraction using Gillespie’s method, as implemented in the GM-PHD (Gruden Citation2022), did not use the SK step. Finally, we computed the metrics after the whistle extraction using the GM-PHD from the candidates without the SK ().

Table 2. Precision, recall, and F2-measure metrics for whistle candidate detection using the spectrogram (with and without the smoothing kernel: SK and ¬SK, respectively) and the pyknogram in all the scenarios.

Table 3. Precision, recall, and F2-measure results after the GM-PHD whistle contour extraction in the three scenarios.

3.3.2. Discussion on candidate detection and whistle extraction in real scenarios

The analysis of the candidate detection metrics () shows that, with the SK, the PWCD achieved better recall metrics than the spectrogram. The combined F2-measure was also higher for the PWCD compared to the spectrogram. However, the precision was always higher for the spectrogram than it was for the PWCD. This is in agreement with the results obtained in the simulations where the pyknogram provided better recall and F2-measure when the noise floor increased suddenly and whistles overlapped with high Δf.

When the SK was not used (¬SK column), recall was higher for the spectrogram than it was for the PWCD, and precision was lower for the spectrogram compared to that of the PWCD. The PWCD achieved a better F2-measure than the spectrogram for all of the scenarios, except the scenario of multiple overlapping whistles (1).

In summary, the overall percentage of whistle candidate detection (F2-measure) in the Bay of Biscay recordings increased by 6.6% with the SK and by 5.7% without the SK when using the PWCD when compared to the spectrogram (62.4% vs 55.8% or vs 56.7%). Although this is far below the 20% improvement that some authors claim the pyknogram improves the extraction of tonal components in speech, this small improvement might be worth it in some challenging scenarios.

Once the whistle extraction using the GM-PHD method was done (), the results completely changed. The GM-PHD was able to discard spectrogram candidates that did not belong to real whistles, increasing the precision at the cost of a reduction in the recall. Something similar happened in some scenarios for the PWCD: an increase in the precision and a decrease of the recall. However, the GM-PHD did the extraction task better for the spectrogram than it did it for the PWCD. The F2-measure, after whistle extraction, was higher for the spectrogram than it was for the PWCD in all the scenarios, except for the scenario (2a). This makes sense if we take into account that the variances of the GM-PHD and system noise covariance matrix was optimised to work with the settings of the spectrogram frequency resolution (which was 93.75 Hz for 10.7 ms). The PWCD, on the other hand, had a frequency resolution given by the Gabor filterbank of 500 Hz.

4. Conclusions

We have presented an alternative method for whistle candidate detection based on the pyknogram. The technique, named PWCD, has been shown to have better combined F2-measure than the spectrogram in some challenging scenarios such as multiple overlapping whistles and regions with anthropogenic noise. This behaviour is due to the fact that the density distribution of the pyknogram points is less affected by the presence of broadband noise and sudden increases in the noise floor. Monte Carlo simulations were done to illustrate this behaviour (Subsections 3.1 and 3.2).

The PWCD has some additional advantages over the spectrogram that may be attractive in some situations. First, time and frequency resolution in the PWCD can be controlled separately by the window length and the bandwidth of the Gabor filters, respectively. This can be useful for the analysis of certain short cetacean calls. Second, the PWCD is capable of extracting a larger number of whistle regions in high slope crossing whistles than the spectrogram does.

The application of the proposed PWCD in a real dataset containing more than 2000 ground-truth annotated whistle sounds demonstrated that, before the whistle extraction stage, the candidates obtained with the PWCD technique outperformed the candidates obtained with the spectrogram. In the best of the scenarios, the PWCD technique obtained an accuracy of 67% compared to 58.9% (measured using the F2-measure), which is an increase of slightly over 8%. The overall accuracy result in the combined scenarios was also increased by approximately 6.6% when using the PWCD with respect to the spectrogram.

The results changed when the candidates were used to extract the contour using the GM-PHD method and the spectrogram outperformed the PWCD in most of the scenarios. The final accuracy (F2-measure) of the whistle extraction was equal to 57.4% for the PWCD and 64.9% for the spectrogram. Even though the GM-PHD implementation used was specifically trained to work with the spectrogram settings, in the scenario of isolated whistles with low SNR, the PWCD was able to obtain slightly better accuracy than the spectrogram (48.2% vs 46.3%). Candidates obtained using the PWCD might have some potential to achieve better whistle tracking in adverse scenarios when paired with the appropriate whistle extraction techniques. Being able to extract whistle contours in noisy scenarios is an important research line that may help to develop automatic alert systems. As a result, a thorough study of the different whistle extraction techniques and their adaptation for working with the proposed PWCD technique is a future line of work.

Acknowledgements

This work was partially supported by the following: LIFE PortSounds (grant agreement LIFE20-ENV/ES/000387), the European Commission – DG ENV (grant agreement 110661/2018/794607/SUB/ENV.C2), the Spanish Institute of Oceanography (IEO), AZTI, and the Oceanografic of Valencia. We would like to thank the crew of the ships who participated in the deployment and recovery of the acoustic campaign.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The work was supported by the Directorate General for Environment (DG ENV) [110661/2018/794607/SUB/ENV.C2]; European Climate, Infrastructure and Environment Executive Agency (CINEA) [LIFE20-ENV/ES/000387].

References

  • Camilo S, Gerrodette T, Louzao M, Valeiras J, García S, Cerviño S, Pierce GJ, Santos MB. 2018. Assessing the environmental status of the short-beaked common dolphin (Delphinus delphis) in north-western Spanish waters using abundance trends and safe removal limits. Prog Oceanogr. 166:66–75. doi: 10.1016/j.pocean.2017.08.006.
  • Christen P, Hand DJ, Kirielle N. 2023. A review of the f-measure: its history, properties, criticism, and alternatives. ACM Comput Surv. 56(3):1–24. doi: 10.1145/3606367.
  • Cornel I, Cédric G, Yann S, Jerome IM. 2010. Analysis of underwater mammal vocalisations using time–frequency-phase tracker. Appl Acoust. 71(11):1070–1080. doi: 10.1016/j.apacoust.2010.04.009.
  • Delprat N, Escudie B, Guillemain P, Kronland-Martinet R, Tchamitchian P, Torresani B. 1992. Asymptotic wavelet and Gabor analysis: extraction of instantaneous frequencies. IEEE Trans Inf Theory. 38(2):644–664. doi: 10.1109/18.119728.
  • Gabor D. 1946. Theory of communication. J Electr Eng. 93(3):429–457. doi: 10.1049/ji-3-2.1946.0076.
  • Gannier A, Fuchs S, Quebre P, Oswald JN. 2010. Performance of a contour-based classification method for whistles of mediterranean delphinids. Appl Acoust. 71(11):1063–1069. doi: 10.1016/j.apacoust.2010.05.019.
  • Gillespie D, Caillat M, Gordon J, White P. 2013. Automatic detection and classification of odontocete whistles. J Acoust Soc Am. 134(3):2427. doi: 10.1121/1.4816555.
  • Gruden P. 2016. GM-PHD whistle detector [software]. GitHub repository (commit: 4e105cd); [accessed 2022 Jul 16]. https://github.com/PinaGruden/GMPHD_whistle_contour_tracking
  • Gruden P, White PR. 2016. Automated tracking of dolphin whistles using gaussian mixture probability hypothesis density filters. J Acoust Soc Am. 140(3):1981–1991. doi: 10.1121/1.4962980.
  • Gruden P, White PR. 2020. Automated extraction of dolphin whistles - a sequential Monte Carlo probability hypothesis density approach. J Acoust Soc Am. 148(5):3014–3026. doi: 10.1121/10.0002257.
  • Hsu MK, Sheu JC, Hsue C. 2011. Overcoming the negative frequencies- instantaneous frequency and amplitude estimation using osculating circle method. J Mar Sci Technol. 19(5): doi: 10.51400/2709-6998.2165.
  • Johansson AT, White PR. 2011. An adaptive filter-based method for robust, automatic detection and frequency estimation of whistles. J Acoust Soc Am. 130(2):893–903. doi: 10.1121/1.3609117.
  • Kershenbaum A, Roch MA. 2013. An image processing based paradigm for the extraction of tonal sounds in cetacean communications. J Acoust Soc Am. 134(6):4435. doi: 10.1121/1.4828821.
  • Lara G, Bou-Cabo M, Esteban JA, Espinosa V, Miralles R. 2019. Design and application of a passive acoustic monitoring system in the Spanish implementation of the marine strategy framework directive. 6th International Electronic Conference on Sensors and Applications; Nov 15–30; MDPI. p. 1–7.
  • Lara G, Bou-Cabo M, Esteban JA, Espinosa V, Miralles R. 2020. New insights into the design and application of a passive acoustic monitoring system for the assessment of the good environmental status in Spanish marine waters. Sensors. 20(5353):1–13. doi: 10.3390/s20185353.
  • Mallawaarachchi A, Ong SH, Chitre M, Taylor E. 2008. Spectrogram denoising and automated extraction of the fundamental frequency variation of dolphin whistles. J Acoust Soc Am. 124(2):1159–1170. doi: 10.1121/1.2945711.
  • Maragos P, Loupas T, Pitsikalis V. 2002. On improving doppler ultrasound spectroscopy with multiband instantaneous energy separation. Proceedings Int’l Conf. DSP-2002; Jul 1-3; Santorini, Greece
  • Mellinger DK, Martin SW, Morrissey RP, Thomas L, Yosco JJ. 2011. A method for detecting whistles, moans, and other frequency contour sounds. J Acoust Soc Am. 129(6):4055–4061. doi: 10.1121/1.3531926.
  • Parzen E. 1962. On estimation of a probability density function and mode. Ann Math Stat. 33(3):1065–1076. doi: 10.1214/aoms/1177704472.
  • Potamianos A, Maragos P. 1996. Speech formant frequency and bandwidth tracking using multiband energy demodulation. J Acoust Soc Am. 99(6):3795–3806. doi: 10.1121/1.414997.
  • Roch MA, Scott Brandes T, Patel B, Barkley Y, Baumann-Pickering S, Soldevilla MS. 2011. Automated extraction of odontocete whistle contours. J Acoust Soc Am. 130(4):2212–2223. doi: 10.1121/1.3624821.
  • Shokouhi N, Hansen JHL. 2017. Teager–Kaiser energy operators for overlapped speech detection. IEEE/ACM Trans Audio, Speech, Language Process. 25(5):1035–1047. doi: 10.1109/TASLP.2017.2678684.
  • Vijayan K, Raghavendra Reddy P, Sri Rama Murty K. 2016. Significance of analytic phase of speech signals in speaker verification. Speech Commun. 81:54–71. doi: 10.1016/j.specom.2016.02.005.
  • White P, Hadley M. 2008. Introduction to particle filters for tracking applications in the passive acoustic monitoring of cetaceans. Canadian Acoustics. 36(1):146–152. https://jcaa.caa-aca.ca/index.php/jcaa/article/view/2004
  • Yousefi M, Shokouhi N, Hansen JHL. 2018. Assessing speaker engagement in 2- person debates: overlap detection in United States presidential debates. Proceedings of the Interspeech 2018; Hyderabad, India, p. 2117–2121.