1,101
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Cloud removal using scattering model and evaluation via semi-realistic simulation

, &
Pages 2799-2825 | Received 16 Nov 2022, Accepted 18 Apr 2023, Published online: 12 May 2023

ABSTRACT

Cloud removal is an essential task in remote sensing data analysis. As the image sensors are distant from the earth ground, it is likely that part of the area of interests is covered by cloud. Moreover, the atmosphere in between creates a constant haze layer upon the acquired images. To recover the ground image, we propose to use scattering model for temporal sequence of images of any scene in the framework of low rank and sparse models. We further develop its variant, which is much faster and yet more accurate. To measure the performance of different methods objectively, we develop a semi-realistic simulation method to produce cloud cover so that various methods can be quantitatively analysed, which enables detailed study of many aspects of cloud removal algorithms, including verifying the effectiveness of proposed models in comparison with the state-of-the-arts, including deep learning models, and addressing the long standing problem of the determination of regularization parameters. Theoretic analysis on the range of the sparsity regularization parameter is provided and verified numerically.

1. Introduction

As the imaging sensors are deployed kilometres above the earth ground, cloud usually appears in the acquired images, which complicates data analysis tasks. It is desirable to remove the cloud totally to recover clean ground scene, which gives rise to cloud removal. Due to the versatility of remote sensing imagery, cloud removal methods have to align with the characteristics of the sensors, for example, multiple channels or single band. Meanwhile, the platform is a decisive factor for the design of the algorithm, for example, the computation limitation and power consumption restriction. Furthermore, the analysis tasks after cloud removal have some influence as well. So one has to consider all possible contributing factors in the modelling process.

Our data is single band satellites images of the same scene sampled from different time points which are subjected to light to moderate random cloud coverage at various regions. The aim is to recover images without cloud, i.e. the clear images revealing the ground scene so that subsequent analysis can be performed reliably, for example, object detection and tracking. Therefore the fidelity is the most important factor to be considered, in other words, the recovered must be as close as possible to the truth, not just simply ‘visually fit’ (look plausible from afar). Unfortunately, there is no objective assessment except visual checking, and one of the goals of this paper is to fill this gap.

We focus on non-deep-learning-based methods for cloud removal, although latest deep learning methods were used as contenders in our empirical studies subject to code availability (e.g. Zheng, Liu, and Wang Citation2021; Sarukkai et al. Citation2020). The reason for this is that the fidelity of the recovered images is a concern for deep learning based methods. The workflow of these methods consists of two steps. The first is to identity cloud covered areas and remove them. The second is to apply generative models to fill the removed pixels. Generalised adversial networks (GAN) based models are popular choice for image completion. However, the working mechanism of GAN and its variants heavily rely on the training data on which the distribution is modelled by transforming a specified random distribution, e.g. uniform distribution or multivariate Gaussian distribution. Essentially, GAN is some sort of density estimator. Then the question is, what if the scene that the satellite sampled never appears in the training data? GAN will certainly generate something for the missing areas but will not be able to stretch outside its modelled distribution. Therefore we consider other alternatives, for example, temporal mosaicing (Guo et al. Citation2016, Citation2017). Although enforcing spatial smoothness is the most time-consuming component, the fidelity can be reassured that no ‘alien pixels’ will be inserted into the images like GAN based methods do. Another possibility is matrix completion methods for missing pixel filling, for example Wen et al. (Citation2018), and its later development Zhang et al. (Citation2019). The main model behind these methods is the low-rank robust principal component analysis (Emmanuel et al. Citation2011) coming from a long development of robust PCA (RPCA) (De la Torre and Black Citation2001; Junbin Gao, Guo, and Guo Citation2009) that improves the robustness of the classic linear PCA model by reducing the sensitivity to outliers. The elegance of RPCA comparing to its peers is the simplicity in its formation as well as its theoretical guarantee for the recovery of the low-rank signals and sparse noise. The application of RPCA implies that the observed images are the summation of low rank ground images and sparse cloud cover images (images with cloud only without the ground scene). It makes sense for such arrangement assuming that the ground scene changes little after excluding misalignment and geometric distortion, and clouds cover only small portion of the scene. The low-rank condition on ground component signifies the way of filling missing pixels and hence RPCA has better interpretability than GAN methods.

It seems that the aforementioned two-step workflow can be consolidated to a single one using RPCA. Nonetheless this two-step strategy was still adopted (e.g. in Wen et al. Citation2018; Zhang et al. Citation2019), for no obvious reason, in which RPCA is only used for cloud identification and a low-rank matrix completion follows after those cloud affected areas masked out. Two questions remain though. Firstly, where is the atmosphere modelled in the image data?The atmosphere is reflected as a thin haze layer in the acquired images which may not be negligible. Secondly, is the simple additive model in RPCA really the right description of the physics? Apparently not. The most realistic model so far is the so-called atmosphere scattering model (Narasimhan and Nayar Citation2002) for satellite images. Therefore one should build atmospheric affect into the model for cloud removal and ground scene recover.

2. Models considering atmosphere effects

We first describe RPCA-based methods here in the setting of imagery applications. Let IiRd1×d2 be the i-th sampled image of size d1×d2 and i=1,,n; D=[vec(I1),,vec(In)] where vec(X) is the vectorization of matrix X to be a column vector, and hence DRd×n (d=d1d2). The RPCA model used in Wen et al. (Citation2018) and Zhang et al. (Citation2019) is the following,

(1) minL,CL+λC1(1)
s.t.D=L+C

where X is the nuclear norm of X, i.e. the summation of all singular values of X, which is the convex envelope for matrix rank, X1 is the 1 norm of X, L is the initial recovered ground images, C is the cloud cover images, and both are the same size as D. λ is the regularization parameter usually fixed to be 1d as recommended in Emmanuel et al. (Citation2011). By incorporating group sparsity (defined by super-pixels) into (1) Zhang et al. (Citation2019) claim slightly better performance. After solving (2.1), both methods proceed to matrix completion with the mask derived from C as follows

(2) minB,SL+αSΩ1+βSΩˉ1(2)
s.t.D=B+S

where Ω is the mask matrix of size d×n with 0’s for masked out elements and 1’s for others, Ω¯ is the negated version of Ω, i.e. flipping 0’s and 1’s, and SΩ is the projection of S on Ω, i.e. masking out elements indicated by 0’s in Ω. The ijth element in the mask matrix, [Ω]ij=1 if [C]ij>γσ(vec(C)) and [Ω]ij=0 otherwise, where σ(v) is the standard deviation of v and γ[0,1] is a pre-set ratio. B is the final recovered ground images, which are supposed to be cloud free. S is the noise. In implementation, γ=0.8, α=0.1d and β=1. Both problems are convex with two blocks of variables. There are many gradient projection based solvers/optimizers for them under the ADMM (Alternating Direction Method of Multipliers) framework (Boyd and Vandenberghe Citation2004). They all work reasonably well for moderate size of images, for example, d1=d2=1024 and n=7.

The critical step is in (1) where cloud cover C is supposed to be separated. Note that the decomposition of the observed data D=L+C reflects the basic model assumption. As mentioned earlier, this departures from the reality by ignoring atmosphere effect. So instead of simple additive model we propose to use atmosphere scattering (Narasimhan and Nayar Citation2002), D=L(1C)+C, in the modelling, and hence optimize the following

(3) minL,CL+λC1(3)
s.t.D=L(1C)+C
[L]ij[0,1],[C]ij[0,1]

X where XY is the element-wise product of matrix X and Y of the same size. In the above formulation, it is assumed that the pixels in observed images are rescaled to [0,1], which is easily done by dividing the maximum digital number of the sensor, but not the maximum of the observed values. Note that (3) is no longer a convex problem as the equality condition is not affine. It is supposed to be much difficult to solve on itself, let alone the boxed conditions clamping the elements in both L and C within [0,1]. Nonetheless, there are still some strategies for the optimization. For example, introducing a dummy variable X to untangle the interaction between L and C

(4) minL,CL+λC1(4)
s.t.D=X(1C)+C
L=X
[L]ij[0,1],[C]ij[0,1],[X]ij[0,1]

and proceed with the normal ADMM. However, we observed that this does not converge well enough to be practically useful. Instead, we employ linearization using primal accelerated proximal gradient method (Kei Pong et al. Citation2010) for its ease in handling entangled nuclear norm optimization and stability. The Lagrange of (3) with proximity is

(5) L=∥L+λC1+Y,DL(1C)C(5)
+μ2DL(1C)CF2

leading to

(6) L=∥L+λC1+μ2DL(1C)C+YμF2(6)

by ignoring constants, where XF is the Frobenius norm of X, YRd×n is Lagrangian parameters for the equality condition and μ0 is the proximity coefficient. Note that (5) and (6) are the proximal form, and the boxed conditions in (3) are ignored at this stage, which will be handled later by feasibility projection after updating all unknowns. Alternating the minimization w.r.t. L and C is adopted here. Apparently minimizing L with respect to L is difficult due to the term L(1C) although no much trouble for C. The gradients are shown below.

(7) LC=λC1μ(DL(1C)C+Yμ)(1L)(7)
(8) LL=Lμ(DL(1C)C+Yμ)(1C)(8)

where X1 and X are subgradients of 1 norm and nuclear norm respectively. The stationary point of (7) gives closed form solution C

(9) [C]ij=0,bij[λ,λ]λ|bij|aijsign(bij),otherwise(9)

where aij=μ[(L1)(L1)]ij, bij=μ[(DL+Yμ)(L1)]ij and sign(x) is the sign function of x which takes 1 when x>0 and 1 when x<0. It is a straightforward soft thresholding for 1 norm minimization. The only difference is the regularization is not global but local or adaptive as the regularization parameter λ is rescaled by each 1/aij as shown in (9). Whereas there is no closed form solution for LL=0 because the singular value thresholding (SVT) (Cai, Candès, and Shen Citation2010) only works for the following general form

τX+XA=0

where τ>0 is an arbitrary scaler (normally regularization parameter) and A is a matrix the size as X. The solution X to above is X=Sτ(A) and Sτ(A) is the so-called SVT operator defined as

(10) Sτ(A)=Umax(σ1τ,0)max(σnτ,0)VT(10)

where U and V are from SVD of A, i.e. A=UΣVT and Σ=diag(σ1,,σn).

To work around it, we linearize the smooth part in (6)

(11) μ2DL(1C)C+YμF2fs(L)(11)

by the first-order Taylor expansion with a proximal term w.r.t Lk, the kth value of L in the iterative optimization for (6), and optimize L while holding other variables constant as

minLL+fsL|Lk,LLk+p2LLkF2.

In above, fsL|Lk=μ(DLk(1C)C+Yμ)(1C). p is the Lipschitz constant of fs(L), which is the operator norm of fsL|Lk, that maps a matrix of the same size of L to R

fsL|Lk,:Rn×dR
AfsL|Lk,A

It is straightforward to see that p=1 due to the box conditions of C and L. This leads to

(12) minLL+p2LLk+fsL|Lk/pF2(12)

The above linear approximation results is very convenient as the interaction between L and C has been removed and therefore (12) has closed form solution using SVT. We apply Nesterov acceleration to speed up the process, which is proven to be convergent for (12) with carefully chosen optimization parameters (Nesterov Citation2003). This iterative procedure for L has to be embedded into the optimization for (6) and hence there are two loops in the entire algorithm. The detailed optimization algorithm for solving (3) is listed in Alg. 1. Note that the boxed conditions are satisfied by clamping in Alg. 1, which is the feasibility projection commonly used in many implementations (Jun Liu and Jieping Citation2009). We call the model in (3) and its realization in Alg. 1 atmosphere cloud removal model and ATM for short.

Algorithm 1 Solving atmosphere scattering model for cloud removal in (3)Require: D, λ, \isin1: Y=D/λ, L=0, C=02: ρ=1.5, μ=1.25/D, μmax=μ1073: while DC(1C)LFDF>\isin do4: update C using (2.9)5: C(C< 0)=0, C(C>1)=1 ▷ Feasibility projection for C6: θ=θ_=1, L_=L, p=17: f_=0, f=, kmax=1008: while k<kmax and |ff_|>103 do9: W=L+(θ/θ_θ)(L L_)10: L_=L, L=S1μp(W)11: θ_=θ, θ=θ4+4θ2θ2212: f_=f, f=μ2DC(1C)LF2+L13: end while14: L(L<0)=0, L(L>1)=1 ▷ Feasibility projection for L15: Y=vY+DC(1C)°L16: μ=min(μρ,μmax)17: end while

Due to the double iterative procedure for solving L in ATM, it is expected to be slow. However, the recovered cloud component, i.e. C is closer to reality than that obtained from RPCA as shown in , where the source images are from GaoFen4 satellite captured at the same scene at seven time points. GF-4 is the first Chinese geosynchronous orbit remote sensing satellite and equipped with one array camera with resolution of 50 m for the panchromatic band and 400 m for the MWIR band. The test panchromatic data were captured on 20 August 2016, and the geographic coordinates are 180934 ′′183727 ′′ north latitude and 1085630 ′′1094828 ′′ east longitude (Hainan Province in China). Since the width of a single-field GF-4 satellite image is 10,240 × 10240 pixels, we cropped the west part of the Hainan Province with lots of clouds as the test data.

Figure 1. Cloud detection for GaoFen4 images of the same scene at seven time points. (a) original GF4 images, (b) ATM detected clouds, (c) RPCA detected clouds. Note that all the images are in vector form, including the ones in later sections. Please zoom in to see details in the PDF file of this paper.

Figure 1. Cloud detection for GaoFen4 images of the same scene at seven time points. (a) original GF4 images, (b) ATM detected clouds, (c) RPCA detected clouds. Note that all the images are in vector form, including the ones in later sections. Please zoom in to see details in the PDF file of this paper.

It is clear that the ATM detected clouds are much brighter than those detected by RPCA thanks to its detailed atmosphere model, at the cost of much higher computational load as shown in . This motivates us to reduce its computational cost while maintaining model capacity. The key is to disentangle the interaction between L and C that breaks the convexity. Let us take a closer look at the core in ATM model in (2.3), i.e. D=L(1C)+C. We decompose L as L=(1P)L+PL for P[0,1] and PC, where means element-wise , i.e. [P]i[C]i. We proceed using this decomposition

D=L+CLC=(1P)L+C+(PC)L.

Under the choice of P, N(PC)L0. The above can be written as

(13) D=L˜+C+N(13)

for L˜=(1P)L. We can easily write out an equivalent optimization problem to (3) using (13) with many coupling conditions, which complicate the optimization. However, if we drop some coupling conditions via relaxation and approximation, it will be much easier to solve, and yet the coupled problem is still a special case of the relaxed version. So we optimize the following

(14) minL,C,NL+λC1+βNF2(14)
s.t.D=L+C+N
[L]ij[0,1],[C]ij[0,1],[N]ij[0,1]

Note that in above L replace L˜ which is an approximation. We highlight this is a relaxed version of (3) with its own interpretation, that is N acts as a thin haze layer accounting for the atmosphere.

In (14) the values in N are controlled by the Frobenius norm. It is well known that the Frobenius norm will not encourage sparsity, but compress the values towards zeros uniformly. Depending on the value of β, C+N can reach the so-called α-sparsity (Yonina and Kutyniok Citation2012), i.e. sparsity beyond value α. Note that we fix β=1 throughout this paper.

EquationEquations (14) is significantly easier to solve than EquationEquation (3) for being convex with no interaction terms in the low-rank component. Although direct generalization of ADMM to more than two blocks of variables like those in (14) may not converge as discussed in (Chen et al. Citation2014) with hand-crafted counter examples, from many other applications, and vast amount of experiments we carried out, the optimization converged quite quickly. Detailed optimization algorithm is listed in Alg. 2. We call the model in (14) and its optimization algorithm in Alg. 2 alternative ATM, or aATM for short.

Algorithm 2 Solving (2.14)

Require: D, λ, β, \isin

Y=D/λ, L=0, C=0, N=0

ρ=1.5, μ=1.25/D, μmax=μ107

while DCLNFDF>\isin do[C]ij=sign([C]ij)max(|[C]ij|λμ,0)

C(C<0)=0, C(C>1)=1 ▷ Feasibility projection for CL=S1μ(DCN+Yμ)

L(L<0)=0, L(L>1)=1▷ Feasibility projection for LN=μ(β+μ)(DLC+Yμ)

N(N<0)=0, N(N>1)=1▷ Feasibility projection for N but not necessaryY=Y+DC L Nμ=min(μρ,μmax)end while

For the regularization parameters, we provide theoretic analysis on the range of the main regularization parameter λ in Appendix A. The results align with the empirical study outcomes presented in the next section. Furthermore, we will present an empirical equation based on numerical method to determine the best value for λ as a guidance for practical use. shows the flow chart of our method. It is indeed very straightforward. The input stack of images is first normalized by dividing the maximum digital number of the instrument. Then determine regularization parameter λ by the procedure provided in 3.4. Finally run Algorithm in 2 to obtain the clean images and cloud cover. We need to point out that all our models can be used directly to recover ground images unlike the main contenders (Wen et al. Citation2018; Zhang et al. Citation2019) where a matrix completion (MC) step has to follow although it is debatable whether MC is necessary. However, without some sort of ground truth, it would be a myth and the arguments would be meaningless. To address this long standing issue, we design a semi-realistic simulation of cloud covered images so that all methods can be analysed quantitatively.

Figure 2. Flowchart of the proposed method. Note that LCD=L(1C)+C.

Figure 2. Flowchart of the proposed method. Note that L⊕C≡D=L∘(1−C)+C.

3. Quantification of performance

3.1. Simulation and performance indicator

Cloud removal experiments are normally conducted on real images from satellites and the evaluation of the effectiveness of the recovery is based on visual checking and cloud cover by IoU (Intersection over Union) originated from computer vision (Liggins et al. Citation1997) which is basically Jaccard index (Jaccard Citation1912). The ground truth of cloud cover is obtained by time-consuming manual labelling of cloud clusters. Due to the complexity of the nature of cloud, it is extremely difficult to delineate the boundary of cloud clusters accurately, especially for thin clouds, and hence there exists large amount of errors when segmenting clouds manually. An ideal solution is to build cloud model to capture the shape and formation of all sorts of cloud, thick or thin. Unfortunately it is quite involved in physics and mathematics and it is a multi-facet problem (Dobashi et al. Citation1999, Citation2017; Yuan et al. Citation2014; Yumeihui Xing et al. Citation2017). Even if the cloud cover is known, the other side of the problem, way more important than cloud, is the ground truth of the ground scene. The ultimate goal of cloud removal is to recover ground scene accurately. Whereas current practice largely relies on subjective evaluation, i.e. ‘eye-balling’, which is apparently very vulnerable to bias. Therefore, an objective and robust evaluation is highly desirable. The work in Zheng, Liu, and Wang (Citation2021) used an overly simplified method to train the Unet for cloud separation by simulating random strips of white rectangles or from brightest to darkest colour gradient boxes on top of clear ground images. This is a bit primitive. Not only are they far from real clouds, but most importantly the regular shape reduces the complexity of the problem. Inspired by the success of applying Perlin noise (Perlin Citation1985, Citation2002) in the simulation of virtual landscapes, we adopt Perlin noise to generate synthetic cloud. We take a cloud free image, say from the Inria aerial image labelling dataset (Maggiori et al. Citation0000), convert it to greyscale as true ground image I (pixels rescaled to [0,1]), and generate multiple 2D Perlin noise the same size as the image, as Ci, i=1,,n. Then the observed image Ii is

Ii=Ci+(1Ci)I

where pixels in Ci are rescaled within [0,1]. Optionally one can apply any transformation f to I before combining to clouds, e.g. geometric distortion to study some aspects of the methods; or generate a base C and apply dynamics to C for cloud time series mimicking clouds movement. We leave these for future work. By varying the parameters in Perlin noise generator, we can control the density of the generated clouds, from sparse to heavy cover. We also apply some image correction, e.g. Gamma correction and histogram equalization, totally optional, to enhance the similarity to real clouds and haze.

As the ground truth is readily accessible, we can apply any suitable quantitative evaluation to the cloud removal methods for detailed study. Given the main focus is the fidelity of the recovered image, we use the following to quantify the goodness of recovery

(15) r=IˆiIFIF(15)

Where Iˆi is the recovered image from any method. The quantity r defined in (15) is the normalized distance metric, which is not meant to be the best. Other sophisticated measures could be applied certainly. However, (15) is sufficient by virtual of equivalency of norms (Åžuhubi Citation2010, Ch6.6), although in modelling process, different norms affect model behaviours vastly.

3.2. Performance evaluation on simulations on single image

Thanks to the above semi-realistic simulation, we can now investigate another important aspect, that is the regularization parameters used in the models. Using the goodness of recovery r, we can determine the best values from large-scale randomized trials. Meanwhile we can also verify the necessity of the MC step.

Let us first visually check the outcomes of different methods on one set of simulated images. The true image is from Inria dataset named tyrol-w1 from Lienz in Austrian Tyrol resized to 1024×1024 (d=220). It is a mixture of urbane and nature scene with some high intensity areas such as roads and roof tops shown in . We simulate seven thin cloud covers. One simulated image and the cloud layer are also shown in .

Figure 3. Simulated image. Left to right: true clear image, one of the simulated image, its cloud cover.

Figure 3. Simulated image. Left to right: true clear image, one of the simulated image, its cloud cover.

The cloud looks very nature. Note that the cloud cover image appears to be sparse as large dark areas exist as shown in the histograms in top panel where the right one is showing details in the range of [0,0.2]. However, they are not exactly zero and correspond to thin haze. If one thresholds them to zero, the cloud cover then becomes very artificial visually. The bottom panels in show thresholded results, by 0.1 and 0.2 respectively from left to right. The visible boundaries of clouds are unpleasant and against the intuition due to the lack of the critical smoothness commonly present in natural images with cloud. This also shows the tremendous difficulty to manually separate clouds in real images.

Figure 4. Details of a simulated cloud cover.

Figure 4. Details of a simulated cloud cover.

shows the recovered images by different methods obtained with the setting of the regularization parameters as λ=1d=11024 and β=1 in aATM. Simple visual checking tells us that aATM and RPCA are better than ATM as ATM results (with and without MC) contain fair amount of cloud pixels. This may be straightforward. However, it is not clear which one is the best. It appears that aATM is slightly better for less ‘washed away’ areas. It is also impossible to identify the effect of MC. These indicate the limit of visual examination. Nonetheless, the r values of these methods are 0.1758,0.3195, 0.1754 in the order of aATM, ATM, RPCA with MC, and 0.1678, 0.3373, 0.1681 without MC. Now it is clear that aATM without MC is the best and MC does not do anything useful to enhance the results.

Figure 5. Recovered images. Top row: simulated images with cloud cover; 2nd-3rd row (in green box): aATM results with/without MC; 4th-5th row (in blue box): ATM results with/without MC; 6th-7th row (in yellow box): RPCA results with/without MC.

Figure 5. Recovered images. Top row: simulated images with cloud cover; 2nd-3rd row (in green box): aATM results with/without MC; 4th-5th row (in blue box): ATM results with/without MC; 6th-7th row (in yellow box): RPCA results with/without MC.

shows the cloud covers detected by these methods. The clouds separated by ATM are in better contrast, i.e. very bright and very dark although it appears very conservative, that is visually sparser than others. In contrast, aATM and RPCA seem to have more cloud pixels identified. Again, it is impossible to tell the difference between aATM cloud and RPCA cloud by visual examination. Note that for aATM the cloud is the summation of C and N.

Figure 6. Cloud covers estimated by various methods. From top: known cloud covers, aATM results, ATM results, RPCA results.

Figure 6. Cloud covers estimated by various methods. From top: known cloud covers, aATM results, ATM results, RPCA results.

One major benefit of simulation is to validate the sensitivity of the regularization parameter, mainly λ in the models. We ran large-scale simulation with n=7 and 15, using the same true image. We tested 51 values of λ equally spaced in log scale with the recommend value, 1d, in the middle, i.e. from 9.7656e-05 to 0.0098, and for each value of λ, we ran 50 randomized trials. The results are collected in , where each data point is the mean and ±1 standard deviation of the r values across all trials for a given λ value.

Figure 7. Randomised trials n=7. λ in log scale.

Figure 7. Randomised trials n=7. λ in log scale.

Figure 8. Randomised trials (n=15). λ in log scale.

Figure 8. Randomised trials (n=15). λ in log scale.

Many things can be read out from the plots. The first is that ATM is not as good as competitors for small n, e.g. n=7, regardless of the choice of the λ values. However, it begins to gain advantage when n is larger. This will be investigated later. The second is that λ has roughly three zones: 1) failure zone, where the sparsity is too weak and all methods fail with no recovered images; 2) clamping zone, where the sparsity is overwhelming such that sparse component is wiped out and all methods lose the capacity to identify cloud; 3) Goldilock zone, where the algorithms work reasonably well (r0.2 for λ[0.0007,0.0012]), including their bests. Of course, these zones have different boundaries for different methods, and their r values inside these zones have different shapes. For example, RPCA seems to have rather flat r values in its Goldilock zone meaning that its performance varies just a little bit if λ is from that zone. There exists a value for λ which is better than the default recommended value. This holds true for all methods, interestingly with different margin of being true. For example, for RPCA, the margin is smaller, that is the optimal value of λ brings 17.23% reduction of r value on average in n=7 case, while that is 42.11% for aATM. Similar observation for n=15. When all methods take the default value of λ, aATM without MC works the best on average, which is 22.84% better than RPCA in expectation sense. The overall best performance of aATM against that of RPCA is 43.06% reduction in r value, down from 0.1625 to 0.0941, which is very significant. This is verified by a one-side t-test with null hypothesis of no r values reduction performed on the trials with the optimal and default λ values where significance level α=105. The resulting p-value for null hypothesis is extremely low 1.7557×1029 strongly supporting the alternative hypothesis that the reduction is quite significant. A very interesting observation is that ATM without MC comes to the second when n=15 in terms of the overall best performance, better than RPCA. reveals the details of the r values of both methods in the trials when holding λ value constant, λ=4.6741e04, the optimal value for both methods. The r values of each method vary during the trials due to the randomness of the simulation. RPCA has higher values of r almost constantly with greater variation than ATM. There is no doubt that ATM outperforms RPCA when λ is optimal. The third is that MC does not bring much improvement even acts adversely when λ is in the Goldilock zone. This claim is strongly supported by statistical evidence. shows the one-side t-tests results performed on the trials of various methods with optimal λ values for both n=7 and n=15 cases. The null hypothesis is that MC brings r value reduction on average, i.e. the mean of rMCrMC is no greater than 0. rMC and rMC are the r values of a method with and without MC respectively. The significance level α is set as low as 105. The p-values are extremely low suggesting that the null hypothesis should be rejected almost surely. The only exception is ATM when n=7, which favours the MC to further improve its performance. So clearly the recommendation is to omit MC step in cloud removal in these methods, which is extra computation with little benefit. However, we need to point out here though that there are regularization parameters as well in MC, for which we took the default/recommended values. See previous sections for detail.

Figure 9. rvalues of ATM and RPCA in randomised trials (n=15) with their best λ values (λ=4.6741e04 for both). X-axis is the repeat indices from 1 to 50. The coloured horizontal lines are the means of corresponding r values.

Figure 9. rvalues of ATM and RPCA in randomised trials (n=15) with their best λ values (λ=4.6741e−04 for both). X-axis is the repeat indices from 1 to 50. The coloured horizontal lines are the means of corresponding r values.

Table 1. Null hypothesis H0 is rMCrMC0. Significance level α=105 in t-tests.

To see the comparison more clearly, we present the mean and standard deviation values of the results in some λ range (in the Goldilock zone) in for clarity. The column in the middle of the tables with double column indicates the values when λ is equal to the default value. The column-wise best (minimum among all methods) is highlighted by italic font and overall best is highlighted by bold font. They show clearly that aATM is the best in terms of both expected r value and stability reflected by smaller standard deviations.

Table 2. Mean r values of various methods from all trials for some λ values.

Table 3. Standard deviations of r values of various methods from all trials for some λ values.

3.3. Computation costs comparison on simulations using single fixed image

We report the time for computation. shows the time consumed by various methods, with and without MC. Similar to previous plots, the data points in the plot are the means and ±1 standard deviation of the times (in seconds) across all trials for a given λ value. Apparently they vary across simulations.

Figure 10. Time consumed in randomised trials (n=7). λ in log scale.

Figure 10. Time consumed in randomised trials (n=7). λ in log scale.

It is quite obviously that MC is extra work. Given no extra benefit, the computation for MC should be saved. ATM is pretty difficult to solve indeed, reflected by the skyrocketed computational time compared with others. There are double optimization loops inside its solver. Interestingly, when λ is correct, ATM takes the most of time to compute on average. When λ is growing from the failure zone to the Goldilock zone, a huge jump of needed computation can be observed, which is statistically significant. As ATM’s performance turns very sharply along λ values, its computation cost varies accordingly, peaking at where ATM works the best and jumping down quickly. This is a very interesting observation that may lead to a way of selection of its regularization parameter as well as a hypothesis of required computational cost vs λ value. Along with the well known regularization path in sparse models (Hastie et al. Citation2004; Friedman, Hastie, and Tibshirani Citation2010; Tibshirani and Taylor Citation2011), this may be a useful route leading to optimal regularization selection in future. This was never possible previously without simulation. In general, aATM is more expensive to compute than RPCA because of the extra block of variables N, doubling the cost almost for all λ values. However, the base is quite small. when λ is in the Goldilock zone, aATM is doubling RPCA from about 6 seconds to 10 seconds. Therefore it is not dramatic.

Again, we present the mean and standard deviation values of the results in some λ range (in the Goldilock zone) in for clarity. The column in the middle of the tables with double column indicates the values when λ equal to the default value. The column-wise best (minimum time among all methods) is highlighted by italic font and overall best is highlighted by bold font. They show clearly that RPCA is the fastest and aATM is about 50% more expensive to run at this range of λ values. Considering its superiority in recovery performance, this cost is absolutely worthwhile.

Table 4. Means of time consumed by various methods from all trials for some λ values.

Table 5. Standard deviations of time consumed by various methods from all trials for some λ values.

Both aATM and RPCA exhibit the same pattern observed from ATM but less pronounced. When λ goes form failure zone to Goldilock zone, there is time cost leap and stabilizes for a while and then some up and downs. Again, the zone changing pattern of time cost is a good indicator of entering the Goldilock zone from failure zone. It is possible to exploit it for finding a better λ value than the default one, although it is tricker than ATM where the pattern is very clear.

3.4. Determining the best λ value

What is the best value for the regularization parameter λ? This is an inevitable and yet critical question in practice. It is almost impossible to address it without many assumptions and lengthy theoretic analysis. Please refer to Appendix A for Goldilock zone bounds to see the complexity. However, thanks to simulation, we can fit the data to derive some equation for the best λ value. Different from drilling into the computational cost pattern suggested by previous section, we look at the best λ values of different methods by stretching n from 2 to 250. The ‘best’ is defined as the λ value corresponding to the minimum average r value across trials, which we denote as λ. shows the ratio of λ of all methods to the suggested default value 1d, i.e. dλ. As n becomes larger, dλ decreases exponentially. We turn this into almost linear by applying log twice to n, as shown in right panel. From this data, we fit a linear model and derive the following λ estimator

Figure 11. dλv.S n=2250.

Figure 11. dλ∗v.S n=2∼250.

(16) λˆ=0.5682log(log(n))+1.0747d,n2(16)

The red curves in are the values of dλˆ at different scales of n.

Remark 3.1. λˆ is estimated for all methods. However, it is possible to derive the estimator for individual method. ATM may be disadvantaged as the fit is not as good as others. λˆ should be lower bounded by the minimum value of λ in Lemma A.1 so

λˆ=max{0.5682log(log(n))+1.0747d,1dn}

prevents λ from being too small. Finally this λˆ is empirical and approximate with no model assumption. More sophisticated regression methods are possible.

3.5. Performance evaluation on simulations on multiple images

Now we are ready for more comprehensive tests. One last question is how these evaluations hold across different n (image sequence length) and different scenes? To this end, we picked three other images from Inria data set, chicago1, kitsap1 and vienna1, and ran the same randomized trials with n=5,7,10,12,15,20, each with 50 repeats. The ground truth images are displayed in . In this experiment, we bring in the state-of-the-art deep learning methods (Sarukkai et al. Citation2020) (called STGAN+Resnet and STGAN+Unet) and (Zheng, Liu, and Wang Citation2021) (called UNET and UNET+GAN) for a thorough comparison. STGAN provides two variants using Resnet and Unet backbone networks. UNET separates cloud and ground only and UNET+GAN uses GAN to fill thick cloud covered areas. The training of these deep learning models strictly followed the procedures in their code base repository, and were optimized for best performance as per instructions. For our models and RPCA, λ was automatically determined by (16) for different n.

Figure 12. Four base images for randomised trials. (a) tyrol-w1,(b) chicago1,(c) kitsap1 and (d) vienna1.

Figure 12. Four base images for randomised trials. (a) tyrol-w1,(b) chicago1,(c) kitsap1 and (d) vienna1.

visaulises all r values in one place, where the height of the bars are the means and error bars on top show the standard deviation calculated from multiple trials. The results of the same methods are grouped together with different coloured bars showing the results for different n. The overall impression is that deep learning methods are not as good although they have quite stable performance across different n values. They may have some advantages when n is small, say n=5. STGAN is better than RPCA in kitsap1 although no match for aATM and ATM when n>7. Deep learning methods have large performance variations across different scenes, while others are rather consistent. All sparse models have better r values when n grows larger. This suggests that a strategy to boost performance is to increase the sampling frequency moderately. It makes perfect sense as more images provide more information for the missing pixels covered by cloud, and it is more likely that some areas covered in one image are not covered in another. The rank minimization in aATM/ATM/RPCA is designed to fully utilize this. The unnecessity of MC is once again verified in this test. The add-on value of MC is only observable for ATM when n is small, i.e. n<15.

Figure 13. Randomised trials for various base images (in the order shown in ) and sequence length n. The legend shows n.

Figure 13. Randomised trials for various base images (in the order shown in Figure 12) and sequence length n. The legend shows n.

The above observations provide us a clear clue to the questions raised at the beginning of this paper and justify our motivation. Deep learning methods in general have lower fidelity (higher r values). This uncertainty poses many questions for subsequent applications. They may have good performance on some specific scenes, for example, pure nature scene like kitsap1. However, it is not clear how GAN’s distribution transformation works. The large performance variation reveals their problems in dealing with different situations. While our models do not have these issues and are interpretable in terms of their working mechanism.

4. Conclusions

In this paper, we introduced atmosphere scattering model into cloud removal modelling process and proposed two ATM models as superior alternatives to RPCA-based model. Furthermore, we proposed a method to simulate controllable cloud cover scenes. This semi-realistic simulation enables detailed study of various cloud removal methods and provides valuable insights to several aspects of the algorithms as large-scale randomized trial and quantitative analysis become available. Examining the methods by using this powerful experimental tool, we saw clearly that the proposed aATM outperforms not only RPCA model, the state-of-the-art in this category of non-deep-learning-based cloud removal methods, but also latest deep learning models constructed on large scale backbone networks, by quite a large margin. There were many interesting findings in this process, for example the zoning of the regularization parameter λ, computational cost pattern across zones, automated regularization parameter determination and so on. These may be out of the question without the assistance of the simulation. We envisage a robust development of the cloud removal algorithms under this framework in near future. For this purpose, we released the Matlab code at https://github.com/yguo-wsu/clouldremoval.

Disclosure statement

No potential conflict of interest was reported by the authors.

References

  • Åžuhubi, E. S. 2010. Functional Analysis. OCLC: 961064002. Netherlands, Dordrecht: Springer.
  • Bai, Z. D., and Y. Q. Yin. 1993. “Limit of the Smallest Eigenvalue of a Large Dimensional Sample Covariance Matrix.” Annals of Probability 21 (3, July). doi:10.1214/aop/1176989118.
  • Boyd, S., and L. Vandenberghe. 2004. Convex Optimization. tex.owner: guo020 tex.timestamp: 2014.07.24. Cambridge University Press.
  • Cai, J.F., E. J. Candès, and Z. Shen. 2010. “A Singular Value Thresholding Algorithm for Matrix Completion.” SIAM Journal on Optimization 20 (4, January): 1956–1982. doi:10.1137/080738970.
  • Chen, C., H. Bingsheng, Y. Yinyu, and X. Yuan. 2014. “The Direct Extension of ADMM for Multi-Block Convex Minimization Problems is Not Necessarily Convergent.” In Mathematical Programming, 1–23. tex.publisher: Springer.
  • De la Torre, F. and M. J. Black. Robust Principal Component Analysis for Computer Vision. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 1, pages 362–369, Vancouver, BC, Canada, 2001. IEEE Comput. Soc.
  • Dobashi, Y., K. Iwasaki, Y. Yue, and T. Nishita. 2017. “Visual Simulation of Clouds.” Visual Informatics 1 (1, March): 1–8. doi:10.1016/j.visinf.2017.01.001.
  • Dobashi, Y., T. Nishita, H. Yamashita, and T. Okita. 1999. “Using Metaballs to Modeling and Animate Clouds from Satellite Images.” The Visual Computer 15 (9, December): 471–482. doi:10.1007/s003710050193.
  • Emmanuel, J. C., L. Xiaodong, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM, 58(3):1–11:1–11, June 2011. tex.numpages: 37 tex.publisher: ACM.
  • Friedman, J. H., T. Hastie, and R. Tibshirani. 2010 February. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1):1–22. doi:10.18637/jss.v033.i01.
  • Guo, Y., L. Feng, P. Caccetta, and D. Devereux. 2017. “Multiple Temporal Mosaicing for Landsat Satellite Images.” Journal of Applied Remote Sensing 11 (1, March): 015021. doi:10.1117/1.JRS.11.015021.
  • Guo, Y., L. Feng, P. Caccetta, D. Devereux, and M. Berman. Cloud Filtering for Landsat TM Satellite Images Using Multiple Temporal Mosaicing. In 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 7240–7243, Beijing, China, July 2016. IEEE.
  • Hastie, T., S. Rosset, R. Tibshirani, and J. Zhu. 2004. “The Entire Regularization Path for the Support Vector Machine.” Journal of Machine Learning Research 5: 1391–1415.
  • Jaccard, P. 1912. “THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1.” The New Phytologist 11 (2, February): 37–50. doi:10.1111/j.1469-8137.1912.tb05611.x.
  • Junbin Gao, P. W. K., Y. Guo, and Y. Guo. 2009. “Robust Multivariate L1 Principal Component Analysis and Dimensionality Reduction.” Neurocomputing 72 (4–6): 1242–1249. doi:10.1016/j.neucom.2008.01.027.
  • Jun Liu, S. J., and Y. Jieping SLEP: Sparse Learning with Efficient Projection. Technical report, Arizona State University, 2009.
  • Kei Pong, T., P. Tseng, J. Shuiwang, and Y. Jieping. 2010 . “Trace Norm Regularization: Reformulations, Algorithms, and Multi-Task Learning.” SIAM Journal on Optimization 20 (6):3465–3489. doi:10.1137/090763184.
  • Liggins, M. E., C. Y. Chong, I. Kadar, M. G. Alford, V. Vannicola, and S. Thomopoulos. 1997. “Distributed Fusion Architectures and Algorithms for Target Tracking.” Proceedings IEEE 85 (1): 95–107. doi:10.1109/JPROC.1997.554211.
  • Maggiori, E., Y. Tarabalka, G. Charpiat, and P. Alliez. Can Semantic Labeling Methods Generalize to Any City? The Inria Aerial Image Labeling Benchmark. In IEEE international geoscience and remote sensing symposium (IGARSS), 2017. tex.organization: IEEE.
  • Narasimhan, S. G., and S. K. Nayar. 2002. “Vision and the Atmosphere.” International Journal of Computer Vision 48 (3, July): 233–254. doi:10.1023/A:1016328200723.
  • Nesterov, Y. 2003. “Introductory Lectures on Convex Optimization: A Basic Course.“ Applied Optimization, edited by Pardalos Panos M., and W. Hearn Donald, Vol. 87. Kluwer Academic Publishers.
  • Perlin, K. 1985. “An Image Synthesizer.” ACM SIGGRAPH Computer Graphics 19 (3, July): 287–296. doi:10.1145/325165.325247.
  • Perlin, K. 2002. “Improving Noise.” ACM Transactions on Graphics 21 (3, July): 681–682. doi:10.1145/566654.566636.
  • Sarukkai, V., A. Jain, B. Uzkent, and S. Ermon. Cloud Removal from Satellite Images Using Spatiotemporal Generator Networks. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV), Snowmass, CO, USA, March 2020.
  • Tao, T., and V. Van. 2010. “Random Matrices: The Distribution of the Smallest Singular Values.” Geometric and Functional Analysis 20 (1, June): 260–297. doi:10.1007/s00039-010-0057-8.
  • Tibshirani, R., and J. Taylor. 2011 . “The Solution Path of the Generalized Lasso.” Annals of Statistics 39 (3):1335–1371. tex.owner: guo020 tex.timestamp: 2014.07.24. doi:10.1214/11-AOS878.
  • Vershynin, R. 2012. “Introduction to the Non-Asymptotic Analysis of Random Matrices.” In Compressed Sensing, edited by Y. C. Eldar and G. Kutyniok, 210–268. Cambridge: Cambridge University Press.
  • Wen, F., Y. Zhang, Z. Gao, and X. Ling. 2018. “Two-Pass Robust Component Analysis for Cloud Removal in Satellite Image Sequence.” IEEE Geoscience and Remote Sensing Letters 15 (7, July): 1090–1094. doi:10.1109/LGRS.2018.2829028.
  • Xing, Y., J. Duan, Y. Zhu, and H. Wang. 2017. “Three-Dimensional Particle Cloud Simulation Based on Illumination Model.” In LIDAR Imaging Detection and Target Recognition 2017, edited by Y. Lv, J. Su, W. Gong, J. Yang, W. Bao, W. Chen, Z. Shi, J. Fei, S. Han, and W. Jin, 74. Changchun, China: SPIE.
  • Yonina, C. E., and G. Kutyniok editors. 2012. Compressed Sensing Theory and Applications. United Kingdom: Cambridge University Press
  • Yuan, C., X. Liang, S. Hao, Q. Yue, and Q. Zhao. 2014. “Modelling Cumulus Cloud Shape from a Single Image: Modelling Cumulus Cloud Shape from a Single Image.” Computer Graphics Forum 33 (6, September): 288–297. doi:10.1111/cgf.12350.
  • Zhang, Y., F. Wen, Z. Gao, and X. Ling. 2019. “A Coarse-To-Fine Framework for Cloud Removal in Remote Sensing Image Sequence.” IEEE Transactions on Geoscience and Remote Sensing 57 (8, August): 5963–5974. doi:10.1109/TGRS.2019.2903594.
  • Zheng, J., X.Y. Liu, and X. Wang. 2021. “Single Image Cloud Removal Using U-Net and Generative Adversarial Networks.” IEEE Transactions on Geoscience and Remote Sensing 59 (8, August): 6371–6385. doi:10.1109/TGRS.2020.3027819.

Appendix

A Analysis of regularisation parameter λ

In this section, we focus on the theoretic analysis on the regularisation parameter λ in the models, in particular its valid range. We first have the following minimum λ value lemma.

Lemma A.1.

For any given data in DRd×n assuming d>n, the minimum and maximum value for λ in ATM, aATM and RPCA model is 1dn and n respectively. The extremum is in the sense of bound for the models to generate non-trivial solutions.

Proof.

The optimality condition of both RPCA and ATM models requires

L+λC1=0

leading to

(A.1) UVT+W+λsign(C)=0(A.1)

where superscribed star   stands for the optimal value, L=UΣVT is the skinny SVD of L (i.e. URd×l=[u1,,ul], VRn×l=[v1,,vl], and Σ=Il, identity matrix of size l×l) and WRd×n is any matrix satisfying

W21,UTW=0,WV=0.

It is easy to see that W=UEVT, where A is the complementary components in ambient space that is orthogonal to matrix A, and E is a diagonal matrix with all element ej[0,1] to satisfy W21. With the re-wrting Eq. (A.1), we are seeking

(A.2) infλ{λ0:i=1lHi+j>lnejHj=λsign(C)}(A.2)

where as Hi=uiviT. Hi’s are orthogonal to each other and unitary in terms of Frobenius norm. EquationEquation (A.2) shows that the subgradient of L at L is clamped by λ regardless C, meaning the elements in the left hand side of (A.2) have to be in the range of [λ,λ], a boxed condition. The largest norm within the λ box is at one of its corners. Without loss of generality, we can choose the first orthant corner. According to Pethagorean, adding an orthogonal component to a vector will only increase the norm. Hence to allow as large as possible for λ, one can seek the vector with smallest norm, which reflected to the situation in (A.2) is to let l=1 and set jej=0. It is equivalent to choose U and V to be vectors of all 1/d and all 1/n respectively and let W=0. In this case, UVT+W=1dn1 where 1 is matrix of all one’s with compatible dimensions. This is the smallest norm the subgradient of L can fit in the λ box. Therefore, the infimum in (A.2), i.e. what is required in this lemma is 1dn.

Similarly the maximum is

(A.3) supλ{λ0:i=1lHi+j>lnejHj=λsign(C)}.(A.3)

The only difference is that one has to consider all possibilities, i.e. the maximum of the norm. Therefore, (A.3) is equivalent to

max{i=1lHi+j>lnejHjF}=tr[VnUnTUnVnT]=n

where Un=[u1,,un] and Vn likewise.

Remark A.2. From above, we can see that, when λ<1dn, the only allowed solution is to nullify the elements in L, in which case, L=0 and C=D. This is what we have seen in , where when λ is very small, r value is 1 as L=0. aATM has the same result although it has another regularisation because the infimum of λ happens only when β=, otherwise it would further reduce the value of λ.

This matches the purpose of regularisation. When λ is too small, the penalty to sparsity is next to null. Hence the sparse component is free. The sensible choice is of course to set the low rank component zeros, such that the objective is quite small, although this is a trivial solution. Similar logic for maximum value of λ.

Note that the extrema values of λ deduced in Lemma A.1 is for general cases, in other words, no specific conditions. The minimum value of λ is very close to the recommended value of λ in RPCA, while the maximum is rather loose, due to the generality. Actually we can have the following tighter upper bound of λ.

Lemma A.3.

For a given DRd×n assuming d>n, the maximum value for λ in ATM, aATM with β= and RPCA model, written as λm, is UVT where U and V is from the skinny SVD of D=UΣVT and A is the matrix infinity norm, i.e. the maximum absolution value of all its elements.

Proof.

The proof is similar to that of Lemma A.1 except now we consider only one data set D, i.e.

(A.4) supλ{λ0:i=1lHi[λ,λ]},(A.4)

as the sparse component C is totally zero. i=1lHi=UVT and all its elements are surely less than or equal to UVT. Therefore when λ=∥UVT, the condition in (A.4) holds. This is equivalent to setting j,ej=0 as in (A.3).

The upper bound of λ in Lemma A.3 is much better than that in Lemma A.1, especially when dn and D[0,1]. However it is possible to further quantify λ=∥UVT without actual SVD. We give asymptotic results here of the upper bound of λ=∥UVT and hence λm. To proceed, we need the following proposition to bound UVT.

Proposition A.4.

For any matrix A of size d×n (d>n)and its skinny SVD as A=UΣVT, the following holds

UVTAσn(A)

where σi(A) is the ith largest singular value of A and then σn(A) is the smallest singular value of A.

Proof.

Following the same way of thinking from previous lemmas, we see that UVT can only bound the Frobenius norm of the matrix spanned by the same bases up to n. In other words, for any matrix of size d×n, if XFn, then X≥∥UVT. Also iσiuiviTF2=iσi2. Therefore, if σi1 for all i, then iσiuiviTFn. Combining this observation with the fact that A˜=1σn(A)A=UΣ/σnVT, we obtain the claim in this proposition, as we have A˜>∥UVT since σ1(A˜)σ1(A˜)σn(A˜)=1.

Now we treat the observed image matrix D as a random matrix whose elements are i.i.d from uniform distribution from [0,1] and hence we do not assume any further structure. We are then concerned with the smallest singular value σn(D) of a non-central random matrix, i.e. the mean is non-zero. There is limited results on smallest singular values. The closest one is (Tao and Van Citation2010), which deals with centralised random matrix. Fortunately, we only need a lower bound on σn(D). We use the following theorem from (Bai and Yin Citation1993), which also appeared in (Vershynin Citation2012.

Theorem A.5

(Bai-Yin’s law). Let A be a d×n random matrix whose entries are independent copies of a random variable with zero mean, unit variance, and finite fourth moment. Suppose that the dimensions d and n grow to infinity while the aspect ratio n/d converges to a constant in [0,1]. Then

σn(A)=dn+o(n),σ1(A)=d+n+o(n)

almost surely.

Note in Bai-Yin’s law there is no assumption on the distribution but centrality. We then write D=A+0.5E where A’s elements are i.i.d from uniform distribution from [0.5,0.5]. We use the Courant-Fischer minimax characterisation of singular values to obtain the bound

σk(A)=maxdim(S)=kminv2=1,vSAv2

where S is any subspace of Rn. This leads to

σn(A)=minv2=1,vRnAv2

as S is just Rn. We have the following lemma tailored to non-central uniform distribution.

Lemma A.6.

Let A be a d×n random matrix whose entries are independent copies of a random variable from uniform distribution with mean ε, unit variance. Suppose that the dimensions d and n grow to infinity while the aspect ratio n/d converges to a constant in [0,1]. Then

σn(A)dn+o(n)

almost surely.

Proof.

Let E be a all 1 matrix with compatible dimensions, then A=A˜+εE where A˜ satisfy conditions in A.5. We have

σn(A)2=minv2=1,vRn(A˜+εE)v22
minv12=1,v1RnA˜v122+minv22=1,v2RnεEv222
+minv32=1,v3nv3TA˜TEv3

Note we use vj j=1,2,3 to highlight that these minimisations are separated and hence the above holds. A˜TE in the third terms gives

A˜TE=(s1,s2,,sn)T1nT

where si=jdaij, aij is the ijth element in A and 1n is the vector with all 1 with length n. As aij’s are from centralised population, under asymptotic condition, si=0 almost surely and hence the third term vanishes. The first two terms are the square of the smallest singular values of corresponding matrices, i.e. A˜ and E. Since σi(E)=0 for i1, we have the required inequality by using Theorem A.5.

Combining Lemma A and Proposition A.4, we have the following corollary.

Corollary A.7.

Assume the elements in data matrix D of size d×n is uniformly distributed in [0,1] and its size grows asymptoticly as in Lemma A. Given the conditions in Lemma A.3, the maximum value for λ is almost surely

(A.5) 23dn+o(n)(A.5)

Proof.

It is simply the rescaling result of Lemma A by recognizing the standard uniform distribution has variance 1/12 and also D=1.

Remark A.8. Although λm in Corollary A.7 is asymptotic result, as the images are quite large, say 210×210 in our experiments, i.e. d=220, the bound of λm is quite good. In practice, n is relatively small, typically at the order of 10, nd. Therefore we can further simplify (A.5) to

λm23d

That is what we see from that when λ is too large, precisely larger than 0.0032 as shown in the figure, the sparse component is erased, i.e. C=0. In this case, our theory predicted λm=23220=0.0034, very close to our observation. When C=0, L=D and hence the observed images with clouds. The variations we see from the figures are due to the simulated clouds.

B

Deep learning models training details

We describe how the deep learning models were trained. Basically, we strictly followed the training instructions from the original papers [32] and [24].

B.1 U-Net+GAN

We used the training data from the original paper. We use the cloudless aerial imagery data set from the National Institute for Research in Computer Science and Automation in France (INRIA) as the ground images dataset, and use the greyscale infrared image data set from National Oceanic and Atmospheric Administration (NOAA) as cloud images dataset, respectively. A cloud covered satellite image was obtained by combining clear (cloud free) ground images with cloud only image by atmospheric scattering models as follows:

  1. Clear ground images were taken from the INRIA aerial image data set (INRIA), which contains 180 training images. Each image has 5000 times 5000 pixels with a spatial resolution of 0.3 m. As directed in [32], each ground image was cropped into non-overlapping patches with 256 × 256 pixels. Therefore, each ground image results in 361 non-overlapping patches of size 256times256. Finally, a dataset of 64,980 ground images was generated.

  2. Following the instructions from [32], we collected cloud images from the NOAA, totally 150 cloud images from 2018 to form a training set. These images are multispectral (RGB + infrared) images with a resolution of 1920 × 1080 pixels. They were preprocessed into greyscale images. After preprocessing, each cloud image was downsampled and cropped into 16 non-overlapping patches of size 256×256. Finally, a dataset of 900 cloud images is obtained.

  3. Cloud and clear images were randomly combined by using atmospheric scattering model to form cloud covered image data with total of 64,980 images for model training.

Finally, model structure and hyperparameters including optimization parameters were set as required in [32].

B.2 STGAN

We used the training data from the original paper [24]. The dataset in the original paper was constructed using publicly available real Sentinel-2 images. When training STGAN, a total of 3130 sets (256× 256) of image pairs were constructed, and each set of datasets was obtained by pairing one cloudless ground image with three cloud covered images of the same area at different times. The training, verification and test datasets were split according to the ratio of 8:1:1, and there were 2504, 313 and 313 training, verification and test image groups respectively.

Once again, model structure and hyperparameters including optimization parameters were set as required in [24].