1,869
Views
2
CrossRef citations to date
0
Altmetric
Research Article

Non-normal Data Simulation using Piecewise Linear Transforms

ORCID Icon &

ABSTRACT

We present PLSIM, a new method for generating nonnormal data with a pre-specified covariance matrix that is based on coordinate-wise piecewise linear transformations of standard normal variables. In our presentation, the piecewise linear transforms are chosen to match pre-specified skewness and kurtosis values for each marginal distribution. We demonstrate the flexibility of the new method, and an implementation using R software is provided.

It is well known that multivariate normally distributed data are rare in the social sciences (Cain et al., Citation2017; Micceri, Citation1989). Nevertheless, statistical estimation procedures and inferences that are based on multivariate normality are routinely used in data analysis in the educational and behavioral sciences. The study of whether this practice leads to approximately valid inference, i.e. whether methods based on the normality assumption are robust to non-normality, is most often based on a simulation design. Such simulation studies are thus important in evaluating various statistical procedures in structural equation modeling (SEM) and in multivariate statistics in general. A main concern in the context of SEM is to be able to generate random samples from a distribution whose covariance matrix is controlled. In addition, most simulation techniques control some other aspects of the simulated data, such as skewness and kurtosis.

In the present study, we introduce a new and flexible simulation technique for non-normal data that matches a pre-specified population covariance matrix, and which also allows researchers to specify values for some univariate moments. Our approach is based on transforming univariate normal variables using piecewise linear (PL) functions.

Let us limit our attention to PL functions H(x) that are continuous, so that their graphs consist of a finite number of line segments that are glued together at the end points. depicts one such PL function, where four line segments with different slopes are joined together. Now, assume Z is a standard normal variable, ZN(0,1). Consider the random variable Y:=H(Z), which is generally non-normal. Since Y is based on two analytically simple and well-known concepts, that of a standard normal variable and that of piecewise linearity, many aspects of the distribution of X are amenable to analytical and computational treatment. For instance, there are exact and computationally tractable formulas for the mean, variance, skewness, and kurtosis of Y. The same tractability holds for the bivariate case. That is, define Y1:=H(Z1) and Y2:=H(Z2), where (Z1,Z2) is a bivariate normal vector. Then, as outlined below, we may use straightforward formulas to calculate the covariance between Y1 and Y2. In the following, we refer to the piecewise linear simulation approach as PLSIM.

Figure 1. Graph of the continuous piecewise linear function H1(x)

Figure 1. Graph of the continuous piecewise linear function H1(x)

This article is organized as follows. We next review simulation techniques for non-normal data with pre-specified population covariance matrix. We then present our method formally, and include some illustrations in this discussion. A data illustration is thereafter given. We finally discuss strengths and limitations of PLSIM. Some R (R Core Team, Citation2020) code is provided in the text, and complete R code is provided in the supplemental material.

Simulating multivariate data with pre-specified covariance matrix

Given the importance of variances and covariances in factor analysis and SEM, it is not surprising that several methods have been proposed for drawing random samples from multivariate distributions whose covariance matrix is fixed. The most popular approach is the transform of Vale and Maurelli (Citation1983), which starts with a multivariate normal vector and then applies polynomial transforms in each coordinate. The polynomials are so chosen as to produce given univariate skewness and kurtosis in the resulting vector. If feasible, the method identifies the correlation matrix in the multivariate normal vector that ensures that the polynomially transformed variables have the required covariance matrix. A thorough theoretical study of the Vale-Maurelli (VM) transform is given by Foldnes and Grønneberg (Citation2015). The VM transform is the default method for generating non-normal data in commercial software such as EQS (Bentler, Citation2006) and LISREL (Jöreskog & Sörbom, Citation2006), and in the R package lavaan (Rosseel, Citation2012).

Recently, several alternative methods that also control the moments have been developed. The independent generator approach proposed by Foldnes and Olsson (Citation2016) is available in the R package covsim (Grønneberg et al., Citation2021), and can match pre-specified univariate skewness and kurtosis. It offers more flexibility than VM, since many possible marginal distributions are attainable. Based on the independent generator idea, Qu et al. (Citation2019) recently proposed a method which controls multivariate skewness and kurtosis, available in the R package mnonr (Qu & Zhang, Citation2020).

Both VM and the independent generator approach allows the asymptotic covariance matrix of the empirical covariances to be exactly calculated (Foldnes & Grønneberg, Citation2017). This means that the population-level properties of standard errors and fit statistics in SEM models may be exactly calculated using well-known formulas (Browne, Citation1984).

The above methods have in common with PLSIM that only some lower-order moments of the univariate distributions are controlled. Mair et al. (Citation2012) offered a different approach, based on the concept of a copula. A copula is a multivariate distribution with uniform marginals on [0,1]. In general, multivariate distributions may be split up into a copula component and the univariate distributions. In the Mair et al. (Citation2012) approach the marginals are specified together with the copula class, but in a final step the simulated vector is obtained by pre-multiplication with a matrix in order to reach the target covariance, leading to perturbations in the marginal distributions. We are aware of two approaches that allow complete control over the univariate distributions. First, the NORTA method of Cario and Nelson (Citation1997), which is implemented in package SimCorMultRes (Touloumis, Citation2016). In common with VM and PLSIM, it is based on generating multivariate normal data and then transforming each variable according to univariate specifications. A limitation of this method is that it can only produce data with a normal copula. Second, the VITA method of Grønneberg and Foldnes (Citation2017), implemented in package covsim (Grønneberg et al., Citation2021), fully specifies the marginal distributions. In addition, since VITA is based on regular vine distributions, the user may specify for each variable pair the (conditional) copula. The VITA approach is particularly suited for ordinal data simulation, as recently demonstrated by Foldnes and Grønneberg (Citation2021).

Piecewise linear simulation: The univariate case

In this section, we consider a random variable that is stochastically represented as a PL function of a standard normal variable Z. A general expression for a PL function consisting of d line segments is

(1) H(x)=i=1d[aix+bi]I{γi1<xγi},(1)

where γ0=,γd=. The indicator function I{A} evaluates to 1 if A is true, and to 0 otherwise. The γ are breakpoints where function evaluation shifts from one affine function to another. The d line segments have slopes denoted by ai and y-intercepts denoted by bi. The requirement that H(x) be continuous means that bi+1=(aiai+1)γi+bi for i=1,,d. Therefore, H(x) is parameterized by the a and γ vectors, in addition to b1 (a total of 2d parameters).

As an example, consider the graph in , which depicts the following function, where d=4:

(2) H1(x)=0.552x0.127ifx<0.6740.258x0.325if0.674x<00.585x0.325if0x<0.6742.185x1.404ifx>0.674.(2)

In this example the breakpoints are 0.674,0, and 0.674. The breakpoints are regular in the sense that they correspond to quantiles of regularly spaced probabilities (.25,.5, and .75) for the normal distribution. Note that all the slopes in this example are positive, so that H1 is a monotone function.

We now assume that ZN(0,1) is a standard normal variable, and define the random variable

(3) Y=H(Z)=i=1d[aiZ+bi]I{γi1<Zγi}.(3)

The cumulative distribution and density functions of Y may be deduced following straightforward arguments (See also Foldnes & Grønneberg, Citation2015, Prop.1). For instance, the density may, without further assumptions, be calculated as

(4) f(y)=i=1d|ai|1ϕ((ybi)/ai)I{γi1<(ybi)/aiγi},(4)

where ϕ() is the density function of a standard normal variable. depicts the density of H1(Z). It is evident that the four line segments in H1(Z) contribute separately to the density, producing rather pronounced shifts in the density curve.

Figure 2. The density of Y=H1(Z)

Figure 2. The density of Y=H1(Z)

In order to calculate the moments of Y, it is useful to first obtain the conditional moments mki:=E(Zk|γi1<Zγi), that is, the moments of a truncated normal variable. These may be obtained with the following recursive formula (Burkardt, Citation2014; Orjebin, Citation2014), where we initialize by m1i=0 and m0i=1:

mki=(k1)mk2iγik1ϕ(γi)γi1k1ϕ(γi1)Φ(γi)Φ(γi1),

where Φ() is the cumulative distribution function of a standard normal variable. Now, to calculate the k-th moment of Y, we apply the following formula, which is derived in Appendix A:

(5) E(Yk)=i=1dj=0kkjaikjbij(Φ(γi)Φ(γi1))mkji.(5)

Using this formula, the first four centralized moments of Y=H1(Z) from EquationEquation (2) are

E(Y)=0,E(Y2)=1
E(Y3)=2,E(Y4)=8.

In other words, Y is a zero mean unit variance random variable whose skewness is E(Y3)/E(Y2)3/2=2 and whose excess kurtosis is E(Y4)/E(Y2)23=5.

We have shown how to calculate the moments of a given PL random variable. However, in simulation applications we need to move in the opposite direction: We first specify the moments, and then we search for a PL function that generates the pre-specified moments. That is, we use a fast numerical routine to calibrate H1(x), based on the formula in EquationEquation (5). For given breakpoints, slope values a and y-intercepts b, this equation allows us to calculate the first four moments of Y=H(Z). If we are given pre-specified values μ, σ2, μ˜3, and μ˜4 for mean, variance, skewness, and excess kurtosis, respectively, we can therefore use numerical optimization to minimize

(E(Y3)/E(Y2)3/2μ˜3)2+(E(Y4)/E(Y2)2μ˜3)2

as a function of a1,,ad. The above expression is not dependent upon the specific b-values, but we assume that these are such that H(x) is continuous in each optimization step. The final step involves shifting the bi so that E(Y)=μ, and scaling the ai so that E(Y2)=σ2+μ2. The optimization routine has been implemented in the function rPLSIM in package covsim Foldnes and Grønneberg (Citation2020b). The default setting of rPLSIM is to use three regularly spaced break-points. We here demonstrate how one may request a sample of size 1000 from a population with skewness 2 and excess kurtosis 5, where the PL function is forced to be monotonous:

library(covsim)

library(psych)#for sample skew/kurtosis

set.seed(1)

res <- rPLSIM(N = 10^3, sigma.target = 1, skewness = 2, + excesskurtosis =5, monot = TRUE)

sim.sample <- res[[1]][[1]]# a simulated sample

skew(sim.sample)

kurtosi(sim.sample)

res[[2]]$a;res[[2]]$b#print slopes and intercepts

In the output below we see that the sample skewness and excess kurtosis values are close to the population values. Also, in the second element of the output we can inspect the fitted slope and intercept values, which agree with the values in EquationEquation (2):

[1] 1.973825

[1] 4.987873

[1] 0.5519887 0.2583700 0.5849776 2.1849716

[1] −0.1271060 −0.3251488 −0.3251488 −1.4043284

Matching pre-specified mean, variance, skewness, and kurtosis

In the example above the slopes a1,,a4 and the y-intercepts b1,,b4 were carefully chosen so that Y=H1(Z) has mean zero, unit variance, skewness 2 and excess kurtosis 5. One reason for choosing the first four moments in this way was to produce a condition of non-normality where the VM transform is not helpful, since the third-order Fleishman polynomial cannot produce skewness 2 in combination with excess kurtosis of 5. Using the lavaan package for calibration of Fleishman polynomials yields:

library(lavaan)

res <- simulateData(“x1~~x2,” skewness = 2, + kurtosis = 5)

lavaan WARNING: ValeMaurelli1983 method

+ did not convergence,

+ or it did not find the roots

So the skewness 2, kurtosis 5 case illustrates that there are conditions in which VM is infeasible but that are still within reach of PLSIM.

Given the flexibility of piecewise linear functions, it is not surprising that even with the same breakpoints, there are other PL functions that produce the same first four moments as H1(x). For instance, we may relax the monotonicity constraint:

res <- rPLSIM(N = 10^3, sigma.target = 1,

+ skewness = 2, excesskurtosis =5, monot = FALSE)

res[[2]]$a[[1]]

[1] 0.8500105 −0.9079488 1.2142742 2.1681442

The result is a non-monotonous function H2(x) depicted in , which ensures that Y2=H(Z) also has zero mean, unit variance, skewness 2 and excess kurtosis 5. The density of Y2 is depicted in . We observe that although Y1 and Y2 share the first four moments, their densities are quite dissimilar.

Figure 3. Graph of the piecewise linear function H2(x).

Figure 3. Graph of the piecewise linear function H2(x).

Figure 4. The density of Y=H2(Z).

Figure 4. The density of Y=H2(Z).

To further illustrate the flexibility of piecewise linear transforms, we next specify skewness 2 and excess kurtosis 4. With the default breakpoint settings, as also used in H1(x), our routine did not identify a PL that attained these skewness and excess kurtosis values. To find a valid PL function the breakpoints therefore need to be altered, either by introducing more line segments, or by keeping d = 4 and changing the location of the breakpoints. In our case, the latter was a viable option:

g <- list(c(−2,0.5, 2))

res <- rPLSIM(N = 10^3, sigma.target = 1,

+ skewness = 2, excesskurtosis =4,

+ monot = FALSE, gammalist = g) res[[2]]$a[[1]]

[1] 1.350564 0.201702 2.284732 1.398601

For the set of feasible breakpoints γ1=2,γ2=0.5, and γ3=2, the function H3(x) depicted in produces the density for Y3=H3(Z) depicted in . This density has skewness 2 and excess kurtosis 4. Note that, although we did not constrain the routine to monotonous solutions, the result is still monotonous in this special case.

Figure 5. Graph of the piecewise linear function H3(x).

Figure 5. Graph of the piecewise linear function H3(x).

Figure 6. The density of Y=H3(Z).

Figure 6. The density of Y=H3(Z).

The generality of univariate PLSIM

As exemplified above, PLSIM accommodates a much larger class of univariate marginal distributions than those provided under the VM transform. In fact, PLSIM may approximate to arbitrary precision the distribution F of any continuous random variable X. For, given the quantile function F1(u)=inf{x:F(x)=u}, 0<u<1, then F1(U)X, provided U is a uniform variable on [0,1]. Therefore, F1(Φ(Z))X, where ZN(0,1) (Shorack & Wellner, Citation2009, Theorem 1, p. 3). With increasing number of breakpoints we can approximate the function F1(Φ(x)) using a PL function H(x) arbitrarily well. It then follows that H(Z) will converge in distribution to F as the number of breakpoints increases, provided the breakpoints and line segments are suitably chosen.

The bivariate case

In the previous section, we demonstrated how a PL function H(x) could be fitted to accommodate univariate moments for the random variable Y=H(Z). In this section we move to the bivariate case where we consider 2-dimensional random vectors

Y=(Y,Y˜) =(H(Z),H˜(Z˜)) ,

where each coordinate is a PL transform of a standard normal variable. Moreover, Z and Z˜ are assumed to be bivariate normally distributed and have correlation ρ. We assume, without loss of generality, that H(x) and H˜(x) are such that Y=H(Z) and Y˜=H˜(Z˜) each has zero mean and unit variance.

Next  consider  the  covariance/correlation   between Y=H(Z)=i=1d[aiZ+bi]I{γi1<Zγi} and H˜(Z˜)=j=1d˜[a˜jZ˜+b˜j]I{γ˜j1<Z˜γ˜j}). For less notational burden, we let Ri,j denote the rectangle (γi1,γi]×(γ˜j1,γ˜j]. Then,

i=1d[aiZ+bi]I{γi1<Zγi}j=1d˜[a˜jZ˜+b˜j]I{γ˜j1<Z˜γ˜j}
=i=1dj=1d˜[aiZ+bi][a˜jZ˜+b˜j]I{(Z,Z˜)Ri,j}
=i=1dj=1d˜[aia˜jZZ˜+aib˜jZ+bia˜jZ˜+bib˜j]I{(Z,Z˜)Ri,j}.

This gives

(6) Cov(H(Z),H˜(Z˜))=E(H(Z)H˜(Z˜))(6)
=i=1dj=1d˜[aia˜jE(ZZ˜I{(Z,Z˜)Ri,j})+
aib˜jE(ZI{(Z,Z˜)Ri,j})+
bia˜jE(Z˜I{(Z,Z˜)Ri,j})+
bib˜jE(I{(Z,Z˜)Ri,j})]
=i=1dj=1d˜[aia˜jE(ZZ˜|(Z,Z˜)Ri,j)+
aib˜jE(Z|(Z,Z˜)Ri,j)+
bia˜jE(Z˜|(Z,Z˜)Ri,j)+
bib˜j]P((Z,Z˜)Ri,j).

Procedures for calculating the moments

E(ZZ˜|(Z,Z˜)R)andE(Z|(Z,Z˜)R)

of a truncated bivariate normal variable is a well-studied problem (e.g., Leppard & Tallis, Citation1989). In our implementation we use the R package tmvtnorm (Wilhelm & Manjunath, Citation2015) to calculate these moments.

It is important to notice that the expression in EquationEquation (6), in addition to being dependent upon the slopes, y-intercepts and breakpoints in H(x) and H˜(x), also depends on an additional parameter, namely ρ, the intermediate correlation between Z and Z˜. We emphasize this in notation by writing Cov(H(Z),H˜(Z˜);ρ).

We next illustrate the dependency of Cov(H(Z),H˜(Z˜);ρ) on ρ. We consider the three PL functions H1(x),H2(x), and H3(x) introduced in the previous section. For values of ρ between 1 and 1, we calculated the correlation Cov(H1(Z),H2(Z˜);ρ), Cov(H1(Z),H3(Z˜);ρ), and Cov(H2(Z),H3(Z˜);ρ). In we graphically depict the dependence of the correlations upon the correlation ρ between Z and Z˜. Clearly, combining H1(x) and H2(x) yields the largest possible range of correlations, although there is no way to produce a correlation below.68 for this pair of PL transforms. Combining H2(x) and H3(x) yields the smallest range, with the lowest attainable correlation being .55. The figure demonstrates that not all correlations are attainable once the univariate specification of skewness and excess kurtosis have been given.

Figure 7. The correlation among piecewise linear transforms of bivariate normal variables with correlation ρ.

Figure 7. The correlation among piecewise linear transforms of bivariate normal variables with correlation ρ.

The PLSIM simulation procedure

In the previous sections, we demonstrated how a PL function could be fitted to accommodate univariate moments of the simulated variable Y=H(Z), and how to calculate the covariance among two such variables. We now move to the full multivariate case and consider p-dimensional random vectors

(7) Y=(H1(Z1),H2(Z2),,Hp(Zp)) ,(7)

where each coordinate Hi(Zi) is a PL transform of a standard normal variable,Footnote1 ZiN(0,1). We assume, without loss of generality, that the Hi(x) functions (i=1,,p) have been calibrated so that Hi(Zi) has zero mean and unit variance. We also assume that Z=(Z1,,Zp) is multivariate normally distributed with standardized marginals and a covariance matrix equal to ΣZ. Note that ΣZ is in fact a correlation matrix containing p(p1)/2 non-redundant off-diagonal elements.

The steps in PLSIM are as follows:

  1. The user specifies

  • (a) Univariate properties (e.g., skewness and excess kurtosis) of each marginal variable Y1,,Yp.

  • (b) A target correlation matrix Σ.

  1. (2) The PL functions H1,,Hp are calibrated to match the properties specified in step 1(a).

  2. (3) For each correlation ρi,j, 1i<jp in Σ, we numerically determine a correlation ρi,jZ among Zi and Zj so that Hi(Zi) and Hj(Zj) have correlation ρi,j.

  3. (4) The matrix ΣZ is formed from entries ρi,jZ. A random sample from the multivariate normal distribution with zero mean and covariance matrix ΣZ is drawn. We apply the functions Hi(x), i=1,,p, coordinate-wise to the random sample to obtain our PLSIM sample.

There are two ways the above procedure may fail to complete. First, in Step 2, we may fail to identify a Hi(x) for some i=1,,p, that reaches the given skewness and kurtosis values. Although we may then change the breakpoint locations or increase the number of breakpoints, it is computationally burdensome to run PLSIM with, say, 20 breakpoints in each of 40 variables. In the R implementation provided in the supplemental material, the default number of breakpoints is 3, which are regularly placed (0.674,0,0.674) so that each line segment is associated with the same probability .25. In our experiments, this seems to offer a reasonable compromise between computational tractability and flexibility across various skewness, kurtosis and correlations combinations.

The second way the above PLSIM procedure may fail, is in Step 4, should ΣZ be negative definite. That is, it may happen that ΣZ is not a proper correlation matrix. The reason for this is that we, for computational simplicity, calibrate the entries in ΣZ independently. The alternative is full simultaneous calibration of all the entries, under the additional restraint of positive definiteness. However, we did not find this option viable in our numerical experiments. The issue of the intermediate matrix not being a proper correlation matrix arises also in the VM procedure. We here propose a simple solution to this problem, which is applicable to both PLSIM and VM. If ΣZ is negative definite, we calculate its nearest correlation matrix TZ. There are various approaches to defining TZ, and in our current implementation we use the method proposed by Higham (Citation2002), as implemented in package Matrix (Bates & Maechler, Citation2019). Then, in Step 4, we replace ΣZ by TZ. This means that the target covariance Σ is no longer reached. However, we may correct this by pre-multiplying the PLSIM vector Y by

P=Σ1/2M1/2,

where M is the implied covariance matrix of Y=(H1(Z1),,Hp(Zp)) , when the Zi, (i=1,,p) have covariance TZ. The square root matrices are symmetric and such that M1/2M1/2=M and Σ1/2Σ1/2=Σ. Note that, due to EquationEquation (6), exact calculation of M is straightforward. In most cases P will be fairly close to the identity matrix, so that the marginal distributions will be only slightly modified by pre-multiplication. So the pre-specified marginal properties, e.g., skewness and kurtosis, will still be closely matched, while the covariance matrix will be exactly matched. To summarize, to avoid the problem of negative definiteness, we rewrite Step 4 above as follows:

4. The matrix ΣZ is formed, with elements ρi,jZ. If ΣZ is negative definite, let TZ denotes its closest positive definite matrix. Draw a random sample from the multivariate normal distribution with zero mean and covariance matrix ΣZ (or TZ when needed). Apply the functions Hi(x), i=1,,p, coordinate-wise to the random sample. If TZ was used, the final random sample is obtained by post-multiplying the random data matrix by Σ1/2M1/2.

On the asymptotic covariance matrix for PLSIM

As argued by Foldnes and Grønneberg (Citation2017), it is desirable in a simulation study to specify non-normality more precisely than just reporting univariate skewness and kurtosis. Ideally, the full asymptotic covariance matrix Γ of the empirical second-order moments should be computed in each simulation condition. For SEM, access to Γ means that the population-level properties of standard errors and fit statistics may be exactly calculated using well-known formulas (Browne, Citation1984). For PLSIM, we here show that there are closed form formulas available from Γ. Unfortunately, no presently available implementation of these formulas are able to calculate these quantities within either an acceptable running time, or at an acceptable numerical precision. New implementations will be needed for calculating Γ for PLSIM. In light of this future availability, we here sketch how Γ can be obtained.

Consider a random p-dimensional vector Y whose expectation is zero and whose fourth-order moments exist. Let Σ be the covariance matrix of Y. Then Γ is defined as the p(p+1)/2×p(p+1)/2 matrix with elements

Γij,kl=E(YiYjYkYl)ΣijΣkl,

where all or some indices are allowed to be equal. To obtain Γ under PLSIM, we need to perform calculations for the expectation similar to the deductions in EquationEquation (6). The full expression for E(Hi(Zi)Hj(Zj)Hk(Zk)Hl(Zl)) is a linear combination of elements of the type

E(ZiZjZkZlI{(Zi,Zj,Zk,Zl)R})
(8) =E(ZiZjZkZl|(Zi,Zj,Zk,Zl)R)P((Zi,Zj,Zk,Zl)R),(8)

where R is a four-dimensional rectangle defined by four pairs of breakpoints in Hi(x),Hj(x),Hk(x), and Hl(x), and again indices are allowed to be equal. We therefore need to calculate higher-order moments of a truncated multivariate normal vector, and there exist both exact recursive and non-recursive formulas for this task. A numerical routine implementing the recursive formulas is found in package MomTrunc (Galarza et al., Citation2021), and succeeds in calculating a subset of the required moments at an acceptable speed and precision. However, in our experiments, tri- and four-variate moments as calculated by MomTrunc are sometimes not satisfactory, a finding echoed by Ogasawara (Citation2021), who developed non-recursive formulas which can be used to calculate higher-order moments of a truncated normal vector to arbitrary precision. Unfortunately, the only available implementation for the formulas in Ogasawara (Citation2021), which is given in the supplementary material of Ogasawara (Citation2021), takes an excessive long time to terminate in one of the cases we need, namely when all indices in EquationEquation (8) are distinct, and genuine four-dimensional integration is required. There is therefore no available implementation of an algorithm capable of computing Γ in reasonable time, and we therefore do not provide a function to calculate Γ in our implementation. This will be added when future efficiency improvements in the procedure proposed by Ogasawara (Citation2021) is made available. A rough approximation to Γ can be obtained, as always, by direct simulation from PLSIM.

Limitations

As argued above, the univariate generality of PLSIM is only limited by computational considerations. However, this is not the case in terms of multivariate dependency properties, as PLSIM takes a multivariate normal random vector and apply only coordinate-wise transformations. Since the transformation from Z to Y has no interaction between the coordinates, this restricts the multivariate dependency properties of Y, as shown in Foldnes and Grønneberg (Citation2015).

Recall that the copula of a continuous random vector (X1,,Xp)  is the distribution of (F1(X1),,Fd(Xp))  where F1,,Fp are the marginal cumulative distribution functions of X1,,Xp. In the case where each of the coordinate-wise transformations H1,,Hp are monotonous, the copula of the PLSIM random vector Y of EquationEquation (7) is exactly normal, meaning that it has the same copula as the multivariate normal vector Z.

A recent discovery (Grønneberg & Foldnes, Citation2019) warns against the widespread practice of using VM in robustness studies for ordinal SEM. In many relevant cases encountered in the simulation literature, VM has the normal copula, which in ordinal SEM has the unfortunate consequence of making the ordinal vector generated by discretizing a VM random vector numerically equal to a discretization of an exactly normal random vector. The distribution of the manifest variables in ordinal SEM is a function only of the copula of the latent continuous vector at certain points (Foldnes & Grønneberg, Citation2019, Citation2020a, Citation2021; Grønneberg & Foldnes, Citation2021; Grønneberg & Moss, Citation2021). Since PLSIM will have a normal copula when each H1,,Hp are monotonous, the discovery of Grønneberg and Foldnes (Citation2019) also applies to PLSIM. Therefore, PLSIM for simulation studies with ordinal SEM is not recommended, and if used, must be used for non-monotonous H1,,Hp. One important exception is if a normal copula is desired. For instance, Grønneberg and Foldnes (Citation2021) employed a simple bivariate PLSIM distribution to illustrate that ordinal SEM estimation is biased unless knowledge of all latent marginal distributions is provided. In that case, profiting from PLSIM’s marginal flexibility, a bivariate vector Y was constructed who followed a two-factor model, while the bivariate generator vector Z followed a one-factor model. Since Y had a normal copula, all normal theory methods based on discretizing Y estimate features of Z and not Y, illustrating the impossibility of identifying even the number of factors in ordinal SEM without latent marginal knowledge.

In general, PLSIM distributions are contained in the class investigated by Foldnes and Grønneberg (Citation2015), as Y is the coordinate-wise transformation of a continuous random vector Z, where each transformation is piecewise strictly monotonous over a finite set of intervals. In PLSIM, we have that Z is multivariate normal with standardized marginals, and the transformations are straight lines in each interval segment. When some of the coordinate-wise transformations are non-monotonous, the copula of Y is not exactly normal, but the multivariate distribution is still strongly connected to the normal generator variable Z, and for example, Y cannot have what is known as tail dependence, see Section 3.3 of Foldnes and Grønneberg (Citation2015).

The main limitation of PLSIM is therefore its close connection to the normal distribution in terms of copula properties. This limitation may be remedied by replacing the normal random vector with another class of distributions capable of detailed control of lower moments, and with computationally feasible formulas for conditional distributions over rectangles. As mentioned above, calculating the fourth-order moments contained in Γ leads to serious computational challenges even when Z is multivariate normal. It therefore seems that such extensions would either be restricted to very simple distributional classes which would limit its usefulness, or would depend on as of yet unavailable numerical methods.

A comparison of Fleishman polynomials and PLSIM

As discussed above, PLSIM and VM are similar in terms of multivariate dependency, as both are generated by coordinate-wise transformations of a normal random vector. We here further inquire into the univariate distributional differences between PLSIM and VM. In the latter procedure, the marginal distributions are generated by third-order polynomial transformations as proposed by Fleishman (Citation1978). Third-order polynomials f(x) grow more quickly to ± as x± than PL functions, who involve only a simple scale-and-shift of the identity function. In this sense PL functions are more stable transformations compared to Fleishman polynomials, yet they preserve the capability of approximating general functions to arbitrary precision. The stability concerns the tails of the resulting univariate distributions. As implied from EquationEquation (4), PL functions have the same tail behavior as a normal distribution. That is, when y, PL density approaches zero as adϕ((ybd)/ad), and when y, the density goes to zero as a1ϕ((yb1)/a1). That is, in PL functions, the tail behavior is still driven solely by the quick decrease to zero of ϕ(y)=(2π)1/2exp(y2/2). This is not the case for VM, which has heavier tails due to the third-order transformation. Whether heavier tails are desirable, and whether considerations as |y| are relevant or not, depends on the application.

Let us consider in a concrete example the relation between a normal (Z), a PL (YPL) and a Fleishman (YF) variable. The simplest case of a Fleishman polynomial is the standardized third-order transformation (Z3μ3)/σ3. We have μ3=EZ3=0 and σ3=VarZ3=EZ6(EZ3)2=EZ6=15. Therefore, YF=Z3/15. The excess kurtosis of YF is extreme, namely EYF43=(151/2)4EZ343=(15)2EZ123=46.23= 43.2. The density of YF is yϕ((15)1/6y1/3)(15)1/6(1/3)y2/3. While the term y2/3 quickens the convergence to zero, the y1/3 term inside ϕ slows down convergence, and this is the dominant term. We also fitted, using breakpoints 3,2,1,1,2, and 3, a PL function H(x) so that YPL=H(Z) had zero mean, unit variance, skew zero, and excess kurtosis 43.2. depicts the density curves of the three densities. Clearly, although the Fleishman and the PL distribution have common moments up to the fourth order, their distributions differ markedly. Both distributions are more peaked and more heavy-tailed than the standard normal distribution. But the peakedness of the Fleishman polynomial is much more pronounced than that of the PL distribution. Moreover, although not depicted in the figure, for extreme values, the tails of the Fleishman polynomial are fatter than those of the PL distribution. To illustrate this point, we simulated n=107 sample from both distributions, and found that the 99th-percentiles for the PL and Fleishman data were 2.75 and 3.8, respectively (see the supplemental material).

Figure 8. Graph of the density of Z and the standardized version of Z3 when ZN(0,1).

Figure 8. Graph of the density of Z and the standardized version of Z3 when Z∼N(0,1).

Illustration

Let us illustrate PLSIM with a real-world example. The datasets package contains the dataset attitude, based on a survey given to employees in a large financial organization related to satisfaction with their supervisors. For 30 randomly chosen departments the proportion of favorable responses for each item was collected in the attitude dataset. The correlation matrix and the marginal sample skewnesses and excess kurtosi are given . Our aim here is to construct a 7-variate distribution whose correlation matrix and marginal skewness and excess kurtosis values match exactly the values in .

library(datasets)

attach(attitude)

s <- skew(attitude)

k <- kurtosi(attitude)

sigma <- cor(attitude)

set.seed(1)

res <- rPLSIM(100, sigma.target = sigma, + skewness = s, excesskurtosis = k)

sim.sample <- res[[1]][[1]]

Table 1. Correlation matrix and marginal skewnesses and excess kurtosi for the attitude dataset

Note that the VM approach as implemented in lavaan is not up to this task, since it can not match the skewness and excess kurtosis of variable learning. However, with three regularly spaced thresholds, monotonous PL functions may be identified (Step 2) for each of the seven variables that match skewness and excess kurtosis exactly. In Step 3 we fit each pair of variables separately, and in Step 4 the correlations are aggregated into:

ΣZ=1.834.440.643.605.165.1631.570.611.678.196.2331.509.451.156.3531.656.123.5591.397.5881.3091

This matrix is positive definite, so we need not pre-multiply by P. A sample of size n=100 was simulated using PLSIM, with resulting scatter plots and univariate densities depicted in .

Figure 9. Plots for n=100 dataset simulated from PLSIM.

Figure 9. Plots for n=100 dataset simulated from PLSIM.

Finally, let us inspect sample estimates of kurtosis with respect to the pre-specified kurtosis values. In a generated sample, sample kurtosis will differ substantially from the specified kurtosis value. The latter holds at the population level but not in samples. To illustrate, we generated 1000 samples, each of size 30, from PLSIM. In each sample univariate kurtosis was estimated for each of the seven variables. The results are depicted in , where each panel is associated with one variable. It is seen that sample excess kurtosis varies substantially across samples. In each panel, the vertical red line indicates population excess kurtosis. Small-sample bias of the kurtosis estimator is manifested for most variables, with general downward bias in our case. Let us also consider the b2,p statistic (Mardia, Citation1970) for multivariate kurtosis. Qu et al. (Citation2019) proposed a simulation method where samples are generated from a distribution with pre-specified multivariate kurtosis. In our development of PLSIM, we control the univariate kurtosi, but we have no control over multivariate kurtosis. Over the same 1000 samples described above, we calculated b2,p, with results given in . The red line represents this statistic calculated in the attitude dataset. It is seen that the PLSIM datasets have lower b2,p values than the original attitude data. This illustrates that we do not control multivariate kurtosis with our method.

Figure 10. Histograms of sample estimates of univariate kurtosis based on 1000 simulated datasets of size n=30. The red line represents the population kurtosis derived from the attitude dataset.

Figure 10. Histograms of sample estimates of univariate kurtosis based on 1000 simulated datasets of size n=30. The red line represents the population kurtosis derived from the attitude dataset.

Figure 11. Histogram of multivariate kurtosis b2,p based on 1000 simulated datasets of size n=30. The red line represents b2,p calculated for the attitude dataset.

Figure 11. Histogram of multivariate kurtosis b2,p based on 1000 simulated datasets of size n=30. The red line represents b2,p calculated for the attitude dataset.

Conclusion

We have presented a new method to simulate univariate and multivariate non-normal data. We employ piecewise linear transforms of standard normal variables. This method is flexible since we may manipulate the slopes of the line segments in order to reach pre-specified skewness and kurtosis values. This is possible since the fourth-order moments of the transformed variable is exactly computable. The same holds for pairs of piecewise linearly transformed variables, i.e., the covariance may be calculated using exact formulas. This means that we may pairwise calibrate the correlation among the two standard normal variables in order to reach a given covariance among the transformed variables.

PLSIM has been implemented in R and is available in package covsim. We have demonstrated that PLSIM may be used in cases (e.g., skewness 2 and excess kurtosis 4) where the Vale-Maurelli procedure fails. PLSIM supports a flexible class of univariate distributions, since its framework is based on choosing arbitrary number and placement of breakpoints, and arbitrary line segments between the breakpoints. In our implementation the default number of line segments is four, separated by regularly spaced breakpoints. We have deduced the formulas necessary to exactly compute the asymptotic covariance matrix of the generated second-order moments under PLSIM. However, at the present time the needed routines to calculate moments of the truncated multivariate normal distributions are too slow for practical use. However, we project that this situation will be soon remedied, given the present active development around truncated multivariate moments in the field of multivariate statistics. We also proposed a simple correction to the case of negative definiteness of the intermediate correlation matrix, based on finding the nearest positive definite matrix. This correctional step may likewise be useful for extending the generality of the Vale-Maurelli procedure.

Notes

1 H1,H2,,Hp denote general functions of the form of EquationEquation (1), and not the specific illustrative functions defined in the previous sections.

References

Appendix

Calculating the moments of Y=H(Z)Consider the random variable Y=H(Z). Then

Yk=i=1d[aiZ+bi]I{γi1<Zγi}k
=i=1d[aiZ+bi]kI{γi1<Zγi}
=i=1dj=0kkjaikjbijZkjI{γi1<Zγi}.

Now, since

E(ZkI{γi1<Zγi})=
E(ZkI{γi1<Zγi}|Zγi1)Φ(γi1)+
E(ZkI{γi1<Zγi}|γi1<Zγi)(Φ(γi)Φ(γi1))+
E(ZkI{γi1<Zγi}|γi<Z)(1Φ(γi))=
E(Zk|γi1<Zγi)(Φ(γi)Φ(γi1))

the k-th moment is

E(Yk)=i=1dj=0kkjaikjbijE(ZkjI{γi1<Zγi})
=i=1dj=0kkjaikjbijE(Zkj|γi1<Zγi)(Φ(γi)Φ(γi1)).

To evaluate this expression we need to calculate the conditional moments mk:=E(Zk|γi1<Zγi), that is, the k-th moment of a truncated normal variable. Mean and variance formulas are provided in Johnson et al. (Citation1994). Higher-order moments may be obtained with the following recursive formula (Burkardt, Citation2014; Orjebin, Citation2014), where we initialize by m1=0 and m0=1:

mk=(k1)mk2γik1ϕ(γi)γi1k1ϕ(γi1)Φ(γi)Φ(γi1).

To sum up, we may calculate E(Yk) by first using the above recursive formula to calculate the conditional moments m1,,mk. Then we apply the formula

E(Yk)=i=1dj=0kkjaikjbij(Φ(γi)Φ(γi1))mkj.