1,379
Views
1
CrossRef citations to date
0
Altmetric
Articles

Applications of discrete factor analysis

&
Pages 4592-4602 | Received 11 May 2021, Accepted 31 Jul 2021, Published online: 14 Aug 2021

Abstract

A recently proposed method for factor analysis of discrete data is extended to better handle overdispersion. Three empirical examples from veterinary sciences, musicology and agriculture are investigated, involving true count data as well as ordinal data. Comparisons are made with results from related statistical techniques, e.g., principal component analysis.

1. Introduction

Factor analysis is a statistical technique that can be employed in several fields of application. The variability among observed correlated variables is described in terms of a lower number of unobserved variables called factors. Recently, a method for factor analysis of discrete data was presented (Larsson Citation2020), along with an empirical example: a seven-dimensional data set of ordinal data (survey study). In this paper, the main purpose is to illustrate the usefulness of that method by studying three different empirical examples. In the sequel of this introductory section, we first review the method from Larsson (Citation2020) and also present an extension for analysis of data with overdispersion, introducing the negative binomial distribution.

The method is implemented in Matlab, and codes are available upon request.

1.1. Dependent Poisson models

In Larsson (Citation2020), a method of performing factor analysis for discrete data using a dependent Poisson distribution is presented. A short description of the method is as follows. In the literature, there are many versions of dependent Poisson models, see Larsson (Citation2020) and the references therein. The model studied by Larsson (Citation2020) has been proposed by e.g., Karlis (Citation2003). In is simplest form, it is given as a bivariate model (1) {Y1=U+X1,Y2=U+X2,(1) where U, X1 and X2 are independent Poisson variables, with intensities λ, μ1 and μ2 (say), respectively. Then Y1 and Y2 are also Poisson, but with intensities λ+μ1 and λ+μ2, respectively. The variables U, X1 and X2 are considered latent, only Y1 and Y2 are observed.

It is easily seen that (2) corr(Y1,Y2)=λ(λ+μ1)(λ+μ2).(2) In particular we see that (as is natural), the correlation is bigger when λ is large relative to μ1, μ2 than otherwise. Another important observation is that the model only allows for a positive correlation.

Suppose we have observation pairs (y11,y12),,(yn1,yn2). In Larsson (Citation2020), it is laid out how the parameters may be estimated numerically by maximum likelihood. Indeed, for the model in (1), the likelihood takes the form (3) L(λ,μ1,μ2)=i=1nu=0min(yi1,yi2)f(u;λ)g(yi1u;μ1)g(yi2u;μ2),(3) where f and g are Poisson probability functions, e.g., (4) f(u;λ)=λuu!eλ(4) and we used that Y1 and Y2 are conditionally independent, given U = u.

In fact the likelihood only needs to be numerically maximized over λ, since it can be proved that λ̂+μ̂k=y¯k for k = 1, 2, where λ̂ and μ̂k are the MLEs and y¯k=n1iyik, cf. Larsson (Citation2020).

We may view (1) as a simple two-variable factor model, with a common factor U. The idea of Larsson (Citation2020) is to extend this model to a system with many variables, with different groups of variables. For each group, the variables are connected to each other through a specific common factor. For example, a model with five variables and two factors may be constructed as (5) {Y1=U1+X1,Y2=U1+X2,Y3=U1+X3,Y4=U2+X4,Y5=U2+X5,(5) where U1,U2,X1,X2,X3,X4,X5 are independent.

Here, the variables Y1,Y2,Y3 are connected (correlated) through U1 and Y4, Y5 are connected through U2. It is clear that any of Y1,Y2,Y3 is uncorrelated to (independent of) any of Y4 and Y5. The parameters of the model (5) may be estimated by maximum likelihood in much the same way as for the parameters of (1). In fact, this estimation is quite simple because of the independence of the two sub systems (Y1,Y2,Y3) and (Y4, Y5).

Now, it is quite obvious how to go on to form many other systems in the style of Equation(5), see further Larsson (Citation2020) for a general formulation of the model.

The next question is how to find the dependent Poisson model that best fits the observed data. Larsson (Citation2020) suggests that the model giving the smallest value on the Akaike information criteria (AIC) should be selected. An obstacle is that, in systems with many variables, there are a lot of possible models to go through. To alleviate this, a kind of forward search algorithm is proposed, starting with the simple independence model and then successively moving to models of higher complexity. This is also the approach that we will adapt in the present paper.

Larsson (Citation2020) also extends the model to deal with truncated Poisson distributions (which is suitable in presence of ordinal data). Another extension is the so-called mixed model, where it is allowed that the same factor can ’load’ on more than one group of variables. This gives the same kind of flexibility as in traditional factor analysis models. However, in order to be able to compare to cluster analysis (’kmeans’) and principal component analysis, we will not consider mixed discrete factor models in this paper.

In Larsson (Citation2020), empirical data from a questionnaire is analyzed. This data is ordinal, and it is modeled by dependent truncated Poisson distributions. The results from this exercise are shown to very much agree with the outcomes of the traditional factor analysis performed by Jöreskog, Olsson, and Wallentin (Citation2016).

1.2. Dependent negative binomial models

Many empirical count data sets are subject to overdispersion, i.e., the variances of the variables are much larger than their means, making Poisson modeling insufficient. To alleviate this, we suggest in the present paper to extend the models in Larsson (Citation2020), simply by replacing the Poisson distribution with the negative binomial. For this, we replace the probability functions in e.g., (4) by (6) f(u;r,p)=(r+u1u)pr(1p)u.(6) There are many different parametrisations, see e.g., Hilbe (Citation2011), but we choose this one for convenience, since this is the one implemented in Matlab. This parametrisation has the property that the expectation is r(1p)/p and the variance is r(1p)/p2. Moreover, the correlations are given by the formula (cf. (2)) (7) corr(Yi,Yj)=v0(v0+vi)(v0+vj),(7) where v0 is the variance of the factor (U) and vi is the variance of Xi.

For r large and p close to one, the distribution is close to a Poisson with parameter r(1p), while for small p, the Poisson approximation breaks down and there is a considerable overdispersion.

Observe that, when going from Poisson to negative binomial, the number of parameters to estimate in our models such as in (1) and (5) is doubled. This is expected to cause numerical problems for systems of large dimensions. However, so far it seems from our calculations that, analogous to the Poisson case, the number of parameters that need to be estimated is reduced by the fact that the empirical means equal the estimated expectations under the model. To our best knowledge, this is a fact that remains to be proved.

In the remainder of the article, three examples of data analysis follow in Secs. 24. In Sec. 5, a concluding discussion is provided.

2. Example: veterinary science, rearing of broilers

2.1. Background

We study data from organic rearing of broilers in Sweden, more precisely data concerning the welfare of chickens. Data from eight farms in Sweden were investigated. From visual inspections, various signs of injury to the chickens had been registred, with the value 0 meaning no injury, and with numbers on an ordinal scale to indicate the degree of worsened injury. The scope of the initial study was to gather new information regarding health and other welfare aspects, housing and management routines in order to describe the present situation on organic broiler farms in Sweden (Göransson, Yngvesson, and Gunnarsson Citation2020).

After initial cleaning of the data set, disregarding some missing values, measurements of 300 individuals remained. For the analysis below, we chose to study four variables. In the table below, these variables are stated along with short descriptions and the values taken by each variable (ordinal scales):

Recall that 0 means no injury at all and observed increased injury on a bird means higher levels on the ordinal scale. Actually, the values taken by variables were x1:(02),x2:(04),x3:(04),x4:(03), but in the table above are listed the levels where counts actually occurred in the data set.

2.2. Statistical analysis

To check the assumption of Poisson distribution, we simply present empirical means and variances for all variables. In , we see that the Poisson assumption is not unreasonable. Note a slight underdispersion for the fourth variable.

Table 1. Veterinary data, means and variances.

We also give empirical correlations, see . The method of Larsson (Citation2020) only works when all variables have non-negative correlation. We here find one negative, but its magnitude is quite small.

Table 2. Veterinary data, empirical correlations.

2.3. Discrete factor analysis

After using the forward search method, discrete factor analysis proposes as the best model to link the variables x2 and x4 in one group with the same factor, and then to have the variables x1 and x3 each in separate groups. The parameter estimates are given in .

Table 3. Veterinary data, estimated parameters.

Note that, as we should have, the observed means are obtained as μ̂1 and μ̂3 for the variables x1 and x3, while for the variable x2, λ̂+μ̂20.12, which is the observed mean for x2, and for x4 we similarly have λ̂+μ̂40.54.

As for estimated correlations from the factor model, we get them as in . Here, we have used (2) to calculate the estimated correlation between x2 and x4.

Table 4. Veterinary data, estimated correlations from the factor model.

We find that the estimated correlation between x2 and x4 agrees well with the corresponding sample correlation. Naturally, all the other estimated correlations from the model are zero since these are between variables that are considered independent. In particular, this means that when comparing to sample values, we miss out on the correlation between x3 and x4 by some margin.

2.4. Discussion. Other techniques

With cluster analysis on these data, via the procedure kmeans, it was the variables x1 and x2 that fell into the same cluster when specifying the number of clusters to three. Given the low sample correlation between x1, and x2, we find this result to be a little surprising. Supposedly, one reason for this result could be the relative similarity between x1 and x2 in terms of their sample means and sample variances.

Finally, we compare the result with discrete factor analysis to a principal-component analysis (PCA). Denote by z1, z2, z3 the first three principal components. Usually, cumulative percentage of the total variance is reported. Here, z1 accounts for 36%, z1 and z2 for 62% and the three first principal components account for 83%.

In , we note that the first principal component loads high on variables x2, x3 and x4, not exactly the result from the discrete factor analysis but somehow pointing in that direction.

Table 5. Veterinary data, estimated principal components.

We conclude with an interpretation from an etological point of view of the finding that variables 2 and 4 form a factor. In earlier analyses, researchers have found relationships between hock burns and plumage cleanliness. The reason could be that birds with hock burns possibly are to a large extent seated in the litter material, and hence get dirtier (if the litter quality is bad).

3. Example: musicology, features of fugue subjects

3.1. Background

Rydén (Citation2020) has considered a problem arising in musicology. In the musical art form called fugue, a so-called subject is first presented. By quantifying a fugue subject, comparisons can be made on a statistical basis between J.S. Bach and composers from later epochs, a priori dividing works into three categories depending on the background of the composition in music history.

A subject could be seen as a melody, which needs to be quantified by some numerical measures. Rydén (Citation2020) chose the following integer-valued variables:

Consider for instance, in , the following subject by Bach, from Fantasia and Fugue in C minor (BWV 537):

Figure 1. J.S. Bach: subject from Fugue in C minor (BWV 537).

Figure 1. J.S. Bach: subject from Fugue in C minor (BWV 537).

In this example, we find the following observed values of the variables:

Note that the tied note in bar 2 implies one single count of the note (A flat).

For details on the data collection, see Rydén (Citation2020). In all, 238 fugue subjects were collected and features compiled.

3.2. Statistical analysis: prerequisites

In our analysis, it feels natural to subtract the respective minima (over individuals i) and consider the so transformed data yj=xjminxij.

We start by performing some descriptive analysis of the data. The means and variances are given in .

Table 6. Music data, means and variances.

We find signs of overdispersion for y1 in particular, and also for y2 and y4. The large variance in y1 is due to some outliers.

Next, we give the empirical Pearson correlations in . We find one negative correlation (between y3 and y4), but this one is so close to zero that we consider it to be a minor problem. This might well be a zero correlation that has been estimated negative by pure chance.

Table 7. Music data, empirical correlations.

3.3. Statistical analysis: discrete factor analysis

Next, we go on to search for the ’best’ discrete factor analysis model, by seeking the one with the smallest possible AIC using the forward search method of Larsson (Citation2020).

We find a model where variables 1,2,5,6 have a common factor, while variables 3 and 4 turn out as independent of the others, as well as of each other. This is also in accord with , where we see that the estimated correlation between y3 and y4 is very close to zero (as already mentioned), and also that y3 and y4 are the variables that correlate the least with all the other variables.

The parameter estimates for the model are given in . In particular, note that for variables yj with a common factor, λ̂+μ̂j equals the observed mean of yj, and that for unrelated variables (like y4), μ̂j equals the observed mean.

Table 8. Music data, estimated parameters, Poisson model.

In order to validate the model, we also give the estimated correlations from the models in , cf EquationEq. (2). Comparing to , we see that in almost all cases, the model estimates of correlations are lower than the corresponding empirical ones.

Table 9. Music data, estimated correlations from the Poisson factor model.

To account for the overdispersion that is present for some of the variables, we also fitted a negative binomial factor model to the group (y1,y2,y5,y6) as well as to y4. (The variable y3 did not show signs of over dispersion, so it was left as it is.) The corresponding parameter estimates are given in .

Table 10. Music data, estimated parameters, negative binomial model.

Observe that, for the underdispersed variable y5, the estimated r is very large and the estimated p is extremely close to one. This is because for large r and p close to one, the negative binomial distribution is close to Poisson. Also, for y6, which is close to being underdispersed, we have a large estimated r and an estimated p fairly close to one.

A model check of negative binomial may be done both for variances and correlations. (As for Poisson, the expectations are estimated without errors.) Estimated variances are given in , and estimated correlations are in . Comparing to , the agreement of variances is not too bad, with exceptions for variables 5 and 6, where the model over-estimates by some margin. As for correlations, they fit better to the empirical ones than for Poisson as long as variable 1 is not involved. The correlations between variable 1 and other variables are more under-estimated here than in the Poisson case.

Table 11. Music data, estimated variances from the negative binomial model.

Table 12. Music data, estimated correlations from the negative binomial model.

3.4. Discussion. Other techniques

It might be interesting to compare with the result given by PCA. Concerning the cumulative percentage of the total variance, for this data set, the first principal component accounts for 51%, the first two principal components 68% and the first three 80%. An interpretation of the coefficients could be made (see ); the first is essentially a weighted linear combination of the variables, with positive weights. Less weight is put on x4, initial interval in semitones. Turning to the second principal component, we find a contrast between variables x1 and x3 (length and number of pitch classes, in a sense overall measures of the subject) against x4 and x6 (interval features, inner construction of subject).

Table 13. Music data, estimated principal components.

In addition, a conventional factor analysis with two factors was carried out. The first factor put less weight on variable x4 (cf. the discrete factor analysis).

Finally in this section, we want to mention that we have also tried traditional cluster analysis on these data, using the procedure kmeans in Matlab. As an initial normalization (after possible transformation), we divided the variables by their respective means. Then, instructing kmeans to form two clusters on the original data, we got one cluster with the variables 1,2,3,5,6 and one with variable 4 only. For transformed data and with three clusters, we got 1,2,5,6, then 3 alone and 4 alone. The latter result exactly agrees with what we got from our discrete factor analysis procedure.

4. Example: Agriculture, damage to potato tubers

4.1. Background

When harvesting, potatoes can be damaged by the lifter device. In experiments performed at Wageningen, the Netherlands, eight types of lifting rods were compared (Keen and Engel Citation1997). Two energy levels, six genotypes/varieties and three weight classes were used. Most combinations of treatments involved about 20 potato tubers. Tubers were rated as undamaged to severely damaged. Data are found in the R package agridat (Wright Citation2018). In the following, we will only consider the four variables in the dataset that are either ordinal or counts.

In all, we consider a controlled experiment and face a data frame with 1152 observations on 4 variables:

Here, variables x1x3 are ordinal and we present them simply as integers.

4.2. Statistical analysis

In order to adapt to the Poisson distribution (or negative binomial), and to get positive correlations, we transformed the original variables, x1,,x4 according to y1=x1minxi1 and yj=maxxijxj for j = 2, 3, while y4 = x4 was left untransformed.

A preliminary check of means and variances, see , reveals that variable 4 is overdispersed. The other variables are underdispersed.

Table 14. Potato data, means and variances.

As for empirical correlations, these are given in . Because the data stem from a controlled experiment, the empirical correlations between y1,y2,y3 are all zero. The interest is the empirical correlations of y4 with the other variables. We find that, as accomplished by the initial transformation, all these are positive. It is also clear that the only correlation that might be of interest is the one between y4 and y3.

Table 15. Potato data, empirical correlations.

4.3. Discrete factor analysis

We went on to search for the best Poisson factor model. As expected, this model identified a common factor for variables 3 and 4, while all other variables turned out as independent. For these variables, as usual the parameter estimates equal their corresponding means. See .

Table 16. Potato data, estimated parameters, Poisson model.

Moving on to handle the overdispersion, we estimated negative binomial models for the pair (y3, y4). The results of these estimations are given in .

Table 17. Potato data, estimated parameters, negative binomial model.

Like the music example in Sec. 3, for the underdispersed variable y3, the large estimated r and the estimated p extremely close to one are striking.

Moreover, just like for the Poisson model, the estimated expectations of the negative binomial model equal the empirical means. However, the estimated variances are not equal to their sample counterparts. These are given in . As expected, the model overestimates the variance for the underdispersed variable 3. For variable 4, it over estimates, but the estimate may still be considered better than the corresponding Poisson model variance, which equals the mean and is much too low.

Table 18. Potato data, estimated variances from the negative binomial model.

The estimated correlations for the Poisson model are zero except for the correlation between y3 and y4, which is 0.24. This is an underestimation of the empirical correlation from : 0.58. The corresponding number for the negative binomial model is 0.06, an even more severe underestimation.

4.4. Discussion. Other techniques

As for other methods, kmeans (with two clusters) grouped variables 1-3 together and variable 4 separately. This might be because of the large sample variance of variable 4.

In PCA, the first two principal components are given in . We find that the first principal component mainly loads on variable 4, but it also agrees with our analysis, in the sense that it also loads a little bit on variable 3. The other components, of which the second one is shown here, loads heavily on one of the variables each. Given the design of the experiment, this should be considered natural. Finally, let us remark that The first principal component accounts for 40% of the variance, while the two first stand for 65%.

Table 19. Potato data, estimated principal components.

Finally, let us mention that for this data set a conventional factor analysis with one factor gives a factor that loads heavily on y3 and y4 but basically not on any other variable. This is quite in accord with our results.

5. Discussion

Investigating (possible) relationships between various variables is common in many domains of science. Factor analysis is then one of the classical tools from multivariate statistical analysis, here presented in terms of the methodology given by Larsson (Citation2020) suitable for discrete data.

The three data sets considered in this article face different situations from the point of view of data type. In Sec. 3, all four variables were on the ordinal scale, while the data set in Sec. 4 included exclusively “true” count data. In Sec. 5, the situation with three ordinal variables and one count variable was investigated (a controlled experiment). Interpreting the results of a factor analysis from the applied user’s point of view is often not straight-forward, but comparing with e.g., PCA the results and practical conclusions are valid for the new methodology presented.

In this paper, in order to handle overdispersed data, we have generalized the Poisson assumption of Larsson (Citation2020) to negative binomial. However, the search algorithm builds on Poisson models, so bringing in negative binomial here is a generalization that is left to make. Also, one could think of other models, for example different combinations of the Poisson with other distributions.

Acknowledgements

We would like to thank the editor and the anonymous referee for helpful suggestions to improve the manuscript.

References

  • Göransson, L., J. Yngvesson, and S. Gunnarsson. 2020. Bird health, housing and management routines on Swedish organic broiler chicken farms. Animals 10 (11):2098. doi:10.3390/ani10112098.
  • Hilbe, J. M. 2011. Negative binomial regression. 2nd ed. New York: Cambridge University Press.
  • Jöreskog, K. G., U. H. Olsson, and F. Y. Wallentin. 2016. Multivariate analysis with LISREL. Switzerland: Springer International Publishing.
  • Karlis, D. 2003. An EM algorithm for multivariate Poisson distribution and related models. Journal of Applied Statistics 30 (1):63–77. doi:10.1080/0266476022000018510.
  • Keen, A., and B. Engel. 1997. Analysis of a mixed model for ordinal data by iterative re-weighted REML. Statistica Neerlandica 51 (2):129–44. doi:10.1111/1467-9574.00044.
  • Larsson, R. 2020. Discrete factor analysis using a dependent Poisson model. Computational Statistics 35 (3):1133–52. doi:10.1007/s00180-020-00960-w.
  • Rydén, J. 2020. On features of fugue subjects. A comparison of J.S. Bach and later composers. Journal of Mathematics and Music 14 (1):1–20. doi:10.1080/17459737.2019.1610193.
  • Wright, K. 2018. Agridat: Agricultural Datasets. R package version 1.16. https://CRAN.R-project.org/package=agridat