Full article: Regression models of Pearson correlation coefficient

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

We propose two simple regression models of Pearson correlation coefficient of two normal responses or binary responses to assess the effect of covariates of interest. Likelihood-based inference is established to estimate the regression coefficients, upon which bootstrap-based method is used to test the significance of covariates of interest. Simulation studies show the effectiveness of the method in terms of type-I error control, power performance in moderate sample size and robustness with respect to model mis-specification. We illustrate the application of the proposed method to some real data concerning health measurements.

Keywords:

1. Introduction

Most regression models have been developed to describe the relationship between the expected value of response(s) and a number of covariates (predictors). In some situations, it is desired to understand the influence of covariates on the strength of Pearson correlation between two responses. For example, in some psychological study, it is of interest to understand the impact of age and/or sex on the association between physical functionality and mental functionality of the elder (Thomas et al., Citation2016).

One traditional measure for such association is the partial correlation coefficient, i.e., the correlation coefficient of two responses after removing the covariate effect (Anderson, Citation2003). However, its inference depends on the normality assumption and it misses the connection to the actual values of the covariates which is supposed to have different effects on the association. In terms of the rank-based counterpart, Liu et al. (Citation2018) proposed covariate-adjusted partial Spearman's rank correlation using probability scale residuals, which is particularly useful for ordinal responses.

Another method is to use a conditional approach to assess the correlation at different levels of the covariates. Bartlett (Citation1993) proposed to model the Fish z-transformed correlation by a polynomial regression of a single covariate z. His model builds a linear regression up on paired observations of $(z_{k}, r (z_{k}))$ , $k = 1, \dots, K$ , where $r (z_{k})$ is the sample correlation coefficient of two responses at $z_{k}$ over K distinct levels of z. One clear drawback of this method is that it requires repeated measures of responses at $z_{k}$ , which would lead to the demand of a large total sample size and even larger when multiple covariates are included. Other studies along this direction include Yu and Dunn (Citation1982) and Paul (Citation1989). Wilding et al. (Citation2011) proposed a model to relate a transformed correlation coefficient of two normal responses ( $y_{1}$ and $y_{2}$ ) with a linear combination of covariates ( $z_{1}, \dots, z_{p}$ ) through the probit link, i.e., $(1 + ρ) / 2 = Φ (γ_{0} + γ_{1} z_{1} + \dots + γ_{p} z_{p})$ , where $ρ = c o r r (y_{1}, y_{2})$ and Φ is the distribution function of the standard normal variable. (Restricted maximum likelihood corrected LRT is used to test the significance of covariates.) We point out that the linear transformed correlation $(1 + ρ) / 2$ guarantees the range for a distribution function. However, it can lose efficiency when ρ is known positive (since it can be modelled directly without transformation).

All the aforementioned methods consider the continuous responses. In many situations, there are only binary responses available, e.g., after dichotomization. In this paper, we propose two simple regression models of Pearson correlation coefficient of two normal responses and two binary responses (Section 2). Specifically, when the correlation coefficient is known positive, the logistic link function is used and when the correlation coefficient is arbitrary in $(- 1, 1)$ , the hyperbolic tangent (or Fisher z-transformation) link function is used. Likelihood-based inference is developed to estimate the regression coefficients (Section 3). We demonstrate the performance of the proposed method by simulation in Section 4 and illustrate the applications in great detail to a real data concerning physical functionality and mental functionality of the elder (Section 5). We defer the technical details in Appendix.

2. Model

Let $(y_{1}, y_{2})$ be a pair of responses of a subject with correlation coefficient $ρ = c o r r (y_{1}, y_{2})$ . Let $x_{1}, \dots, x_{p}$ denote p associated covariates of the paired responses, which follow a joint distribution $f (x_{1}, \dots, x_{p})$ . Consider modelling ρ as a monotonic function of the linear combination of these covariates through (1) $ρ = h (x^{⊤} β),$ (1) where $x = (1, x_{1}, \dots, x_{p})^{⊤}$ and $β = (β_{0}, β_{1}, \dots, β_{p})^{⊤}$ . When $ρ > 0$ , we use the logistic function $h (x) = (1 + e^{- x})^{- 1}$ , referred by link 1. When $- 1 < ρ < 1$ , we use the hyperbolic tangent function $h (x) = \tanh (x) = (e^{2 x} - 1) / (e^{2 x} + 1)$ , referred by link 2. We adopt these two commonly used link functions to map the real line to intervals $(0, 1)$ and $(- 1, 1)$ , respectively, as they are analytically simple. Other links can also be used, e.g., the probit link for $ρ > 0$ . When the correlation is known positive, the logistic function is preferred as it is more efficient for estimation and easy for interpretation. Let $h^{'}$ and $h^{″}$ denote the first and second derivatives of h, respectively. Our goal is to assess model (Equation1(1) $ρ = h (x^{⊤} β),$ (1) ).

2.1. Bivariate normal responses

Assume that $y_{1}$ and $y_{2}$ follow a bivariate normal distribution with marginal distributions $N (μ_{1}, σ_{1}^{2})$ and $N (μ_{2}, σ_{2}^{2})$ , respectively. For j = 1, 2, we further model the expected value of $y_{j}$ by a linear regression, i.e., $E (y_{j} ∣ x) = μ_{j} = x^{⊤} γ_{j}$ where $γ_{j} = (γ_{0}, γ_{1}, \dots, γ_{p})^{⊤}$ . Denote $η = (γ_{1}^{⊤}, γ_{2}^{⊤}, σ_{1}^{2}, σ_{2}^{2})^{⊤}$ as the collection of all nuisance parameters.

The log-likelihood function is (2) $ℓ (β, η) = - \log (2 π) - \frac{T_{1}}{2 (1 - ρ^{2})} + \frac{ρ T_{2}}{1 - ρ^{2}} - \frac{\log (1 - ρ^{2})}{2},$ (2) where (3) $T_{1} = {(\frac{y_{1} - μ_{1}}{σ_{1}})}^{2} + {(\frac{y_{2} - μ_{2}}{σ_{2}})}^{2}, T_{2} = \frac{(y_{1} - μ_{1}) (y_{2} - μ_{2})}{σ_{1} σ_{2}},$ (3) $ρ = h (x^{⊤} β)$ , and $μ_{j} = x^{⊤} γ_{j}$ . The first and second derivatives of ℓ with respect to $β$ are, respectively (4) $s (β, η) = {a (ρ) T_{1} + b (ρ) T_{2} + c (ρ)} x, H (β, η) = {u (ρ) T_{1} + v (ρ) T_{2} + w (ρ)} x x^{⊤},$ (4) where a, b, c, u, v and w are given in Table . By the fact that $E (T_{1} ∣ x) = 2$ and $E (T_{2} ∣ x) = ρ$ , we have $E (s) = 0$ , where $0$ is the zero vector.

Table 1. Some functions under two link functions.

Display Table

2.2. Bivariate binary responses

Assume that both $y_{1}$ and $y_{2}$ are binary responses. Let $E (y_{1}) = p_{1}$ and $E (y_{2}) = p_{2}$ . For a, b = 0, 1, let $I_{a b} = I (y_{1} = a, y_{2} = b)$ , where $I (\cdot)$ is the indicator function, and $p_{a b} = E (I_{a b}) = P (y_{1} = a, y_{2} = b)$ . Clearly, $p_{00} + p_{01} + p_{10} + p_{11} = 1$ . By definition, we have $p_{1} = p_{11} + p_{10}, p_{2} = p_{11} + p_{01}, ρ = \frac{p_{11} - p_{1} p_{2}}{\sqrt{p_{1} (1 - p_{1}) p_{2} (1 - p_{2})}} .$ Inversely, given $p_{1}$ , $p_{2}$ and ρ, we can obtain (5) $p_{11} = p_{1} p_{2} + ρ \sqrt{p_{1} (1 - p_{1}) p_{2} (1 - p_{2})}, p_{10} = p_{1} - p_{11}, p_{01} = p_{2} - p_{11} .$ (5) For j = 1, 2, we further model the expected value of $y_{j}$ by a logistic regression through $p_{j} = {1 + \exp (- x^{⊤} γ_{j})}^{- 1}$ , where $γ_{j}$ is as defined in Section 2.1. Denote $η = (γ_{1}^{⊤}, γ_{2}^{⊤})^{⊤}$ . (Through (Equation1(1) $ρ = h (x^{⊤} β),$ (1) ) and (Equation5(5) $p_{11} = p_{1} p_{2} + ρ \sqrt{p_{1} (1 - p_{1}) p_{2} (1 - p_{2})}, p_{10} = p_{1} - p_{11}, p_{01} = p_{2} - p_{11} .$ (5) ), $p_{a b}$ is a function of $(β, η)$ .)

By (Equation5(5) $p_{11} = p_{1} p_{2} + ρ \sqrt{p_{1} (1 - p_{1}) p_{2} (1 - p_{2})}, p_{10} = p_{1} - p_{11}, p_{01} = p_{2} - p_{11} .$ (5) ), the log-likelihood is $ℓ (β, η) = \sum_{a, b = 0, 1} I_{a b} \log p_{a b} = \sum_{a, b = 0, 1} I_{a b} \log (c_{a b} + d_{a b} ρ),$ where $\begin{aligned} c_{11} & = p_{1} p_{2}, d_{11} = \sqrt{p_{1} (1 - p_{1}) p_{2} (1 - p_{2})}, \\ c_{10} & = p_{1} (1 - p_{2}), d_{10} = - d_{11}, \\ c_{01} & = p_{2} (1 - p_{1}), d_{01} = - d_{11}, \\ c_{00} & = (1 - p_{1}) (1 - p_{2}), d_{00} = d_{11} . \end{aligned}$ Then, the first and second derivatives of ℓ with respect to $β$ are, respectively (6) $s (β, η) = \sum_{a, b = 0, 1} I_{a b} e_{a b} (ρ) x, H (β, η) = \sum_{a, b = 0, 1} I_{a b} f_{a b} (ρ) x x^{⊤},$ (6) where $e_{a b} (ρ) = \frac{d_{a b} h^{'} (x^{⊤} β)}{c_{a b} + d_{a b} ρ}, f_{a b} (ρ) = \frac{- d_{a b}^{2} h^{' 2} (x^{⊤} β) + d_{a b} (c_{a b} + d_{a b} ρ) h^{″} (x^{⊤} β)}{(c_{a b} + d_{a b} ρ)^{2}} .$ The detailed expressions of $e_{a b}$ and $f_{a b}$ under the two models of h are given in Appendix. When $E (I_{a b} ∣ x) = p_{a b}$ , we have $E (s) = 0$ .

3. Inference

Let ${(y_{1}, y_{2}, x_{1}, \dots, x_{p}) : i = 1, \dots, n}$ denote independent samples of $(y_{1}, y_{2}, x_{1}, \dots, x_{p})$ of size n. For $i = 1, \dots, n$ and j = 1, 2, denote $x_{i} = (1, x_{i 1}, \dots, x_{i p})^{⊤}$ , $y_{j} = (y_{1 j}, \dots, y_{n j})^{⊤}$ , $\bar{x} = n^{- 1} \sum_{i = 1}^{n} x_{i}$ and $X_{n \times p} = (x_{1}, \dots, x_{n})^{⊤}$ .

The gradient (first derivative) and the Hessian matrix (second derivative) of the log-likelihood over n independent samples with respect to $β$ are $s (β, η) = \sum_{i = 1}^{n} s_{i} (β, η)$ and $H (β, η) = \sum_{i = 1}^{n} H_{i} (β, η)$ , respectively, where $s_{i} (β, η)$ and $H_{i} (β, η)$ are obtained by replacing $(y_{1}, y_{2}, x)$ in (Equation4(4) $s (β, η) = {a (ρ) T_{1} + b (ρ) T_{2} + c (ρ)} x, H (β, η) = {u (ρ) T_{1} + v (ρ) T_{2} + w (ρ)} x x^{⊤},$ (4) ) or (Equation6(6) $s (β, η) = \sum_{a, b = 0, 1} I_{a b} e_{a b} (ρ) x, H (β, η) = \sum_{a, b = 0, 1} I_{a b} f_{a b} (ρ) x x^{⊤},$ (6) ) by $(y_{i 1}, y_{i 2}, x_{i})$ for the ith subject.

In addition, let $G_{i} (β, η) = \partial^{2} ℓ_{i} / \partial β \partial η^{⊤}$ . Denote $Θ = E (H_{i})$ and $J = E (G_{i})$ , where the expectation is taken with respect to the joint distribution of $(y_{i 1}, y_{i 2}, x_{i})$ . Let $I = E {s_{i} (β, η) s_{i}^{⊤} (β, η)}$ denote the Fisher information matrix with respect to $β$ . It is noted that unlike the usual regression of the mean value of response, the regularity condition does not hold for the model (Equation1(1) $ρ = h (x^{⊤} β),$ (1) ) (Tsiatis, Citation2006, Theorem 3.2). (It can be easily verified that $I \neq - Θ$ .)

Let $\hat{η}$ be the maximum likelihood estimate (MLE) of $η$ obtained from the marginal models. For instance, the marginal MLEs of the nuisance parameters under the bivariate normal response case are ${\hat{γ}}_{j} = (X^{⊤} X)^{- 1} X^{⊤} y_{j}$ , ${\hat{σ}}_{j}^{2} = n^{- 1} \sum_{i = 1}^{n} (y_{j i} - {\hat{μ}}_{j i})^{2}$ , where ${\hat{μ}}_{j i} = x_{i}^{⊤} {\hat{γ}}_{j}$ . By large sample theory, we have $\hat{η} \overset{p}{\to} η$ and $\sqrt{n} (\hat{η} - η) \overset{d}{\to} N (0, Σ_{η})$ , where $Σ_{η}$ is the covariance matrix.

Next, we estimate $β$ by using the Newton–Raphson method through iterating (7) $β^{(r + 1)} = β^{(r)} - H^{- 1} (β^{(r)}, \hat{η}) s (β^{(r)}, \hat{η}),$ (7) where $β^{(r)}$ is the estimate of $β$ at the rth iteration. The convergence of Newton–Raphson method depends on the initial point and the negative definitiveness of the Hessian matrix, which guarantees the (unique) global maximizer of the log-likelihood function. The initial estimate $β^{(1)}$ can be chosen such that $h ({\bar{x}}^{⊤} β^{(1)})$ is in the range of ρ. However, the Hessian matrix involves the plug-in estimators of the nuisance parameters. It is not trivial to show its negative definitiveness theoretically. Nevertheless, numerical study shows the condition holds with all eigenvalues being negative. Denote the root of the Newton–Raphson method by $\hat{β}$ .

Theorem 3.1

Suppose that the conditions of Lemma A.1 for the bivariate normal responses or those of Lemma A.2 for the bivariate binary responses hold. $($ Lemmas A.1 and A.2 are given in Appendix. $)$ Suppose the iteration (Equation7(7) $β^{(r + 1)} = β^{(r)} - H^{- 1} (β^{(r)}, \hat{η}) s (β^{(r)}, \hat{η}),$ (7) ) converges. (i) $\hat{β}$ is a consistent estimator of $β$ , i.e., $\hat{β} \overset{p}{\to} β$ , and (ii) $\sqrt{n} (\hat{β} - β) \overset{d}{\to} N (0, Σ)$ , where Σ is the covariance matrix given in the proof.

The significance of the overall model is assessed by testing the hypothesis $H_{0} : β_{1} = \dots = β_{p} = 0$ . Denote the Wald statistic $W = {\hat{β}}_{- 0}^{⊤} {\hat{c o v}}^{- 1} ({\hat{β}}_{- 0}) {\hat{β}}_{- 0}$ , where ${\hat{β}}_{- 0} = ({\hat{β}}_{1}, \dots, {\hat{β}}_{p})^{⊤}$ and $\hat{c o v} ({\hat{β}}_{- 0})$ is the estimate of the covariance of ${\hat{β}}_{- 0}$ . By Theorem 3.1, under the null hypothesis, W follows a chi-square distribution of degrees of freedom p asymptotically. The significance of individual covariate $x_{j}$ ( $j = 1, \dots, p$ ) is assessed by testing the hypothesis $H_{0 j} : β_{j} = 0$ using the statistic $w_{j} = {\hat{β}}_{j} / \hat{s e} ({\hat{β}}_{j})$ , where $\hat{s e} ({\hat{β}}_{j})$ is the estimate of the standard deviation of ${\hat{β}}_{j}$ , which is the square root of the $(j, j)$ th element of $\hat{c o v} ({\hat{β}}_{- 0})$ . The null distribution of $w_{j}$ is asymptotically standard normal. Since the exact expression of Σ is rather complicated, we use bootstrap to estimate it in practice.

4. Simulation study

We conduct simulations to examine the performance of the proposed model for making inference in estimation and testing.

4.1. Setup

Unlike possible large numbers of covariates for the mean model, a small number of relevant covariates are usually adequate for modelling the correlation coefficient. Consider the number of covariates p to be two and three, respectively. Set the sample size n to be 100, 200, 400 and 1000, respectively.

For bivariate normal responses, set the marginal variances $σ_{1}^{2} = σ_{2}^{2} = 1$ . For both bivariate normal responses and bivariate binary responses, we set both $γ_{1}$ and $γ_{2}$ to be a vector of p + 1 zeros for the marginal mean models. It simplifies the data generation without simplifying the estimation procedure. Let $x_{1}, \dots, x_{p}$ be independent uniform random variables in (0,1).

For the correlation coefficient model, set $β_{0} = 0.25$ . Under the nonnull model, set $β_{1}$ ,…, $β_{p}$ under two link functions as in Table , where only the first covariate is significant. Under the null model, simply set all $β_{1}, \dots, β_{p}$ to be zeros. We set $(0.25, 0, \dots, 0)$ as the initial values for $β$ under both link functions.

Table 2. Specifications of $β_{1}, \dots, β_{p}$ under p = 2, 3 and two link functions.

Display Table

Throughout, we set the significant level to be 0.05 for both the global test and individual test, and set the number of replications, B, to be 1000.

4.2. Results

First, denote the empirical root of mean square error of $\hat{β}$ by $E R M S E = B^{- 1} \sum_{b = 1}^{B} {\sum_{j = 0}^{p} ({\hat{β}}_{j}^{(b)} - β_{j})^{2}}^{1 / 2}$ , where ${\hat{β}}^{(b)}$ is the estimate of $β$ in the bth replication. It measures the performance of the estimation in terms of magnitude. Second, denote the directional consistency rate (CR) by $C R = B^{- 1} \sum_{b = 1}^{B} {n^{- 1} \sum_{i = 1}^{n} I ({\hat{ρ}}_{i}^{(b)} ρ_{i}^{(b)} > 0)}$ , where $ρ_{i}^{(b)}$ is the true correlation coefficient in the bth replication and ${\hat{ρ}}_{i}^{(b)} = h (x_{i}^{⊤} {\hat{β}}^{(b)})$ . It measures the performance of estimation in terms of sign direction.

Table presents the type-I error of the global test and individual tests under the null model. It is seen that the proposed resampling test controls the type-I error well at the nominal level.

Table 3. Type-I errors (in %) of the global test (for the model significance) and individual test (for the significance of the covariates) under two types of responses, two link functions and various sample sizes.

Display Table

Columns 4–5 and 9–10 of Table present the ERMSE and CR under the nonnull model considered in Section 4.1. It is seen that as the sample size increases the ERMSE decreases while the CR increases as expected. The CR ranges from 85% to 100% indicating that the proposed model yields a good fit in the sign direction of the correlation coefficient. Columns 7–9 and 12–15 of Table present the powers of the global test and individual tests of the covariates. It is seen that both the global test and the individual test for the significant covariate (i.e., $x_{1}$ ) gain power as the sample size increases as desired. The rejection rates for the insignificant covariates (i.e., $x_{2}$ and $x_{3}$ ) are close to the nominal level as desired. We note that the power performance depends on the actual model. In general, we see the proposed method works well under moderate sample size.

Table 4. ERMSE, CR and testing power of β's of models considered in Table with various sample sizes.

Display Table

4.3. Sensitivity analysis

We adopt the simulation model from Wilding et al. (Citation2011) for bivariate normal responses, where $y_{1} \sim N (1 + 2 x_{1}, 1)$ , $y_{2} \sim N (1.5 + 1.2 x_{1}, 1)$ and their correlation coefficient is given by $ρ (y_{1}, y_{2}) = 2 Φ (0.5 + β_{1} x_{1}) - 1$ with $x_{1}$ being a uniform random variable in $(0, 1)$ . Consider $β_{1}$ to be 0, 0.5 and 1, respectively, to represent the null case and two nonnull cases. Set n to be 30, 60 and 150, respectively.

We compute the rejection rate for testing the significance of $x_{1}$ using the proposed method with link 2 in comparison with Wilding et al. (Citation2011)'s method by which the model is correctly specified. This serves as a sensitivity analysis under model mis-specification. Table reports the rejection rates under the considered cases. It is seen that under the null case when $β_{1} = 0$ the proposed method controls the type-I error well. Under the nonnull cases when $β_{1} = 0.5$ and 1 the proposed method is less powerful than Wilding et al.'s method when the sample size is 30 and becomes comparable in power when the sample size gets larger ( $⩾ 60$ ).

Table 5. Rejection rates (in %) of Wilding et al. (Citation2011)'s method and the proposed method for testing the significance of the covariate.

Display Table

5. Real data application

Our data are taken from the survey of National Health Measurement Study in 2005–2006 (Fryback, Citation2009). It collects responses to health-related quality of life questionnaires and health status questions through Short Form SF-36 questionnaires from 3844 older adults in the continental United States (1641 males and 2203 females).

5.1. Bivariate normal responses

First, we illustrate the application of the proposed model to the bivariate normal responses case. The SF-36 items were aggregated into eight health concepts, which were further summarized into the physical component summary (PCS) and the mental component summary (MCS), both in the range of $(0, 100)$ with higher scores indicating better physical and mental health functions, respectively [28]. Several medical studies showed that the PCS declined significantly with age, but that the MCS did not change with age (Kim, Citation2016; Ware Jr et al., Citation1994). To continue along these lines, we applied the proposed method to further investigate the effect of gender and marital status. We used the Fisher z-transformation (i.e., tanh link) to accommodate negative value of correlation. The estimate model for the correlation between PCS and age is found to be $\tanh (ρ (P C S, a g e)) = - 0.203 - 0.145 \times M A R R I E D + 0.018 \times S E X,$ where MARRIED is 1 if married or living with a partner and 0 otherwise, and SEX is 1 if female and 0 if male. The corresponding p-values for marital status and gender are $< 0.001$ and 0.592, respectively, indicating strong significance of marriage and insignificance of gender. The negative coefficient of MARRIED implies that the status of being married or living with a partner aggravates the declination of PCS with age ( $- 0.283$ for people married or living with a partner and $- 0.209$ for people living by themselves). In other words, the people who are married or living with a partner have relatively lower physical functionality than the people who live by themselves when they age in general. On the other hand, our model concurred with the conclusion of insignificant correlation between MCS and age. In addition, neither gender nor marital status has significant effect on this correlation. (The details are omitted.)

Moreover, we applied the proposed model to study the age effect on the correlation of PCS and MCS. (Throughout this section the observations of age are standardized, denoted by AGE $^{*}$ , before fitting the model as covariate.) Insignificant result is found and supported by the visual inspection of the correlations over 20 age intervals determined by the percentiles (Figure (a)). This result is different from the finding of Wilding et al. (Citation2011), who showed a nonlinear downward trend of the correlation coefficient over age based on the 2003–2004 Health Outcome Survey data. When we further include gender and marital status as covariates, there is no significant gender effect either (p-value 0.312) as also seen in Figure (b). However, there appears a very mild effect of marital status (p-value 0.236) in the way that the status of being married or living with a partner slightly reduces the correlation from otherwise slight positive to the level of near zero (Figure (c)). (The fitted model is $\tanh (ρ (P C S, M C S)) = 0.148 - 0.016 \times {A G E}^{*} - 0.063 \times M A R R I E D - 0.052 \times S E X$ .)

Figure 1. (a) Correlation of PCS and MCS over 20 age groups, (b) correlation of PCS and MCS over 20 age groups with different genders, (c) correlation of PCS and MCS over 20 age groups with different marital status.

5.2. Bivariate binary responses

Second, we applied the proposed method to model the correlation of two binary indicators of stroke (STR) and diabetes (DIA) using the covariates of age, gender and marital status. Since the correlation is known positive, the logistic link is used. The estimated model which includes the second-order age effect is $\begin{aligned} l o g i s t i c (ρ (S T R, D I A)) = & - 2.050 - 1.352 \times {A G E}^{*} - 0.090 \times {A G E}^{* 2} \\ + 0.938 \times M A R R I E D + 0.270 \times S E X \end{aligned}$ with strong significance of the linear age effect (p-value $< 0.001$ ) and marriage effect (p-value 0.028) and insignificance of the second-order term of age (p-value 0.786) and gender effect (p-value 0.675). It is seen from Figure (a) that the correlation exhibits a parabola shape in age where the largest value (about 0.25) occurs around age 55 and then starts to decline afterwards. This indicates that the association of stroke and diabetes is strongly age dependent. The diabetes becomes a less significant risk factor to stroke after a certain age, e.g., 70 (Kim, Citation2016). The status of married or living with a partner contributes significantly to the positive correlation (0.145 for married vs 0.106 for otherwise) especially for people who are more than 50 years old (Figure (c)), which could be explained by the evidence of low physical functionality found before.

Figure 2. (a) Correlation of STROKE and DIABETES over 10 age groups, (b) correlation of STROKE and DIABETES over 10 age groups with different genders, (c) correlation of STROKE and DIABETES over 10 age groups with different marital status.

6. Concluding remarks

We propose two simple regression models of Pearson correlation coefficient of two continuous responses or two binary responses. Likelihood-based inference is developed to estimate the model and test the significance of covariates. Our method for the binary response case is new to the literature and is useful for analysing data when outcomes are observed in binary form (such as yes or no) as seen in many psychological and sociological studies. The proposed method is easy to implement and computationally affordable. (The Newton–Raphson iteration converges in a few steps.) We have made R package available upon request.

It is noted that for the binary response case, the correlation ρ actually has a restricted range given by $max {- ψ_{1} ψ_{2}, - (ψ_{1} ψ_{2})^{- 1}} ⩽ ρ ⩽ min {ψ_{1} ψ_{2}^{- 1}, ψ_{2} ψ_{1}^{- 1}}$ , where $ψ_{1} = \sqrt{p_{1} / (1 - p_{1})}$ and $ψ_{2} = \sqrt{p_{2} / (1 - p_{2})}$ (Qaqish, Citation2003). A more precise model than the Fisher z-transformation is warranted to further improve the efficiency.

The limitation of the present paper is that the proposed method does not handle the case of mixed responses, i.e., one continuous response and one binary response, or more generally an ordinal response categorized from a latent variable, such as ‘mild’, ‘moderate’, ‘severe’. Regression model of correlation involving latent variable(s) is worth investigation. We will report the result in a separate work.

Disclosure statement

All authors declare no conflict of interest.

Data availability statement

The data that support the findings of this study are available from the corresponding author upon request.

References

Anderson, T. (2003). An introduction to statistical multivariate analysis. 3rd ed. Wiley.
Google Scholar
Bartlett, R. F. (1993). Linear modelling of Pearson's product moment correlation coefficient: An application of Fisher's z-transformation. Journal of the Royal Statistical Society: Series D, 42(1), 45–53.
Google Scholar
Fryback, D. G. (2009). United States National Health Measurement Study, 2005-2006. Inter-University Consortium for Political and Social Research.
Google Scholar
Kim, D. (2016). Correlation between physical function, cognitive function, and health-related quality of life in elderly persons. Journal of Physical Therapy Science, 28(6), 1844–1848. https://doi.org/10.1589/jpts.28.1844
PubMedGoogle Scholar
Liu, Q., Li, C., Wanga, V., & Shepherd, B. E. (2018). Covariate-adjusted Spearman rank correlation with probability-scale residuals. Biometrics, 74(2), 595–605. https://doi.org/10.1111/biom.v74.2
PubMed Web of Science ®Google Scholar
Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics, 4, 2111–2245. https://doi.org/10.1016/S1573-4412(05)80005-4
Google Scholar
Paul, S. (1989). Test for the equality of several correlation coefficients. Canadian Journal of Statistics, 17(2), 217–227. https://doi.org/10.2307/3314850
Web of Science ®Google Scholar
Qaqish, B. F. (2003). A family of multivariate binary distributions for simulating correlated binary variables with specified marginal means and correlations. Biometrika, 90(2), 455–463. https://doi.org/10.1093/biomet/90.2.455
Web of Science ®Google Scholar
Thomas, P., Jolanda, L., Antonius, J. M. d. C., Joris, P. J. S., & Rudi, G. J. W. (2016). Impact of physical and mental health on life satisfaction in old age: A population based observational study. BMC Geriatrics, 16(1), 194. https://doi.org/10.1186/s12877-016-0365-4
PubMedGoogle Scholar
Tsiatis, A. A. (2006). Semiparametric theory and missing data. Springer.
Google Scholar
Ware Jr, J. E., Kosinski, M., & Keller, S. D. (1994). SF-36 physical and mental summary scales: A user's manual. The Health Institute, New England Medical Center.
Google Scholar
Wilding, G. E., Cai, X., Hutson, A., & Yu, Z. (2011). A linear model-based test for the heterogeneity of conditional correlations. Journal of Applied Statistics, 38(10), 2355–2366. https://doi.org/10.1080/02664763.2011.559201
Web of Science ®Google Scholar
Yu, M., & Dunn, O. (1982). Robust tests for the equality of two correlation coefficients: A monte carlo study. Educational and Psychological Measurement, 42(4), 987–1004. https://doi.org/10.1177/001316448204200407
Web of Science ®Google Scholar

Appendix. Technical details

A.1. Details of $e_{a b}$ and $f_{a b}$

The expressions of $e_{a b}$ and $f_{a b}$ under two link functions of h are, respectively $e_{a b} (ρ) = {\begin{cases} \frac{d_{a b} ρ (1 - ρ)}{c_{a b} + d_{a b} ρ}, & l i n k 1, \\ \frac{d_{a b} (1 + ρ) (1 - ρ)}{c_{a b} + d_{a b} ρ}, & l i n k 2, \end{cases}$ and $f_{a b} (ρ) = {\begin{cases} \frac{{- d_{a b}^{2} + d_{a b} (c_{a b} + d_{a b} ρ) (1 - 2 ρ)} ρ^{2} (1 - ρ)^{2}}{(c_{a b} + d_{a b} ρ)^{2}}, & l i n k 1, \\ \frac{{- d_{a b}^{2} - d_{a b} (c_{a b} + d_{a b} ρ) 2 ρ} (1 + ρ)^{2} (1 - ρ)^{2}}{(c_{a b} + d_{a b} ρ)^{2}}, & l i n k 2 . \end{cases}$

A.2. Lemmas

Lemma A.1

Assume that under the bivariate normal response model in Section 2.1, there exist constants $c_{1}$ , $c_{2}$ , $c_{3}$ , and $c_{4}$ such that $0 < c_{1} ⩽ | ρ | ⩽ c_{2} < 1$ , $0 ⩽ | {\hat{μ}}_{j} | < c_{3}$ , $| {\hat{σ}}_{j} | ⩾ c_{4} > 0$ for j = 1, 2. Then, (i) $n^{- 1} \sum_{i = 1}^{n} H_{i} (β^{*}, \hat{η}) \overset{p}{⟶} Θ$ , and (ii) $n^{- 1} \sum_{i = 1}^{n} G_{i} (β, η^{*}) \overset{p}{⟶} J$ .

Proof.

Note that we cannot apply the weak law of large number as $ℓ_{i} (β^{*}, \hat{η})$ s and $ℓ_{i} (β, η^{*})$ s are neither independent nor identically distributed.

(i) Let ${\hat{T}}_{1 i} = (\frac{y_{1 i} - {\hat{μ}}_{1 i}}{{\hat{σ}}_{1}})^{2} + (\frac{y_{2 i} - {\hat{μ}}_{2 i}}{{\hat{σ}}_{2}})^{2}$ and ${\hat{T}}_{2 i} = \frac{(y_{1 i} - {\hat{μ}}_{1 i}) (y_{2 i} - {\hat{μ}}_{2 i})}{{\hat{σ}}_{1} {\hat{σ}}_{2}}$ . First, by the assumptions of $0 ⩽ | {\hat{μ}}_{j} | < c_{3}$ and $| {\hat{σ}}_{j} | ⩾ c_{4} > 0$ for j = 1, 2, we have $\begin{aligned} | {\hat{T}}_{1 i} | & = | {(\frac{y_{1 i} - {\hat{μ}}_{1}}{{\hat{σ}}_{1}})}^{2} + {(\frac{y_{2 i} - {\hat{μ}}_{2}}{{\hat{σ}}_{2}})}^{2} | ⩽ \frac{y_{1 i}^{2} + c_{3}^{2} + 2 c_{3} | y_{1 i} |}{c_{4}^{2}} + \frac{y_{2 i}^{2} + c_{3}^{2} + 2 c_{3} | y_{2 i} |}{c_{4}^{2}}; \\ | {\hat{T}}_{2 i} | & = | \frac{(y_{1 i} - {\hat{μ}}_{1}) (y_{2 i} - {\hat{μ}}_{2})}{{\hat{σ}}_{1} {\hat{σ}}_{2}} | ⩽ \frac{(| y_{1 i} | + c_{3}) (| y_{2 i} | + c_{3})}{c_{4}^{2}} . \end{aligned}$ Second, by the assumption of $0 < c_{1} ⩽ | ρ | ⩽ c_{2} < 1$ , there exists a positive constant $M_{1}$ such that $| u (ρ) |$ , $| v (ρ) |$ , and $| w (ρ) |$ are all bounded by $M_{1}$ . Since $H_{i} (β, \hat{η}) = {u (ρ_{i}) {\hat{T}}_{1 i} + v (ρ_{i}) {\hat{T}}_{2 i} + w (ρ_{i})} x_{i} x_{i}^{⊤}$ , then, there exists a function $g (x, y)$ such that $∥ H_{i} (β, \hat{η}) ∥⩽ g (x_{i}, y_{i})$ and $E {g (x_{i}, y_{i})} ⩽ \infty$ , where $∥ \cdot ∥$ is the Euclidean norm (i.e., $∥ A ∥= {t r (A A^{⊤})}^{1 / 2}$ ). Clearly, $H_{i} (β, η)$ is continuous in $(β^{⊤}, η^{⊤})^{⊤}$ .

By Lemma 2.4 of Newey and McFadden (Citation1994), we have $sup_{β, η} ‖ n^{- 1} \sum_{i = 1}^{n} H_{i} (β, η) - Θ ‖ \overset{p}{⟶} 0$ . Since $β^{*} \overset{p}{⟶} β$ (because $\hat{β} \overset{p}{⟶} β$ and $β^{*}$ is between $β$ and $\hat{β}$ ), $\hat{η} \overset{p}{⟶} η$ and Θ is continuous in $(β^{⊤}, η^{⊤})^{⊤}$ , by Tsiatis (Citation2006, Section 3.2), the result of (i) follows.

(ii) By (Equation2(2) $ℓ (β, η) = - \log (2 π) - \frac{T_{1}}{2 (1 - ρ^{2})} + \frac{ρ T_{2}}{1 - ρ^{2}} - \frac{\log (1 - ρ^{2})}{2},$ (2) ) and (Equation3(3) $T_{1} = {(\frac{y_{1} - μ_{1}}{σ_{1}})}^{2} + {(\frac{y_{2} - μ_{2}}{σ_{2}})}^{2}, T_{2} = \frac{(y_{1} - μ_{1}) (y_{2} - μ_{2})}{σ_{1} σ_{2}},$ (3) ), we have $\begin{aligned} \frac{\partial T_{1 i}}{\partial γ_{1}^{⊤}} & = - \frac{2 (y_{1 i} - μ_{1})}{σ_{1}^{2}} x_{i}^{⊤}, \frac{\partial T_{2 i}}{\partial γ_{1}^{⊤}} = - \frac{y_{2 i} - μ_{2}}{σ_{1} σ_{2}} x_{i}^{⊤}, \\ \frac{\partial T_{1 i}}{\partial γ_{2}^{⊤}} & = - \frac{2 (y_{2 i} - μ_{2})}{σ_{2}^{2}} x_{i}^{⊤}, \frac{\partial T_{2 i}}{\partial γ_{2}^{⊤}} = - \frac{y_{1 i} - μ_{1}}{σ_{1} σ_{2}} x_{i}^{⊤}, \\ \frac{\partial T_{1 i}}{\partial σ_{1}^{2}} & = - \frac{(y_{1 i} - μ_{1})^{2}}{σ_{1}^{4}}, \frac{\partial T_{2 i}}{\partial σ_{1}^{2}} = - \frac{(y_{1 i} - μ_{1}) (y_{2 i} - μ_{2})}{2 σ_{1}^{3} σ_{2}}, \\ \frac{\partial T_{1 i}}{\partial σ_{2}^{2}} & = - \frac{(y_{2 i} - μ_{2})^{2}}{σ_{2}^{4}}, \frac{\partial T_{2 i}}{\partial σ_{2}^{2}} = - \frac{(y_{1 i} - μ_{1}) (y_{2 i} - μ_{2})}{2 σ_{1} σ_{2}^{3}}, \end{aligned}$ and $\begin{aligned} \frac{\partial^{2} ℓ_{i} (β, η)}{\partial β \partial γ_{1}^{⊤}} & = {- a (ρ) \frac{2 (y_{1 i} - μ_{1})}{σ_{1}^{2}} - b (ρ) \frac{y_{2 i} - μ_{2}}{σ_{1} σ_{2}}} x_{i} x_{i}^{⊤}, \\ \frac{\partial^{2} ℓ_{i} (β, η)}{\partial β \partial γ_{2}^{⊤}} & = {a (ρ) \frac{2 (y_{2 i} - μ_{2})}{σ_{2}^{2}} - b (ρ) \frac{y_{1 i} - μ_{1}}{σ_{1} σ_{2}}} x_{i} x_{i}^{⊤}, \\ \frac{\partial^{2} ℓ_{i} (β, η)}{\partial β \partial σ_{1}^{2}} & = {a (ρ) \frac{(y_{1 i} - μ_{1})^{2}}{σ_{1}^{4}} - b (ρ) \frac{(y_{1 i} - μ_{1}) (y_{2 i} - μ_{2})}{2 σ_{1}^{3} σ_{2}}} x_{i}, \\ \frac{\partial^{2} ℓ_{i} (β, η)}{\partial β \partial σ_{2}^{2}} & = {- a (ρ) \frac{(y_{2 i} - μ_{2})^{2}}{σ_{2}^{4}} - b (ρ) \frac{(y_{1 i} - μ_{1}) (y_{2 i} - μ_{2})}{2 σ_{1} σ_{2}^{3}}} x_{i} . \end{aligned}$ By the assumptions, we can find a function $k (x, y)$ such that $‖ D_{i} (β, η) ‖ ⩽ k (x_{i}, y_{i})$ and $E {| k (x_{i}, y_{i}) |} < \infty$ . And $D_{i} (β, η)$ is continuous in $(β^{⊤}, η^{⊤})^{⊤}$ . Using the similar argument as in the proof for (i), by Lemma 2.4 of Newey and McFadden (Citation1994) $sup_{β, η} ‖ n^{- 1} \sum_{i = 1}^{n} G_{i} (β, η) - J ‖ \overset{p}{⟶} 0$ . Since $η^{*} \overset{p}{⟶} η$ and $J$ is continuous in $(β^{⊤}, η^{⊤})^{⊤}$ , the result of (ii) follows.

Lemma A.2

Assume that under the bivariate Bernoulli response model in Section 2.2 there exist constants $d_{1}, \dots, d_{6}$ such that $| h^{'} (x^{⊤} β) | ⩽ d_{1}$ , $| h^{″} (x^{⊤} β) | ⩽ d_{2}$ for all $x$ , $0 < d_{3} < p_{j} < d_{4} < 1$ , for j = 1, 2, $0 < d_{5} < | ρ | < d_{6} < 1$ . Then, (i) $n^{- 1} \sum_{i = 1}^{n} H_{i} (β^{*}, \hat{η}) \overset{p}{⟶} Θ$ , and (ii) $n^{- 1} \sum_{i = 1}^{n} G_{i} (β, η^{*}) \overset{p}{⟶} J$ .

Proof.

(i) By the assumptions, there exists $M_{2}$ such that $| f_{i a b} (β, η) | ⩽ M_{2}$ for all a, b = 0, 1 and $i = 1, \dots, n$ . Then, there exists a function $g (x, y)$ such that $‖ H_{i} (β, \hat{η}) ‖ ⩽ g_{1} (x_{i}, y_{i})$ and $E {g_{1} (x_{i}, y_{i})} ⩽ \infty$ . Clearly, $H_{i} (β, η)$ is continuous in $(β^{⊤}, η^{⊤})^{⊤}$ . By Lemma 2.4 of Newey and McFadden (Citation1994), $sup_{β, η} ‖ n^{- 1} \sum_{i = 1}^{n} H_{i} (β, η) - Θ ‖ \overset{p}{⟶} 0$ .

Since $β^{*} \overset{p}{⟶} β$ , $\hat{η} \overset{p}{⟶} η$ and Θ is continuous in $(β^{⊤}, η^{⊤})^{⊤}$ , by Tsiatis (Citation2006) Section 3.2, the result of (i) follows.

(ii) Observe that $\begin{aligned} G_{i} (β, η) & = \sum_{a, b = 0, 1} I_{i a b} \frac{\partial e_{i a b} (β, η)}{\partial η^{⊤}} x_{i} \\ = \sum_{a, b = 0, 1} I_{i a b} \frac{\frac{\partial d_{i a b}}{\partial η^{⊤}} (c_{i a b} + d_{i a b} ρ_{i}) - (\frac{\partial c_{i a b}}{\partial η^{⊤}} - \frac{\partial d_{i a b}}{\partial η^{⊤}} ρ_{i}) d_{i a b}}{(c_{i a b} + d_{i a b} ρ_{i})^{2}} h^{'} (x_{i}^{⊤} β) x_{i} . \end{aligned}$ By the logistic model defined in Section 2.2, $\partial p_{j i} / \partial η_{j}^{⊤} = p_{j} (1 - p_{j}) x_{i}^{⊤}$ . The expressions of $\partial d_{i a b} / \partial η^{⊤}$ and $\partial c_{i a b} / \partial η^{⊤}$ are obtained as follows. $\begin{aligned} \frac{\partial d_{i 11}}{\partial η_{j}^{⊤}} & = {p_{1 i} (1 - p_{1 i}) p_{2 i} (1 - p_{2 i})}^{1 / 2} (\frac{1}{2} - p_{j i}) x_{i}^{⊤}, j = 1, 2; \\ \frac{\partial d_{i 10}}{\partial η_{j}^{⊤}} & = \frac{\partial d_{i 01}}{\partial η_{j}^{⊤}} = - \frac{\partial d_{i 11}}{\partial η_{j}^{⊤}}, \frac{\partial d_{i 00}}{\partial η_{j}^{⊤}} = \frac{\partial d_{i 11}}{\partial η_{j}^{⊤}}, j = 1, 2; \\ \frac{\partial c_{i 11}}{\partial η_{j}^{⊤}} & = p_{1 i} p_{2 i} (1 - p_{j i}) x_{i}^{⊤}, j = 1, 2; \\ \frac{\partial c_{i 10}}{\partial η_{1}^{⊤}} & = \frac{\partial p_{1 i}}{\partial η_{1}^{⊤}} - p_{2 i} \frac{\partial p_{1 i}}{\partial η_{1}^{⊤}} = p_{1 i} (1 - p_{1 i}) (1 - p_{2 i}) x_{i}^{⊤}; \\ \frac{\partial c_{i 10}}{\partial η_{2}^{⊤}} & = - p_{1 i} \frac{\partial p_{2 i}}{\partial η_{2}^{⊤}} = - p_{1 i} p_{2 i} (1 - p_{2 i}) x_{i}^{⊤}; \\ \frac{\partial c_{i 01}}{\partial η_{1}^{⊤}} & = - p_{2 i} \frac{\partial p_{1 i}}{\partial η_{1}^{⊤}} = - p_{2 i} p_{1 i} (1 - p_{1 i}) x_{i}^{⊤}; \\ \frac{\partial c_{i 01}}{\partial η_{2}^{⊤}} & = \frac{\partial p_{2 i}}{\partial η_{2}^{⊤}} - p_{1 i} \frac{\partial p_{2 i}}{\partial η_{2}^{⊤}} = p_{2 i} (1 - p_{2 i}) (1 - p_{1 i}) x_{i}^{⊤}; \\ \frac{\partial c_{i 00}}{\partial η_{j}^{⊤}} & = - p_{j i} (1 - p_{1 i}) (1 - p_{2 i}) x_{i}^{⊤}, j = 1, 2. \end{aligned}$ By the assumptions, we can find a function $ω (x, y)$ such that $‖ G_{i} (β, η) ‖ ⩽ ω (x_{i}, y_{i})$ and $E {ω (x_{i}, y_{i})} < \infty$ . And, $G_{i} (β, η)$ is continuous in $(β^{⊤}, η^{⊤})^{⊤}$ . Using the similar argument as in the proof for (i), by Lemma 2.4 of Newey and McFadden (Citation1994) $sup_{β, η} ‖ n^{- 1} \sum_{i = 1}^{n} G_{i} (β, η) - J ‖ \overset{p}{⟶} 0$ . Since $η^{*} \overset{p}{\to} η$ in probability and $J$ is continuous in $(β^{⊤}, η^{⊤})^{⊤}$ , the result of (ii) follows.

A.3. Proof of Theorem 3.1

By the standard theory of regression, we have $\sqrt{n} (\hat{η} - η) \to N (0, Σ_{η})$ .

By the fact that $\hat{β}$ solves $s (β, \hat{η}) = 0$ , expand $s (\hat{β}, \hat{η})$ (with respect to $β$ ) around $β$ as (A1) $0 = s (\hat{β}, \hat{η}) = s (β, \hat{η}) + \sum_{i = 1}^{n} H_{i} (β^{*}, \hat{η}) (\hat{β} - β),$ (A1) where $β^{*}$ is between $\hat{β}$ and $β$ component-wise.

Second, expand $s (β, \hat{η})$ (with respect to $η$ ) around $η$ as (A2) $s (β, \hat{η}) = \sum_{i = 1}^{n} s_{i} (β, \hat{η}) = \sum_{i = 1}^{n} s_{i} (β, η) + \sum_{i = 1}^{n} G_{i} (β, η^{*}) (\hat{η} - η),$ (A2) where $η^{*}$ is between $\hat{η}$ and $η$ component-wise.

Combining (EquationA1(A1) $0 = s (\hat{β}, \hat{η}) = s (β, \hat{η}) + \sum_{i = 1}^{n} H_{i} (β^{*}, \hat{η}) (\hat{β} - β),$ (A1) ) and (EquationA2(A2) $s (β, \hat{η}) = \sum_{i = 1}^{n} s_{i} (β, \hat{η}) = \sum_{i = 1}^{n} s_{i} (β, η) + \sum_{i = 1}^{n} G_{i} (β, η^{*}) (\hat{η} - η),$ (A2) ), $\sqrt{n} (\hat{β} - β)$ is expressed as (A3) $- \sqrt{n} {\frac{1}{n} \sum_{i = 1}^{n} H_{i} (β^{*}, \hat{η})}^{- 1} [\frac{1}{n} \sum_{i = 1}^{n} s_{i} (β, η) + {\frac{1}{n} \sum_{i = 1}^{n} G_{i} (β, η^{*})} (\hat{η} - η)] .$ (A3) By Lemma A.1 for the bivariate normal response case or Lemma A.2 for the bivariate Bernoulli response case, (A4) $\frac{1}{n} \sum_{i = 1}^{n} H_{i} (β^{*}, \hat{η}) \overset{p}{⟶} Θ a n d \frac{1}{n} \sum_{i = 1}^{n} G_{i} (β, η^{*}) \overset{p}{⟶} J .$ (A4) And by the central limit theorem, $n^{- 1 / 2} \sum_{i = 1}^{n} s_{i} (β, η) \overset{d}{⟶} N (0, I)$ . Then, by Slusky theorem, (A5) $- {\frac{1}{n} \sum_{i = 1}^{n} H_{i} (β^{*}, \hat{η})}^{- 1} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} s_{i} (β, η) \overset{d}{⟶} N (0, Θ^{- 1} I Θ^{- 1}),$ (A5) and (A6) $- {\frac{1}{n} \sum_{i = 1}^{n} H_{i} (β^{*}, \hat{η})}^{- 1} {\frac{1}{n} \sum_{i = 1}^{n} G_{i} (β, η^{*})} \sqrt{n} (\hat{η} - η) \overset{d}{⟶} N (0, Θ^{- 1} J Σ_{η} J^{⊤} Θ^{- 1}) .$ (A6) Combining (EquationA3(A3) $- \sqrt{n} {\frac{1}{n} \sum_{i = 1}^{n} H_{i} (β^{*}, \hat{η})}^{- 1} [\frac{1}{n} \sum_{i = 1}^{n} s_{i} (β, η) + {\frac{1}{n} \sum_{i = 1}^{n} G_{i} (β, η^{*})} (\hat{η} - η)] .$ (A3) ), (EquationA5(A5) $- {\frac{1}{n} \sum_{i = 1}^{n} H_{i} (β^{*}, \hat{η})}^{- 1} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} s_{i} (β, η) \overset{d}{⟶} N (0, Θ^{- 1} I Θ^{- 1}),$ (A5) ), and (EquationA6(A6) $- {\frac{1}{n} \sum_{i = 1}^{n} H_{i} (β^{*}, \hat{η})}^{- 1} {\frac{1}{n} \sum_{i = 1}^{n} G_{i} (β, η^{*})} \sqrt{n} (\hat{η} - η) \overset{d}{⟶} N (0, Θ^{- 1} J Σ_{η} J^{⊤} Θ^{- 1}) .$ (A6) ), $\sqrt{n} (\hat{β} - β)$ is asymptotically normal with zero mean vector and covariance matrix given by $Σ = Θ^{- 1} I Θ^{- 1} + Θ^{- 1} J Σ_{η} J^{⊤} Θ^{- 1} + c o v {L H S o f (A 5), L H S o f (A 6)} .$

Regression models of Pearson correlation coefficient

Abstract

1. Introduction

2. Model

2.1. Bivariate normal responses

Table 1. Some functions under two link functions.

2.2. Bivariate binary responses

3. Inference

4. Simulation study

4.1. Setup

Table 2. Specifications of $β_{1}, \dots, β_{p}$ under p = 2, 3 and two link functions.

4.2. Results

Table 3. Type-I errors (in %) of the global test (for the model significance) and individual test (for the significance of the covariates) under two types of responses, two link functions and various sample sizes.

Table 4. ERMSE, CR and testing power of β's of models considered in Table with various sample sizes.

4.3. Sensitivity analysis

Table 5. Rejection rates (in %) of Wilding et al. (Citation2011)'s method and the proposed method for testing the significance of the covariate.

5. Real data application

5.1. Bivariate normal responses

5.2. Bivariate binary responses

6. Concluding remarks

Disclosure statement

Data availability statement

References

Appendix. Technical details

A.1. Details of $e_{a b}$ and $f_{a b}$

A.2. Lemmas

A.3. Proof of Theorem 3.1

Information for

Open access

Opportunities

Help and information

Regression models of Pearson correlation coefficient

Abstract

1. Introduction

2. Model

2.1. Bivariate normal responses

Table 1. Some functions under two link functions.

2.2. Bivariate binary responses

3. Inference

4. Simulation study

4.1. Setup

Table 2. Specifications of β1,…,βp under p = 2, 3 and two link functions.

4.2. Results

Table 3. Type-I errors (in %) of the global test (for the model significance) and individual test (for the significance of the covariates) under two types of responses, two link functions and various sample sizes.

Table 4. ERMSE, CR and testing power of β's of models considered in Table 1 with various sample sizes.

4.3. Sensitivity analysis

Table 5. Rejection rates (in %) of Wilding et al. (Citation2011)'s method and the proposed method for testing the significance of the covariate.

5. Real data application

5.1. Bivariate normal responses

5.2. Bivariate binary responses

6. Concluding remarks

Disclosure statement

Data availability statement

References

Appendix. Technical details

A.1. Details of eab and fab

A.2. Lemmas

A.3. Proof of Theorem 3.1

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 2. Specifications of $β_{1}, \dots, β_{p}$ under p = 2, 3 and two link functions.

Table 4. ERMSE, CR and testing power of β's of models considered in Table with various sample sizes.

A.1. Details of $e_{a b}$ and $f_{a b}$