Jonas Haslbeck

Selecting the Number of Factors in Exploratory Factor Analysis via out-of-sample Prediction Errors

Tue, 14 Feb 2023 09:00:00 +0000

Exploratory Factor Analysis (EFA) identifies a number of latent factors that explain correlations between observed variables. A key issue in the application of EFA is the selection of an adequate number of factors. This is a non-trivial problem because more factors always improve the fit of the model. Most methods for selecting the number of factors fall into two categories: either they analyze the patterns of eigenvalues of the correlation matrix, such as parallel analysis; or they frame the selection of the number of factors as a model selection problem and use approaches such as likelihood ratio tests or information criteria.

In a recent paper we proposed a new method based on model selection. We use the connection between model-implied correlation matrices and standardized regression coefficients to do model selection based on out-of-sample prediction errors, as is common in the field of machine learning. We show in a simulation study that our method slightly outperforms other standard methods on average and is relatively robust across specifications of the true model. An implementation is available in the R-package fspe, which I present here with a short code example.

We use a dataset with 24 measurements of cognitive tasks from 301 individuals from Holzinger and Swineford (1939). Harman (1967) presents both a four- and five-factor solution for this dataset. In the four-factor solution, the fifth factor corresponding to the variables 20–24 is eliminated. For this reason, we exclude variables 20–24, which gives us an example dataset in which we would theoretically expect four factors. This reduced dataset is is included in the fspe-package:

library(fspe)
data(holzinger19)
dim(holzinger19)

## [1] 301  19

head(holzinger19)

##   t01_visperc t02_cubes t03_frmbord t04_lozenges t05_geninfo t06_paracomp
## 1          20        31          12            3          40            7
## 2          32        21          12           17          34            5
## 3          27        21          12           15          20            3
## 4          32        31          16           24          42            8
## 5          29        19          12            7          37            8
## 6          32        20          11           18          31            3
##   t07_sentcomp t08_wordclas t09_wordmean t10_addition t11_code t12_countdot
## 1           23           22            9           78       74          115
## 2           12           22            9           87       84          125
## 3            7           12            3           75       49           78
## 4           18           21           17           69       65          106
## 5           16           25           18           85       63          126
## 6           12           25            6          100       92          133
##   t13_sccaps t14_wordrecg t15_numbrecg t16_figrrecg t17_objnumb t18_numbfig
## 1        229          170           86           96           6           9
## 2        285          184           85          100          12          12
## 3        159          170           85           95           1           5
## 4        175          181           80           91           5           3
## 5        213          187           99          104          15          14
## 6        270          164           84          104           6           6
##   t19_figword
## 1          16
## 2          10
## 3           6
## 4          10
## 5          14
## 6          14

Next to providing the data to the fspe() function we specify that factor models with 1, 2, … ,10 factors should be considered (maxK = 10), that the cross-validation scheme should use with 10 folds (nfold = 10) and be repeated 10 times (rep = 10), and that prediction errors (method = "PE") should be used. An alternative method (method = "CovE") computes an out-of-sample estimation error on the covariance matrix instead of a prediction error on the raw data. This is a method that is similar to the one proposed by Browne & Cudeck (1989). Finally, we set a seed so that the analysis demonstrated here is fully reproducible.

set.seed(1)
fspe_out <- fspe(holzinger19,
                 maxK = 10,
                 nfold = 10,
                 rep = 10,
                 method = "PE", 
                 pbar = FALSE)

We can inspect the out-of-sample prediction error averaged across variables, folds, and repetitions as a function of the number of factors:

par(mar=c(4.5,4,0,1))
plot.new()
plot.window(xlim=c(1,10), ylim=c(0.6, 0.8))
axis(1, 1:10)
axis(2, las=2)
title(xlab="Number of Factors", ylab="Out-of-sample Prediction Error")
points(which.min(fspe_out$PEs), min(fspe_out$PEs), cex=3, col="red", lwd=2)
lines(fspe_out$PEs, lwd=2)
abline(h=min(fspe_out$PEs), col="grey", lty=2, lwd=2)

We see that the out-of-sample prediction error is minimized by the factor model with four factors. The number of factors with lowest prediction error can also be directly obtained from the output object:

fspe_out$nfactor

## [1] 4

The un-aggregated of the 10 repetitions of the cross-validation scheme can be found in fspe_out$PE_array.

The Impact of Ordinal Scales on Gaussian Mixture Recovery

Thu, 14 Jul 2022 11:00:00 +0000

Gaussian Mixture Models (GMMs) and its special cases Latent Profile Analysis and k-Means are a popular and versatile tools for exploring heterogeneity in multivariate continuous data. However, they assume that the observed data are continuous, an assumption that is often not met: for example, the severity of symptoms of diseases is often measured in ordinal categories such as not at all, several days, more than half the days, and nearly every day, and survey questions are often assessed using ordinal responses such as strongly agree, agree, neutral, and agree, strongly agree. In this blog post, I summarize a paper which investigates to what extent estimating GMMs is robust against observing ordinal instead of continuous variables.

Simulation Setup

To investigate this question we generate data from a number of GMMs that differ in the number of variables/dimensions $p \in \{2, \dots, 10 \}$, the number of components $K \in \{2,3,4\}$ and the pairwise Kullback-Leibler Divergence $\text{D}_{KL} \in \{2, 3.5, 5\}$ between the components. We then generate data from each GMM and threshold the continuous data into $12, 10, 8, 6, 5, 4, 3,$ or $2$ categories, using equally spaced thresholds ranging from the $0.5\%$ to the $99.5\%$ quantile. The following figure shows the result of this thresholding for a bivariate ($p=2$) GMM with two components ($K=2$):

We then estimate the GMM using arguably the most widely used algorithm, the Expectation-Maximization (EM) algorithm and perform model selection with the Bayesian Information Criterion (BIC). The red X in the figure indicate the means of the selected model. We see that the EM-algorithm & BIC correctly recover two components and their means for the continuous data. However, when observing ordinal variables instead we select incorrect numbers of components.

Summary of Results

In the following figure we show the accuracy of recovering the correct number of components $K$ averaged across the variations in $K$ and $\text{D}_{KL}$, as a function of the number of variables $p$ (y-axis) and the number of ordinal categories (x-axis) using $N=10000$ samples and averaged across $100$ repetitions. We see that if the number of variables or the number of ordinal categories are low, the accuracy is extremely low. However, when both are larger than $5$ then the accuracy is above $0.90$.

In our paper we present performance as a function of the number of components $K$, the distance between components $\text{D}_{KL}$, the number of variables $p$, and the sample sizes $N \in \{1000, 2500, 10000\}$. In addition, we assess the estimation error on parameters the models for which $K$ has been correctly estimated, as a function of various characteristics of the data generating GMM. These results show that a sizable bias in parameter estimates remains across scenarios and this bias does not decrease with increasing sample size. Next to the simulation results we discuss possible alternative modeling approaches based on ordinal models with underlying latent Gaussian distributions and based on categorical data analysis in which ordinal variables are manifest variables. The code to reproduce all analyses and results in our paper can be found here on Github.

Computing Odds Ratios from Mixed Graphical Models

Tue, 25 Aug 2020 11:00:00 +0000

Interpreting statistical network models typically involves interpreting individual edge parameters. If the network model is a Gaussian Graphical Model (GGM), the interpretation is relatively simple: the pairwise interaction parameters are partial correlations, which indicate conditional linear relationships and vary from -1 to 1. Using the standard deviations of the two involved variables, the partial correlation can also be transformed into a linear regression coefficient (see for example here). However, when studying interactions involving categorical variables, such as in an Ising model or a Mixed Graphical Model (MGM), the parameters are not limited to a certain range and their interpretation is less intuitive. In these situations it may be helpful to report the interactions between variables in terms of odds ratios.

Odds Ratios

Odds Ratios are made up of odds, which are themselves a ratio of probabilities

$\text{Odds} = \frac{P(X_1=1)}{P(X_1=0)}.$ Since we chose to put $P(X_1=1)$ in the numerator, we interpret these odds as the “odds being in favor of $X_1=1$”. For example, if $X_1$ is the symptom sleep problems which takes the values 0 (no sleep problems) with probability 0.75 and the value 1 (sleep problems) with probability 0.25, then the odds of having sleep problems are 1 to 4.

However, these odds may be different in different circumstances. Let’s say these circumstances are captured by variable $X_2$ which takes values in ${0,1}$. In our example, those circumstances could be whether you live next to a busy street (1) or not (0). If the odds indeed depend on $X_2$ then we have

\[\text{Odds}_{X_2=1} = \frac{P(X_1=1 \mid X_2=1)}{P(X_1=0 \mid X_2=1)} \neq \frac{P(X_1=1 \mid X_2=0)}{P(X_1=0 \mid X_2=0)} = \text{Odds}_{X_2=0} .\]

A way to quantify the degree to which the odds are different depending whether we set $X_2=1$ or $X_2=0$ is to divide the odds in those two situations

\[\text{Odds Ratio} = \frac{\text{Odds}_{X_2=1}}{\text{Odds}_{X_2=0}} ,\]

which gives rise to an odds ratio (OR).

How do we interpret this odds ratio? If the OR is equal to 1, then $X_2$ has no influence on the odds between $P(X_1=1)$ and $P(X_1=0)$; if OR > 1, $X_2=1$ increases the odds compared to $X_2=0$; and if OR < 1, $X_2=1$ decreases the odds compared to $X_2=0$. In our example from above, an OR = 4 would imply that the odds of sleep problems (vs. no sleep problems) are four times larger when living next to a busy street (vs. not living next to a busy street).

In the remainder of this blog post, I will illustrate how to compute such odds ratios based on MGMs estimated with the R-package mgm.

Loading Example Data

We use a data set on Autism Spectrum Disorder (ASD) which contains $n=3521$ observations of seven variables, gender (1 = male, 2 = female), IQ (continuous), Integration in Society (3 categories ordinal), Number of comorbidities (count), Type of housing (1 = supervised, 2 = unsupervised), working hours (continuous), and satisfaction with treatment (continuous).

library(mgm) # data is loaded with mgm package (version 1.2-11)
head(autism_data$data)

##   Gender   IQ Integration in Society No of Comorbidities Type of Housing
## 1      1 6.00                      1                   1               1
## 2      2 6.00                      2                   1               1
## 3      1 5.00                      2                   0               1
## 4      1 6.00                      1                   0               1
## 5      1 5.00                      1                   1               1
## 6      1 4.49                      1                   1               1
##   Workinghours Satisfaction: Treatment
## 1            0                    3.00
## 2            0                    2.00
## 3            0                    4.00
## 4           10                    3.00
## 5            0                    1.00
## 6            0                    1.75

For more details on this data set have a look at this previous blog post.

Fitting MGM

We model gender, integration in society and type of housing as categorical variables (with 2, 3 and 2 categories), and all remaining variables as Gaussian variables

set.seed(1)
mod <- mgm(data = autism_data$data, 
           type =  c("c", "g", "c", "g", "c", "g", "g"), 
           level = c(2, 1, 3, 1, 2, 1, 1), 
           lambdaSel = "CV", 
           ruleReg = "AND",
           pbar=FALSE)

## Note that the sign of parameter estimates is stored separately; see ?mgm

and we visualize the dependencies of the resulting model using the qqgraph package:

library(qgraph)
qgraph(mod$pairwise$wadj, 
       nodeNames = autism_data$colnames, 
       edge.color = mod$pairwise$edgecolor,
       # edge.labels = TRUE,
       legend = TRUE, 
       layout = "spring")

Example Calculation of Odds Ratio

We consider the simplest possible example of calculating the OR involving two binary variables. Note that we calculate the OR between two variables within a multivariate model. This means that the OR is conditional on all other variables in the model which implies that it can be different from the OR calculated based on those two variables alone (specifically, this will always happen when at least one additional variable is connected to both binary variables.)

Some of you might know that there is a simple relationship between the OR and the coefficients in logistic regression. Since multinomial regression with two outcomes is equivalent to logistic regression, we could use this simple rule in this specific example. Here, however, we start out with the definition of OR and show how to calculate it in the general case of $\geq 2$ categories. Along the way, we’ll also derive the simple relationship between parameters and ORs for the binary case.

In our data set we use the two binary variables type of housing and gender for illustration. Specifically, we look at how the odds of type of housing change as a function of gender. Corresponding to the column numbers of the two variables, let $X_5$ be type of housing, and $X_1$ gender.

The definition of ORs above shows that we need to calculate four conditional probabilities. We first calculate $P(X_5=1 \mid X_1=0)$ and $P(X_5=0 \mid X_1=0)$ in the numerator. To compute these probabilities, we need the estimated parameters of the multinomial regression on $X_5$. In the standard parameterization of multinomial regression one of the response categories serves as the reference category (see here). The regularization used within the mgm package allows more direct parameterization in which the probability of each response category can be modeled directly (for details see Chapter 4 in this paper on the glmnet package). We therefore get a set of parameters for each response category.

We can find those parameters in mod$nodemodels[[5]] which contains the parameters of the multinomial regression on variable $X_5$:

coefs <- mod$nodemodels[[5]]$model
coefs

## $`1`
## 8 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)  0.62372320
## V1.2        -0.38147442
## V2.         -0.42186120
## V3.2         0.13918915
## V3.3        -0.04001268
## V4.          0.01414994
## V6.         -0.22024227
## V7.          0.04346463
## 
## $`2`
## 8 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) -0.62372320
## V1.2         0.38147442
## V2.          0.42186120
## V3.2        -0.13918915
## V3.3         0.04001268
## V4.         -0.01414994
## V6.          0.22024227
## V7.         -0.04346463

The first set of parameters in coefs[[1]] models $P(X_5 = 0 \mid \dots)$ and the second set of parameters in coefs[[2]] models $P(X_5 = 1 \mid \dots)$.

The data set contains seven variables, which means that we have six variables that predict variable 5. However, since variable 3 (Integration in Society) is a categorical variable with three categories, it is represented by two dummy variables that code for its 2nd and 3rd category. This is why we have a total of 5*1+2=7 predictor variables and 7 associated parameters.

Now, back to computing those probabilities in the enumerator. We would like to compute $P(X_5=1 \mid X_1=0)$ and $P(X_5=0 \mid X_1=0)$, however we see that the probability of $P(X_5)$ not only depends on $X_1$ but also on all other variables (none of the parameters are zero). We therefore need to fix all variables to some value in order to obtain a conditional probability. That is, we actually have to write the conditional probabilities as $P(X_5=1 \mid X_1, X_2, X_3, X_4, X_6, X_7)$ and $P(X_5=0 \mid X_1, X_2, X_3, X_4, X_6, X_7)$. Here we will fix all other variables to 0, but we will see later that it does not matter for our OR calculation to which value we set all these variables, as long as we choose the same values in all of the four probabilities.

The probability of $P(X_5=0 \mid \dots)$ is calculated by dividing the potential for this category by the sum of all (here two) potentials:

$P(X_5=0 \mid \dots) = \frac{ \text{Potential}(X_5=0 \mid \dots) }{ \text{Potential}(X_5=0 \mid \dots) + \text{Potential}(X_5=1 \mid \dots) }$ If $X_5$ would have $m$ categories, there would be $m$ terms in the denominator.

The potentials are specified by the estimated parameters

\[\text{Potential}(X_5=1 \mid X_1=0, X_2=0, \dots, X_7=0) = \exp \{ \beta_{0} + \beta_{0,1.2} \mathbb{I}(X_1=1) + \dots + \beta_{0,7} X_7 \} ,\]

where $\beta_{0}$ is the intercept and the remaining seven parameters are the ones associated with the predictor terms in the model, and $\mathbb{I}(X_1=1)$ is the indicator function (or dummy variable) for category $X_1=1$.

Notice that we set all variables to zero, which means that the above potential simplifies to

\[\text{Potential}(X_5=1 \mid \dots) = \exp \{\beta_{0} \} \approx \exp \{ 0.624 \}\]

where I took the intercept parameter $\beta_{0}$ from coefs[[1]][1,1]. Similarly, we have

$\text{Potential}(X_5=0 \mid \dots) = \exp \{\beta_{1} \} \approx \exp \{ -0.624 \}$ taken from coefs[[1]][1,1].

Using the two potentials, we can compute the probabilities. We now do this in R:

Potential0 <- exp(coefs[[1]][1,1])
Potential1 <- exp(coefs[[2]][1,1])

Prob0 <- Potential0 / (Potential0 + Potential1)
Prob1 <- Potential1 / (Potential0 + Potential1)
Prob0

## [1] 0.7768575

Prob1

## [1] 0.2231425

We calculated that the probability of $P(X_5=0 \mid \dots)$ (supervised housing) is $\approx 0.78$ and the probability of $P(X_5=0 \mid \dots)$ (unsupervised housing) is $\approx 0.22$.

Now we can compute the odds $\text{Odds}_{X_2=0}$:

odds_x1_0 <- Prob1 / Prob0

odds_x1_0

## [1] 0.2872373

The odds are smaller than one, which means that it is more likely that an individual lives in supervised housing.

Note that when computing the odds, the denominator in the calculations for the probabilities cancel out, which means we could have immediately computed the odds with the potentials:

Potential1 / Potential0

## [1] 0.2872373

So far, we computed the numerator of the formula of the Odds Ratio. We now compute the denominator, which includes the same conditional probabilities as above, except that we set $X_1=1$ instead of $X_1=0$. As discussed above, all other variables are kept constant at 0. To keep things short, I only show the R code for this second case:

Potential0 <- exp(coefs[[1]][1,1] + coefs[[1]][2,1] * 1)
Potential1 <- exp(coefs[[2]][1,1] + coefs[[2]][2,1] * 1)

odds_x1_1 <- Potential1 / Potential0

Similar to above, coefs[[1]][1,1] contains the intercept and coefs[[1]][2,1] contains the parameter associated with predictor $X_1$ for probability $P(X_5 = 0 \mid \dots)$. coefs[[2]] contains the corresponding parameters for probability $P(X_5 = 1 \mid \dots)$.

We can now compute the OR:

OR <- odds_x1_1 / odds_x1_0
OR

## [1] 2.144591

We see that for females (coded 2) the odds of living in unsupervised housing (coded 2) are about twice as high as for males.

In a similar way, ORs can be calculated when the predicted variable or the predictor variable of interest has more than two categories. One can also compute ORs that combine the effect of several variables, for example by setting several variables to 1 in the numerator, and setting them all to 0 in the denominator.

Does it matter to which value we fix the other variables?

In the above calculation we fixed all other variables to zero. Would the OR have been different if we had fixed them to a different value? We will show with some basic calculations that the answer is no. Along the way, we will also derive a much simpler way to compute the OR from the parameter estimates for the special case in which the response variable is binary.

To keep the notation manageable, we only consider one third variable $X_3$ instead of the five in the empirical example above. However, we would reach the same conclusion with any number of additional variables that are kept constant.

We start out with the definition of the odds ratio:

\[\text{Odds Ratio} = \frac{\text{Odds}_{X_2=1}}{\text{Odds}_{X_2=0}} = \frac{ \frac{P(X_1=1 \mid X_2=1,X_3=x_3)}{P(X_1=0 \mid X_2=1,X_3=x_3)} }{ \frac{P(X_1=1 \mid X_2=0,X_3=x_3)}{P(X_1=0 \mid X_2=0,X_3=x_3)} } .\]

The question is whether it matters what we fill in for $x_3$. We will show that the terms associated with $x_3$ cancel out and it therefore does not matter to which value we fix $x_3$.

\[\frac{ \frac{P(X_1=1 \mid X_2=1,X_3=x_3)}{P(X_1=0 \mid X_2=1,X_3=x_3)} }{ \frac{P(X_1=1 \mid X_2=0,X_3=x_3)}{P(X_1=0 \mid X_2=0,X_3=x_3)} } = \frac{ \frac{\exp\{\beta_1 + \beta_{21}1 + \beta_{31}x_3\} }{\exp\{\beta_0 + \beta_{20}1 + \beta_{30}x_3\}} }{ \frac{\exp\{\beta_1 + \beta_{21}0 + \beta_{31}x_3\} }{ \exp\{\beta_0 + \beta_{20}0 + \beta_{30}x_3\} } } = \frac{ \frac{\exp\{\beta_1 + \beta_{21} + \beta_{31}x_3\} }{\exp\{\beta_0 + \beta_{20} + \beta_{30}x_3\}} }{ \frac{\exp\{\beta_1 + \beta_{31}x_3\} }{ \exp\{\beta_0 + \beta_{30}x_3\} } }\]

In the first step we fixed $X_1=1$ in the numerator and $X_1=0$ in the denominator and simplified. The parameter $\beta_{21}$ refers to the coefficient associated with $X_2$ in the equation modeling $P(X_1=1 \mid \dots)$, while $\beta_{20}$ refers to the coefficient associated with $X_2$ in the equation modeling $P(X_1=0 \mid \dots)$.

We further rearrange

\[\frac{ \frac{\exp\{\beta_1 + \beta_{21} + \beta_{31}x_3\} }{\exp\{\beta_0 + \beta_{20} + \beta_{30}x_3\}} }{ \frac{\exp\{\beta_1 + \beta_{31}x_3\} }{ \exp\{\beta_0 + \beta_{30}x_3\} } } = \frac{\exp\{\beta_1 + \beta_{21} + \beta_{31}x_3\} }{\exp\{\beta_0 + \beta_{20} + \beta_{30}x_3\}} \frac{\exp\{\beta_0 + \beta_{30}x_3\} }{ \exp\{\beta_1 + \beta_{31}x_3\} } ,\]

which is equal to

\[\exp\{\beta_1 + \beta_{21} + \beta_{31}x_3 + \beta_0 + \beta_{30}x_3 - (\beta_0 + \beta_{20} + \beta_{30}x_3 + \beta_1 + \beta_{31}x_3)\} .\]

We collect all the terms with $x_3$

\[\exp\{ (\beta_1 + \beta_{21} + \beta_0 - \beta_0 - \beta_{20} - \beta_1) + (\beta_{31}x_3 + \beta_{30}x_3 - \beta_{30}x_3 - \beta_{31}x_3) \}\]

and we see that the terms including $x_3$ add to zero, which shows that no matter what number we fill in for $x_3$, we will always obtain the same OR.

Actually, we can further simplify and get

\[\exp\{ (\beta_1 + \beta_{21} + \beta_0 - \beta_0 - \beta_{20} - \beta_1) + 0 = \exp\{ \beta_{21} - \beta_{20} \}\]

which reveals a simpler way to compute the OR. We can verify this with our estimated coefficients

exp(coefs[[2]][2,1] - coefs[[1]][2,1])

## [1] 2.144591

and indeed we get the same OR.

If we wanted the OR with the numerator and denominator swapped, one would calculate:

exp(coefs[[1]][2,1] - coefs[[2]][2,1])

## [1] 0.4662894

One could verify this by repeating the derivation above with swapped numerator/denominator or by using the general approach of calculating ORs that we used above.

This reflects the well-known relation between multiple regression parameters and the OR

\[\exp\{\beta_x\}=\text{OR}_x\]

since the relation between logistic regression parameterization and the symmetric multinomial regression parameterization used here is $\beta_x = 2 \beta_{x1}$, where $\beta_{x1}$ is the parameter corresponding to $\beta_x$ in the equation modeling $P(X_1=1 \mid \dots)$.

Is the OR “significant”?

The model has been estimated with $\ell_1$-regularized regression, in which the regularization parameters have been selected with 10-fold cross-validation with the goal that the parameter estimates generalize to new samples. Thus, variable selection has already been performed and it is not necessary to perform an additional hypothesis test on the OR or the underlying variables.

An Alternative: Changes in Predicted Probabilities

An alternative to ORs is to report change in predicted probabilities of $X_5$ depending on which value we fill in for $X_1$. When considering only two variables such a change in probabilities is perhaps easier to interpret than ORs. However, we will see that changes in predicted probabilities do not have the nice property of ORs that it doesn’t matter to which value we fix all other variables. This makes this alternative less attractive for models that include more than two variables.

When looking at changes in predicted probabilities we are interested in the difference

\[P(X_5=1 \mid X_1=1, \dots) - P(X_5=0 \mid X_1=0, \dots) .\]

When calculating these probabilities we are again required to fix all other variables (“…”) to some value, which we again choose to be 0. In the interest of brevity I only show the R-code for this calculation.

We compute the probabilities for the case $X_1=0$

Potential0 <- exp(coefs[[1]][1,1])
Potential1 <- exp(coefs[[2]][1,1])
Prob1_x10 <- Potential1 / (Potential0 + Potential1)

and for the case $X_1=1$:

Potential0 <- exp(coefs[[1]][1,1] + coefs[[1]][2,1] * 1)
Potential1 <- exp(coefs[[2]][1,1] + coefs[[2]][2,1] * 1)
Prob1_x11 <- Potential1 / (Potential0 + Potential1)

We see a change in probability of

Prob1_x11 - Prob1_x10

## [1] 0.1580482

That is, the probability of living in unsupervised housing is $\approx 0.16$ higher for females, which is consistent with the OR > 1 calculated above.

However, now let’s set $X_3=1$ instead of $X_3=0$:

Potential0 <- exp(coefs[[1]][1,1] + coefs[[1]][3,1]*1)
Potential1 <- exp(coefs[[2]][1,1] + coefs[[2]][3,1]*1)
Prob1_x10 <- Potential1 / (Potential0 + Potential1)

Potential0 <- exp(coefs[[1]][1,1] + coefs[[1]][2,1] * 1 + coefs[[1]][3,1]*1)
Potential1 <- exp(coefs[[2]][1,1] + coefs[[2]][2,1] * 1 + coefs[[2]][3,1]*1)
Prob1_x11 <- Potential1 / (Potential0 + Potential1)

We now get an increase of probability of

Prob1_x11 - Prob1_x10

## [1] 0.1884348

We see that changes in the predicted probabilities as a function of $X_5$ depend on where we fixed the other variables. So while changes in probabilities are maybe easier to interpret, they have the downside that the changes depend on which values we fix the other variables. However, this may be acceptable in situations in which we are interested in some specific state of all other variables.

Summary

Starting out with the definition of odds ratios, I showed how to compute them in a general setting and how to compute them from the output of the mgm-package. We looked into whether the OR depends on the specific values at which we fix the other variables (it doesn’t). Proving this fact revealed a simpler formula for calculating ORs for the special case of having a binary response variable. Finally, we considered predicted probabilities as an alternative for ORs.

I would like to thank Oisín Ryan for his feedback on this blog post.

Estimating Group Differences in Network Models using Moderation

Wed, 24 Jun 2020 11:00:00 +0000

Researchers are often interested in comparing statistical network models across groups. For example, Fritz and colleagues compared the relations between resilience factors in a network model for adolescents who did experience childhood adversity to those who did not. Several methods are already available to perform such comparisons. The Network Comparison Test (NCT) performs a permutation test to decide for each parameter whether it differs across two groups. The Fused Graphical Lasso (FGL) uses a lasso penalty to estimate group differences in Gaussian Graphical Models (GGMs). And the BGGM package allows one to test and estimate differences in GGMs in a Bayesian setting. In a recent preprint, I proposed an additional method based on moderation analysis which has the advantage that it can be applied to essentially any network model and at the same time allows for comparisons across more than two groups.

In this blog post I illustrate how to estimate group differences in network models via moderation analysis using the R-package mgm. I show how to estimate a moderated Mixed Graphical Model (MGM) in which the grouping variable serves as a moderator; how to analyze the moderated MGM by conditioning on the moderator; how to visualize the conditional MGMs; and how to assess the stability of group differences.

The Data

The data are automatically loaded with the R-package mgm and can be accessed in the object dataGD. The data set contains 3000 observations, of seven variables $X_1, …, X_7$, where $X_7 \in {1, 2, 3}$ indicates group membership. We have 1000 observations from each group. Note that these variables are of mixed type, with $X_1, X_2, X_4,$ and $X_6$ being continuous and the other variables being categorical.

library(mgm) # version 1.2-10

dim(dataGD)

## [1] 3000    7

head(dataGD)

##          x1     x2 x3     x4 x5     x6 x7
## [1,] -0.136 -0.009  0  0.399  0 -0.137  1
## [2,] -0.041  0.803  1  2.475  2  0.181  1
## [3,]  1.011  0.254  1 -0.194  0  1.329  1
## [4,] -0.158 -1.022  1 -1.587  2 -1.377  1
## [5,] -2.157  1.291  1  0.990  0 -0.018  1
## [6,]  0.499 -0.757  1 -0.941  2  1.099  1

In the Appendix of my preprint, I describe how these data were generated.

Fitting a Moderated MGM

Recall that a standard MGM describes the pairwise relationships between variables of mixed types. In order to detect group differences in the pairwise relationships between variables $X_1, X_2, \dots, X_6$ we fit a moderated MGM with the grouping variable $X_7$ being specified as a categorical moderator:

mgm_obj <- mgm(data = dataGD, 
               type = c("g", "g", "c", "g", "c", "g", "c"), 
               level = c(1, 1, 2, 1, 3, 1, 3), 
               moderators = 7, 
               lambdaSel = "EBIC", 
               lambdaGam = 0.25, 
               ruleReg = "AND", 
               pbar = FALSE)

## Note that the sign of parameter estimates is stored separately; see ?mgm

The argument type indicates the type of variable (“g” for continuous-Gaussian, and “c” for categorical) and level indicates the number of categories of each variable, which is set to 1 by default for continuous variables. The moderators argument specifies that the variable in the $7^{th}$ column is included as a moderator. Since we specified via the type argument that this variable is categorical, it will be treated as a categorical moderator. The remaining arguments specify that the regularization parameters in the $\ell_1$-regularized nodewise regression algorithm used by mgm are selected with the EBIC with a hyperparameter of $\gamma=0.25$ and that estimates are combined across nodewise regressions using the AND-rule.

Conditioning on the Moderator

In order to inspect the pairwise MGMs of the three groups, we need to condition the moderated MGM on the values of the moderator variable, which represent the three groups. This can be done with the function condition(), which takes the moderated MGM object and a list specifying on which values of which variables the model should be conditioned on. Here we only have a single moderator variable ($X_7$) and we condition on each of its values ${1, 2, 3}$ which represent the three groups, and save the three conditional pairwise MGMs in the list object l_mgm_cond:

l_mgm_cond <- list()
for(g in 1:3) l_mgm_cond[[g]] <- condition(object = mgm_obj, 
                                           values = list("7" = g))

Visualizing conditioned MGMs

We can now inspect the pairwise MGM in each group similar to when fitting a standard pairwise MGM (for details see the mgm paper or the other posts in my blog). Here we choose to visualize the strength of dependencies in the three MGMs in a network using the qgraph package. We provide the three conditional mgm-objects as an input and set the maximum argument in qgraph() for each visualization to the maximum parameter across all groups to ensure that the visualizations are comparable.

library(qgraph)

v_max <- rep(NA, 3)
for(g in 1:3) v_max[g] <- max(l_mgm_cond[[g]]$pairwise$wadj)

par(mfrow=c(1, 3))
for(g in 1:3) {
  qgraph(input = l_mgm_cond[[g]]$pairwise$wadj, 
         edge.color = l_mgm_cond[[g]]$pairwise$edgecolor_cb,
         lty = l_mgm_cond[[g]]$pairwise$edge_lty,
         layout = "circle", mar = c(2, 3, 5, 3),
         maximum = max(v_max), vsize = 16, esize = 23, 
         edge.labels  = TRUE, edge.label.cex = 3)
  mtext(text = paste0("Group ", g), line = 2.5)
}

The edges represent conditional dependence relationships and their width is proportional to their strength. The blue (red) edges indicate positive (negative) linear relationships. The grey edges indicate relationships involving categorical variables, for which no sign is defined (for details see ?mgm or the mgm paper). We see that there are conditional dependencies of equal strength between variables $X_1 - X_3$, $X_3 - X_4$ and $X_4 - X_6$ in all three groups. However, the linear dependency between $X_1 - X_2$ differs across groups: it is negative in Group 1, positive in Group 2 and almost absent in Group 3. In addition, there is no dependency between $X_3 - X_5$ in Group 1, but there is a dependency in Groups 2 and 3. Note that the comparable strength in dependencies between those variables in Groups 2 and 3 does not imply that the nature of these dependencies is the same. As with pairwise MGMs, it is possible to inspect the (non-aggregated) parameter estimates of these interactions with the function showInteraction().

Assessing the Stability of Estimates

Similar to pairwise MGMs, we can use the resample() function to assess the stability of all estimated parameters with bootstrapping. Here we only choose 50 bootstrap samples to keep the running time manageable for this tutorial. In practice, the number of bootstrap samples should better be in the order of 1000s.

res_obj <- resample(object = mgm_obj, 
                    data = dataGD, 
                    nB = 50, 
                    pbar = FALSE)

Finally, we can visualize the summaries of the bootstrapped sampling distributions using the function plotRes(). The location of the circles indicates the mean of the sampling distribution, the horizontal lines the 95\% quantiles, and the number in the circle the proportion of estimates that were nonzero.

plotRes(res_obj)

We see that for these simulated data the moderation effects (group differences) are extremely stable. However, note that in observational data moderation effects are typically much smaller and less stable.

Summary

In this blog post, I have shown how to use the mgm-package to estimate a moderated MGM in which the grouping variable serves as a moderator; how to analyze the moderated MNM by conditioning on the moderator; how to visualize the conditional MGMs; and how to assess the stability of group differences. For more details on this method and for a simulation study comparing its performance to the performance of existing methods, have a look at this preprint.

I would like to thank Fabian Dablander for his feedback on this blog post.

Estimating Time-varying Vector Autoregressive (VAR) Models

Tue, 02 Jun 2020 11:00:00 +0000

Models for individual subjects are becoming increasingly popular in psychological research. One reason is that it is difficult to make inferences from between-person data to within-person processes. Another is that time series obtained from individuals are becoming increasingly available due to the ubiquity of mobile devices. The central goal of so-called idiographic modeling is to tap into the within-person dynamics underlying psychological phenomena. With this goal in mind many researchers have set out to analyze the multivariate dependencies in within-person time series. The most simple and most popular model for such dependencies is the first-order Vector Autoregressive (VAR) model, in which each variable at the current time point is predicted by (a linear function of) all variables (including itself) at the previous time point.

A key assumption of the standard VAR model is that its parameter do not change over time. However, often one is interested in exactly such changes over time. For example, one could be interested in relating changes in parameters with other variables, such as changes in a person’s environment. This could be a new job, the seasons, or the impact of a global pandemic. In less exploratory designs, one could examine which impact certain interventions (e.g., medication or therapy) have on the interactions between symptoms.

In this blog post I give a very brief overview of how to estimate a time-varying VAR model with the kernel smoothing approach, which we discussed in this recent tutorial paper. This method is based on the assumption that parameters can change smoothly over time, which means that parameters cannot “jump” from one value to another. I then focus on how to estimate and analyze this type of time-varying VAR models with the R-package mgm.

Estimating time-varying Models via Kernel Smoothing

The core idea of the kernel smoothing approach is the following: We choose equally spaced time points across the duration of the whole time series and then estimate “local” models at each of those time points. All local models taken together then constitute the time-varying model. With “local” models we mean that these models are largely based on time points that are close to the time point at hand. This is achieved by weighting observations accordingly during parameter estimation. This idea is illustrated for a toy data set in the following Figure:

Here we only illustrate estimating the local model at $t=3$. We see the 10 time points of this time series on the left panel. The column $w_{t_e=3}$ in red indicates a possible set of weights we could use to estimate the local model at $t=3$: the data at time points close to $t=3$ get the highest weight, and time points further away get an increasingly small weight. The function that defines these weights is shown on the right panel. The blue column in the left panel, and the corresponding blue function on the right indicate another possible weighting. Using this weighting, we combine fewer observations close in time. This allows us to detect more “time-varyingness” in the parameters, because we smooth over less time points. On the other hand, however, we use less data, which makes our estimates less reliable. It is therefore important to choose a weighting function that strikes a good balance between sensitivity to “time-varyingness” and stable estimates. In the method presented here we use a Gaussian weighting function (also called a kernel) which is defined by its standard deviation (or bandwidth). We will return to how to select a good bandwidth parameter below.

In this blog post I focus on how to estimate time-varying models with the R-package mgm. For a more detailed explanation of the method see our recent tutorial paper.

Loading & Inspecting the Data

To illustrate estimating time-varying VAR models, I use an ESM time series of 12 mood related variables that are measured up to 10 times a day for 238 consecutive days (for details about this dataset see Kossakowski et al. (2017)). The questions are “I feel relaxed”, “I feel down”, “I feel irritated”, “I feel satisfied”, “I feel lonely”, “I feel anxious”, “I feel enthusiastic”, “I feel suspicious”, “I feel cheerful”, “I feel guilty”, “I feel indecisive”, and “I feel strong”. Each question is answered on a 7-point Likert scale ranging from “not” to “very”.

The data set is loaded with the mgm-package. We first subset the 12 mood variables:

library(mgm)
mood_data <- as.matrix(symptom_data$data[, 1:12])
mood_labels <- symptom_data$colnames[1:12]
colnames(mood_data) <- mood_labels
time_data <- symptom_data$data_time

We see that the data set has 1476 observations:

dim(mood_data)

## [1] 1476   12

head(mood_data)

##      Relaxed Down Irritated Satisfied Lonely Anxious Enthusiastic Suspicious
## [1,]       5   -1         1         5     -1      -1            4          1
## [2,]       4    0         3         3      0       0            3          1
## [3,]       4    0         2         3      0       0            4          1
## [4,]       4    0         1         4      0       0            4          1
## [5,]       4    0         2         4      0       0            4          1
## [6,]       5    0         1         4      0       0            3          1
##      Cheerful Guilty Doubt Strong
## [1,]        5     -1     1      5
## [2,]        4      0     1      4
## [3,]        4      0     2      4
## [4,]        4      1     1      4
## [5,]        4      1     2      3
## [6,]        3      1     2      3

time_data contains temporal information about each measurement. We will make use of the day on which the measurement occured (dayno), the measurement prompt (beepno) and the overall time stamp (time_norm).

head(time_data)

##       date dayno beepno beeptime resptime_s resptime_e   time_norm
## 1 13/08/12   226      1    08:58   08:58:56   09:00:15 0.000000000
## 2 14/08/12   227      5    14:32   14:32:09   14:33:25 0.005164874
## 3 14/08/12   227      6    16:17   16:17:13   16:23:16 0.005470574
## 4 14/08/12   227      8    18:04   18:04:10   18:06:29 0.005782097
## 5 14/08/12   227      9    20:57   20:58:23   21:00:18 0.006285774
## 6 14/08/12   227     10    21:54   21:54:15   21:56:05 0.006451726

Selecting the optimal Bandwidth

One way of selecting a good bandwidth parameter is to fit time-varying models with different candidate bandwidth parameters on a training data set, and evaluate their prediction error on a test data set. The function bwSelect() implements such a bandwith selection scheme. Here we do not show the specification of this function, because it has the same input arguments as the tvmvar() which we describe in a moment below plus a candidate sequence of bandwidth values and some specifications for how to split the data into training and test data. In addition, data driven bandwidth selection can take a considerable amount of time to run, which would not allow you to run the code while reading the blog post. For this tutorial, we therefore just fix the bandwidth to the value that was returned by bwSelect(). However, you can find the code to perform bandwidth selection with bwSelect() on the present data set here).

bandwidth <- .34

Estimating time-varying VAR Models

We can now specify the estimation of the time-varying VAR model. We provide the data as input and we specify the type of variables and how many categories they have with the type and level arguments. In our example data sets all variables are continuous, and we therefore set type = rep("g", 12) for continuous-Gaussian, and set the number of categories to 1 by convention. We choose to select the regularization parameters with cross-validation with lambdaSel = "CV", and we specify that the VAR model should include a single lag with lags = 1. The arguments beepvar and dayvar provide the day and the number of notification on a given day for each measurement, which is necessary to specify the VAR design matrix. In addition, we provide the time stamps of all measurements with timepoints = time_data$time_norm to account for missing measurements. Note however, that we still assume a constant lag size of 1. The time stamps are only used to ensure that the weighting indeed gives those time points the highest weight that are closest to the current estimation point (for details see Section 2.5 in this paper). So far, the specification is the same as for the mvar() function which fits stationary mixed VAR models.

For the time-varying model, we need to specify two additional arguments. First, with estpoints = seq(0, 1, length = 20) we specify that we would like to estimate 20 local models across the duration of the entire time series (which is normalized to [0,1]). The number of estimation points can be chosen arbitrarily large, but at some point adding more estimation point is not worth the additional computational costs, because subsequent local models are essentially identical. Finally, we specify the bandwidth with the bandwidth argument.

# Estimate Model on Full Dataset
set.seed(1)
tvvar_obj <- tvmvar(data = mood_data,
                    type = rep("g", 12),
                    level = rep(1, 12), 
                    lambdaSel = "CV",
                    lags = 1,
                    beepvar = time_data$beepno,
                    dayvar = time_data$dayno,
                    timepoints = time_data$time_norm, 
                    estpoints = seq(0, 1, length = 20), 
                    bandwidth = bandwidth,
                    pbar = FALSE)

## Note that the sign of parameter estimates is stored separately; see ?tvmvar

We can paste the output object into the console

# Check on how much data was used
tvvar_obj

## mgm fit-object 
## 
## Model class:  Time-varying mixed Vector Autoregressive (tv-mVAR) model 
## Lags:  1 
## Rows included in VAR design matrix:  876 / 1476 ( 59.35 %) 
## Nodes:  12 
## Estimation points:  20

which provides a summary of the model and also shows how many rows were in the VAR design matrix (876) compared to the number of time points in the data set (1476). The former number is lower, because a VAR(1) model can only be estimated if for a given time point also the time point 1 lag earlier is available. This is not the case for the first measurement on a given day or if there are missing responses during the day.

Computing Time-varying Prediction Errors

Similarly to stationary VAR models, we can compute prediction errors. This can be done with the predict() function, which takes the model object, the data, and two variables indicating the day number and notification number. Providing the data and the notification variables independently from the model object allows to compute prediction errors for new samples.

The argument errorCon = c("R2", "RMSE") specifies that the proportion of explained variance ($R^2$) and the Root Mean Squared Error (RMSE) should be returned as prediction errors. The final argument tvMethod specifies how time-varying prediction errors should be calculated. The option tvMethod = "closestModel" makes predictions for a time point using the local model that is closest to it. The option chosen here, tvMethod = "weighted", provides a weighted average of the predictions of all local models, weighted using the weighting function centered on the location of the time point at hand. Typically, both methods give very similar results.

pred_obj <- predict(object = tvvar_obj, 
                    data = mood_data, 
                    beepvar = time_data$beepno,
                    dayvar = time_data$dayno,
                    errorCon = c("R2", "RMSE"),
                    tvMethod = "weighted")

The main output are the following two objects: pred_obj$tverrors is a list that includes the estimation errors for each estimation point / local model; pred_obj$errors contains the average error across estimation points.

Visualizing parts of the Model

The time-varying VAR(1) model consists of $(p + p^2) \times E$ parameters, where $p$ is the number of variables and $E$ is the number of estimation points. Visualizing all parameters at once is therefore challenging. Instead, one can pick the parameters that are of most interest for the research question at hand. Here, we choose two different visualizations. First, we use the qgraph() function from the R-package qgraph to inspect the VAR interaction parameters at estimation points 1, 10, and 20:

library(qgraph)

par(mfrow=c(1,3))
for(tp in c(1,10,20)) qgraph(t(tvvar_obj$wadj[, , 1, tp]), 
                             layout = "circle",
                             edge.color = t(tvvar_obj$edgecolor[, , 1, tp]), 
                             labels = mood_labels, 
                             mar = rep(5, 4), 
                             vsize=14, esize=15, asize=13,
                             maximum = .5, 
                             pie = pred_obj$tverrors[[tp]][, 3],
                             title = paste0("Estimation point = ", tp), 
                             title.cex=1.2)

dev.off()

## null device 
##           1

We see that some parameters in the VAR models are varying considerably over time. For example, the autocorrelation effect of Relaxed seems to be decreasing over time, the positive effect of Strong on Satisfied only appears at estimation point 20, and also the negative effect of Satisfied on Guilty only appears at estimation point 20.

We can zoom in on these individual parameters by plotting them as a function of time:

# Obtain parameter estimates with sign
par_ests <- tvvar_obj$wadj[, , 1, ]
par_ests[tvvar_obj$edgecolor[, , 1, ]=="red"] <- par_ests[tvvar_obj$edgecolor[, , 1, ]=="red"] * -1

# Select three parameters to plot
m_par_display <- matrix(c(1, 1, 
                          4, 12, 
                          10, 4), ncol = 2, byrow = T)
# Plotting
plot.new()
par(mar = c(4, 4, 0, 1))
plot.window(xlim=c(1, 20), ylim=c(-.25, .55))
axis(1, c(1, 5, 10, 15, 20), labels = T)
axis(2, c(-.25, 0, .25, .5), las = 2)
abline(h = 0, col = "grey", lty = 2)
title(xlab = "Estimation points", cex.lab = 1.2)
title(ylab = "Parameter estimate", cex.lab = 1.2)

for(i in 1:nrow(m_par_display)) {
  par_row <- m_par_display[i, ]
  P1_pointest <- par_ests[par_row[1], par_row[2], ]
  lines(1:20, P1_pointest, lwd = 2, lty = i) 
}

legend_labels <- c(expression("Relaxed"["t-1"]  %->%  "Relaxed"["t"]),
                   expression("Strong"["t-1"]  %->%  "Satisfied"["t"]),
                   expression("Satisfied"["t-1"]  %->%  "Guilty"["t"]))
legend(1, .49, 
       legend_labels,
       lwd = 2, bty = "n", cex = 1, horiz = T, lty = 1:3)

We see that the effect of Relaxed on itself on the next time point is relatively strong at the beginning of the time series, but then decreases towards zero and remains zero from around estimation point 13. The cross-lagged effect of Strong on Satisfied on the next time point is equal to zero until around estimation point 9 but then seems to increase monotonically. Finally, the cross-lagged effect of Satisfied on Guilty is also equal to zero until around estimation point 13 and then decreases monotonically.

Stability of Estimates

Similar to stationary models, one can assess the stability of time-varying parameters using boostrapped sampling distributions. The mgm package allows one to do that with the resample() function. Here you can find code on how to do that for the example in this tutorial.

Time-varying or not?

Clearly, “time-varyingness” is a continuum that goes from stationary to extremely time-varying. However, in some cases it might be necessary to decide whether the parameters of a VAR model are reliably time-varying. To reach such a decision one can use a hypothesis test with the null hypothesis that the model is not time-varying. Here is one way to perform such a hypothesis test: One begins by fitting a stationary VAR model to the data and then repeatedly simulates data from this estimated model. For each of these simulated time series data sets, one computes a pooled prediction error for the time-varying model. The distribution of these prediction errors serves as the sampling distribution of prediction errors under the null hypothesis. Now one can compute the pooled estimation error of the time-varying VAR model on the empirical data and use it as a test-statistic. This test is explained in more detailed here, and the code to implement this test for the data set used in this tutorial can be found here.

Summary

In this blog post, I have shown how to estimate a time-varying VAR model with a kernel smoothing approach, which is based on the assumption that all parameters are a smooth function of time. In addition to estimating the model, we discussed the selection of an appropriate bandwidth parameter, how to compute (time-varying) prediction errors, and how to visualize different aspects of the model. Finally, I provided pointers to code that shows how to assess the stability of estimates via bootstrapping, and how to perform a hypothesis test one can use to select between stationary and time-varying VAR models.

I would like to thank Fabian Dablander for his feedback on this blog post.

Moderated Network Models for Continuous Data

Thu, 05 Sep 2019 11:00:00 +0000

Statistical network models have become a popular exploratory data analysis tool in psychology and related disciplines that allow to study relations between variables. The most popular models in this emerging literature are the binary-valued Ising model and the multivariate Gaussian distribution for continuous variables, which both model interactions between pairs of variables. In these pairwise models, the interaction between any pair of variables A and B is a constant and therefore does not depend on the values of any of the variables in the model. Put differently, none of the pairwise interactions is moderated. However, in the highly complex and contextualized fields like psychology, such moderation effects are often plausible. In this blog post, I show how to fit, analyze, visualize and assess the stability of Moderated Network Models for continuous data with the R-package mgm.

Moderated Network Models (MNMs) for continuous data are extending the pairwise multivariate Gaussian distribution with moderation effects (3-way interactions). The implementation in the mgm package estimates these MNMs with a nodewise regression approach, and allows one to condition on moderators, visualize the models and assess the stability of its parameter estimates. For a detailed description of how to construct such a MNM, and on how to estimate its parameters, have a look at our paper on MNMs. For a short recap on moderation and its relation to interactions in the regression framework, have a look at this blog post.

Loading & Inspecting the Data

We use a data set with $n=3896$ observations including the variables Hostile, Lonely, Nervous, Sleepy and Depressed, which we took from the 92-item Motivational State Questionnaire (MSQ) data set that comes with the R-package psych. Update June 5 2020: the msq data is not available anymore from the psych package, and we therefore load it manually. Each item is answered on a Likert scale with responses 0 (Not at all), 1 (A little), 2 (Moderately), and 3 (Very much).

data <- read.table("https://jmbh.github.io/files/data/msq.csv", sep=",", header=TRUE)
data <- data[, c("hostile", "lonely", "nervous", "sleepy", "depressed")] # subset
data <- na.omit(data) # exclude rows with missing values

head(data)

##   hostile lonely nervous sleepy depressed
## 1       0      1       1      1         0
## 2       0      0       0      1         0
## 3       0      0       0      0         0
## 4       0      1       2      1         1
## 5       0      0       0      1         0
## 6       0      0       0      2         1

Because MNMs include next to 2-way (pairwise) interactions also 3-way interactions, they are more sensitive to extreme values. This is because multiplying three of them naturally leads to more extreme values than when multiplying only two extreme values. It is therefore especially important to check the marginal distributions of all variables:

par(mfrow=c(2,3))
for(i in 1:5) {
  barplot(table(data[, i]), axes = FALSE, xlab = "", ylim = c(0, 3000))
  axis(2, las = 2, c(0, 1000, 2000, 3000))
  title(main = colnames(data)[i])
}

We see that the marginal distributions of all variables except Sleepy are right-skewed and are thereby most likely violating the assumption of MNMs that all variables are conditionally Gaussian. This is the same assumption as in any multiple linear regression model, i.e. that the residuals have a Gaussian distribution. One option would be to transform the variables, possibly by taking the log or square root, or applying the nonparanormal transform. However, any transformation renders the interpretation of parameters more difficult (for example, “increasing X by 1 unit increases the nonparanormal transform of Y by $\beta_{AB}$, keeping everything else constant”), which is why we choose here to use the original variables, but to later check the reliability of our estimates using bootstrapping.

Estimating Moderated Network Model

MNMs can be estimated with the mgm() function, and which moderation effects are included is specified with the moderators argument:

library(mgm) # 1.1-7

set.seed(1)
mgm_mod <- mgm(data = data,
               type = rep("g", 5),
               level = rep(1, 5),
               lambdaSel = "CV",
               ruleReg = "AND",
               moderators = 5, 
               threshold = "none", 
               pbar = FALSE)

## Note that the sign of parameter estimates is stored separately; see ?mgm

One can specify a particular set of moderation effects by providing a $M \times 3$ matrix to the moderators argument which indicates the specific set of 3-way interactions to be included in the model, where $M$ is the number of included moderation effects. If one provides a vector, then all moderation effects involving the specified variables are included. For the present example we include all moderation effects of the variable Depressed by setting moderators = 5 (Depressed is in column 5).

The remaining arguments are standard arguments of mgm(): type indicates the type of each variable, which in our example is continuous for all variables, which is especified with a "g" for “Gaussian”. level indicates the number of categories of a given variable, which is not applicable for continuous variables and set to 1 by convention. lambdaSel="CV" specifies that the regularization parameters in the $\ell_1$-regularized nodewise estimation approach used in mgm() is selected with cross-validation, and threshold = "none" specifies that no additional thresholding should be performed after estimation.

We can inspect the nonzero interaction parameters in the mgm() output object:

mgm_mod$interactions$indicator

## [[1]]
##       [,1] [,2]
##  [1,]    1    2
##  [2,]    1    3
##  [3,]    1    4
##  [4,]    1    5
##  [5,]    2    3
##  [6,]    2    4
##  [7,]    2    5
##  [8,]    3    4
##  [9,]    3    5
## [10,]    4    5
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    1    2    5
## [2,]    1    3    5
## [3,]    2    4    5
## [4,]    3    4    5

We see that we estimated ten pairwise interactions (that is, all possible $\frac{5(5-1)}{2} = 10$ pairwise interactions) and four 3-way interactions or moderation effects. We see that each nonzero 3-way interaction involves variable 5 (Depression), which has to be the case since we only specified Depression as a moderator. The parameter estimates can be retrieved with the showInteraction() function. For example, for the pairwise interaction between variables 1 (Hostile) and 3 (Nervous) can be obtained like this:

showInteraction(object = mgm_mod, int = c(1,3))

## Interaction: 1-3 
## Weight:  0.1145438 
## Sign:  1  (Positive)

We can interpret this parameter using the usual linear regression interpretation: when increasing Nervous (Hostile) by 1 unit, Hostile (Nervous) increases by $\approx 0.114$, keeping everything else constant.

Note, however, that variables 1 and 3 are also involved in a 3-way interaction with 5 (Depressed), that is, the pairwise interaction between 1 and 3 is moderated by 5 (Depression). Similarly to above, we can retrieve the moderation effect using showInteraction():

showInteraction(object = mgm_mod, int = c(1,3,5))

## Interaction: 1-3-5 
## Weight:  0.05279001 
## Sign:  1  (Positive)

We see that there is a positive moderation effect. This means that, if one increases the values of Depression, the pairwise interaction between Hostile and Nervous becomes stronger. For example, if Depression = 0, then the parameter for the pairwise interaction between Hostile and Nervous is equal to $0.114 + 0.053 \times 0 = 0.114$. If Depression = 1, then the pairwise interaction parameter is equal to $0.114 + 0.053 \times 1 = 0.167$. In this example, we fixed Depression to a given value and computed the resulting pairwise interaction between Hostile and Nervous We can also do this for the entire network, and inspect the network for different values of Depression. The function condition() does that for you (see below). The discussion of the modereation effect shows that we have to correct our above interpretation of the pairwise interaction between Hostile and Nervous: when increasing Nervous (Hostile) by 1 unit, Hostile (Nervous) increases by $\approx 0.114$, if Depression is equal to zero and when keeping everything else constant.

Why does the pairwise effect represent exactly the effect when Depression is equal to zero? The reason is that we mean-centered all variables before estimation by default. This leads to the situation that the pairwise interaction parameter represents the moderated effect when conditioning on the moderator value with the highest density (assuming that the moderator variable is symmetric and unimodel, which we do here because we assume all variables to be conditional Gaussians). Uncentered variables often lead to pairwise parameters that are conditioned on moderator values that do not exist in the data, which leads to pairwise parameters that have no meaningful interpretation. For a detailed discussion of this see here.

Conditioning on the Moderator

The function condition conditions on (fixes values of) a set of moderators. In our model, we included only a single moderator (Depression), so we only fix the values of Depression. Note that internally, mgm() scales all variables to mean = 0, SD = 1, to ensure that the regularization on parameters does not depend on the variance of the associated variable(s). We therefore need to specify values based on the scaled version of the Depression variable. To pick a reasonable set of values, we inspect the scaled version of the Depression variable:

tb <- table(scale(data$depressed))
names(tb) <- round(as.numeric(names(tb)), 2)
barplot(tb, axes=FALSE, xlab="", ylim=c(0, 3000))
axis(2, las=2, c(0, 1000, 2000, 3000))

Here, we choose the values 0, 1 and 2.

To condition on, that is, fix a set of variables we provide two arguments to condition(): first, the mgm() output object; and second, a list in which the entry name indicates the variable (by its column number) and the entry value indicates the value to which the variable should be fixed. For example, if we would like to fix Depression = 2 we specify values = list('5' = 2). Here, we call the condition() and fix Depression (5) to the values 0, 1 and 2.

cond0 <- condition(object = mgm_mod, 
                   values = list('5' = 0))

cond1 <- condition(object = mgm_mod, 
                   values = list('5' = 1))

cond2 <- condition(object = mgm_mod, 
                   values = list('5' = 2))

The output of condition() is a complete mgm model object, conditioned on the provided set of variables. For example, cond0 contains the model object conditioned on Depression = 0. On the population level (i.e. if sample size does not play a role), this is the model one would obtain if one took only the rows in which Depression = 0 in the data set, and estimated a pairwise model on this subset of the data.

Note that we can compute the model object conditioned on any value. However, if we specify values that we do not actually observe, for example Depression = 7, we extrapolate beyond the observed data and therefore do not know whether the computed conditional mgm object is accurate.

Since we only specified a single moderator in the present example, the three conditional mgm objects are now pairwise models. We can inspect their parameters as usual, or visualize them:

l_cond <- list(cond0, cond1, cond2)

library(qgraph)
par(mfrow=c(1,3))

max_val <- max(max(l_cond[[1]]$pairwise$wadj),
               max(l_cond[[2]]$pairwise$wadj),
               max(l_cond[[3]]$pairwise$wadj))

for(i in 1:3) qgraph(l_cond[[i]]$pairwise$wadj, layout="circle", 
                     edge.color=l_cond[[i]]$pairwise$edgecolor, 
                     labels = colnames(msq_p5), 
                     maximum = max_val, 
                     edge.labels = TRUE, edge.label.cex=2,
                     vsize=20, esize=18, 
                     title = paste0("Depression = ", (0:2)[i]))

Let’s focus again on the pairwise interaction between Hostile and Nervous. As illustrated above, the relationship becomes stronger when increasing Depression. We also see changes in the three other pairwise interactions that are moderated by Depression: Hostile-Lonely, Lonely-Sleepy, and Nervous-Sleepy. All these conditional pairwise effects can be interpreted as partial correlations. However, note that the pairwise interaction between Hostile-Nervous for Depression=1 computed above ($.167$) is not the same as the one shown in the network picture ($.18$). The reason is that showInteraction() reports the moderation effect aggregated over the estimates of this parameters from all three regressions, while condition() reports the moderation effect aggregated over the two regressions on the two variables in the pairwise interaction. For details have a look at our paper on Moderated Network Models.

Note that Depression is not connected to any other variable in all of the networks. The reason is that in each of the networks Depression is fixed to a value, and therefore technically not part of the model anymore. However, in mgm we keep the variable in the fit object to avoid confusion when comparing MNMs with pairwise models.

Displaying the pairwise networks after conditioning on different values of the moderator variable is perhaps the most intuitive way to report the results of a Moderated Network Model. However, this is only feasible if the number of moderators is small, because the number of cases to consider is equal to $3^m$ if we consider three values for each moderator and have $m$ moderators. In our case we had $3^1=3$ which is still easy to visualize. However, $m=2,3,4$ moderators lead to $9, 27, 81$ cases. An alternative way to visualize MNMs that also works for larger number of moderators is to visualize the MNM in a factor graph.

Visualization using Factor Graph

Here we show how to visualize a MNM that includes several moderators. To this end, we fit the same model as above, but now include all variables as moderators:

set.seed(1)
mgm_mod_all <- mgm(data = data,
                   type = rep("g", 5),
                   level = rep(1, 5),
                   lambdaSel = "CV",
                   ruleReg = "AND",
                   moderators = 1:5, 
                   threshold = "none", 
                   pbar = FALSE)

## Note that the sign of parameter estimates is stored separately; see ?mgm

We again check which interactions have been estimated to be nonzero

mgm_mod_all$interactions$indicator

## [[1]]
##       [,1] [,2]
##  [1,]    1    2
##  [2,]    1    3
##  [3,]    1    4
##  [4,]    1    5
##  [5,]    2    3
##  [6,]    2    4
##  [7,]    2    5
##  [8,]    3    4
##  [9,]    3    5
## [10,]    4    5
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    1    2    4
## [3,]    1    3    4
## [4,]    1    3    5
## [5,]    2    3    4
## [6,]    2    4    5
## [7,]    3    4    5

and we see that we recovered a few additional 3-way interactions (moderation effects).

Before visualizing this model as a factor graph, let’s consider why we actually need to go beyond a typical network. For example, if there is a 3-way interaction between variables 1-2-3, one could simply connect those three nodes to indicate the 3-way interaction. The problem with this solution, however, is that we cannot tell from the graph alone anymore whether this triangle comes indeed from a 3-way interaction, or from three 2-way (pairwise) interactions, or even a combination of 2-way and 3-way interactions. We therefore need a more powerful graph.

A factor graph includes, as usual, nodes as variables. But it includes additional variables for interaction parameters. That is, we add an additional (factor) node for each 2-way and 3-way interaction. With an mgm() output object as input, the FactorGraph() function plots the factor graph for us:

FactorGraph(mgm_mod_all, 
            labels = colnames(data), 
            layout = "circle")

The five round nodes indicate the five variables, while the red square nodes indicate pairwise interactions, and the blue triangle nodes indicate 3-way intereactions. Corresponding to the estimated interactions listed in mgm_mod_all$interactions$indicator we see that there are seven 3-way interactions, and 10 pairwise interactions.

The FactorGraph() function also allows one to only plot 3-way interactions as factor nodes and 2-way interactions as standard edges between variables by setting PairwiseAsEdge = TRUE, which often helps to simply the visualization. In addition, it allows to pass any arguments to qgraph(), which is called by FactorGraph().

Assessing Stability of Estimates

As with pairwise network models, it is useful to obtain a measure for how stable the estimated parameters are. This is especially important in MNMs: first, as discussed at the beginning of this blog post, moderation effects depend more on extreme values than pairwise effects; second, MNMs have more parameters, and therefore the variance on estimates may be larger than for pairwise models; and third, moderation effects are typically smaller than pairwise effects.

To assess stability, we inspect the bootstrapped sampling distributions of all parameters. The function resample() takes the original model object, the data, and the number of bootstrap samples nB as input, and returns an object with nB models fitted on bootstrap samples. Here, we assess the reliability of estimates of the initial example in which Depression was the only moderator (this takes 30-60s to run):

set.seed(1)
res_obj <- resample(object = mgm_mod, 
                    data = data, 
                    nB = 50,
                    pbar = FALSE)

We can then visualize the summary of all sampling distributions using plotRes():

plotRes(res_obj, 
        axis.ticks = c(-.1, 0, .1, .2, .3, .4, .5), 
        axis.ticks.mod = c(-.1, -.05, 0, .05, .1), 
        cex.label = 1, 
        labels = colnames(msq_p5), 
        layout.width.labels = .40)

The first column displays the pairwise effects, and the second column shows the moderation effects. The horizontal lines show the 5% and 95% quantiles of the bootstrapped sampling distributions. The number is the proportion of bootstrap samples in which a parameter has been estimated to be nonzero, and the number is placed at the location of the mean of the sampling distribution. We observe that the variance of the sampling distribution is small for pairwise effects, indicating that they are stable.

Regarding the moderation effects, we first notice that moderation effects on pairwise interactions involving Depressed are always zero. Taking the pairwise interaction Lonely-Depressed as an example, the moderation effect would refer to the 3-way interaction Lonely-Depressed-Depressed. Because we do not include such quadratic effects by default, these parameters are always equal to zero. For the moderation effects that were estimated, we see that their mean size are smaller and that the variances are larger relative to their means. However, the moderation effect of Depression on Nervous-Sleepy seems quite stable, and at least the moderation effects on Lonely-Sleepy, and Hostile-Nervous are somewhat stable.

Model Selection

Finally, I would like to address the topic of model selection; specifically, the problem of selecting between models with and without moderation effects. Since all effects are subject to regularization, the model selection in mgm() between models with different regularization parameters essentially selects between models with and without moderation.

An alternative would be to compare the fit of a (possibly unregularized) model including only pairwise interactions with the fit of a (possibly unregularized) model including pairwise interactions and (a set of) moderation effects. This could be done by comparing the fit in a separate training data set, or by approximating out-of-sample error using an out-of-bag error of a cross-validation scheme. These approaches, however, have to be performed on a nodewise basis, because a limitation of MNMs is that it is unclear whether the estimates obtained from nodewise regression give rise to a proper joint distribution. For details see our paper, which precludes using any global likelihood ratio tests or global goodness-of-fit measures. However, all measures can be computed on a nodewise bases and combined to achieve a global model selection.

Moderated Mixed Graphical Models

In the present blog post, I have focused on MNMs that include only continuous variables. However, mgm() also allows one to fit MGMs with moderation effects, in which any type of variable can be a moderator variable that moderates the pairwise interaction between pairs of any types of variables. An interesting special case of this flexible model is a model in which one includes a single categorical variable as a moderator, since this presents an alternative way to estimate group differences between 2 or more groups. I will write a blog post about this in the near future.

Summary

I showed how to estimate Moderated Network Models using mgm() and how to use showInteraction() to retrieve parameter estimates from the output object. For the model including only Depression as a moderator, we used condition() to fix, that is, condition on a set of values of the moderator variable, and used the qgraph package to visualize the separate conditional network models. We showed how to use FactorGraph() to draw a factor graph of an estimated MNM including more than one moderator, which is useful especially when the model includes several moderator variables. And finally, we used resample() and plotRes() to evaluate the stability of estimates using boostrapping.

In case anything is unclear or if you have any questions or comments, please let me know in the comments!

I would like to thank Fabian Dablander and Matti Heino for their feedback on this blog post.

Regression with Interaction Terms - How Centering Predictors influences Main Effects'

Thu, 16 Feb 2017 10:00:00 +0000

Centering predictors in a regression model with only main effects has no influence on the main effects. In contrast, in a regression model including interaction terms centering predictors does have an influence on the main effects. After getting confused by this, I read this nice paper by Afshartous & Preston (2011) on the topic and played around with the examples in R. I summarize the resulting notes and code snippets in this blogpost.

We give an explanation on two levels:

By illustrating the issue with the simplest possible example
By showing in general how main effects are a function of the constants (e.g. means) that are substracted from predictor variables

Explanation 1: Simplest example

The simplest possible example to illustrate the issue is a regression model in which variable $Y$ is a linear function of variables $X_1$, $X_2$, and their product $X_1X_2$

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1X_2 + \epsilon,\]

where we set $\beta_0 = 1, \beta_1 = 0.3, \beta_2 = 0.2, \beta_3 = 0.2$, and $\epsilon \sim N(0, \sigma^2)$ is Gaussian distribution with mean zero and variance $\sigma^2$. We define the predictors $X_1, X_2$ as Gaussians with means $\mu_{X_1} = \mu_{X_2} = 1$ and $\sigma_{X_1}^{2}=\sigma_{X_2}^{2}=1$. This code samples $n = 10000$ observations from this model:

n <- 10000
b0 <- 1; b1 <- .3; b2 <- .2; b3 <- .2
set.seed(1)
x1 <- rnorm(n, mean = 1, sd = 1)
x2 <- rnorm(n, mean = 1, sd = 1)
y <- b0 + b1 * x1 + b2 * x2 + b3 * x1 * x2 + rnorm(n, mean = 0, sd = 1)

Regression models with main effects

We first verify that centering variables indeed does not affect the main effects. To do so, we first fit the linear regression with only main effects with uncentered predictors

lm(y ~ x1 + x2)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Coefficients:
## (Intercept)           x1           x2  
##      0.8088       0.4983       0.4015

and then with mean centered predictors

x1_c <- x1 - mean(x1) # center predictors
x2_c <- x2 - mean(x2)
lm(y ~ x1_c + x2_c)

## 
## Call:
## lm(formula = y ~ x1_c + x2_c)
## 
## Coefficients:
## (Intercept)         x1_c         x2_c  
##      1.7036       0.4983       0.4015

The parameter estimates of the regression with uncentered predictors are $\hat\beta_1 \approx 0.50$ and $\hat\beta_2 \approx 0.40$. The estimates of the regression with centered predictors are $\hat\beta_1^\ast \approx 0.50$ and $\hat\beta_2^\ast \approx 0.40$ (we denote estimates from regressions with centered predictors with an asterisk). And indeed, $\hat\beta_1 = \hat\beta_1^\ast$ and $\hat\beta_2 = \hat\beta_2^\ast$.

Regression models with main effects + interaction

We include the interaction term and show that centering the predictors now does does affect the main effects. We first fit the regression model without centering

lm(y ~ x1 * x2)

## 
## Call:
## lm(formula = y ~ x1 * x2)
## 
## Coefficients:
## (Intercept)           x1           x2        x1:x2  
##      1.0183       0.2883       0.1898       0.2111

and then with centering

lm(y ~ x1_c * x2_c)

## 
## Call:
## lm(formula = y ~ x1_c * x2_c)
## 
## Coefficients:
## (Intercept)         x1_c         x2_c    x1_c:x2_c  
##      1.7026       0.4984       0.3995       0.2111

We see that $\hat\beta_1 \approx 0.29$ and $\hat\beta_2 \approx 0.19$ and $\hat\beta_1^\ast \approx 0.50$ and $\hat\beta_2^\ast \approx 0.40$. While the two models have different parameters, they are statistically equivalent. Here this means that expected values of both models are the same. In empirical terms this means that their coefficient of determination $R^2$ is the same. The reader will be able to verify this in Explanation 2 below.

We make two observations:

In the model with interaction terms, the main effects differ between the regressions with/without centering of predictors
When centering predictors, the main effects are the same in the model with/without the interaction term (up to some numerical inaccuracy)

Why does centering influence main effects in the presence of an interaction term?

The reason is that in the model with the interaction term, the parameter $\beta_1$ (uncentered predictors) is the main effect of $X_1$ on $Y$ if $X_2 = 0$, and the parameter $\beta_1^\ast$ (centered predictors) is the main effect of $X_1$ on $Y$ if $X_2 = \mu_{X_2}$. This means that $\beta_1$ and $\beta_1^\ast$ are modeling different effects in the data. Here is a more detailed explanation:

Rewriting the model equation in the following way

\[\begin{aligned} \mathbb{E}[Y] &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1X_2 \\ &= \beta_0 + (\beta_1 + \beta_3 X_2) X_1 + \beta_2 X_2 \end{aligned}\]

shows that in the model with interaction term, the effect of $X_1$ on $Y$ is equal to $(\beta_1 + \beta_3 X_2)$ and therefore a function of $X_2$. What does the parameter $\beta_1$ model here? It models the effect of $X_1$ on $Y$ when $X_2 = 0$. Similarly we could rewrite the effect of $X_1$ on $Y$ as a function of $X_2$.

Now let $X_1^c = X_1 - \mu_{X_1}$ and $X_2^c = X_2 - \mu_{X_2}$ be the centered predictors. We get the same model equations, now with the parameters estimated using the centered predictors $X_1^c, X_2^c$:

\[\begin{aligned} \mathbb{E}[Y] &= \beta_0^\ast + \beta_1^\ast X_1^c + \beta_2^\ast X_2^c + \beta_3^\ast X_1^c X_2^c \\ &= \beta_0^\ast + (\beta_1^\ast + \beta_3^\ast X_2^c) X_1^c + \beta_2^\ast X_2^c \\ \end{aligned}\]

Again we focus on the effect $(\beta_1^\ast + \beta_3^\ast X_2^c)$ of $X_1^c$ on $Y$. What does the the parameter $\beta_1^\ast$ model here? It models the main effect of $X_1^c$ on $Y$ when $X_2^c = \mu_{X_2^c} = 0$. What remained the same is that $\beta_1^\ast$ is the main effect of $X_1^c$ on $Y$ when $X_2^c = 0$. But what is new is that $\mu_{X_2^c} = 0$.

To summarize, in the uncentered case $\beta_i$ is the main effect when the predictor variable $X_i$ is equal to zero; and in the centered case, $\beta_i^\ast$ is the main effect when the predictor variable $X_i$ is equal to its mean. Clearly, $\beta_i$ and $\beta_i^\ast$ model different effects in the data and it is therefore not surprising that the two regressions give us very different estimates.

Centering $\rightarrow$ interpretation of $\beta$ remains the same when adding interaction

Our second observation above was that the estimates of main effects are the same with/without interaction term when centering the predictor variables. This is because in the models without interaction term (centered or uncentered predictors) the interpretation of $\beta_1$ is the same as in the model with interaction term and centered predictors.

More precisely, in the regression model with only main effects, $\beta_1$ is the main effect of $X_1$ on $Y$ averaged over all values of $X_2$, which is the same as the main effect of $X_1$ on $Y$ for $X_2 = \mu_{X_2}$. This means that if we center predictors, $\beta_1$ models the same effect in the data in a model with/without interaction term. This is an attractive property to have when one is interested in comparing models with/without interaction term.

Explanation 2: Main effects as functions of added constants

Substracting the mean from predictors is a special case of adding constants to predictors. Here we first show numerically what happens to each regression parameter when adding constants to predictors. Then we show analytically how each parameter is a function of its value in the original regression model (no constant added) and the added constants.

Why are we doing this? We are doing this to develop a more general understanding of what happens when adding constants to predictors. It also puts the above example in a more general context, since we can consider it as a special case of the following analysis.

Numerical experiment I: Only main effects

We first fit a series of regression models with only main effects. In each of them we add a different constant to the predictors. We do this verify that our claim that centering predictors does not change main effects extends to the more general situation of adding constants to predictors.

We first define a sequence of constant values we add to the predictors and create storage for parameter estimates:

n <- 25
c_sequence <- seq(-1.5, 1.5, length = n)

A <- as.data.frame(matrix(NA, ncol=5, nrow=n))
colnames(A) <- c("b0", "b1", "b2", "b3", "R2")

We now fit 25 regression models, and in each of them we add a constant c to both predictors, taken from the sequence c_sequence:

for(i in 1:25) {
  
  c <- c_sequence[i]
  x1_c <- x1 + c
  x2_c <- x2 + c
  
  lm_obj <- lm(y ~ x1_c + x2_c) # Fit model
  A$b0[i] <- lm_obj$coefficients[1]
  A$b1[i] <- lm_obj$coefficients[2]
  A$b2[i] <- lm_obj$coefficients[3]
  
  yhat <- predict(lm_obj)
  A$R2[i] <- 1 - var(yhat - y) / var(y) # Compute R2
  
}

Remark: in Explanation 1 we said that the coefficient of determination $R^2$ does not change when adding constants to the predictors. We invite the reader to verify this by inspecting A$R2.

We plot all parameters $\beta_0, \beta_1, \beta_2$ as a function of c:

library(RColorBrewer)
cols <- brewer.pal(4, "Set1") # Select nice colors

plot.new()
plot.window(xlim=range(c_sequence), ylim=c(-.5, 2.5))
axis(1, round(c_sequence, 2), cex.axis=0.75, las=2)
axis(2, c(-.5, 0, .5, 1, 1.5, 2, 2.5), las=2)
lines(c_sequence, A$b0, col = cols[1])
lines(c_sequence, A$b1, col = cols[2])
lines(c_sequence, A$b2, col = cols[3])
legend("topright", c("b0", "b1", "b2"), 
       col = cols[1:3], lty = rep(1,3), bty = "n")
title(xlab = "Added constant")
title(ylab = "Parameter value")

We see that the intercept changes as a function of c. The model at c = 0 corresponds to the very first model we fitted above. And the model at c = -1 corresponds to the model fitted with centered predictors. But the key observation is that the main effects $\beta_1, \beta_2$ do not change. A proof of this and an exact expression for the intercept will fall out of our analysis of the model with interaction term in the last section of this blogpost.

Numerical experiment II: main effects + interaction term

Next we show that this is different when adding the interaction term. We use the same sequence of c as above and fit regression models with interaction term:

for(i in 1:25) {
  
  c <- c_sequence[i]
  x1_c <- x1 + c
  x2_c <- x2 + c
  
  lm_obj <- lm(y ~ x1_c * x2_c) # Fit model
  A$b0[i] <- lm_obj$coefficients[1]
  A$b1[i] <- lm_obj$coefficients[2]
  A$b2[i] <- lm_obj$coefficients[3]
  A$b3[i] <- lm_obj$coefficients[4]
  
  yhat <- predict(lm_obj, data = c(y, x1_c, x2_c))
  A$R2[i] <- 1 - var(yhat - y) / var(y) # Compute R2
  
}

And again we plot all parameters $\beta_0, \beta_1, \beta_2, \beta_3$ as a function of c:

plot.new()
plot.window(xlim=range(c_sequence), ylim=c(-.5, 2.5))
axis(1, round(c_sequence, 2), cex.axis=0.75, las=2)
axis(2, c(-.5, 0, .5, 1, 1.5, 2, 2.5), las=2)
lines(c_sequence, A$b0, col = cols[1])
lines(c_sequence, A$b1, col = cols[2])
lines(c_sequence, A$b2, col = cols[3])
lines(c_sequence, A$b3, col = cols[4])
legend("topright", c("b0", "b1", "b2", "b3"), 
       col = cols[1:4], lty = rep(1,3), bty = "n")
title(xlab = "Added constant")
title(ylab = "Parameter value")

This time both the intercept $\beta_0$ and the main effects $\beta_1, \beta_2$ are a function of c, while the interaction effect $\beta_3$ is constant. At this point the best explanation is simply to go through the algebra, which explains these results exactly. We do this in the next section.

Deriving function for all effects

We plug in the definition of centering in the population regression model we introduced at the very beginning of this blogpost. This gives us every parameter as a function of two things: (1) the parameters in the original model and (b) the added constant. Above we added the same constant to both predictors. Here we consider the general case where the constants can differ.

Our original (unaltered) model is given by:

\[\begin{aligned} \mathbb{E}[Y] &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1X_2 \end{aligned}\]

Now we plug in the predictors with added constants $c_1, c_2$, multiply out, and rearrange:

\[\begin{aligned} \mathbb{E}[Y] &= \beta_0^\ast + \beta_1^\ast (X_1 + c_1) + \beta_2 (X_2 + c_2) + \beta_3^\ast (X_1 + c_1) (X_2 + c_2) \\ & = \beta_0^\ast + \beta_1^\ast X_1 + \beta_1^\ast c_1 + \beta_2^\ast X_2 + \beta_2^\ast c_2 + \beta_3^\ast X_1X_2 + \beta_3^\ast X_1 c_2 + \beta_3^\ast c_1X_2 + \beta_3^\ast c_1c_2 \\ &= (\beta_0^\ast + \beta_1^\ast c_1 + \beta_2^* c_2 + \beta_3^\ast c_1c_2) + (\beta_1^\ast + \beta_3^\ast c_2)X_1 + (\beta_2^\ast + \beta_3^\ast c_1)X_2 + \beta_3^\ast X_1X_2 \end{aligned}\]

Now if we equate the respective interecept and slope terms we get:

\[\beta_0 = \beta_0^\ast + \beta_1^\ast c_1 + \beta_2^\ast c_2 + \beta_3^\ast c_1c_2\] \[\beta_1 = \beta_1^\ast + \beta_3^\ast c_2\] \[\beta_2 = \beta_2^\ast + \beta_3^\ast c_1\]

and

\[\beta_3 = \beta_3^\ast\]

Now we solve for the parameters $\beta_0^\ast, \beta_1^\ast, \beta_2^\ast, \beta_3^\ast$ from the models with constants added to the predictors.

Because we know $\beta_3 = \beta_3^\ast$ we can write $\beta_2 = \beta_2^\ast + \beta_3 c_1$ and can solve

\[\beta_2^\ast = \beta_2 - \beta_3 c_1\]

The same goes for $\beta_1^\ast$ so we have

\[\beta_1^\ast = \beta_1 - \beta_3 c_2\]

Finally, to obtain a formula for $\beta_0^\ast$ we plug the just obtained expressions for $\beta_1^\ast$, $\beta_2^\ast$ and $\beta_3^\ast$ into

\[\beta_0 = \beta_0^\ast + \beta_1^\ast c_1 + \beta_2^\ast c_2 + \beta_3^\ast c_1c_2\]

and get

\[\begin{aligned} \beta_0 &= \beta_0^\ast + (\beta_1 - \beta_3 c_2)c_1 + (\beta_2 - \beta_3 c_1)c_2 + \beta_3 c_1c_2 \\ &= \beta_0^\ast + \beta_1 c_1 - \beta_3 c_2 c_1 + \beta_2 c_2 - \beta_3 c_2 c_1 + \beta_3 c_1c_2 \\ &= \beta_0^\ast + \beta_1 c_1 + \beta_2 c_2 - \beta_3 c_1c_2 \end{aligned}\]

and can solve for $\beta_0^\ast$:

\[\beta_0^\ast = \beta_0 - \beta_1 c_1 - \beta_2 c_2 + \beta_3 c_1c_2\]

Let’s check whether those fomulas predict the parameter changes as a function of c in the numerical experiment above.

lm_obj <- lm(y ~ x1 * x2) # Reference model (no constant added)
b0 <- lm_obj$coefficients[1]
b1 <- lm_obj$coefficients[2]
b2 <- lm_obj$coefficients[3]
b3 <- lm_obj$coefficients[4]

B <- A # Storage for predicted parameters

for(i in 1:25) {

c <- c_sequence[i]

B$b0[i] <- b0 - b1*c - b2*c + b3*c*c
B$b1[i] <- b1 - b3*c
B$b2[i] <- b2 - b3*c
B$b3[i] <- b3

}

We plot the computed parameters by the derived expressions as points on the empirical results from the numerical experiments above

plot.new()
plot.window(xlim=range(c_sequence), ylim=c(-.5, 2.5))
axis(1, round(c_sequence, 2), cex.axis=0.75, las=2)
axis(2, c(-.5, 0, .5, 1, 1.5, 2, 2.5), las=2)
lines(c_sequence, A$b0, col = cols[1])
lines(c_sequence, A$b1, col = cols[2])
lines(c_sequence, A$b2, col = cols[3])
lines(c_sequence, A$b3, col = cols[4])
legend("topright", c("b0", "b1", "b2", "b3"), 
col = cols[1:4], lty = rep(1,3), bty = "n")

# Plot predictions
points(c_sequence, B$b0, col = cols[1])
points(c_sequence, B$b1, col = cols[2])
points(c_sequence, B$b2, col = cols[3])
points(c_sequence, B$b3, col = cols[4])
title(xlab = "Added constant")
title(ylab = "Parameter value")

and they match the numerical results exactly.

We see that the derived expressions explain exactly how parameters change as a function of the parameters of the reference model and the added constants.

If we set $\beta_3 = 0$, we get the same derivation for the regression model without interaction term. We find that $\beta_1^* = \beta_1$, $\beta_2^* = \beta_2$, and $\beta_0^* = \beta_0 - \beta_1 c_1 - \beta_2 c_2$.

Deconstructing 'Measurement error and the replication crisis'

Thu, 16 Feb 2017 10:00:00 +0000

Yesterday, I read ‘Measurement error and the replication crisis’ by Eric Loken and Andrew Gelman, which left me puzzled. The first part of the paper consists of general statements about measurement error. The second part consists of the claim that in the presence of measurement error, we overestimate the true effect when having a small sample size. This sounded wrong enough to ask the authors for their simulation code and spend a couple of hours to figure out what they did in their paper. I am offering a short and a long version.

Edit Feb 17th: After a nice email converstaion with the authors, I now know that they do make their general argument only under the condition of selecting on significance. Their result then trivially follows from the increased variance of the sampling distribution due to adding ‘measurement error’ (see section (3) below). My source of confusion was that they talk about selection on significance in the paper, but then do not select on significance in the two scatter plots, and incorrectly state in the figure title, that they do. The conclusions of this blog post are still valid when making the assumptions in (1), so I leave it online in case somebody finds (parts of) it interesting.

The Short Version

My conclusion is that the authors show the following: If an estimator is biased (here by the presence of measurement error), then the proportion of estimates that overestimate the true effect depends on the variance of the sampling distribution (which depends on $N$). While this is an interesting insight, the authors do not say this clearly anywhere in the paper. Instead, they use formulations that suggest that they refer to the expected value of the estimator, which does not depend on the sample size. To make things worse, they plot the estimates in a way that suggest that the variance of the estimators is equal for N = 50 and N = 3000 and that the effect is driven by a difference in expected value, while the reverse is true.

The Long Version

I try to make an argument for my claims in the ‘short version’ above in 6 steps. (1) We make clear what the claim is the authors make, (2) we define our terminology, (3) we investigate what adding measurement error does on the population level, (4) we see how this influences the characteristics of estimators based on different sample sizes, (5) we summarize our results and (6) get back to the paper.

(1) The exact claim

The authors write ‘In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate co- efficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance.’ (p. 584/585). From this we can deduce that the authors claim that ‘In a high noise setting, the presence of measurement error and selection on statistical significance leads to an increase in coefficient estimates’. However, the authors do not select on statistical significance in their simulation, hence we also drop this condition and arrive at the claim ‘In a high noise setting, the presence of measurement error leads to an increase in coefficient estimates’.

What this statement means is unclear to me. Under the reasonable assumption that the authors did not make a fundamental mistake, the rest of this blogpost is about finding out what the authors could have meant.

(2) Terminology (for reference)

In the paper, ‘measurement error’, ‘noise’ and ‘variance’ are used interchangeably. Here, with variances we refer to the variances of the dimensions of the bivariate Gaussian distribution, if not stated otherwise. With measurement error we mean another bivariate Gaussian distribution with zero covariance. By a noisy setting, we refer to a situation with a low signal to noise ratio. This is defined relative to another setting, which is less noisy. The signal to noise ratio is a function of $N$ and is related to the variance of the sampling distribution of the estimator. All these things will become clear in sections (3) and (4).

(3) What does ‘adding measurement error’ mean on the population level?

In order to evaluate the above claim with respect to the simulation setup of the authors, we need to know the simulation setup. Fortunately, the authors provided the code in a quick and friendly email.

The authors consider the problem of estimating the covariance of a bivariate Gaussian distribution from a finite number of observations. The bivariate Gaussian distribution has the density

\[f(x_1, x_2) = \frac{1}{\sqrt{(2 \pi)^k | \mathbf{ \Sigma } | }} \exp \bigl \{ - \frac{1}{2} \bigr (x - \mu)^{\top} \mathbf{ \Sigma }^{-1} (x - \mu) \},\]

where in our case the covariance $cov(x_1, x_2) = r > 0$ is some positive value, so the covariance matrix $\Sigma$ has entries:

\[\Sigma = \begin{bmatrix} 1 & r \\[0.3em] r & 1 \end{bmatrix}\]

Note that if we scale both dimensions of the Gaussian to $\mu_1 = \mu_2 = 0$ and $\sigma_1 = \sigma_2 = 1$ the correlation coefficient is equal to the coefficient of the regression of $x_1$ on $x_2$ or vice versa. Thus all results obtained here also extend to the regression coefficient that is refered to in the paper.

Now the authors ‘add measurement error’ to the two variables which consists of independent Gaussian noise with a variance $k > 0$, where $k$ is a constant. Notice that these two variables can also described by a bivariate Gaussian with covariance matrix $\Sigma^{ME}$:

\[\Sigma^{ME} = \begin{bmatrix} k & 0 \\[0.3em] 0 & k \end{bmatrix}\]

Notice that adding ‘measurement error’ as done by the authors is the same as adding these two Gaussians. Addition is a linear transformation and hence the resulting distribution is again a bivariate Gaussian distribution. Indeed, it turns out that the covariance matrix $\Sigma^A$ of the resulting bivariate Gaussian is the sum of the covariance matrices $\Sigma$ and $\Sigma^{ME}$ of the two bivariate Gaussians:

\[\Sigma^A = \begin{bmatrix} 1 & r \\[0.3em] r & 1 \end{bmatrix} + \begin{bmatrix} k & 0 \\[0.3em] 0 & k \end{bmatrix} = \begin{bmatrix} k + 1 & r \\[0.3em] r & k + 1 \end{bmatrix}\]

Now, if we renormalize the variances to get back to a correlation matrix it becomes obvious that adding ‘measurement error’ has to decrease the absolute value of the covariance:

\[\Sigma^{A_{norm}} = \begin{bmatrix} 1 & \frac{r}{k + 1} \\[0.3em] \frac{r}{k + 1} & 1 \end{bmatrix}\]

Note that $k > 0$ and hence $\frac{r}{k + 1} < r$ and hence the absolute value of the covariance is smaller in \Sigma^{A_{norm}} than in $\Sigma$ in the population.

(4) Properties of the Estimator

We now consider the estimate $\hat \sigma_{1,2}$ for the covariance between $x_1$ and $x_2$ in the bivariate Gaussian with covariance matrix $\Sigma^{A_{norm}}$ which is ‘corrupted’ by measurement error. We obtain $\hat \sigma_{1,2}$ via the least squares estimator, which is an unbiased estimator for $\frac{r}{k + 1}$.

What does this mean? This means that by the Central limit theorem, the sampling distribution will be a Gaussian distribution that is centered on the true coefficient, which is $\frac{r}{k + 1}$. Thus, if we take many samples of size $N$ and compute a coefficient estimate on each of them, the mean coefficient will be equal to $\frac{r}{k + 1}$:

\[\mathbb{E} [\hat \sigma_{1,2}] = \lim_{S \rightarrow \infty} \frac{1}{S} \sum_{i=1}^{\infty} \hat \sigma_{1,2}^i = \frac{r}{k + 1}\]

From the fact that the Gaussian density is symmetric and centered on the true effect, it follows that $\hat \sigma_{1,2}$ will equally often under- and overestimate the true effect $\frac{r}{k + 1}$. It is important to stress that this is true, irrespective of the variance of the sampling distribution (which depends on $N$). We illustrate this in the following Figure which shows the empirical sampling distributions from the simulation of the authors:

The solid black line indicates the density estimate of the empirical sampling distribution of the coefficient estimates in the low noise (N = 3000) case. The solid red line indicates the density of the empirical sampling distribution of in the high noise (N = 50) case. The dashed black and red lines indicate the arithmetic means of the corresponding sampling distributions. The green dashed line indicates the true coefficient of the bivariate Gaussian with added measurement error. Now, as predicted from the fact that $\hat \sigma_{1,2}$ is an unbiased estimator independent of $N$, we see that the mean parameter estimates in both low/high noise setting (black/red dashed lines) are close to the true coefficient $\frac{r}{k + 1}$ (dashed green line).

Before moving on, we define $\mathcal{P}^\uparrow \in [0,1]$ as the proportion of coefficient estimates that are larger than the true effect $r$ and hence overestimate it. $\mathcal{P}^\uparrow_H$ refers to that proportion in the high noise (small $N$) setting, $\mathcal{P}^\uparrow_L$ refers to that proportion in the low noise (large $N$) setting.

Now, the second important observation is that for both noise settings we have $\mathcal{P}^\uparrow_H = \mathcal{P}^\uparrow_L = \frac{1}{2}$, which implies that we equally often under- and overestimate the true effect. Note that another way of saying this is that the area under the curve left of the green line is equal to the area under the curve right to the orange line, for both sampling distributions.

We now make the crucial step by considering $\hat \sigma_{1,2}$ not as an estimate for the covariance $\frac{r}{k + 1}$ in $\Sigma^{A_{norm}}$, but for the covariance $r$ of the ‘true’ bivariate Gaussian without added measurement error with covariance matrix $\Sigma$. We know that $\hat \sigma_{1,2}$ is an unbiased estimator for $\frac{r}{k + 1}$ and we know $\frac{r}{k + 1} < r$. From this follows that $\hat{\sigma}_{1,2}$ is a biased estimator for $r$. Specifically, the estimator is biased downwards.

We again look at the proportions of coefficient estimates that under- and overestimate the true effect $r$ (the dashed blue line in the figure). We first consider the low noise case: the first observation is that we overestimate $r$ less often than we overestimated $\frac{r}{k + 1}$, which implies $\mathcal{P}^\uparrow_L < \frac{1}{2}$. Again, this is the same as saying that the area under the curve on the right of the blue line is smaller than the area under the curve left to the blue line.

For the high noise case the exact same is true, i.e. $\mathcal{P}^\uparrow_H < \frac{1}{2}$. Let’s define $q := \frac{\mathcal{P}^\uparrow_H}{\mathcal{P}^\uparrow_L}$. Now what we do we have is that $\mathcal{P}^\uparrow_H > \mathcal{P}^\uparrow_L$ and hence $q > 1$. This means that in the presence of measurement error, we overestimate absolutely less often than we underestimate in all settings, however, we overestimate relatively more in a high noise (small $N$) setting compared to a low noise (large $N$) setting. Let’s let this sink in for a moment and then move on to the summary:

(5) Summary

What have we found? We found that if our estimator is biased downwards (here by measurement error), then different sample sizes (and hence different variances of the sampling distribution) lead to different proportions of coefficient estimates that overestimate the true effect.

However, it is important to stress: when keeping $N$ constant and introducing measurement error, the proportion of overestimating estimates decreases compared to the situation without measurement error. This is because the whole sampling distribution is shifted towards zero in the presence of measurement error (the blue line is shifted to the position of the green line in the Figure).

The only thing that is increasing is $q$, which means that in the presence of measurement error in a high noise setting (small $N$) we relatively overestimate more than in a low noise setting (high $N$). What determines $q$? The larger the difference between the variances of two sampling distributions, the larger $q$. The more we shift the sampling distribution towards zero (by adding measurement error), the larger $q$.

(6) Back to the Paper

I think the results stated in (5) are pretty far away from the claim in the paper, which was ‘In a high noise setting, the presence of measurement error leads to an increase in coefficient estimates’. This statement rather suggests that introducing measurement error increases the expected value of the sampling distribution (moving the blue line to the right instead of to the left) which is - as we have seen - incorrect. This false suggestion is strengthened by the scaling of the figures. We illustrate this here, by plotting the figure as shown in the paper (top row) and with equal coordinate systems (bottom row).

The top row suggests that the difference between the low/high noise setting is because the whole cloud is ‘shifted’ downwards in the low noise setting. This would mean that the sampling distributions are shifted differently depending on the noise setting (sample size) when adding measurement error. On the other hand, when plotting the data in the same coordinate system, it is clear that the expected values do not change and that effect is driven by the differing variances of the estimator.

And one more thing: in the right panel in the figure of the paper the authors plot $\mathcal{P}^\uparrow$ as a function of $N$. Note that from the discussion in (4) it follows that this value can never be larger than $\frac{1}{2}$ as long as the estimator is unbiased or biased downwards. So there must have been some mistake.

Conclusion

This was a fun opportunity to do some statistics detective work. However, the lack of clarity does potentially also do quite some harm by confusing the reader about important concepts. There is of course also the possibility that I just fully misunderstood their paper. In that case I hope the reader will point to my mistakes.

The code to exactly reproduce the above figures can be found here.

I would like to thank Fabian Dablander and Peter Edelsbrunner for helpful comments on this blogpost. In addition, I would like to thank Oisín Ryan and Joris Broere for an interesting discussion on a train ride from Eindhoven to Utrecht yesterday, and I apologize to about 15 anonymous Dutch travelers because they had to endure a heated statistical debate.

I am looking forward to comments, complaints and corrections.

Predictability in Network Models

Tue, 01 Nov 2016 10:00:00 +0000

Network models have become a popular way to abstract complex systems and gain insights into relational patterns among observed variables in many areas of science. The majority of these applications focuses on analyzing the structure of the network. However, if the network is not directly observed (Alice and Bob are friends) but estimated from data (there is a relation between smoking and cancer), we can analyze - in addition to the network structure - the predictability of the nodes in the network. That is, we would like to know: how well can a given node in the network predicted by all remaining nodes in the network?

Predictability is interesting for several reasons:

It gives us an idea of how practically relevant edges are: if node A is connected to many other nodes but these only explain, let’s say, only 1% of its variance, how interesting are the edges connected to A?
We get an indication of how to design an intervention in order to achieve a change in a certain set of nodes and we can estimate how efficient the intervention will be
It tells us to which extent different parts of the network are self-determined or determined by other factors that are not included in the network

In this blogpost, we use the R-package mgm to estimate a network model and compute node wise predictability measures for a dataset on Post Traumatic Stress Disorder (PTSD) symptoms of Chinese earthquake victims. We visualize the network model and predictability using the qgraph package and discuss how the combination of network model and node wise predictability can be used to design effective interventions on the symptom network.

Load Data

We load the data which the authors made openly available:

data <- read.csv('http://psychosystems.org/wp-content/uploads/2014/10/Wenchuan.csv')
data <- na.omit(data)
data <- as.matrix(data)
p <- ncol(data)
dim(data)

## [1] 344  17

The datasets contains complete responses to 17 PTSD symptoms of 344 individuals. The answer categories for the intensity of symptoms ranges from 1 ‘not at all’ to 5 ‘extremely’. The exact wording of all symptoms is in the paper of McNally and colleagues.

Estimate Network Model

We estimate a Mixed Graphical Model (MGM), where we treat all variables as continuous-Gaussian variables. Hence we set the type of all variables to type = 'g' and the number of categories for each variable to 1, which is the default for continuous variables level = 1:

library(mgm)

set.seed(1)
fit_obj <- mgm(data = data, 
               type = rep('g', p),
               level = rep(1, p),
               lambdaSel = 'CV',
               ruleReg = 'OR', 
               pbar = FALSE)

## Note that the sign of parameter estimates is stored separately; see ?mgm

For more info on how to estimate Mixed Graphical Models using the mgm package see this previous post or the mgm paper.

Compute Predictability of Nodes

After estimating the network model we are ready to compute the predictability for each node. Node wise predictability (or error) can be easily computed, because the graph is estimated by taking each node in turn and regressing all other nodes on it. As a measure for predictability we pick the propotion of explained variance, as it is straight forward to interpret: 0 means the node at hand is not explained at all by other nodes in the nentwork, 1 means perfect prediction. We centered all variables before estimation in order to remove any influence of the intercepts. For a detailed description of how to compute predictions and to choose predictability measures, have a look at this paper. In case there are additional variable types (e.g. categorical) in the network, we can choose an appropriate measure for these variables (e.g. % correct classification, for details see ?predict.mgm).

pred_obj <- predict(object = fit_obj, 
                    data = data, 
                    errorCon = 'R2')

pred_obj$error

##     Variable    R2
## 1  intrusion 0.637
## 2     dreams 0.650
## 3      flash 0.602
## 4      upset 0.637
## 5    physior 0.627
## 6    avoidth 0.686
## 7   avoidact 0.681
## 8    amnesia 0.410
## 9    lossint 0.521
## 10   distant 0.498
## 11      numb 0.458
## 12    future 0.543
## 13     sleep 0.564
## 14     anger 0.561
## 15    concen 0.630
## 16     hyper 0.675
## 17   startle 0.621

We calculated the percentage of explained variance ($R^2$) for each of the nodes in the network. Next, we visualize the estimated network and discuss its structure in relation to explained variance.

Visualize Network & Predictability

We provide the estimated weighted adjacency matrix and the node wise predictability measures as arguments to qgraph() to obtain a network visualization including the predictability measure $R^2$:

library(qgraph)

qgraph(fit_obj$pairwise$wadj, # weighted adjacency matrix as input
       layout = 'spring', 
       pie = pred_obj$error[,2], # provide errors as input
       pieColor = rep('#377EB8',p),
       edge.color = fit_obj$pairwise$edgecolor,
       labels = colnames(data))

The mgm-package also allows to compute predictability for higher-order (or moderated) MGMs and for (mixewd) Vector Autoregressive (VAR) models. For details see this paper. For an early paper looking into the predictability of symptoms of different psychological disorders, see this paper.

Graphical Analysis of German Parliament Voting Pattern

Wed, 18 May 2016 10:00:00 +0000

We use network visualizations to look into the voting patterns in the current German parliament. I downloaded the data here and all figures can be reproduced using the R code available on Github.

Missing values, invalid votes, abstention from voting and not showing up for the vote weres coded as (-1), such that all other responses are a yes (1) or no (2) vote. We use pearson correlation as a measure of voting similarity and voting behavior coded as (-1) is regarded as noise in the dataset. 36 of the 659 members of parliament were removed from the data because more than 50% of the votes were coded as (-1). The reason was that they either joined or left the parliament during the analyzed time period.

Disclaimer: note that only for a fraction of the bills passed in the German parliament votes are recorded (and used here) and that relations between single members of parliaments might be artifacts of the noise-coding. Moreover, the data is quite scarce (136 bills). Therefore we should not draw any strong conclusions from this coarse-grained analysis.

Voting Pattern Amongst Members of Parliament

We first compute the correlations between the voting behavior of all pairs of members of parliament, which gives us a 623 x 623 correlation matrix. We then visualize this correlation matrix using the force-directed Fruchterman Reingold algorithm as implemented in the qgraph package. This algorithm puts nodes (politicians) on the plane such that edges (connections) have comparable length and that edges are crossing as little as possible.

(For readers on R-Bloggers.com: click here for the original post with larger figures.)

Green edges indicate positive correlations (voter agreement) and red edges indicate negative correlations (voter disagreement). The width of the edges is proportional to the strength (absolute value) of the correlation. We see that the green party (B90/GRUENE) clusters together, as well as the left party (DIE LINKE). The third and biggest cluster consists of members of the two largest parties, the social democrats (SPD) and the conservatives (CDU/CSU). This is the structure we would expect intuitively, as social democrats and conservatives currently form the government in a grand coalition.

With some imagination, one could also identify a couple of subclusters in this large cluster. A detailed analysis of smaller clusters would be especially interesting if we had additional information about politicians. We could then see whether the cluster assignment computed from the voting behavior relates to these additional variables. For instance, politicians with close ties to the economy might vote together, irrespective of their party.

So far we assumed that we can adequately describe the voting pattern of the whole period from 26.11.2013 - 14.04.2016 with one graph. This implies that we assume that the relative voting behavior does not change over time. For example, this means that if members of parliament A and B agree on votes at the beginning of the period, they also agree throughout the rest of the period and do not start to disagree at some point. In the next section we check whether the voting behavior changes over time.

Voting Pattern Amongst Members of Parliament across Time

To make graphs comparable over different time points and to be able to see growing (dis-) agreement between parties, we arrange individual members of parliament in circles that correspond to their parties. We compute a time-varying graph by visualizing a Gaussian kernel smoothed (bandwidth = .1, time interval [0,1]) correlation matrix at 20 equally spaced time points. Details can be found in the code used to create all figures, which is available here. We then combine these 20 graphs into the following video:

We see that right after the time the parliament was elected and the big coalition was formed in November 2013, there is relatively high agreement between members of CDU/CSU and SPD. Within the next three years, however, the agreement decreasees. With regards to the parties in the opposition, at the beginning of the period the green and the left party disagree to a similar degree with the grand coalition. Over time, however, it appears that the green party increasingly agrees with the grand coalition, while the left party agrees less and less with the CDU/CSU- and SPD-led government.

As the number of seats the parties have in the parliament differs widely, it is hard to read agreement within parties from the above graph. For instance, the cycle of CDU/CSU seems to be filled with more and thicker green edges than the one of SPD, however, this could well be because there are simply more politicians (307 vs. 191) and hence more edges displayed. Therefore, we have a closer look at within-party agreement in the following graph:

Collapsed over time we see the members of the left party agree most with each other and the members of the social democratic party agree the least with each other. The largest changes in agreement appear in the green and left party: from late 2014 to mid 2015, members of the green party seem to agree less with each other than usual, while members of the left party seem to agree more with each other than usual.

Zoom in on small Group of Members of Parliament

While the analyses so far gave a comprehensive overview of the voting behavior amongst members of parliament, the graph is too large to see which node in the graph corresponds to which politician. In the following graph we zoom in on a random subset of 30 politicians and match the nodes to their names:

Note that correlations are bivariate measures and therefore the correlations in this smaller graph are the same as the ones in the larger graph above. We see the same overall structure as above, but now with names assigned to nodes. Again the members of the green party cluster together, but for instance Nicole Maisch votes more often together with Steffi Lempke than with the other displayed colleagues. We also see that for instance Steffen Kampeter and Christian Schmidt are both members of the convervative party, however are placed at quite distant locations in the graph (and indeed the correlation between their voting behavior is almost zero: -0.04).

Analogous to above, we now look into how voting agreement between the politicians in our subset changes over time by computing a time-varying graph as before:

We see that voting agreement changes substantially: for instance members of the opposition parties seem to agree less and less with the grand coalition until mid-2015 and then agree again more and more until the end of the period in early 2016. Some politicians seem to change their voting pattern quite dramatically: for example the voting behavior of conserviative party member Heike Bremer strongly correlates with the voting behavior of most of her party colleagues in 2014, however in late 2015 and early 2016 the correlations are close to zero. Also, interestingly, the voting behavior of conservative Steffen Kampeter tends to vote in the opposite direction than his conservative colleagues in early 2014, but then agrees more and more with them until the last recorded votes.

‘Unique’ Agreement between Members of Parliamen

So far we looked into how the voting patterns of any pair of members of parliaments correlate with each other. While this is an informative measure and gives a first overview of how politicians vote relative to each other, it is also a measure that is tricky to interpret. For instance two politicians of a party might always vote together because they always align their votes with their common mentor in the party. Or because there is pressure from the whole party to vote for a bill together. Or because they are both members of a specific think tank within the parliament, …

An interesting alternative measure is conditional correlation, which is the correlation between any two members of parliament, after controlling for all other members of parliament. In case of a conditional correlation between two members of parliament there are still many possible explanations (e.g. both might be influenced by some person outside the parliament), however, we are sure that this correlation cannot be explained by the voting pattern by any other member of parliament. We compute this conditional correlation graph and visualize it using the same layout as in the corresponding correlation graph:

It is apparent that there are less edges and less strong edges. Note that this is what we would expect in this dataset: in a parliament there is a general level of agreement within parties and also between parties, otherwise it would be difficult to pass bills. Therefore, we would expect that a substantial part of a correlation between the voting pattern between any two politicians can be explained by the voting patterns of other politicians. The strongest conditional correlations is the one between Nicole Gohlke and Norbert Mueller of the left party. For some reason these two politicians align their votes in a way that cannot be explained by the voting pattern of other politicians within and outside their party. Note here that

Concluding comments

It came as quite a surprise to me that the large majority of votes on bills in the German parliament are not recorded and hence not available to the public (please correct me if I missed something). While this is a major reason to interpret these data with caution, on the other hand the votes on bills that are recorded are the more controversial and therefore probably more interesting ones.

The graphs in this post were the first few obvious things I wanted to look into, but of course many more analyses are possible. I put the preprocessed data (no information lost, just everyting in 3 linked files instead of hundreds) on Github alongside with the code that produces the above figures. In case you have any comments, complaints or questions, please comment below!