Jonas HaslbeckPhD Student Psych Methods
http://jmbh.github.io/
Regression with Interaction Terms - How Centering Predictors influences Main Effects<p>Centering predictors in a regression model with only main effects has no influence on the main effects. In contrast, in a regression model including interaction terms centering predictors <em>does</em> have an influence on the main effects. After getting confused by this, I read <a href="https://amstat.tandfonline.com/doi/pdf/10.1080/10691898.2011.11889620">this</a> nice paper by Afshartous & Preston (2011) on the topic and played around with the examples in R. I summarize the resulting notes and code snippets in this blogpost.</p>
<p>We give an explanation on two levels:</p>
<ol>
<li>By illustrating the issue with the simplest possible example</li>
<li>By showing in general how main effects are a function of the constants (e.g. means) that are substracted from predictor variables</li>
</ol>
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<h2 id="explanation-1-simplest-example">Explanation 1: Simplest example</h2>
<p>The simplest possible example to illustrate the issue is a regression model in which variable <script type="math/tex">Y</script> is a linear function of variables <script type="math/tex">X_1</script>, <script type="math/tex">X_2</script>, and their product <script type="math/tex">X_1X_2</script></p>
<script type="math/tex; mode=display">Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1X_2 + \epsilon,</script>
<p>where we set <script type="math/tex">\beta_0 = 1, \beta_1 = 0.3, \beta_2 = 0.2, \beta_3 = 0.2</script>, and <script type="math/tex">\epsilon \sim N(0, \sigma^2)</script> is Gaussian distribution with mean zero and variance <script type="math/tex">\sigma^2</script>. We define the predictors <script type="math/tex">X_1, X_2</script> as Gaussians with means <script type="math/tex">\mu_{X_1} = \mu_{X_2} = 1</script> and <script type="math/tex">\sigma_{X_1}^{2}=\sigma_{X_2}^{2}=1</script>. This code samples <script type="math/tex">n = 10000</script> observations from this model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10000</span><span class="w">
</span><span class="n">b</span><span class="m">0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="p">;</span><span class="w"> </span><span class="n">b</span><span class="m">1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">.3</span><span class="p">;</span><span class="w"> </span><span class="n">b</span><span class="m">2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">.2</span><span class="p">;</span><span class="w"> </span><span class="n">b</span><span class="m">3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">.2</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="m">2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">b</span><span class="m">0</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b</span><span class="m">1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b</span><span class="m">3</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<p><strong>Regression models with main effects</strong></p>
<p>We first verify that centering variables indeed does not affect the main effects. To do so, we first fit the linear regression with only main effects with uncentered predictors</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">Call</span><span class="o">:</span><span class="w">
</span><span class="n">lm</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">Coefficients</span><span class="o">:</span><span class="w">
</span><span class="p">(</span><span class="n">Intercept</span><span class="p">)</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="w">
</span><span class="m">0.8088</span><span class="w"> </span><span class="m">0.4983</span><span class="w"> </span><span class="m">0.4015</span><span class="w"> </span></code></pre></figure>
<p>and then with mean centered predictors</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="c1"># center predictors</span><span class="w">
</span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="p">)</span><span class="w">
</span><span class="n">Call</span><span class="o">:</span><span class="w">
</span><span class="n">lm</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="p">)</span><span class="w">
</span><span class="n">Coefficients</span><span class="o">:</span><span class="w">
</span><span class="p">(</span><span class="n">Intercept</span><span class="p">)</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="w">
</span><span class="m">1.7036</span><span class="w"> </span><span class="m">0.4983</span><span class="w"> </span><span class="m">0.4015</span><span class="w"> </span></code></pre></figure>
<p>The parameter estimates of the regression with uncentered predictors are <script type="math/tex">\hat\beta_1 \approx 0.50</script> and <script type="math/tex">\hat\beta_2 \approx 0.40</script>. The estimates of the regression with <em>centered</em> predictors are <script type="math/tex">\hat\beta_1^* \approx 0.50</script> and <script type="math/tex">\hat\beta_2^* \approx 0.40</script> (we denote estimates from regressions with centered predictors with an asterisk). And indeed, <script type="math/tex">\hat\beta_1 = \hat\beta_1^*</script> and <script type="math/tex">\hat\beta_2 = \hat\beta_2^*</script>.</p>
<p><strong>Regression models with main effects + interaction</strong></p>
<p>We include the interaction term and show that centering the predictors now does <em>does</em> affect the main effects. We first fit the regression model without centering</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">Call</span><span class="o">:</span><span class="w">
</span><span class="n">lm</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">Coefficients</span><span class="o">:</span><span class="w">
</span><span class="p">(</span><span class="n">Intercept</span><span class="p">)</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="o">:</span><span class="n">x</span><span class="m">2</span><span class="w">
</span><span class="m">1.0183</span><span class="w"> </span><span class="m">0.2883</span><span class="w"> </span><span class="m">0.1898</span><span class="w"> </span><span class="m">0.2111</span><span class="w"> </span></code></pre></figure>
<p>and then with centering</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="p">)</span><span class="w">
</span><span class="n">Call</span><span class="o">:</span><span class="w">
</span><span class="n">lm</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="p">)</span><span class="w">
</span><span class="n">Coefficients</span><span class="o">:</span><span class="w">
</span><span class="p">(</span><span class="n">Intercept</span><span class="p">)</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="o">:</span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="w">
</span><span class="m">1.7026</span><span class="w"> </span><span class="m">0.4984</span><span class="w"> </span><span class="m">0.3995</span><span class="w"> </span><span class="m">0.2111</span><span class="w"> </span></code></pre></figure>
<p>We see that <script type="math/tex">\hat\beta_1 \approx 0.29</script> and <script type="math/tex">\hat\beta_2 \approx 0.19</script> and <script type="math/tex">\hat\beta_1^* \approx 0.50</script> and <script type="math/tex">\hat\beta_2^* \approx 0.40</script>. While the two models have different parameters, they are statistically equivalent. Here this means that expected values of both models are the same. In empirical terms this means that their coefficient of determination <script type="math/tex">R^2</script> is the same. The reader will be able to verify this in Explanation 2 below.</p>
<p>We make two observations:</p>
<ol>
<li>In the model with interaction terms, the main effects differ between the regressions with/without centering of predictors</li>
<li>When centering predictors, the main effects are the same in the model with/without the interaction term (up to some numerical inaccuracy)</li>
</ol>
<p><strong>Why does centering influence main effects in the presence of an interaction term?</strong></p>
<p>The reason is that in the model with the interaction term, the parameter <script type="math/tex">\beta_1</script> (uncentered predictors) is the main effect of <script type="math/tex">X_1</script> on <script type="math/tex">Y</script> if <script type="math/tex">X_2 = 0</script>, and the parameter <script type="math/tex">\beta_1^*</script> (centered predictors) is the main effect of <script type="math/tex">X_1</script> on <script type="math/tex">Y</script> if <script type="math/tex">X_2 = \mu_{X_2}</script>. This means that <script type="math/tex">\beta_1</script> and <script type="math/tex">\beta_1^*</script> are modeling different effects in the data. Here is a more detailed explanation:</p>
<p>Rewriting the model equation in the following way</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[Y] &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1X_2 \\
&= \beta_0 + (\beta_1 + \beta_3 X_2) X_1 + \beta_2 X_2
\end{aligned} %]]></script>
<p>shows that in the model with interaction term, the effect of <script type="math/tex">X_1</script> on <script type="math/tex">Y</script> is equal to <script type="math/tex">(\beta_1 + \beta_3 X_2)</script> and therefore a function of <script type="math/tex">X_2</script>. What does the parameter <script type="math/tex">\beta_1</script> model here? It models the effect of <script type="math/tex">X_1</script> on <script type="math/tex">Y</script> when <script type="math/tex">X_2 = 0</script>. Similarly we could rewrite the effect of <script type="math/tex">X_1</script> on <script type="math/tex">Y</script> as a function of <script type="math/tex">X_2</script>.</p>
<p>Now let <script type="math/tex">X_1^c = X_1 - \mu_{X_1}</script> and <script type="math/tex">X_2^c = X_2 - \mu_{X_2}</script> be the centered predictors. We get the same model equations, now with the parameters estimated using the centered predictors <script type="math/tex">X_1^c, X_2^c</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[Y] &= \beta_0^* + \beta_1^* X_1^c + \beta_2^* X_2^c + \beta_3^* X_1^c X_2^c \\
&= \beta_0^* + (\beta_1^* + \beta_3^* X_2^c) X_1^c + \beta_2^* X_2^c \\
\end{aligned} %]]></script>
<p>Again we focus on the effect <script type="math/tex">(\beta_1^* + \beta_3^* X_2^c)</script> of <script type="math/tex">X_1^c</script> on <script type="math/tex">Y</script>. What does the the parameter <script type="math/tex">\beta_1^*</script> model here? It models the main effect of <script type="math/tex">X_1^c</script> on <script type="math/tex">Y</script> when <script type="math/tex">X_2^c = \mu_{X_2^c} = 0</script>. What remained the same is that <script type="math/tex">\beta_1^*</script> is the main effect of <script type="math/tex">X_1^c</script> on <script type="math/tex">Y</script> when <script type="math/tex">X_2^c = 0</script>. But what is new is that <script type="math/tex">\mu_{X_2^c} = 0</script>.</p>
<p>To summarize, in the uncentered case <script type="math/tex">\beta_i</script> is the main effect when the predictor variable <script type="math/tex">X_i</script> is equal to zero; and in the centered case, <script type="math/tex">\beta_i^*</script> is the main effect when the predictor variable <script type="math/tex">X_i</script> is equal to its mean. Clearly, <script type="math/tex">\beta_i</script> and <script type="math/tex">\beta_i^*</script> model different effects in the data and it is therefore not surprising that the two regressions give us very different estimates.</p>
<p><strong>Centering <script type="math/tex">\rightarrow</script> interpretation of <script type="math/tex">\beta</script> remains the same when adding interaction</strong></p>
<p>Our second observation above was that the estimates of main effects are the same with/without interaction term when centering the predictor variables. This is because in the models <em>without</em> interaction term (centered or uncentered predictors) the interpretation of <script type="math/tex">\beta_1</script> is the same as in the model <em>with</em> interaction term and centered predictors.</p>
<p>More precisely, in the regression model with only main effects, <script type="math/tex">\beta_1</script> is the main effect of <script type="math/tex">X_1</script> on <script type="math/tex">Y</script> averaged over all values of <script type="math/tex">X_2</script>, which is the same as the main effect of <script type="math/tex">X_1</script> on <script type="math/tex">Y</script> for <script type="math/tex">X_2 = \mu_{X_2}</script>. This means that if we center predictors, <script type="math/tex">\beta_1</script> models the same effect in the data in a model with/without interaction term. This is an attractive property to have when one is interested in comparing models with/without interaction term.</p>
<h2 id="explanation-2-main-effects-as-functions-of-added-constants">Explanation 2: Main effects as functions of added constants</h2>
<p>Substracting the mean from predictors is a special case of adding constants to predictors. Here we first show numerically what happens to each regression parameter when adding constants to predictors. Then we show analytically how each parameter is a function of its value in the original regression model (no constant added) and the added constants.</p>
<p>Why are we doing this? We are doing this to develop a more general understanding of what happens when adding constants to predictors. It also puts the above example in a more general context, since we can consider it as a special case of the following analysis.</p>
<p><strong>Numerical experiment I: Only main effects</strong></p>
<p>We first fit a series of regression models with only main effects. In each of them we add a different constant to the predictors. We do this verify that our claim that centering predictors does not change main effects extends to the more general situation of adding constants to predictors.</p>
<p>We first define a sequence of constant values we add to the predictors and create storage for parameter estimates:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">25</span><span class="w">
</span><span class="n">c_sequence</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-1.5</span><span class="p">,</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">A</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="o">=</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="o">=</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"b0"</span><span class="p">,</span><span class="w"> </span><span class="s2">"b1"</span><span class="p">,</span><span class="w"> </span><span class="s2">"b2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"b3"</span><span class="p">,</span><span class="w"> </span><span class="s2">"R2"</span><span class="p">)</span></code></pre></figure>
<p>We now fit 25 regression models, and in each of them we add a constant <code class="highlighter-rouge">c</code> to both predictors, taken from the sequence <code class="highlighter-rouge">c_sequence</code>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">25</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">c_sequence</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">c</span><span class="w">
</span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">c</span><span class="w">
</span><span class="n">lm_obj</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="p">)</span><span class="w"> </span><span class="c1"># Fit model</span><span class="w">
</span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">0</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm_obj</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">1</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm_obj</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">2</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm_obj</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">yhat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">lm_obj</span><span class="p">)</span><span class="w">
</span><span class="n">A</span><span class="o">$</span><span class="n">R</span><span class="m">2</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">yhat</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="c1"># Compute R2</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Remark: in Explanation 1 we said that the coefficient of determination <script type="math/tex">R^2</script> does not change when adding constants to the predictors. We invite the reader to verify this by inspecting <code class="highlighter-rouge">A$R2</code>.</p>
<p>We plot all parameters <script type="math/tex">\beta_0, \beta_1, \beta_2</script> as a function of <code class="highlighter-rouge">c</code>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">RColorBrewer</span><span class="p">)</span><span class="w">
</span><span class="n">cols</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">)</span><span class="w"> </span><span class="c1"># Select nice colors</span><span class="w">
</span><span class="n">plot.new</span><span class="p">()</span><span class="w">
</span><span class="n">plot.window</span><span class="p">(</span><span class="n">xlim</span><span class="o">=</span><span class="nf">range</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">),</span><span class="w"> </span><span class="n">ylim</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">-.5</span><span class="p">,</span><span class="w"> </span><span class="m">2.5</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">cex.axis</span><span class="o">=</span><span class="m">0.75</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-.5</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2.5</span><span class="p">),</span><span class="w"> </span><span class="n">las</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">1</span><span class="p">])</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">3</span><span class="p">])</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="s2">"topright"</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"b0"</span><span class="p">,</span><span class="w"> </span><span class="s2">"b1"</span><span class="p">,</span><span class="w"> </span><span class="s2">"b2"</span><span class="p">),</span><span class="w">
</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">],</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">title</span><span class="p">(</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Added constant"</span><span class="p">)</span><span class="w">
</span><span class="n">title</span><span class="p">(</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Parameter value"</span><span class="p">)</span></code></pre></figure>
<p><img src="http://jmbh.github.io/figs/CenteringPredictors/Centering_Fig1.png" alt="center" /></p>
<p>We see that the intercept changes as a function of <code class="highlighter-rouge">c</code>. The model at <code class="highlighter-rouge">c = 0</code> corresponds to the very first model we fitted above. And the model at <code class="highlighter-rouge">c = -1</code> corresponds to the model fitted with centered predictors. But the key observation is that the main effects <script type="math/tex">\beta_1, \beta_2</script> do not change. A proof of this and an exact expression for the intercept will fall out of our analysis of the model with interaction term in the last section of this blogpost.</p>
<p><strong>Numerical experiment II: main effects + interaction term</strong></p>
<p>Next we show that this is different when adding the interaction term. We use the same sequence of <code class="highlighter-rouge">c</code> as above and fit regression models with interaction term:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">25</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">c_sequence</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">c</span><span class="w">
</span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">c</span><span class="w">
</span><span class="n">lm_obj</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="p">)</span><span class="w"> </span><span class="c1"># Fit model</span><span class="w">
</span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">0</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm_obj</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">1</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm_obj</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">2</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm_obj</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">3</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm_obj</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="m">4</span><span class="p">]</span><span class="w">
</span><span class="n">yhat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">lm_obj</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="err">_</span><span class="n">c</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="err">_</span><span class="n">c</span><span class="p">))</span><span class="w">
</span><span class="n">A</span><span class="o">$</span><span class="n">R</span><span class="m">2</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">yhat</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="c1"># Compute R2</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>And again we plot all parameters <script type="math/tex">\beta_0, \beta_1, \beta_2, \beta_3</script> as a function of <code class="highlighter-rouge">c</code>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot.new</span><span class="p">()</span><span class="w">
</span><span class="n">plot.window</span><span class="p">(</span><span class="n">xlim</span><span class="o">=</span><span class="nf">range</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">),</span><span class="w"> </span><span class="n">ylim</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">-.5</span><span class="p">,</span><span class="w"> </span><span class="m">2.5</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">cex.axis</span><span class="o">=</span><span class="m">0.75</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-.5</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2.5</span><span class="p">),</span><span class="w"> </span><span class="n">las</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">1</span><span class="p">])</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">3</span><span class="p">])</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">4</span><span class="p">])</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="s2">"topright"</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"b0"</span><span class="p">,</span><span class="w"> </span><span class="s2">"b1"</span><span class="p">,</span><span class="w"> </span><span class="s2">"b2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"b3"</span><span class="p">),</span><span class="w">
</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">],</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">title</span><span class="p">(</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Added constant"</span><span class="p">)</span><span class="w">
</span><span class="n">title</span><span class="p">(</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Parameter value"</span><span class="p">)</span></code></pre></figure>
<p><img src="http://jmbh.github.io/figs/CenteringPredictors/Centering_Fig2.png" alt="center" /></p>
<p>This time both the intercept <script type="math/tex">\beta_0</script> and the main effects <script type="math/tex">\beta_1, \beta_2</script> are a function of <code class="highlighter-rouge">c</code>, while the interaction effect <script type="math/tex">\beta_3</script> is constant. At this point the best explanation is simply to go through the algebra, which explains these results exactly. We do this in the next section.</p>
<p><strong>Deriving function for all effects</strong></p>
<p>We plug in the definition of centering in the population regression model we introduced at the very beginning of this blogpost. This gives us every parameter as a function of two things: (1) the parameters in the original model and (b) the added constant. Above we added the same constant to both predictors. Here we consider the general case where the constants can differ.</p>
<p>Our original (unaltered) model is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[Y] &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1X_2
\end{aligned} %]]></script>
<p>Now we plug in the predictors with added constants <script type="math/tex">c_1, c_2</script>, multiply out, and rearrange:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[Y] &= \beta_0^* + \beta_1^* (X_1 + c_1) + \beta_2 (X_2 + c_2) + \beta_3^* (X_1 + c_1) (X_2 + c_2) \\
& = \beta_0^* + \beta_1^*X_1 + \beta_1^*c_1 + \beta_2^*X_2 + \beta_2^*c_2
+ \beta_3^* X_1X_2 + \beta_3^*X_1 c_2 + \beta_3^* c_1X_2 + \beta_3^* c_1c_2 \\
&= (\beta_0^* + \beta_1^*c_1 + \beta_2^*c_2 + \beta_3^* c_1c_2) + (\beta_1^* + \beta_3^*c_2)X_1 + (\beta_2^* + \beta_3^*c_1)X_2 + \beta_3^* X_1X_2
\end{aligned} %]]></script>
<p>Now if we equate the respective interecept and slope terms we get:</p>
<script type="math/tex; mode=display">\beta_0 = \beta_0^* + \beta_1^*c_1 + \beta_2^*c_2 + \beta_3^* c_1c_2</script>
<script type="math/tex; mode=display">\beta_1 = \beta_1^* + \beta_3^*c_2</script>
<script type="math/tex; mode=display">\beta_2 = \beta_2^* + \beta_3^*c_1</script>
<p>and</p>
<script type="math/tex; mode=display">\beta_3 = \beta_3^*</script>
<p>Now we solve for the parameters <script type="math/tex">\beta_0^*, \beta_1^*, \beta_2^*, \beta_3^*</script> from the models with constants added to the predictors.</p>
<p>Because we know <script type="math/tex">\beta_3 = \beta_3^*</script> we can write <script type="math/tex">\beta_2 = \beta_2^* + \beta_3 c_1</script> and can solve</p>
<script type="math/tex; mode=display">\beta_2^* = \beta_2 - \beta_3 c_1</script>
<p>The same goes for <script type="math/tex">\beta_1^*</script> so we have</p>
<script type="math/tex; mode=display">\beta_1^* = \beta_1 - \beta_3 c_2</script>
<p>Finally, to obtain a formula for <script type="math/tex">\beta_0^*</script> we plug the just obtained expressions for <script type="math/tex">\beta_1^*</script>, <script type="math/tex">\beta_2^*</script> and <script type="math/tex">\beta_3^*</script> into</p>
<script type="math/tex; mode=display">\beta_0 = \beta_0^* + \beta_1^*c_1 + \beta_2^*c_2 + \beta_3^* c_1c_2</script>
<p>and get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\beta_0 &= \beta_0^* + (\beta_1 - \beta_3 c_2)c_1 + (\beta_2 - \beta_3 c_1)c_2 + \beta_3 c_1c_2 \\
&= \beta_0^* + \beta_1 c_1 - \beta_3 c_2 c_1 + \beta_2 c_2 - \beta_3 c_2 c_1 + \beta_3 c_1c_2 \\
&= \beta_0^* + \beta_1 c_1 + \beta_2 c_2 - \beta_3 c_1c_2
\end{aligned} %]]></script>
<p>and can solve for <script type="math/tex">\beta_0^*</script>:</p>
<script type="math/tex; mode=display">\beta_0^* = \beta_0 - \beta_1 c_1 - \beta_2 c_2 + \beta_3 c_1c_2</script>
<p>Let’s check whether those fomulas predict the parameter changes as a function of <code class="highlighter-rouge">c</code> in the numerical experiment above.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">lm_obj</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="m">1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="c1"># Reference model (no constant added)</span><span class="w">
</span><span class="n">b</span><span class="m">0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm_obj</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">b</span><span class="m">1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm_obj</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">b</span><span class="m">2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm_obj</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">b</span><span class="m">3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm_obj</span><span class="o">$</span><span class="n">coefficients</span><span class="p">[</span><span class="m">4</span><span class="p">]</span><span class="w">
</span><span class="n">B</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">A</span><span class="w"> </span><span class="c1"># Storage for predicted parameters</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">25</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">c</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">c_sequence</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="n">B</span><span class="o">$</span><span class="n">b</span><span class="m">0</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">b</span><span class="m">0</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">b</span><span class="m">1</span><span class="o">*</span><span class="n">c</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">b</span><span class="m">2</span><span class="o">*</span><span class="n">c</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b</span><span class="m">3</span><span class="o">*</span><span class="n">c</span><span class="o">*</span><span class="n">c</span><span class="w">
</span><span class="n">B</span><span class="o">$</span><span class="n">b</span><span class="m">1</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">b</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">b</span><span class="m">3</span><span class="o">*</span><span class="n">c</span><span class="w">
</span><span class="n">B</span><span class="o">$</span><span class="n">b</span><span class="m">2</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">b</span><span class="m">2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">b</span><span class="m">3</span><span class="o">*</span><span class="n">c</span><span class="w">
</span><span class="n">B</span><span class="o">$</span><span class="n">b</span><span class="m">3</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">b</span><span class="m">3</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>We plot the computed parameters by the derived expressions as points on the empirical results from the numerical experiments above</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot.new</span><span class="p">()</span><span class="w">
</span><span class="n">plot.window</span><span class="p">(</span><span class="n">xlim</span><span class="o">=</span><span class="nf">range</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">),</span><span class="w"> </span><span class="n">ylim</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">-.5</span><span class="p">,</span><span class="w"> </span><span class="m">2.5</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">cex.axis</span><span class="o">=</span><span class="m">0.75</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-.5</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2.5</span><span class="p">),</span><span class="w"> </span><span class="n">las</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">1</span><span class="p">])</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">3</span><span class="p">])</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">A</span><span class="o">$</span><span class="n">b</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">4</span><span class="p">])</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="s2">"topright"</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"b0"</span><span class="p">,</span><span class="w"> </span><span class="s2">"b1"</span><span class="p">,</span><span class="w"> </span><span class="s2">"b2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"b3"</span><span class="p">),</span><span class="w">
</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">],</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="c1"># Plot predictions</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="o">$</span><span class="n">b</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">1</span><span class="p">])</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="o">$</span><span class="n">b</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="o">$</span><span class="n">b</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">3</span><span class="p">])</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="n">c_sequence</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="o">$</span><span class="n">b</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="m">4</span><span class="p">])</span><span class="w">
</span><span class="n">title</span><span class="p">(</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Added constant"</span><span class="p">)</span><span class="w">
</span><span class="n">title</span><span class="p">(</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Parameter value"</span><span class="p">)</span></code></pre></figure>
<p><img src="http://jmbh.github.io/figs/CenteringPredictors/Centering_Fig3.png" alt="center" /></p>
<p>and they match the numerical results exactly.</p>
<p>We see that the derived expressions explain exactly how parameters change as a function of the parameters of the reference model and the added constants.</p>
<p>If we set <script type="math/tex">\beta_3 = 0</script>, we get the same derivation for the regression model <em>without</em> interaction term. We find that <script type="math/tex">\beta_1^* = \beta_1</script>, <script type="math/tex">\beta_2^* = \beta_2</script>, and <script type="math/tex">\beta_0^* = \beta_0 - \beta_1 c_1 - \beta_2 c_2</script>.</p>
Sat, 05 May 2018 00:00:00 +0000
http://jmbh.github.io//CenteringPredictors/
http://jmbh.github.io//CenteringPredictors/Deconstructing 'Measurement error and the replication crisis'<p>Yesterday, I read <a href="http://science.sciencemag.org/content/355/6325/584/tab-pdf">‘Measurement error and the replication crisis’</a> by <a href="http://hhd.psu.edu/dsg/eric-loken-phd-assistant-director">Eric Loken</a> and <a href="http://andrewgelman.com">Andrew Gelman</a>, which left me puzzled. The first part of the paper consists of general statements about measurement error. The second part consists of the claim that in the presence of measurement error, we overestimate the true effect when having a small sample size. This sounded wrong enough to ask the authors for their <a href="https://raw.githubusercontent.com/jmbh/jmbh.github.io/master/figs/measurementerror/graph%20codes%20to%20share%20for%20science%20paper%20final-2.txt">simulation code</a> and spend a couple of hours to figure out what they did in their paper. I am offering a short and a long version.</p>
<p><strong>Edit Feb 17th:</strong> After a nice email converstaion with the authors, I now know that they <em>do</em> make their general argument only under the condition of selecting on significance. Their result then trivially follows from the increased variance of the sampling distribution due to adding ‘measurement error’ (see section (3) below). My source of confusion was that they talk about selection on significance in the paper, but then do not select on significance in the two scatter plots, and incorrectly state in the figure title, that they do. The conclusions of this blog post are still valid when making the assumptions in (1), so I leave it online in case somebody finds (parts of) it interesting.</p>
<h2 id="the-short-version">The Short Version</h2>
<p>My conclusion is that the authors show the following: If an estimator is biased (here by the presence of measurement error), then the proportion of estimates that overestimate the true effect depends on the variance of the sampling distribution (which depends on $N$). While this is an interesting insight, the authors do not say this clearly anywhere in the paper. Instead, they use formulations that suggest that they refer to the expected value of the estimator, which does not depend on the sample size. To make things worse, they plot the estimates in a way that suggest that the variance of the estimators is equal for N = 50 and N = 3000 and that the effect is driven by a difference in expected value, while the reverse is true.</p>
<h2 id="the-long-version">The Long Version</h2>
<p>I try to make an argument for my claims in the ‘short version’ above in 6 steps. (1) We make clear what the claim is the authors make, (2) we define our terminology, (3) we investigate what adding measurement error does on the population level, (4) we see how this influences the characteristics of estimators based on different sample sizes, (5) we summarize our results and (6) get back to the paper.</p>
<p><strong>(1) The exact claim</strong></p>
<p>The authors write <em>‘In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate co- efficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance.’ (p. 584/585)</em>. From this we can deduce that the authors claim that ‘In a high noise setting, the presence of measurement error and selection on statistical significance leads to an increase in coefficient estimates’. However, the authors do not select on statistical significance in their simulation, hence we also drop this condition and arrive at the claim ‘In a high noise setting, the presence of measurement error leads to an increase in coefficient estimates’.</p>
<p>What this statement means is unclear to me. Under the reasonable assumption that the authors did not make a fundamental mistake, the rest of this blogpost is about finding out what the authors could have meant.</p>
<p><strong>(2) Terminology (for reference)</strong></p>
<p>In the paper, ‘measurement error’, ‘noise’ and ‘variance’ are used interchangeably. Here, with variances we refer to the variances of the dimensions of the bivariate Gaussian distribution, if not stated otherwise. With measurement error we mean another bivariate Gaussian distribution with zero covariance. By a noisy setting, we refer to a situation with a low signal to noise ratio. This is defined relative to another setting, which is less noisy. The signal to noise ratio is a function of $N$ and is related to the variance of the sampling distribution of the estimator. All these things will become clear in sections (3) and (4).</p>
<p><strong>(3) What does ‘adding measurement error’ mean on the population level?</strong></p>
<p>In order to evaluate the above claim with respect to the simulation setup of the authors, we need to know the simulation setup. Fortunately, the authors provided the code in a quick and friendly email.</p>
<p>The authors consider the problem of estimating the covariance of a bivariate Gaussian distribution from a finite number of observations. The bivariate Gaussian distribution has the density</p>
<script type="math/tex; mode=display">f(x_1, x_2) = \frac{1}{\sqrt{(2 \pi)^k | \mathbf{ \Sigma } | }} \exp \bigl \{ - \frac{1}{2} \bigr (x - \mu)^{\top} \mathbf{ \Sigma }^{-1} (x - \mu) \},</script>
<p>where in our case the covariance $cov(x_1, x_2) = r > 0$ is some positive value, so the covariance matrix $\Sigma$ has entries:</p>
<script type="math/tex; mode=display">% <![CDATA[
\Sigma = \begin{bmatrix}
1 & r \\[0.3em]
r & 1
\end{bmatrix} %]]></script>
<p>Note that if we scale both dimensions of the Gaussian to $\mu_1 = \mu_2 = 0$ and $\sigma_1 = \sigma_2 = 1$ the correlation coefficient is equal to the coefficient of the regression of $x_1$ on $x_2$ or vice versa. Thus all results obtained here also extend to the regression coefficient that is refered to in the paper.</p>
<p>Now the authors ‘add measurement error’ to the two variables which consists of independent Gaussian noise with a variance $k > 0$, where $k$ is a constant. Notice that these two variables can also described by a bivariate Gaussian with covariance matrix $\Sigma^{ME}$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\Sigma^{ME} = \begin{bmatrix}
k & 0 \\[0.3em]
0 & k
\end{bmatrix} %]]></script>
<p>Notice that adding ‘measurement error’ as done by the authors is the same as adding these two Gaussians. Addition is a linear transformation and hence the resulting distribution is again a bivariate Gaussian distribution. Indeed, it turns out that the covariance matrix $\Sigma^A$ of the resulting bivariate Gaussian is the sum of the covariance matrices $\Sigma$ and $\Sigma^{ME}$ of the two bivariate Gaussians:</p>
<script type="math/tex; mode=display">% <![CDATA[
\Sigma^A = \begin{bmatrix}
1 & r \\[0.3em]
r & 1
\end{bmatrix}
+
\begin{bmatrix}
k & 0 \\[0.3em]
0 & k
\end{bmatrix}
=
\begin{bmatrix}
k + 1 & r \\[0.3em]
r & k + 1
\end{bmatrix} %]]></script>
<p>Now, if we renormalize the variances to get back to a correlation matrix it becomes obvious that adding ‘measurement error’ has to decrease the absolute value of the covariance:</p>
<script type="math/tex; mode=display">% <![CDATA[
\Sigma^{A_{norm}} = \begin{bmatrix}
1 & \frac{r}{k + 1} \\[0.3em]
\frac{r}{k + 1} & 1
\end{bmatrix} %]]></script>
<p>Note that $k > 0$ and hence $\frac{r}{k + 1} < r$ and hence the absolute value of the covariance is smaller in \Sigma^{A_{norm}} than in $\Sigma$ in the population.</p>
<p><strong>(4) Properties of the Estimator</strong></p>
<p>We now consider the estimate $\hat \sigma_{1,2}$ for the covariance between $x_1$ and $x_2$ in the bivariate Gaussian with covariance matrix $\Sigma^{A_{norm}}$ which is ‘corrupted’ by measurement error. We obtain $\hat \sigma_{1,2}$ via the least squares estimator, <a href="http://math.stackexchange.com/questions/787939/show-that-the-least-squares-estimator-of-the-slope-is-an-unbiased-estimator-of-t">which is an unbiased estimator</a> for $\frac{r}{k + 1}$.</p>
<p>What does this mean? This means that by the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">Central limit theorem</a>, the sampling distribution will be a Gaussian distribution that is centered on the true coefficient, which is $\frac{r}{k + 1}$. Thus, if we take many samples of size $N$ and compute a coefficient estimate on each of them, the mean coefficient will be equal to $\frac{r}{k + 1}$:</p>
<script type="math/tex; mode=display">\mathbb{E} [\hat \sigma_{1,2}] = \lim_{S \rightarrow \infty} \frac{1}{S} \sum_{i=1}^{\infty} \hat \sigma_{1,2}^i = \frac{r}{k + 1}</script>
<p>From the fact that the Gaussian density is symmetric and centered on the true effect, it follows that $\hat \sigma_{1,2}$ will <em>equally often</em> under- and overestimate the true effect $\frac{r}{k + 1}$. It is important to stress that this is true, irrespective of the variance of the sampling distribution (which depends on $N$). We illustrate this in the following Figure which shows the empirical sampling distributions from the simulation of the authors:</p>
<p><img src="https://raw.githubusercontent.com/jmbh/jmbh.github.io/master/figs/measurementerror/SamplingDistri_new.png" alt="center" /></p>
<p>The solid black line indicates the density estimate of the empirical sampling distribution of the coefficient estimates in the low noise (N = 3000) case. The solid red line indicates the density of the empirical sampling distribution of in the high noise (N = 50) case. The dashed black and red lines indicate the arithmetic means of the corresponding sampling distributions. The green dashed line indicates the true coefficient of the bivariate Gaussian with added measurement error. Now, as predicted from the fact that $\hat \sigma_{1,2}$ is an unbiased estimator independent of $N$, we see that the mean parameter estimates in both low/high noise setting (black/red dashed lines) are close to the true coefficient $\frac{r}{k + 1}$ (dashed green line).</p>
<p>Before moving on, we define $\mathcal{P}^\uparrow \in [0,1]$ as the proportion of coefficient estimates that are larger than the true effect $r$ and hence overestimate it. $\mathcal{P}^\uparrow_H$ refers to that proportion in the high noise (small $N$) setting, $\mathcal{P}^\uparrow_L$ refers to that proportion in the low noise (large $N$) setting.</p>
<p>Now, the second important observation is that for both noise settings we have $\mathcal{P}^\uparrow_H = \mathcal{P}^\uparrow_L = \frac{1}{2}$, which implies that we equally often under- and overestimate the true effect. Note that another way of saying this is that the area under the curve left of the green line is equal to the area under the curve right to the orange line, for both sampling distributions.</p>
<p>We now make the crucial step by considering $\hat \sigma_{1,2}$ not as an estimate for the covariance $\frac{r}{k + 1}$ in $\Sigma^{A_{norm}}$, but for the covariance $r$ of the ‘true’ bivariate Gaussian without added measurement error with covariance matrix $\Sigma$. We <em>know</em> that $\hat \sigma_{1,2}$ is an unbiased estimator for $\frac{r}{k + 1}$ and we know $\frac{r}{k + 1} < r$. From this follows that $\hat{\sigma}_{1,2}$ is a <em>biased</em> estimator for $r$. Specifically, the estimator is biased downwards.</p>
<p>We again look at the proportions of coefficient estimates that under- and overestimate the true effect $r$ (the dashed blue line in the figure). We first consider the low noise case: the first observation is that we overestimate $r$ <em>less often</em> than we overestimated $\frac{r}{k + 1}$, which implies $\mathcal{P}^\uparrow_L < \frac{1}{2}$. Again, this is the same as saying that the area under the curve on the right of the blue line is smaller than the area under the curve left to the blue line.</p>
<p>For the high noise case the exact same is true, i.e. $\mathcal{P}^\uparrow_H < \frac{1}{2}$. Let’s define $q := \frac{\mathcal{P}^\uparrow_H}{\mathcal{P}^\uparrow_L}$. Now what we <em>do</em> we have is that $\mathcal{P}^\uparrow_H > \mathcal{P}^\uparrow_L$ and hence $q > 1$. This means that in the presence of measurement error, we overestimate <em>absolutely less</em> often than we underestimate in all settings, however, we overestimate <em>relatively more</em> in a high noise (small $N$) setting compared to a low noise (large $N$) setting. Let’s let this sink in for a moment and then move on to the summary:</p>
<p><strong>(5) Summary</strong></p>
<p>What have we found? We found that if our estimator is biased downwards (here by measurement error), then different sample sizes (and hence different variances of the sampling distribution) lead to different proportions of coefficient estimates that overestimate the true effect.</p>
<p>However, it is important to stress: when keeping $N$ constant and introducing measurement error, the proportion of overestimating estimates <em>decreases</em> compared to the situation without measurement error. This is because the whole sampling distribution is shifted towards zero in the presence of measurement error (the blue line is shifted to the position of the green line in the Figure).</p>
<p>The only thing that is increasing is $q$, which means that in the presence of measurement error in a high noise setting (small $N$) we <em>relatively</em> overestimate more than in a low noise setting (high $N$). What determines $q$? The larger the difference between the variances of two sampling distributions, the larger $q$. The more we shift the sampling distribution towards zero (by adding measurement error), the larger $q$.</p>
<p><strong>(6) Back to the Paper</strong></p>
<p>I think the results stated in (5) are pretty far away from the claim in the paper, which was ‘In a high noise setting, the presence of measurement error leads to an increase in coefficient estimates’. This statement rather suggests that introducing measurement error increases the expected value of the sampling distribution (moving the blue line to the right instead of to the left) which is - as we have seen - incorrect. This false suggestion is strengthened by the scaling of the figures. We illustrate this here, by plotting the figure as shown in the paper (top row) and with equal coordinate systems (bottom row).</p>
<p><img src="https://raw.githubusercontent.com/jmbh/jmbh.github.io/master/figs/measurementerror/ScalingIssue.png" alt="center" /></p>
<p>The top row suggests that the difference between the low/high noise setting is because the whole cloud is ‘shifted’ downwards in the low noise setting. This would mean that the sampling distributions are shifted differently depending on the noise setting (sample size) when adding measurement error. On the other hand, when plotting the data in the same coordinate system, it is clear that the expected values do not change and that effect is driven by the differing variances of the estimator.</p>
<p>And one more thing: in the right panel in the figure of the paper the authors plot $\mathcal{P}^\uparrow$ as a function of $N$. Note that from the discussion in (4) it follows that this value can <em>never</em> be larger than $\frac{1}{2}$ as long as the estimator is unbiased or biased downwards. So there must have been some mistake.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This was a fun opportunity to do some statistics detective work. However, the lack of clarity does potentially also do quite some harm by confusing the reader about important concepts. There is of course also the possibility that I just fully misunderstood their paper. In that case I hope the reader will point to my mistakes.</p>
<p>The code to exactly reproduce the above figures can be found <a href="https://raw.githubusercontent.com/jmbh/jmbh.github.io/master/figs/measurementerror/RCode_ME_comment.R">here</a>.</p>
<p>I would like to thank <a href="https://twitter.com/fdabl">Fabian Dablander</a> and <a href="https://www.gess.ethz.ch/en/the-department/people/person-detail.html?persid=191462">Peter Edelsbrunner</a> for helpful comments on this blogpost. In addition, I would like to thank <a href="https://www.uu.nl/staff/ORyan/0">Oisín Ryan</a> and <a href="https://www.uu.nl/medewerkers/JJBroere/0">Joris Broere</a> for an interesting discussion on a train ride from Eindhoven to Utrecht yesterday, and I apologize to about 15 anonymous Dutch travelers because they had to endure a heated statistical debate.</p>
<p>I am looking forward to comments, complaints and corrections.</p>
Thu, 16 Feb 2017 00:00:00 +0000
http://jmbh.github.io//Deconstructing-ME/
http://jmbh.github.io//Deconstructing-ME/Predictability in Network Models<p>Network models have become a popular way to abstract complex systems and gain insights into relational patterns among observed variables in <a href="http://www.sachaepskamp.com/files/NA/NetworkTakeover.pdf">many areas of science</a>. The majority of these applications focuses on analyzing the structure of the network. However, if the network is not directly observed (Alice and Bob are friends) but <em>estimated</em> from data (there is a relation between smoking and cancer), we can analyze - in addition to the network structure - the predictability of the nodes in the network. That is, we would like to know: how well can a given node in the network predicted by all remaining nodes in the network?</p>
<p>Predictability is interesting for several reasons:</p>
<ol>
<li>It gives us an idea of how <em>practically relevant</em> edges are: if node A is connected to many other nodes but these only explain, let’s say, only 1% of its variance, how interesting are the edges connected to A?</li>
<li>We get an indication of how to design an <em>intervention</em> in order to achieve a change in a certain set of nodes and we can estimate how efficient the intervention will be</li>
<li>It tells us to which extent different parts of the network are <em>self-determined or determined by other factors</em> that are not included in the network</li>
</ol>
<p>In this blogpost, we use the R-package <a href="https://cran.r-project.org/web/packages/mgm/index.html">mgm</a> to estimate a network model and compute node wise predictability measures for a <a href="http://cpx.sagepub.com/content/3/6/836.short">dataset</a> on <a href="https://en.wikipedia.org/wiki/Posttraumatic_stress_disorder">Post Traumatic Stress Disorder (PTSD)</a> symptoms of <a href="https://en.wikipedia.org/wiki/2008_Sichuan_earthquake">Chinese earthquake victims</a>. We visualize the network model and predictability using <a href="https://cran.r-project.org/web/packages/qgraph/index.html">the qgraph package</a> and discuss how the combination of network model and node wise predictability can be used to design effective interventions on the symptom network.</p>
<h2 id="load-data">Load Data</h2>
<p>We load the data which the authors made freely available:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="s1">'http://psychosystems.org/wp-content/uploads/2014/10/Wenchuan.csv'</span><span class="p">)</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">na.omit</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">344</span><span class="w"> </span><span class="m">17</span></code></pre></figure>
<p>The datasets contains complete responses to 17 PTSD symptoms of 344 individuals. The answer categories for the intensity of symptoms ranges from 1 ‘not at all’ to 5 ‘extremely’. The exact wording of all symptoms is in the <a href="http://cpx.sagepub.com/content/3/6/836.short">paper of McNally and colleagues</a>.</p>
<h2 id="estimate-network-model">Estimate Network Model</h2>
<p>We estimate a <a href="http://www.jmlr.org/proceedings/papers/v33/yang14a.pdf">Mixed Graphical Model (MGM)</a>, where we treat all variables as continuous-Gaussian variables. Hence we set the type of all variables to <code class="highlighter-rouge">type = 'g'</code> and the number of categories for each variable to 1, which is the default for continuous variables <code class="highlighter-rouge">lev = 1</code>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">mgm</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">fit_obj</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mgm</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data</span><span class="p">,</span><span class="w">
</span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="s1">'g'</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">),</span><span class="w">
</span><span class="n">level</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">),</span><span class="w">
</span><span class="n">lambdaSel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'CV'</span><span class="p">,</span><span class="w">
</span><span class="n">ruleReg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'OR'</span><span class="p">)</span></code></pre></figure>
<p>For more info on how to estimate Mixed Graphical Models using the mgm package see <a href="http://jmbh.github.io/Estimation-of-mixed-graphical-models/">this previous post</a> or the <a href="https://arxiv.org/pdf/1510.06871v2.pdf">mgm paper</a>.</p>
<h2 id="compute-predictability-of-nodes">Compute Predictability of Nodes</h2>
<p>After estimating the network model we are ready to compute the predictability for each node. Node wise predictability (or error) can be easily computed, because the graph is estimated by taking each node in turn and regressing all other nodes on it. As a measure for predictability we pick the propotion of explained variance, as it is straight forward to interpret: 0 means the node at hand is not explained at all by other nodes in the nentwork, 1 means perfect prediction. We centered all variables before estimation in order to remove any influence of the intercepts. For a detailed description of how to compute predictions and to choose predictability measures, <a href="https://arxiv.org/abs/1610.09108">check out this preprint</a>. In case there are additional variable types (e.g. categorical) in the network, we can choose an appropriate measure for these variables (e.g. % correct classification, see <code class="highlighter-rouge">?predict.mgm</code>).</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">pred_obj</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit_obj</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data</span><span class="p">,</span><span class="w">
</span><span class="n">errorCon</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'R2'</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">pred_obj</span><span class="o">$</span><span class="n">error</span><span class="w">
</span><span class="n">Variable</span><span class="w"> </span><span class="n">Error.R2</span><span class="w">
</span><span class="m">1</span><span class="w"> </span><span class="n">intrusion</span><span class="w"> </span><span class="m">0.639</span><span class="w">
</span><span class="m">2</span><span class="w"> </span><span class="n">dreams</span><span class="w"> </span><span class="m">0.661</span><span class="w">
</span><span class="m">3</span><span class="w"> </span><span class="n">flash</span><span class="w"> </span><span class="m">0.601</span><span class="w">
</span><span class="m">4</span><span class="w"> </span><span class="n">upset</span><span class="w"> </span><span class="m">0.636</span><span class="w">
</span><span class="m">5</span><span class="w"> </span><span class="n">physior</span><span class="w"> </span><span class="m">0.627</span><span class="w">
</span><span class="m">6</span><span class="w"> </span><span class="n">avoidth</span><span class="w"> </span><span class="m">0.686</span><span class="w">
</span><span class="m">7</span><span class="w"> </span><span class="n">avoidact</span><span class="w"> </span><span class="m">0.681</span><span class="w">
</span><span class="m">8</span><span class="w"> </span><span class="n">amnesia</span><span class="w"> </span><span class="m">0.41</span><span class="w">
</span><span class="m">9</span><span class="w"> </span><span class="n">lossint</span><span class="w"> </span><span class="m">0.52</span><span class="w">
</span><span class="m">10</span><span class="w"> </span><span class="n">distant</span><span class="w"> </span><span class="m">0.498</span><span class="w">
</span><span class="m">11</span><span class="w"> </span><span class="n">numb</span><span class="w"> </span><span class="m">0.451</span><span class="w">
</span><span class="m">12</span><span class="w"> </span><span class="n">future</span><span class="w"> </span><span class="m">0.54</span><span class="w">
</span><span class="m">13</span><span class="w"> </span><span class="n">sleep</span><span class="w"> </span><span class="m">0.565</span><span class="w">
</span><span class="m">14</span><span class="w"> </span><span class="n">anger</span><span class="w"> </span><span class="m">0.562</span><span class="w">
</span><span class="m">15</span><span class="w"> </span><span class="n">concen</span><span class="w"> </span><span class="m">0.638</span><span class="w">
</span><span class="m">16</span><span class="w"> </span><span class="n">hyper</span><span class="w"> </span><span class="m">0.676</span><span class="w">
</span><span class="m">17</span><span class="w"> </span><span class="n">startle</span><span class="w"> </span><span class="m">0.626</span></code></pre></figure>
<p>We calculated the percentage of variance explained in each of the nodes in the network. Next, we visualize the estimated network and discuss its structure in relation to explained variance.</p>
<h2 id="visualize-network--predictability">Visualize Network & Predictability</h2>
<p>We provide the estimated weighted adjacency matrix and the node wise predictability measures as arguments to <code class="highlighter-rouge">qgraph()</code> …</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">qgraph</span><span class="p">)</span><span class="w">
</span><span class="n">qgraph</span><span class="p">(</span><span class="n">fit_obj</span><span class="o">$</span><span class="n">pairwise</span><span class="o">$</span><span class="n">wadj</span><span class="p">,</span><span class="w"> </span><span class="c1"># weighted adjacency matrix as input</span><span class="w">
</span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'spring'</span><span class="p">,</span><span class="w">
</span><span class="n">pie</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred_obj</span><span class="o">$</span><span class="n">error</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="c1"># provide errors as input</span><span class="w">
</span><span class="n">pieColor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="s1">'#377EB8'</span><span class="p">,</span><span class="n">p</span><span class="p">),</span><span class="w">
</span><span class="n">edge.color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit_obj</span><span class="o">$</span><span class="n">pairwise</span><span class="o">$</span><span class="n">edgecolor</span><span class="p">,</span><span class="w">
</span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">data</span><span class="p">))</span><span class="w">
</span></code></pre></figure>
<p>… and get the following network visualization:</p>
<p><img src="http://jmbh.github.io/figs/2016-11-01-Predictability-in-network-models/McNellyNetwork.png" alt="center" /></p>
<p><a href="http://jmbh.github.io/Predictability-in-network-models/">[Click here for the original post with larger figures]</a></p>
<p>Each variable is represented by a node and the edges correspond to partial correlations, because in this dataset the MGM consists only of conditional Gaussian variables. The green color of the edges indicates that all partial correlations in this graph are positive, and the edge-width is proportional to the absolute value of the partial correlation. The blue pie chart behind the node indicates the predictability measure for each node (more blue = higher predictability).</p>
<p>We see that intrusive memories, traumatic dreams and flashbacks cluster together. Also, we observe that avoidance of thoughts (avoidth) about trauma interacts with avoidance of acitivies reminiscent of the trauma (avoidact) and that hypervigilant (hyper) behavior is related to feeling easily startled (startle). But there are also less obvious interactions, for instance between anger and concentration problems.</p>
<p>Now, if we would like to reduce sleep problems, the network model suggests to intervene on the variables anger and startle. But what the network structure does not tell us is <em>how much</em> we could possibly change sleep through the variables anger and startle. The predictability measure gives us an answer to this question: 53.1%. If the goal was to intervene on amnesia, we see that all adjacent nodes in the network explain only 32.7% of its variance. In addition, we see that there are many small edges connected to amnesia, suggesting that it is hard to intervene on amnesia via other nodes in the symptom network. Thus, one would possibly try to find additional variables that are not included in the network that interact with amnesia or try to intervene on amnesia directly.</p>
<h2 id="limitations">Limitations!</h2>
<p>Of course, there are limitations to interpreting explained variance as predicted treatment outcome: first, we cannot know the causal direction of the edges, so any edge could point in one or both directions. However, if there is no edge, there is also no causal effect in any direction. Also, it is often reasonable to combine the network model with general knoweldge: for instance, it seems more likely that amnesia causes being upset than the other way around. Second, we estimated the model on cross-sectional data (each row is one person) and hence assume that all people are the same, which is an assumption that is always violated to some extent. To solve this problem we would need (many) repeated measurements of a single person, in order to estimate a model specific to that person. This also solves the first problem to some degree as we can use the direction of time as the direction of causality. One would then use models that predict all symptoms at time point t by all symptoms at an earlier time point, let’s say t-1. An example of such a model is the <a href="https://en.wikipedia.org/wiki/Vector_autoregression">Vector Autoregressive (VAR) model</a>.</p>
<h2 id="compare-within-vs-out-of-sample-predictability">Compare Within vs. Out of Sample Predictability</h2>
<p>So far we looked into how well we can predict nodes by all other nodes within our sample. But in most situations we are interested in the predictability of nodes in new, unseen data. In what follows, we compare the within sample predictability with the out of sample predictability.</p>
<p>We first split the data in two parts: a training part (60% of the data), which we use to estimate the network model and a test part, which we will use to compute predictability measures on:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">ind</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="kc">TRUE</span><span class="p">,</span><span class="kc">FALSE</span><span class="p">),</span><span class="w"> </span><span class="n">prob</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">.6</span><span class="p">,</span><span class="w"> </span><span class="m">.4</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="o">=</span><span class="n">nrow</span><span class="p">(</span><span class="n">data</span><span class="p">),</span><span class="w"> </span><span class="n">replace</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span></code></pre></figure>
<p>Next, we estimate the network only on the training data and compute the predictability measure both on the training data and the test data:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">fit_obj_ts</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mgm</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data</span><span class="p">[</span><span class="n">ind</span><span class="p">,],</span><span class="w">
</span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="s1">'g'</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">),</span><span class="w">
</span><span class="n">level</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">),</span><span class="w">
</span><span class="n">lambdaSel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'CV'</span><span class="p">,</span><span class="w">
</span><span class="n">ruleReg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'OR'</span><span class="p">)</span><span class="w">
</span><span class="c1"># Compute Preditions on training data 60%</span><span class="w">
</span><span class="n">pred_obj_train</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit_obj_ts</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data</span><span class="p">[</span><span class="n">ind</span><span class="p">,],</span><span class="w">
</span><span class="n">errorCon</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'R2'</span><span class="p">)</span><span class="w">
</span><span class="c1"># Compute Predictions on test data 40%</span><span class="w">
</span><span class="n">pred_obj_test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit_obj_ts</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data</span><span class="p">[</span><span class="o">!</span><span class="n">ind</span><span class="p">,],</span><span class="w">
</span><span class="n">errorCon</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'R2'</span><span class="p">)</span></code></pre></figure>
<p>We now look at the mean predictability over nodes for the training- and test dataset:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">mean</span><span class="p">(</span><span class="n">pred_obj_train</span><span class="o">$</span><span class="n">error</span><span class="p">[,</span><span class="m">2</span><span class="p">])</span><span class="w"> </span><span class="c1"># mean explained variance training data</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.6258235</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="n">pred_obj_test</span><span class="o">$</span><span class="n">error</span><span class="p">[,</span><span class="m">2</span><span class="p">])</span><span class="w"> </span><span class="c1"># mean explained variance test data</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.4915882</span></code></pre></figure>
<p>As expected, the explained variance is higher in the training dataset. This is because we fit the model to structure that is specific to the training data and is not present in the population (noise). Note that both means are lower than the mean we would get by taking the mean of the explained variances above, because we used less observation to estimate the model and hence have less power to detect edges.</p>
<p>While the explained variance values are lower in the test set, there is a positive relationship between the explained variance of a node in the training- and the test set</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">cor</span><span class="p">(</span><span class="n">pred_obj_train</span><span class="o">$</span><span class="n">error</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">pred_obj_test</span><span class="o">$</span><span class="n">error</span><span class="p">[,</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.5539814</span></code></pre></figure>
<p>which means that if a node has high explained variance in the training set, it tends to also have a high explained variance in the test set.</p>
<h2 id="edit-nov-3rd-the-and--or-or-rule-and-predictability">Edit Nov 3rd: The AND- or OR-rule and Predictability</h2>
<p>In the above example I used the OR-rule to combine estimates in the <a href="http://www.jstor.org/stable/25463463">neighborhood regression approach</a>, without justifying why (thanks to <a href="https://scholar.google.com.br/citations?user=fH6qCDoAAAAJ&hl=en">Wagner de Lara Machado</a> for pointing this out). Here comes the explanation:</p>
<p>In the neighborhood regression approach to graph estimation we pick each node in the graph and regress all other nodes on this node. If we have three nodes <script type="math/tex">x_1</script>, <script type="math/tex">x_2</script>, <script type="math/tex">x_3</script>, this procedure leads to three regression models:</p>
<ol>
<li>
<script type="math/tex; mode=display">x_1 = \beta_{10} + \beta_{12} x_2 + \beta_{13} x_3</script>
</li>
<li>
<script type="math/tex; mode=display">x_2 = \beta_{20} + \beta_{21} x_1 + \beta_{23} x_3</script>
</li>
<li>
<script type="math/tex; mode=display">x_3 = \beta_{30} + \beta_{31} x_1 + \beta_{32} x_2</script>
</li>
</ol>
<p>This procedure leads to two estimates for the edge between <script type="math/tex">x_1</script> and <script type="math/tex">x_2</script>: <script type="math/tex">\beta_{12}</script> from regression (1) and <script type="math/tex">\beta_{21}</script> from regression (2). If both parameters are nonzero, we clearly set the edge between x1 and x2 to present, and if both parameters are zero, we clearly set the edge between <script type="math/tex">x_1</script> and <script type="math/tex">x_2</script> to not present. However, in some cases the two estimates disagree and we need a rule for this situation: The OR-rule sets an edge to be present if <em>at least one</em> of the estimates is nonzero. The AND-rule sets an edge to be present only if <em>both</em> estimates are nonzero.</p>
<p>Now, to compute predictions and hence a measure of predictability we use the regression models 1-3. Let’s take regression model (3), where we predict <script type="math/tex">x_3</script> by <script type="math/tex">x_1</script> and <script type="math/tex">x_2</script>. Now, if the betas agree (<script type="math/tex">\beta_{31}</script> and <script type="math/tex">\beta_{13}</script> agree and <script type="math/tex">\beta_{32}</script> and <script type="math/tex">\beta_{23}</script> agree), everything is fine. But if there is disagreement, we have the following problem:</p>
<ul>
<li>
<p>When using the AND-rule: if let’s say the parameter <script type="math/tex">\beta_{32}</script> is nonzero but <script type="math/tex">\beta_{23}</script> is zero, the AND rule sets the edge-parameter <script type="math/tex">x_3</script>-<script type="math/tex">x_2</script> in the graph to zero; however the parameter <script type="math/tex">\beta_{32}</script> will still be used for estimation of <script type="math/tex">x_3</script>. This leads to a predictability that is too high. Hence we could have a situation in which a node has no connection in the graph (obtained using the AND-rule) but has a nonzero predictability measure.</p>
</li>
<li>
<p>When using the OR-rule: if the parameter <script type="math/tex">\beta_{23}</script> is nonzero but <script type="math/tex">\beta_{32}</script> is zero, the OR-rule sets the edge-parameter <script type="math/tex">x_3</script>-<script type="math/tex">x_2</script> in the graph to be present; however we use the (zero) parameter <script type="math/tex">\beta_{32}</script> in regression (3) for prediction. This leads to a predictability that is too small. Hence we could hae the situation that a node has a connetion in the graph but has a zero predictability measure.</p>
</li>
</ul>
<p>Hence, when using the OR-rule, we <em>underestimate</em> the true predictability given the graph and hence get a <em>conservative</em> estimate of predictability in the graph. This is why I chose the OR-rule above.</p>
<p>Okay, but why don’t we adjust the parameters of the regression models 1-3 by setting parameters to zero (AND-rule) or filling in parameters (OR-rule)? This is not possible, because tinkering with the parameters will destroy the prediction model in most situations.</p>
<p>A possible way around this would be to take the estimated graph and then re-estimate the graph (by performing p regressions) but only use those variables as predictors that were connected to the predicted node in the initial graph. However, this 2-stage procedure would lead to a (possibly) completely different scaling for the estimation of each of the neighborhoods of the different nodes. This is likely to lead to an algorithm that does not consistently recover the true graph/network.</p>
Tue, 01 Nov 2016 00:00:00 +0000
http://jmbh.github.io//Predictability-in-network-models/
http://jmbh.github.io//Predictability-in-network-models/Graphical Analysis of German Parliament Voting Pattern<p>We use network visualizations to look into the voting patterns in the current German parliament. I downloaded the data <a href="https://www.bundestag.de/abstimmung">here</a> and all figures can be reproduced using the R code available on <a href="https://github.com/jmbh/bundestag">Github</a>.</p>
<p>Missing values, invalid votes, abstention from voting and not showing up for the vote weres coded as (-1), such that all other responses are a yes (1) or no (2) vote. We use pearson correlation as a measure of voting similarity and voting behavior coded as (-1) is regarded as noise in the dataset. 36 of the 659 members of parliament were removed from the data because more than 50% of the votes were coded as (-1). The reason was that they either joined or left the parliament during the analyzed time period.</p>
<p><em>Disclaimer: note that only for a fraction of the bills passed in the German parliament votes are recorded (and used here) and that relations between single members of parliaments might be artifacts of the noise-coding. Moreover, the data is quite scarce (136 bills). Therefore we should not draw any strong conclusions from this coarse-grained analysis.</em></p>
<h2 id="voting-pattern-amongst-members-of-parliament">Voting Pattern Amongst Members of Parliament</h2>
<p>We first compute the correlations between the voting behavior of all pairs of members of parliament, which gives us a 623 x 623 correlation matrix. We then visualize this correlation matrix using the force-directed <a href="https://en.wikipedia.org/wiki/Force-directed_graph_drawing">Fruchterman Reingold algorithm</a> as implemented in the <a href="https://cran.r-project.org/web/packages/qgraph/index.html">qgraph package</a>. This algorithm puts nodes (politicians) on the plane such that edges (connections) have comparable length and that edges are crossing as little as possible.</p>
<p><img src="http://jmbh.github.io/figs/bundestag/bundestag_cor_full.jpg" alt="center" /></p>
<p>(For readers on R-Bloggers.com: <a href="http://jmbh.github.io/Analyzing-voting-pattern-of-German-parliament/">click here for the original post with larger figures.</a>)</p>
<p>Green edges indicate positive correlations (voter agreement) and red edges indicate negative correlations (voter disagreement). The width of the edges is proportional to the strength (absolute value) of the correlation. We see that the green party (B90/GRUENE) clusters together, as well as the left party (DIE LINKE). The third and biggest cluster consists of members of the two largest parties, the social democrats (SPD) and the conservatives (CDU/CSU). This is the structure we would expect intuitively, as social democrats and conservatives currently form the government in a grand coalition.</p>
<p>With some imagination, one could also identify a couple of subclusters in this large cluster. A detailed analysis of smaller clusters would be especially interesting if we had additional information about politicians. We could then see whether the cluster assignment computed from the voting behavior relates to these additional variables. For instance, politicians with close ties to the economy might vote together, irrespective of their party.</p>
<p>So far we assumed that we can adequately describe the voting pattern of the whole period from 26.11.2013 - 14.04.2016 with one graph. This implies that we assume that the relative voting behavior does not change over time. For example, this means that if members of parliament A and B agree on votes at the beginning of the period, they also agree throughout the rest of the period and do not start to disagree at some point. In the next section we check whether the voting behavior changes over time.</p>
<h2 id="voting-pattern-amongst-members-of-parliament-across-time">Voting Pattern Amongst Members of Parliament across Time</h2>
<p>To make graphs comparable over different time points and to be able to see growing (dis-) agreement between parties, we arrange individual members of parliament in circles that correspond to their parties. We compute a time-varying graph by visualizing a Gaussian kernel smoothed (bandwidth = .1, time interval [0,1]) correlation matrix at 20 equally spaced time points. Details can be found in the code used to create all figures, which is available <a href="https://github.com/jmbh/bundestag">here</a>. We then combine these 20 graphs into the following video:</p>
<p><img src="http://jmbh.github.io/figs/bundestag/bundestag_cor.gif" alt="center" /></p>
<p>We see that right after the time the parliament was elected and the big coalition was formed in November 2013, there is relatively high agreement between members of CDU/CSU and SPD. Within the next three years, however, the agreement decreasees. With regards to the parties in the opposition, at the beginning of the period the green and the left party disagree to a similar degree with the grand coalition. Over time, however, it appears that the green party increasingly agrees with the grand coalition, while the left party agrees less and less with the CDU/CSU- and SPD-led government.</p>
<p>As the number of seats the parties have in the parliament differs widely, it is hard to read agreement <em>within</em> parties from the above graph. For instance, the cycle of CDU/CSU seems to be filled with more and thicker green edges than the one of SPD, however, this could well be because there are simply more politicians (307 vs. 191) and hence more edges displayed. Therefore, we have a closer look at within-party agreement in the following graph:</p>
<center><img src="http://jmbh.github.io/figs/bundestag/bundestag_agreement_time.jpg" width="400" height="350" /></center>
<p>Collapsed over time we see the members of the left party agree most with each other and the members of the social democratic party agree the least with each other. The largest changes in agreement appear in the green and left party: from late 2014 to mid 2015, members of the green party seem to agree less with each other than usual, while members of the left party seem to agree more with each other than usual.</p>
<h2 id="zoom-in-on-small-group-of-members-of-parliament">Zoom in on small Group of Members of Parliament</h2>
<p>While the analyses so far gave a comprehensive <em>overview</em> of the voting behavior amongst members of parliament, the graph is too large to see which node in the graph corresponds to which politician. In the following graph we zoom in on a random subset of 30 politicians and match the nodes to their names:</p>
<p><img src="http://jmbh.github.io/figs/bundestag/bundestag_cor_ss_names.jpg" alt="center" /></p>
<p>Note that correlations are bivariate measures and therefore the correlations in this smaller graph are the same as the ones in the larger graph above. We see the same overall structure as above, but now with names assigned to nodes. Again the members of the green party cluster together, but for instance Nicole Maisch votes more often together with Steffi Lempke than with the other displayed colleagues. We also see that for instance Steffen Kampeter and Christian Schmidt are both members of the convervative party, however are placed at quite distant locations in the graph (and indeed the correlation between their voting behavior is almost zero: -0.04).</p>
<p>Analogous to above, we now look into how voting agreement between the politicians in our subset changes over time by computing a time-varying graph as before:</p>
<p><img src="http://jmbh.github.io/figs/bundestag/bundestag_cor_ss.gif" alt="center" /></p>
<p>We see that voting agreement changes substantially: for instance members of the opposition parties seem to agree less and less with the grand coalition until mid-2015 and then agree again more and more until the end of the period in early 2016. Some politicians seem to change their voting pattern quite dramatically: for example the voting behavior of conserviative party member Heike Bremer strongly correlates with the voting behavior of most of her party colleagues in 2014, however in late 2015 and early 2016 the correlations are close to zero. Also, interestingly, the voting behavior of conservative Steffen Kampeter tends to vote in the opposite direction than his conservative colleagues in early 2014, but then agrees more and more with them until the last recorded votes.</p>
<h2 id="unique-agreement-between-members-of-parliament">‘Unique’ Agreement between Members of Parliament</h2>
<p>So far we looked into how the voting patterns of any pair of members of parliaments correlate with each other. While this is an informative measure and gives a first overview of how politicians vote relative to each other, it is also a measure that is tricky to interpret. For instance two politicians of a party might always vote together because they always align their votes with their common mentor in the party. Or because there is pressure from the whole party to vote for a bill together. Or because they are both members of a specific think tank within the parliament, …</p>
<p>An interesting alternative measure is conditional correlation, which is the correlation between any two members of parliament, <em>after controlling for all other members of parliament</em>. In case of a conditional correlation between two members of parliament there are still many possible explanations (e.g. both might be influenced by some person <em>outside</em> the parliament), however, we are sure that this correlation cannot be explained by the voting pattern by any other member of parliament. We compute this conditional correlation graph and visualize it using the same layout as in the corresponding correlation graph:</p>
<p><img src="http://jmbh.github.io/figs/bundestag/bundestag_cond_ss_names.jpg" alt="center" /></p>
<p>It is apparent that there are less edges and less strong edges. Note that this is what we would expect in this dataset: in a parliament there is a general level of agreement within parties and also between parties, otherwise it would be difficult to pass bills. Therefore, we would expect that a substantial part of a correlation between the voting pattern between any two politicians can be explained by the voting patterns of other politicians. The strongest conditional correlations is the one between Nicole Gohlke and Norbert Mueller of the left party. For some reason these two politicians align their votes in a way that cannot be explained by the voting pattern of other politicians within and outside their party. Note here that</p>
<h2 id="concluding-comments">Concluding comments</h2>
<p>It came as quite a surprise to me that the large majority of votes on bills in the German parliament are not recorded and hence not available to the public (please correct me if I missed something). While this is a major reason to interpret these data with caution, on the other hand the votes on bills that <em>are</em> recorded are the more controversial and therefore probably more interesting ones.</p>
<p>The graphs in this post were the first few obvious things I wanted to look into, but of course many more analyses are possible. I put the preprocessed data (no information lost, just everyting in 3 linked files instead of hundreds) on <a href="https://github.com/jmbh/bundestag">Github</a> alongside with the code that produces the above figures. In case you have any comments, complaints or questions, please comment below!</p>
Wed, 18 May 2016 00:00:00 +0000
http://jmbh.github.io//Analyzing-voting-pattern-of-German-parliament/
http://jmbh.github.io//Analyzing-voting-pattern-of-German-parliament/Interactions between Categorical Variables in Mixed Graphical Models<p>In a <a href="http://jmbh.github.io/Estimation-of-mixed-graphical-models/">previous post</a> we estimated a Mixed Graphical Model (MGM) on a dataset of <em>mixed variables</em> describing different aspects of the life of individuals diagnosed with Autism Spectrum Disorder, using the <a href="https://cran.r-project.org/web/packages/mgm/index.html">mgm package</a>. For interactions between continuous variables, the weighted adjacency matrix fully describes the underlying interaction parameter. Correspondinly, the parameters are represented in the graph visualization: the width of the edges is proportional to the absolute value of the parameter, and the edge color indicates the sign of the parameter. This means that we can clearly interpret an edge between two continuous variables as a positive or negative linear relationship of some strength.</p>
<p>Interactions between categorical variables, however, can involve several parameter that can describe non-linear relationships. A present edge between two categorical variables, or between a categorical and a continuous variable only tells us that there is <em>some</em> interaction. In order to find out the exact nature of the interaction, we have to look at all estimated parameters. This is what this blog post is about.</p>
<p>We first re-estimate the MGM on the Autism Spectrum Disorder (ADS) dataset from this <a href="http://jmbh.github.io/Estimation-of-mixed-graphical-models/">previous post</a>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">fit_ADS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mgm</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">data</span><span class="p">),</span><span class="w">
</span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">type</span><span class="p">,</span><span class="w">
</span><span class="n">level</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">level</span><span class="p">,</span><span class="w">
</span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
</span><span class="n">lambdaSel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'EBIC'</span><span class="p">,</span><span class="w">
</span><span class="n">lambdaGam</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.25</span><span class="p">)</span></code></pre></figure>
<p>We then plot the weighted adjacency matrix as in the previous blog post, however, we now group the variables by their type:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">groups_typeV</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s2">"Gaussian"</span><span class="o">=</span><span class="n">which</span><span class="p">(</span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">type</span><span class="o">==</span><span class="s1">'g'</span><span class="p">),</span><span class="w">
</span><span class="s2">"Poisson"</span><span class="o">=</span><span class="n">which</span><span class="p">(</span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">type</span><span class="o">==</span><span class="s1">'p'</span><span class="p">),</span><span class="w">
</span><span class="s2">"Categorical"</span><span class="o">=</span><span class="n">which</span><span class="p">(</span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">type</span><span class="o">==</span><span class="s1">'c'</span><span class="p">))</span><span class="w">
</span><span class="n">qgraph</span><span class="p">(</span><span class="n">fit_ADS</span><span class="o">$</span><span class="n">pairwise</span><span class="o">$</span><span class="n">wadj</span><span class="p">,</span><span class="w">
</span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'spring'</span><span class="p">,</span><span class="w"> </span><span class="n">repulsion</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.3</span><span class="p">,</span><span class="w">
</span><span class="n">edge.color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit_ADS</span><span class="o">$</span><span class="n">pairwise</span><span class="o">$</span><span class="n">edgecolor</span><span class="p">,</span><span class="w">
</span><span class="n">nodeNames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">colnames</span><span class="p">,</span><span class="w">
</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">groups_color</span><span class="p">,</span><span class="w">
</span><span class="n">groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">groups_typeV</span><span class="p">,</span><span class="w">
</span><span class="n">legend.mode</span><span class="o">=</span><span class="s2">"style2"</span><span class="p">,</span><span class="w"> </span><span class="n">legend.cex</span><span class="o">=</span><span class="m">.8</span><span class="p">,</span><span class="w">
</span><span class="n">vsize</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="n">esize</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">)</span><span class="w">
</span><span class="n">dev.off</span><span class="p">()</span></code></pre></figure>
<p>The above code produces the following figure:</p>
<p><img src="http://jmbh.github.io/figs/2017-11-30-Closer-Look/Fig_mgm_application_Autism_byTypes.png" alt="center" /></p>
<p>Red edges correspond to negative edge weights and green edge weights correspond to positive edge weights. The width of the edges is proportional to the absolut value of the parameter weight. Grey edges connect categorical variables to continuous variables or to other categorical variables and are computed from more than one parameter and thus we cannot assign a sign to these edges.</p>
<p>While the interaction between continuous variables can be interpreted as a conditional covariance similar to the well-known multivariate Gaussian case, the interpretation of edge-weights involving categorical variables is more intricate as they are comprised of several parameters. In the following two sections we show how to retrieve necessary parameters from the <code class="highlighter-rouge">fit_ADS</code> object in order to interpret interactions between continuous and categorical, and betwen categorical and categorical variables.</p>
<h2 id="interpretation-of-interaction-continuous---categorical">Interpretation of Interaction: Continuous - Categorical</h2>
<p>We first consider the edge weight between the continuous Gaussian variable ‘Working hours’ and the categorical variable ‘Type of Work’, which has the categories (1) No work, (2) Supervised work, (3) Unpaid work and (4) Paid work.</p>
<p>In order to get the necessary parameter, we look up in which row this pairwise interaction is listed in <code class="highlighter-rouge">fit_ADS$rawfactor$indicator[[1]]</code>. We look at the first list entry here, because we are looking for a pairwise interaction. If we estimated an MGM involving 3-way interactions, the 3-way interactions would be listed in the second list entry, etc. Here, however, we look for the pairwise interaction between ‘Type of Work’ (16) and ‘Working hours’ (17), and find it in row 86:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">fit_ADS</span><span class="o">$</span><span class="n">rawfactor</span><span class="o">$</span><span class="n">indicator</span><span class="p">[[</span><span class="m">1</span><span class="p">]][</span><span class="m">86</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">16</span><span class="w"> </span><span class="m">17</span></code></pre></figure>
<p>Using the row number, we can now look up all estimated parameters in <code class="highlighter-rouge">fit_ADS$rawfactor$weights</code>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">fit_ADS</span><span class="o">$</span><span class="n">rawfactor</span><span class="o">$</span><span class="n">weights</span><span class="p">[[</span><span class="m">1</span><span class="p">]][[</span><span class="m">86</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">-14.6460488</span><span class="w"> </span><span class="m">-0.7576681</span><span class="w"> </span><span class="m">0.7576681</span><span class="w"> </span><span class="m">1.4885513</span><span class="w">
</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[,</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">V</span><span class="m">16.2</span><span class="w"> </span><span class="m">0.5150313</span><span class="w">
</span><span class="n">V</span><span class="m">16.3</span><span class="w"> </span><span class="m">1.3871043</span><span class="w">
</span><span class="n">V</span><span class="m">16.4</span><span class="w"> </span><span class="m">1.7926628</span></code></pre></figure>
<p>The first entry corresponds to the regression on ‘Type of Work’ (16). Since we model the probability of every level of a categorical variable (see the <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/pdf/nihms201118.pdf">glmnet paper</a> for a detailed explanation), we get for each of the four levels of ‘Type of Work’ a parameter for ‘Working hours’. We see that there is a huge negative parameter for the first category of ‘Type of Work’, which is ‘No work’. This makes sense, since in the data all individuals with no work logically also work 0 hours. The differences between the remaining categories are less strong. However, we see that the more hours one works, the more one is likely to be in category (4) ‘Paid work’.</p>
<p>The second entry corresponds to the regression on ‘Working hours’ (17). Now the categorical variable is a predictor variable, which means that the first category is coded as a dummy category which is absorbed in the intercept. Note that we could also model all categories explicitly by using the overparameterized parameterization by setting <code class="highlighter-rouge">overparameterize = TRUE</code> in <code class="highlighter-rouge">mgm()</code>. Here we see that being in category (3) ‘Unpaid work’ predicts a larger amount of working hours than being in category (2) ‘Supervised work’, and that being in category (4) ‘Paid work’ predicts a larger amount of working hours than being in category (3) ‘Unpaid work’, which makes sense.</p>
<p>In order to interpret the interaction between ‘Type of Work’ (16) and ‘Working hours’ (17) one can choose either of the two regressions. One or the other regression may be more appropriate, depending on which interpretation is easier to understand, or depending on which regression reflects the more plausible causal direction.</p>
<h2 id="interpretation-of-interaction-categorical---categorical">Interpretation of Interaction: Categorical - Categorical</h2>
<p>Next we consider the edge weight between the categorical variables (14) ‘Type of Housing’ and the variable (16) ‘Type of Work’ from above. ‘Type of Housing’ has two categories, (a) ‘Not independent’ and (b) ‘Independent’. As in the previous example, we look up the row of the pairwise interaction in <code class="highlighter-rouge">fit_ADS$rawfactor$indicator[[1]]</code>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">fit_ADS</span><span class="o">$</span><span class="n">rawfactor</span><span class="o">$</span><span class="n">indicator</span><span class="p">[[</span><span class="m">1</span><span class="p">]][</span><span class="m">81</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">14</span><span class="w"> </span><span class="m">16</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">fit_ADS</span><span class="o">$</span><span class="n">rawfactor</span><span class="o">$</span><span class="n">weights</span><span class="p">[[</span><span class="m">1</span><span class="p">]][[</span><span class="m">81</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[[</span><span class="m">1</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[,</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">V</span><span class="m">16.2</span><span class="w"> </span><span class="m">0.00000000</span><span class="w">
</span><span class="n">V</span><span class="m">16.3</span><span class="w"> </span><span class="m">-0.08987943</span><span class="w">
</span><span class="n">V</span><span class="m">16.4</span><span class="w"> </span><span class="m">-0.62798733</span><span class="w">
</span><span class="p">[[</span><span class="m">1</span><span class="p">]][[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[,</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">V</span><span class="m">16.2</span><span class="w"> </span><span class="m">0.00000000</span><span class="w">
</span><span class="n">V</span><span class="m">16.3</span><span class="w"> </span><span class="m">0.08987943</span><span class="w">
</span><span class="n">V</span><span class="m">16.4</span><span class="w"> </span><span class="m">0.62798733</span><span class="w">
</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">-0.5882431</span><span class="w"> </span><span class="m">-0.1582227</span><span class="w"> </span><span class="m">0.1582227</span><span class="w"> </span><span class="m">1.7812274</span></code></pre></figure>
<p>The first entry of <code class="highlighter-rouge">fit_ADS$rawfactor$weights[[1]][[81]]</code> shows the interaction between (14) ‘Type of Housing’ and (16) ‘Type of Work’ from the regression on ‘Type of Housing’. We predict the probability of both (a) ‘Not independent’ and (b) ‘Independent’. ‘Type of Work’ is a predictor variable, hence the first category is a dummy category that gets absorbed in the intercept. We see that ‘Unpaid Job’ and ‘Paid Job’ increase the probability of living independently, whereas the latter does increase this probability more.</p>
<p>The second entry shows the same interaction from the regression on ‘Type of Work’. We now have 4 parameters, corresponding to the 4 categories of ‘Type of Work’. Since ‘Type of Housing’ has only two categories, and the first one (a) is a dummy category that gets absorbed in the intercept and only the indicator function for (b) is left as a predictor. We see that the better the works situation is, the higher the probability that the individual is living independently, which makes sense.</p>
<p>As above, in order to choose one of the two regressions in order to interpret the interaction, one might want to take the regression that is easier to interpret or/and the regression that reflects the more plausible causal direction.</p>
Fri, 29 Apr 2016 00:00:00 +0000
http://jmbh.github.io//Interactions-between-categorical-Variables-in-mixed-graphical-models/
http://jmbh.github.io//Interactions-between-categorical-Variables-in-mixed-graphical-models/Estimating Mixed Graphical Models<p>Determining conditional independence relationships through undirected graphical models is a key component in the statistical analysis of complex obervational data in a wide variety of disciplines. In many situations one seeks to estimate the underlying graphical model of a dataset that includes <em>variables of different domains</em>.</p>
<p>As an example, take a typical dataset in the social, behavioral and medical sciences, where one is interested in interactions, for example between gender or country (categorical), frequencies of behaviors or experiences (count) and the dose of a drug (continuous). Other examples are Internet-scale marketing data or high-throughput sequencing data.</p>
<p>There are methods available to estimate mixed graphical models from mixed continuous data, however, these usually have two drawbacks: first, there is a possible information loss due to necessary transformations and second, they cannot incorporate (nominal) categorical variables (for an overview see <a href="http://arxiv.org/abs/1510.05677">here</a>). When using the recently introduced class of <a href="http://www.jmlr.org/proceedings/papers/v33/yang14a.pdf">Mixed Graphical Models (MGMs)</a>, we avoid these problem because we are able to model each variable on its proper domain.</p>
<p>In the following, we use the R package <a href="https://cran.r-project.org/web/packages/mgm/index.html">mgm</a> to estimate a Mixed Graphical Model on a data set consisting of questionnaire responses of individuals diagnosed with Autism Spectrum Disorder. This dataset includes variables of different domains, such as age (continuous), type of housing (categorical) and number of treatments (count).</p>
<p>The dataset consists of responses of 3521 individuals diagnosed with Autism Spectrum Disorder (ASD) to a questionnaire including 28 variables of domains continuous, count and categorical and is automatically loaded with the <a href="https://cran.r-project.org/web/packages/mgm/index.html">mgm</a> package.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="nf">dim</span><span class="p">(</span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">data</span><span class="p">)</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">3521</span><span class="w"> </span><span class="m">28</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">data</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">]</span><span class="w">
</span><span class="n">Gender</span><span class="w"> </span><span class="n">IQ</span><span class="w"> </span><span class="n">Age</span><span class="w"> </span><span class="n">diagnosis</span><span class="w"> </span><span class="n">Openness</span><span class="w"> </span><span class="n">about</span><span class="w"> </span><span class="n">Diagnosis</span><span class="w"> </span><span class="n">Success</span><span class="w"> </span><span class="n">selfrating</span><span class="w">
</span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">6</span><span class="w"> </span><span class="m">-0.9605781</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">2.21</span><span class="w">
</span><span class="m">2</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="m">6</span><span class="w"> </span><span class="m">-0.5156103</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">6.11</span><span class="w">
</span><span class="m">3</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="m">-0.7063108</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="m">5.62</span><span class="w">
</span><span class="m">4</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">6</span><span class="w"> </span><span class="m">-0.4520435</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">8.00</span></code></pre></figure>
<p>We use our knowledge about the variables to specify the domain (type) of each variable and the number of levels for categorical variables (for non-categorical variables we choose 1 by convention). “c”, “g”, “p” stands for categorical, Gaussian and Poisson (count), respectively:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">type</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="s2">"c"</span><span class="w"> </span><span class="s2">"g"</span><span class="w"> </span><span class="s2">"g"</span><span class="w"> </span><span class="s2">"c"</span><span class="w"> </span><span class="s2">"g"</span><span class="w"> </span><span class="s2">"c"</span><span class="w"> </span><span class="s2">"c"</span><span class="w"> </span><span class="s2">"p"</span><span class="w"> </span><span class="s2">"p"</span><span class="w"> </span><span class="s2">"p"</span><span class="w"> </span><span class="s2">"p"</span><span class="w"> </span><span class="s2">"p"</span><span class="w"> </span><span class="s2">"p"</span><span class="w">
</span><span class="p">[</span><span class="m">14</span><span class="p">]</span><span class="w"> </span><span class="s2">"c"</span><span class="w"> </span><span class="s2">"p"</span><span class="w"> </span><span class="s2">"c"</span><span class="w"> </span><span class="s2">"g"</span><span class="w"> </span><span class="s2">"p"</span><span class="w"> </span><span class="s2">"p"</span><span class="w"> </span><span class="s2">"p"</span><span class="w"> </span><span class="s2">"p"</span><span class="w"> </span><span class="s2">"g"</span><span class="w"> </span><span class="s2">"g"</span><span class="w"> </span><span class="s2">"g"</span><span class="w"> </span><span class="s2">"g"</span><span class="w"> </span><span class="s2">"g"</span><span class="w">
</span><span class="p">[</span><span class="m">27</span><span class="p">]</span><span class="w"> </span><span class="s2">"c"</span><span class="w"> </span><span class="s2">"g"</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">level</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">3</span><span class="w">
</span><span class="p">[</span><span class="m">28</span><span class="p">]</span><span class="w"> </span><span class="m">1</span></code></pre></figure>
<p>https://arxiv.org/abs/1510.06871</p>
<p><a href="https://cran.r-project.org/web/packages/mgm/index.html">mgm</a> allows to estimate k-order MGMs (for more details see <a href="https://arxiv.org/abs/1510.06871">here</a>). Here we are interested in fitting a pairwise MGM, and we therefore choose <code class="highlighter-rouge">k = 2</code>. In order to get a sparse graph, we use L1-penalized regression, which minimizes the negative log likelihood together with the L1 norm of the parameter vector. This penality is weighted by a parameter <script type="math/tex">\lambda</script>, which can be selected either using cross validation (<code class="highlighter-rouge">lambdaSel = "CV"</code>) or an information criterion, such as the Extended Bayesian Information Criterion (EBIC) (<code class="highlighter-rouge">lambdaSel = "EBIC"</code>). Here, we choose to use the EBIC with a hyper parameter of <script type="math/tex">\gamma = 0.25</script>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">mgm</span><span class="p">)</span><span class="w">
</span><span class="n">fit_ADS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mgm</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">data</span><span class="p">),</span><span class="w">
</span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">type</span><span class="p">,</span><span class="w">
</span><span class="n">level</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">level</span><span class="p">,</span><span class="w">
</span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
</span><span class="n">lambdaSel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'EBIC'</span><span class="p">,</span><span class="w">
</span><span class="n">lambdaGam</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.25</span><span class="p">)</span></code></pre></figure>
<p>The fit function returns all estimated parameters and a weighted adjacency matrix. Here we use the <a href="http://www.jstatsoft.org/article/view/v048i04/v48i04.pdf">qgraph</a> package to visualize the weighted adjacency matrix. The separately provide the edge color for each edge, which indicates the sign of the edge-parmeter, if defined. For more info on the signs of edge-parameters and when they are defined, see the <a href="https://arxiv.org/abs/1510.06871">mgm paper</a> or the help file <code class="highlighter-rouge">?mgm</code>. We also provide a grouping of the variables and associated colors, both of which are contained in the data list <code class="highlighter-rouge">autism_data_large</code>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># plot</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">qgraph</span><span class="p">)</span><span class="w">
</span><span class="n">qgraph</span><span class="p">(</span><span class="n">fit_ADS</span><span class="o">$</span><span class="n">pairwise</span><span class="o">$</span><span class="n">wadj</span><span class="p">,</span><span class="w">
</span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'spring'</span><span class="p">,</span><span class="w"> </span><span class="n">repulsion</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.3</span><span class="p">,</span><span class="w">
</span><span class="n">edge.color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit_ADS</span><span class="o">$</span><span class="n">pairwise</span><span class="o">$</span><span class="n">edgecolor</span><span class="p">,</span><span class="w">
</span><span class="n">nodeNames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">colnames</span><span class="p">,</span><span class="w">
</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">groups_color</span><span class="p">,</span><span class="w">
</span><span class="n">groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">autism_data_large</span><span class="o">$</span><span class="n">groups_list</span><span class="p">,</span><span class="w">
</span><span class="n">legend.mode</span><span class="o">=</span><span class="s2">"style2"</span><span class="p">,</span><span class="w"> </span><span class="n">legend.cex</span><span class="o">=</span><span class="m">.4</span><span class="p">,</span><span class="w">
</span><span class="n">vsize</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="n">esize</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">)</span><span class="w">
</span></code></pre></figure>
<p><img src="http://jmbh.github.io/figs/2015-10-31-Estimation-of-mixed-graphical-models/Fig_mgm_application_Autism.png" alt="center" /></p>
<p>The layout is created using the <a href="https://en.wikipedia.org/wiki/Force-directed_graph_drawing">Fruchterman-Reingold algorithm</a>, which places nodes such that all the edges are of more or less equal length and there are as few crossing edges as possible. Green edges indicate positive relationships, red edges indicate negative relationships and grey edges indicate relationships involving categorical variables for which no sign is defined. The width of the edges is proportional to the absolute value of the edge-parameter. The node color maps to the different domains Demographics, Psychological, Social Environment and Medical.</p>
<p>We observe, for instance, a strong positive relationship between age and age of diagnosis, which makes sense because the two variables are logically connected (one cannot be diagnosed before being born).The negative relationship between number of unfinished educations and satisfaction at work seems plausible, too. Well-being is strongly connected in the graph, with the strongest connections to satisfaction with social contacts and integration in society. These three variables are categorical variables with 5, 3 and 3 categories, respectively. In order to investigate the exact nature of the interaction, one needs to look up all parameters in <code class="highlighter-rouge">fit_ADS$rawfactor$indicator</code> and <code class="highlighter-rouge">fit_ADS$rawfactor$weights</code>.</p>
<p>For more examples on how to use the mgm package see the helpfiles in the package or the <a href="https://arxiv.org/abs/1510.06871">mgm paper</a>. For a tutorial on how to interpret interactions between categorical variables in MGMs see <a href="https://jmbh.github.io/Interactions-between-categorical-Variables-in-mixed-graphical-models/">here</a>. For a tutorial on how to compute nodewise predictability in MGMs see <a href="https://jmbh.github.io/Predictability-in-network-models/">here</a>.</p>
Mon, 30 Nov 2015 00:00:00 +0000
http://jmbh.github.io//Estimation-of-mixed-graphical-models/
http://jmbh.github.io//Estimation-of-mixed-graphical-models/