\documentclass[11pt,letterpaper]{article}
\usepackage{graphicx}
%\usepackage[lmargin=3cm,rmargin=3cm,tmargin=3cm,bmargin=3cm]{geometry}
\usepackage{subfig}
\usepackage{paralist}
\usepackage{natbib}
\usepackage{color}
\usepackage{url}
\usepackage{hyperref}
\usepackage[utf8]{inputenc}
\newcommand{\correct}[1]{{\color{red}#1\color{black}}}
\newcommand{\todo}[1]{{\color{red}{\bf #1}\color{black}}}
\newcommand{\code}[1]{{\tt #1}}
\newcommand{\daic}{\Delta \mbox{AIC}}
\newcommand{\pkglink}[1]{\href{http://cran.r-project.org/web/packages/#1}{\nolinkurl{#1}}}
\newcommand{\rflink}[1]{\href{https://r-forge.r-project.org/projects/#1/}{\nolinkurl{#1}}}
\newcommand{\fnlink}[2]{\href{http://finzi.psych.upenn.edu/R/library/#1/html/#2.html}{\nolinkurl{#1:#2}}}
\usepackage{fancyvrb}
\VerbatimFootnotes
\begin{document}
%\linenumbers
\title{GLMMs in action: gene-by-environment interaction in total fruit production of wild populations of \emph{Arabidopsis thaliana} \\\textbf{Revised version, part 2}}
\author{Benjamin M. Bolker\footnote{
$^1$Departments of Mathematics \& Statistics and Biology, McMaster University, Hamilton, Ontario, Canada L8S 4K1;
$^2$ Department of Biology, University of Florida, PO Box 118525, Gainesville FL 32611-8525;
$^3$ Nicholas School of the Environment, Duke University, PO Box 90328, Durham, NC 27708;
$^4$School of Biological Sciences, Victoria University of Wellington PO Box 600, Wellington 6140, New Zealand;
$^5$Department of Botany, Miami University, Oxford OH, 45056;
$^6$Department of Biological Sciences, California State University, Chico, Chico, CA 9529-515.
\emph{Corresponding author}: Bolker, B. M. ({\tt bolker@mcmaster.ca})} \and
Mollie E. Brooks$^2$\and Connie J. Clark$^3$\and Shane W. Geange$^4$\and John R. Poulsen$^3$\and M. Henry H. Stevens$^5$ \and Jada-Simone S. White$^6$}
\maketitle
\SweaveOpts{eps=FALSE,prefix=FALSE,keep.source=TRUE,prefix=FALSE,include=FALSE}
<>=
options(width=80, contrasts=c("contr.treatment", "contr.poly"), continue=" ",
SweaveHooks=list(fig=function() par(bty="l",las=1)))
if (require(cacheSweave)) {
setCacheDir("banta_cache_part2")
}
@
\section{Introduction}
This is revised supplementary material for \citep{Bolker+2008b}.
As described in part~1 of this document,
we analyze variation in nutrient availability and herbivory among genotypes and populations of mouse-ear cress (\emph{Arabidopsis thaliana}:
\cite{banta_comprehensive_2010}) \footnote{Data and context are provided courtesy of J. Banta and M. Pigliucci, State University of New York (Stony Brook).}.
Part~1 describes the data and the context of the analysis
and gives a detailed example of exploratory graphical analysis, model fitting,
model reduction to remove overfitted components, and inference on the fitted model(s).
In part~1, we focus on the \code{glmer} function from the \code{lme4} package, which
is in some ways the most developed package for fitting GLMMs in R.
In this part, we extend the analysis to compare the procedure and results from
\code{glmer} with other approaches (\code{MCMCglmm}, \code{MASS::glmmPQL},
\code{glmmADMB}, and \code{lmer} in the \code{lme4} package).
Load packages:
<>=
## install.packages("coefplot2",repos="http://r-forge.r-project.org")
library(coefplot2) ## for coefplot2
library(reshape)
library(plyr)
library(ggplot2)
library(emdbook) ## for qchibarsq
library(bbmle) ## for AICtab
library(lme4)
library(MCMCglmm)
source("glmm_funs.R")
@
Load results from part~1:
<>=
load("Banta_part1.RData")
@
\footnote{
We used \Sexpr{R.version$version.string} and package versions:
<>=
usedpkgs <- sort(c("coefplot2","ggplot2","plyr","reshape",
"glmmADMB","bbmle","MCMCglmm","coda","lme4",
"RLRsim"))
i1 <- installed.packages()
print(i1[usedpkgs,"Version"],quote=FALSE)
@
The \rflink{coefplot2} and \rflink{glmmADMB} packages must be
installed from \url{http://r-forge.r-project.org}; all others are on CRAN.
}
Later we will also load the \code{glmmADMB} package, but there
are conflicts between some of its methods and \code{lme4}, so
we will wait to load it until we really need it.
\section{MCMCglmm}
Now we'll use the \pkglink{MCMCglmm} package to fit the model.
The main advantages of \code{MCMCglmm} are that it (1) uses
Bayesian numerical integration methods, which are at present
one of the most reliable ways to get confidence intervals
on variance components and with small data sets, and (2) is
quite flexible (it allows estimation of zero-inflated
distributions, multivariate data, setting of priors, etc.).
Its disadvantages are that it is a bit slower than
\code{glmer} and that, as a side effect of its flexibility,
its syntax is slightly more complicated.
<<>>=
library(MCMCglmm)
@
The random effects in an \code{MCMCglmm} fit are specified
as \verb~v(X):g~ where \code{g} is the grouping factor
(population or genotype in our case); \code{X} is the
set of fixed effects that vary across the levels of the
grouping factor (\code{1}, \code{amd}, \code{nutrient}, etc.);
and \code{v} specifies the structure of the variance-covariance
matrix. Using \code{us} would fit an unstructured (i.e. fully
flexible) variance-covariance matrix, as is the default in
\code{glmer}. This failed with the error
\code{ill-conditioned G/R structure: use proper priors if you haven't
or rescale data if you have}, so we instead used the \code{idh}
specification which makes the variation in each effect independent.
We fit three models --- one with variation in nutrient, clipping,
and interaction effects at both levels; one with variation in
clipping effects at the population level and intercept variation
at the genotype level; and one with intercept-only variation
at population and genotype levels (i.e. equivalent to our final model,
\code{mp4}, in Part~1).
<>=
mcmc0 <- MCMCglmm(total.fruits ~ nutrient*amd +
rack + status,
random=~us(amd*nutrient):popu+us(amd*nutrient):gen,
data=dat.tf, family="poisson",verbose=FALSE)
@
<>=
## use independent effects (idh); full covariance matrix (us) fails
mcmc1 <- MCMCglmm(total.fruits ~ nutrient*amd +
rack + status,
random=~idh(amd*nutrient):popu+idh(amd*nutrient):gen,
data=dat.tf, family="poisson",verbose=FALSE)
mcmc2 <- MCMCglmm(total.fruits ~ nutrient*amd +
rack + status,
random=~idh(amd):popu+idh(1):gen,
data=dat.tf, family="poisson",verbose=FALSE)
mcmc3 <- MCMCglmm(total.fruits ~ nutrient*amd +
rack + status,
random=~idh(1):popu+idh(1):gen,
data=dat.tf, family="poisson",verbose=FALSE)
@
Trace plots
are the first graphical diagnostic for
MCMC-based estimation.
For the full model,
the trace plots for the fixed effects
look as they should, with no evidence of trends or slowly varying
temporal patterns (Figure~\ref{fig:mcmcglmmfixtrace}.
<>=
print(xyplot(as.mcmc(mcmc1$Sol),layout=c(3,3)))
@
\begin{figure}[ht]
\begin{centering}
\includegraphics[width=.9\linewidth]{MCMCglmmtrace}
\caption{Trace plots for fixed effects in \code{MCMCglmm} fit of the full model.}
\label{fig:mcmcglmmfixtrace}
\end{centering}
\end{figure}
However, the variance parameters don't look good (Figure~\ref{fig:mcmcglmmrtrace}) --- many of the
variance parameters are getting stuck at zero for long periods of
time, or taking brief excursions to extreme values, or both.
<>=
print(xyplot(as.mcmc(mcmc1$VCV),layout=c(3,3)))
@
\begin{figure}[ht]
\begin{centering}
\includegraphics[width=.9\linewidth]{MCMCglmmvtrace}
\caption{Trace plots for random effects in \code{MCMCglmm} fit of the full model.}
\label{fig:mcmcglmmrtrace}
\end{centering}
\end{figure}
% \begin{figure}[ht]
% \centering
% \includegraphics[width=.9\linewidth]{MCMCglmmtrace}
% \caption{Trace plots for fixed-effect parameters of \code{MCMCglmm} fit}
% \label{fig:mcmcglmmfixtrace}
% \end{figure}
% \begin{figure}[ht]
% \centering
% \includegraphics[width=.9\linewidth]{MCMCglmmvtrace}
% \caption{Trace plots for variance parameters of \code{MCMCglmm} fit}
% \label{fig:mcmcglmmvtrace}
% \end{figure}
Unfortunately, even in the simplest model the trace plots for the variances (not shown)
don't look much better: the genotype-level
variation still spends much of its time stuck at zero.\footnote{
For completeness, we show here how to compare the fixed-effect
parameters across the \code{MCMCglmm} fits and the
original \code{glmer} fit, and how to
compare variance parameters across \code{MCMCglmm} fits:
<>=
coefplot2(list("MCMC 1"=mcmc1,
"MCMC 2"=mcmc2,
"MCMC 3"=mcmc3,
"glmer"=mp4),merge.names=FALSE,
intercept=TRUE,
legend=TRUE,
legend.x="right")
@
<>=
op <- par(mfrow=c(2,1))
coefplot2(mcmc1,ptype="vcov",intercept=TRUE,
varnames=c(outer(c("base","clip","nutr","inter"),
c("pop","gen"),paste,sep="_"),"obs"))
coefplot2(mcmc3,ptype="vcov",col="red",
varnames=c("base_pop","base_gen","obs"),
intercept=TRUE,
top.axis=FALSE,main="",xlim=c(0,3))
par(op)
@
}
To be confident in the results we would try (1) dropping
the genotype-level variance term; (2) setting a weakly informative
prior to push the estimate of the genotype-level variance away from
zero and improve mixing; (3) running the chain for much longer
(e.g. setting \code{nitt=1e6}, \code{thin=1000}).
In general, we recommend \code{coefplot2} from the \code{coefplot2}
package, along with \code{xyplot} and \code{densityplot} from
the \code{coda} package, for graphical summaries, and
\code{raftery.diag}, \code{geweke.diag}, \code{effectiveSize},
and \code{HPDinterval} from the \code{coda} package for numerical
summaries of MCMC information.
\section{glmmADMB}
<<>>=
detach("package:coefplot2")
detach("package:lme4")
library(glmmADMB)
library(lme4) ## load lme4 AFTERWARDS
@
\footnote{We would also have to \code{detach} the
\code{doBy} package at this point, if it were already loaded;
it requires \code{lme4}, so \code{lme4} can't be unloaded
without unloading \code{doBy} first.}
We run a batch file that fits NB1 and NB2
(i.e. linear- and quadratic-variance parameterizations of
the negative binomial: see Part~1), with and without zero-inflation,
using models such as
<>=
gnb2 <- glmmadmb(total.fruits ~ nutrient*amd +
rack + status,
random=~(amd*nutrient|gen)+
(amd*nutrient|popu),
data=dat.tf, family="nbinom")
@
where we specify \code{family="nbinom1"} for NB1 fits and
\code{zeroInflation=TRUE} for zero-inflated models.
(In \code{glmmADMB}
the random effects can be specified either in the same formula,
as in the \code{lme4} package, or in a separate \code{random}
argument, as in \code{nlme} or \code{MCMCglmm}.)
<<>>=
load("Banta_glmmADMB_fits.RData")
@
The loaded R object \code{fits} is a list
of fits: \code{gnb[12][Z]} specifies negative binomial type~1
(linear variance-mean relationship) or type~2 (quadratic
variance-mean relationship), with or without zero-inflation:
\code{gnb1B} is an additional model fit with nutrient effects
at the genotype level and intercept effects at the population
level, while \code{gnb1C} is our final model (\code{mp4} from Part~1:
intercept variance at genotype and population levels).
<<>>=
AICtab(fits)
@
It looks like NB1 is much better than NB2 ($\approx 100$ AIC units)
\ldots but zero-inflation
isn't doing anything.
Comparing the fixed-effect parameters (Figure~\ref{fig:gnbcoefs}:
<>=
library(coefplot2)
ff <- fits[c("gnb1","gnb2","gnb1B","gnb1C")]
names(ff) <- c("NB1, full","NB2, full",
"NB1, nutxgen","NB1, int")
coefplot2(ff,legend.x="right",legend=TRUE)
@
\begin{figure}[ht]
\begin{centering}
\includegraphics[width=.7\linewidth]{gnbcoefs}
\caption{Comparison of fixed effect model estimates for models using linear type 1 (NB1) or quadratic type 2 (NB2) negative binomial distribution and either:
all random effects (\code{full}); nutrient effects at genotype level, but intercept effects at population level (\code{nutxgen});
or the simplest model, with intercept effects estimated at both genotype and population levels (\code{int}).}
\label{fig:gnbcoefs}
\end{centering}
\end{figure}
Figure~\ref{fig:gnbcoefs} shows that the parameters are slightly different
(NB2 estimates a stronger interaction
effect), but not really qualitatively different --- and we should go with
the better-fitting model in any case.
It helps with, but doesn't
entirely get rid of, the artifacts in the location-scale plot, though.
Figure~\ref{fig:newlocscale} compares
the lognormal-Poisson fit from \code{glmer}
(i.e., \code{mp4} from Part~1) with
the NB1 fit from \code{glmmADMB}.
<>=
op <- par(mfrow=c(1,2))
locscaleplot(mp4)
locscaleplot(fits$gnb1C)
par(op)
@
\begin{figure}[ht]
\begin{centering}
\includegraphics[width=.7\linewidth]{gnbresid}
\end{centering}
\caption{Location-scale plots of our final model from \code{glmer} (using a log-Poisson distribution) and from
\code{glmmADMB} (using a type 1 negative binomial fit). Each plot has a superimposed loess fit.}
\label{fig:newlocscale}
\end{figure}
\section{via LMM approximation (\code{lme4::lmer})}
Would we have come to a significantly different answer via the
traditional normal approximation? In general, we agree with those
who argue that one should prefer modeling solutions that explicitly
take the distribution of the data into account rather than
transforming to achieve normality \citep{ohara_not_2010,warton_arcsine_2011},
but since GLMMs can be significantly harder to fit than LMMs it
is worth comparing the approaches.
We will again use the final model we
settled on in part~1 (\code{mp4},
genotype- and population-level random intercepts):
<>=
library(lme4) ## re-load lme4 package
lm1 <- lmer(log(1+total.fruits)~nutrient*amd+rack+status+
(1|popu)+(1|gen),data=dat.tf)
@
The location-scale plot (Figure~\ref{fig:lmmlocscale}) looks reasonable, although it again
contains some artifacts based on the zero observations.
<>=
locscaleplot(lm1,col=ifelse(dat.tf$total.fruits==0,"blue","black"))
@
\begin{figure}[ht]
\begin{centering}
\includegraphics[width=.7\linewidth]{lmmlocscale}
\end{centering}
\caption{Location-scale plot of our final model via LMM approximation using \code{lmer},
with superimposed loess fit. Zeros in the data are marked in blue.}
\label{fig:lmmlocscale}
\end{figure}
The Q-Q plot (Figure~\ref{fig:qqplot}) is more questionable.
<>=
r <- residuals(lm1)
qqnorm(r)
qqline(r,col=2)
@
\begin{figure}[ht]
\begin{centering}
\includegraphics[width=.7\linewidth]{qqplot2}
\caption{Q-Q plot of the
residuals from a full model using a normal distribution
on $\log(1+x)$-transformed data (\code{lmer}).}
\label{fig:qqplot}
\end{centering}
\end{figure}
Neither the fixed nor the random effect variances are extremely different
(Figure~\ref{fig:lmerglmercomp}).
<>=
op <- par(mfrow=c(2,1))
coefplot2(list(lm1,mp4))
coefplot2(list(lmer=lm1,glmer=mp4),ptype="vcov",
main="",intercept=TRUE,
xlim=c(0,1.5),
legend=TRUE)
par(op)
@
\begin{figure}[ht]
\begin{centering}
\includegraphics[width=.7\linewidth]{lmmglmmcomp}
\caption{Regression estimates for fixed effects (top) and random effects estimates / predictions (bottom) for both LMM (black = \code{lmer}) and GLMM (red = \code{glmer}) fits. Also included are estimates of the observation-level standard deviation for each model (\code{SD.resid.} and \code{SD.X.(Intercept)}, respectively).}
\label{fig:lmerglmercomp}
\end{centering}
\end{figure}
We briefly considered testing to see whether the LMM or the GLMM fits the data better,
but it is a little tricky (one has to add an additional factor to the log-likelihood
to account for the $\log(1+x)$ transformation), and we decided not to try.
As Jack Weiss comments in
\href{http://www.unc.edu/courses/2010fall/ecol/563/001/docs/lectures/lecture15.htm#caveats}{his course notes},
\begin{quote}
The best recommendation I can make is not to [compare count-data models such as GLMMs to continuous-data models such as transformed LMMs].
The use of continuous distributions to model count data dates from a time when fitting discrete probability distributions to data was difficult. Yet the use of the normal distribution as a probability model for count data or log-transformed count data is still widespread. Does it make sense to use AIC or log-likelihood to compare continuous probability models with discrete probability models?
\end{quote}
He goes on to point out some fairly subtle difficulties involved in such comparisons.
\section{Penalized quasi-likelihood (\code{MASS::glmmPQL})}
In the first version of this document, we used \code{family="quasipoisson"} in
\code{glmer} to account for overdispersion.
The \code{glmer} function no longer accepts \code{quasi} families, so we
cannot directly compare our new approach with observation-level variance
(Part~1), but we can use the \code{glmmPQL} function in the \code{MASS} package
to test a quasi-likelihood approach.
<<>>=
library(MASS)
@
<>=
mp1Q <- glmmPQL(total.fruits ~ nutrient*amd +
rack + status,
random=list(~amd*nutrient|popu,
~amd*nutrient|gen),
data=dat.tf, family="quasipoisson")
@
<>=
library(nlme)
vv <- rev(sort(suppressWarnings(as.numeric(VarCorr(mp1Q)))[1:11]))
@
The variance-covariance structure is a bit unwieldy:
<>=
VarCorr(mp1Q,rdig=2)
@
The take home message is that most of the estimated variance components
are very small: the three largest
variance components are the residual variance ($\sigma^2 \approx \Sexpr{round(vv[1])}$),
population-intercept ($\sigma^2 \approx \Sexpr{round(vv[2],2)}$),
and variation in nutrient effect across genotypes
($\sigma^2 \approx \Sexpr{round(vv[3],4)}$).
While formal significance testing of quasi-likelihood models is a bit
hard because we do not have a log-likelihood or its derived quantities
to work with, but we feel safe in asserting that this method does not
allow us to estimate variation that the other methods failed to discover.
(Unfortunately, this also leads us to the conclusion that our
previous results --- from the first version of the supplementary
online material, published in 2009 --- based on quasi-likelihood methods with
\code{glmer} were flawed.)
The fixed-effect coefficients are all similar.
<>=
coefplot2(list(glmer=mp4,
MCMCglmmm=mcmc3,
lmer=lm1,
glmmPQL=mp1Q,
glmmADMB_NB1=fits$gnb1C),
merge.names=FALSE,intercept=TRUE,
legend.x="right",legend=TRUE)
@
\begin{figure}[ht]
\begin{centering}
\includegraphics[width=.7\linewidth]{finalcoefplot}
\end{centering}
\caption{Summary of regression estimates of fixed effects for our final model from each of the major estimation methods discussed.}
\label{fig:finalcoefplot}
\end{figure}
While the pattern is not perfectly consistent,
Figure~\ref{fig:finalcoefplot}
shows that in general
\code{glmmADMB} and \code{glmmPQL}
results are most similar (both assume the
variance increases linearly with the mean);
the \code{lmer} fit
is slightly different
($V \propto \mu^2$); and
the \code{MCMCglmm} and
\code{glmer} fits ($V \propto \mu (1+C \mu)$)
are similar to each other.
(Surprisingly, a negative binomial (NB2) fit
from \code{glmmADMB} did not closely match
the \code{glmer} and \code{MCMCglmm} fits,
as we would have expected because it has
the same mean-variance relationship.)
However, these differences are really very small
when considered in biological terms.
In this
example, if we had proceeded carefully we would
have drawn the same biological conclusions no
matter what model we started with:
\begin{itemize}
\item There is insufficient variation (alas!) to
reliably quantify
the variation in nutrient/clipping/interaction effects
across genotypes and populations;
\item Total fruit set varies both among genotypes within
a population and among populations;
\item nutrient has a positive effect, clipping has a
negative effect, and adding nutrient appears to cancel
out the effect of clipping.
\end{itemize}
\bibliographystyle{ESA1009}
\bibliography{glmm}
\end{document}