Tutorial on fitting models M1 and M2 in R
August 2011
This tutorial demonstrates site population size estimation using models M1 and M2 described
by Dail and Madsen, Jorunal of Agricultural, Biological, and Environmental Statistics, 2011.
It largely follows the form of the tutorials provided by J.A. Royle and by D. Dail for
fitting the Generalized N-mixture models for repeated point counts, closed and open
populations, respectively.
The tutorial for closed populations is given in the supplemental appendix of Kerry, Royle,
Schmid, Modeling avian abundance from replicated counts using binomial mixture models,
Ecological Applications, 2005. Likewise, the tutorial for open populations is given in the
Web Appendix of Dail and Madsen (2011), Models for estimating abundance from repeated counts
of an open metapopulation, Biometrics (67), 577-587.
This tutorial assumes the reader has some familiarity with R.
The R function "model_M1" takes a while to run, so the time required for finding MLEs will be
reduced by working on a 64 bit computer (though this is not required).
A command intended to be entered as R code into the R console is depicted by the ">", with
commentary between the commands describing the process of fitting models M1 and M2.
First, load the R file Web_Supplement_Files.R into an active R session, using the command:
> source("/Web_Supplement_Files.R")
where is the complete path name of the directory containing the file
Web_Supplement_Files.R. There are 4 files contained in this archive, verified using the
command:
> ls()
All four files are R functions. The function "model_M1" includes a call to implement the C
function named "M1_Trans", and the function "M1_ans" includes a call to implement the C
function named "M1_ANS". The calls to these C functions are used to speed up calculations.
These functions require the logit and reverse-logit transform, which are built by the
following commands:
> expit = function(x) { return((exp(x))/(1+exp(x))) }
> logit = function(x) { return(log(x/(1-x))) }
Before we start, the C functions "M1_Trans" and "M1_ANS" must be downloaded and compiled.
The steps required to do this in Windows are following.
1. Get Rtools here:
http://www.murdoch-sutherland.com/Rtools/
To install Rtools, run the executable. In setup, use all defaults and MAKE SURE THAT BOTH
BOXES ARE CHECKED on the screen titled, "Select Additional Tasks."
2. Save the files "Web_Supplement_Files.R", "M1_Trans.c", and "M1_ANS.c" in R's bin directory
(typically something like C:\Program Files\R\R-2.9.2\bin).
3. Open up the Windows command window: Start > Run > cmd
4. In the command window, cd to R's bin directory. For the pathway given above, the command
looks like this:
cd "Program Files\R\R-2.9.2\bin"
5. Compile M1_Trans.c. This command produces the compiled file M1_Trans.so.
R CMD SHLIB M1_Trans.c -o M1_Trans.so
6. Back in the R console, load the file "M1_Trans.so" file into R, using the command:
> dyn.load("./M1_Trans.so")
It is a good idea to verify that the file loaded properly, using the command.
> is.loaded("M1_Trans") # should return TRUE
7. Repeat steps 5,6, and 7, replacing "M1_Trans" with "M1_ANS".
A webpage authored and maintained by Duncan Murdoch at the University of Western Ontario
offers assistance in the interface between R and C, and can be found at
http://www.stats.uwo.ca/faculty/murdoch/software/compilingDLLs/index.html#badname
Next, we need to load the observed data and build a matrix of observed countsand sampling
days. The data set we will use here is described in Section 6 of the accompanying manuscript.
It consists of live adult coho counts obtained in 28 sites in the Suislaw basin in 2007.
We first will load the point counts collected in this study:
> R=28
> n.vec = scan()
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA NA
NA NA NA NA NA NA NA NA NA 1 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA 0 0 NA 0 NA NA NA NA NA NA NA NA NA NA NA 0 0 0 NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA 1 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0
0 NA NA NA NA NA NA NA NA NA NA 0 0 0 0 NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 5 NA NA NA NA NA NA NA
NA NA NA 0 0 NA 0 NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA 0 NA 0 0 0 0 NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 NA
0 0 NA 0 0 NA NA NA NA NA 0 NA 0 0 0 NA NA 0 0 NA 0 NA NA NA NA 0 NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA 0 2 NA NA NA NA NA NA NA NA NA NA 0 NA NA 0 NA
NA 0 0 NA 0 NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA 0 NA NA NA NA NA 0 0 NA 0 0 0 NA 0 NA NA NA
NA NA 0 NA 0 0 0 0 6 0 0 NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 0 NA 0 NA
NA NA 1 0 NA NA NA NA NA 4 NA 1 1 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 NA 3 0 6 NA NA NA NA NA NA NA 0 NA NA NA
3 0 11 2 0 NA 0 NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA 0 NA 0 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 NA 1 0 3 NA 8 0
0 NA NA NA NA NA 0 0 NA NA NA NA NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA 0 0 0 0 0 0 0 NA NA NA 0 NA NA NA NA NA NA 0 2 NA NA NA 0 NA
NA NA NA NA NA NA NA NA NA NA NA 0 0 0 NA 0 NA 2 NA NA NA NA NA NA NA 2 NA 1 9 0
NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 NA 7 0 0 NA NA 1 0 NA NA NA NA NA 0 4
NA NA NA 4 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 1
3 NA NA 0 0 NA NA 0 1 NA NA NA NA NA NA 0 0 NA NA NA 0 NA NA NA NA NA NA NA NA NA
NA NA NA 2 0 NA NA 2 0 0 NA NA NA NA NA NA NA 1 NA 0 1 0 NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA 1 0 0 NA NA NA NA NA NA NA NA NA 8 2 NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA 0 0 NA NA NA NA NA NA NA 0 NA 0 0 NA NA 3 0 0 NA
NA NA NA NA NA NA NA 0 3 NA NA NA 0 NA NA NA NA NA NA 0 NA NA 0 0 NA NA NA 0 0 1
0 0 NA NA 0 NA NA NA NA 0 NA 0 1 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
0 0 NA NA NA NA NA NA NA NA NA NA 0 0 0 NA 3 NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA 0 0 NA NA NA NA NA NA NA NA NA NA 0 2 0 2 0 0 NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA 0 0 0 NA NA NA NA NA NA 1 NA
NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 NA NA 0 NA NA NA NA 0
NA 2 0 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 1 NA NA NA NA NA NA NA NA
NA NA 1 0 0 NA 2 3 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA 0 NA NA 0 0 0 0 1 0 0 0 NA NA NA NA NA NA NA 0 0 NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 NA NA NA 0 NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 NA 0 NA NA 0 NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA 0 NA NA NA NA NA NA 0 0 NA NA 0 1 1 1 0 NA NA NA NA NA NA
0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 NA 0 0 0 NA NA 0
0 0 0 NA NA NA NA NA NA 0 NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 NA NA 0 NA 0 0 0 NA 0 NA
NA NA NA NA NA 0 NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0
0 0 NA NA NA NA NA 0 NA NA NA NA NA NA 0 0 1 0 NA NA 0 NA NA NA NA 0 NA NA NA NA
NA NA NA 0 0 NA NA NA NA 0 0 NA NA NA NA NA NA NA NA NA NA 0 NA NA NA 0 NA NA NA NA
NA NA NA 1 NA 1 0 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 NA 0 1 0 NA 0 0
0 NA NA NA NA NA 0 0 NA NA NA NA NA 0 NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA 0 NA 0 NA NA NA NA 0 0 NA 0 0 NA NA NA NA NA 0 NA NA NA NA 0 NA
NA NA NA NA NA 0 NA 0 0 NA NA NA NA 0 NA NA 0 0 NA NA 0 0 NA NA NA 0 0 0 1 0
NA NA NA NA NA NA NA 0 NA NA 0 NA NA NA NA NA 0 0 NA NA 0 0 0 NA NA NA NA NA 0 0
0 NA 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0
NA NA 0 NA 0 0 NA 0 0 0 NA NA NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA 0
NA NA NA NA 0 NA NA 0 NA NA NA 0 NA NA NA NA NA NA NA 0 0 0 NA NA NA NA NA NA NA NA
NA NA NA NA NA NA
>
>
Read in all the lines above, including the one just above that is empty. This will signify
that the data entry has stopped.
> n3.uniq = matrix(n.vec, nrow=R)
> n.it3 = n3.uniq
Then, we need a matrix of sampling dates for each entry of the point count matrix:
> Date.row = c(1,8,9,10,15,16,17,18,22,23,24,25,30,31,32,36,37,38,43,44,45,46,52,53,54,
> 55,57,58,59,60,64,65,66,67,69,73,74,75,76,77,80,81,82,83,86,87,88,89,92,93,94,95,100,
> 101,102,107,108)
> Date3.uniq = matrix( rep(Date.row,R), nrow=R, byrow=TRUE)
> Date3 = Date3.uniq
Take a look at the first 5 rows of data:
> rbind(Date3.uniq[1,], n3.uniq[1:5,])
Now, we need to find starting values for fiting model_M1. For these data, we will use the
values of p and s assumed by the AUC model as starting values in the search for the MLEs,
which are p=0.75 and s=11.3.
For the rate parameters, we will use properties of the observed counts:
> x=numeric(0)
> y=numeric(0)
> for(i in 1:(dim(n.it3)[1])){
> x = c(x,as.vector(na.omit(n.it3[i,2:length(n.it3[i,])])))
> y = c(y,Date3[i,which(!is.na(n.it3[i,2:length(n.it3[i,])]))+1]-7)
> }
> non.zero=y[x>0]
> min.pos = min(non.zero) #first observed non-zero count
> max.pos = max(non.zero) #last observed non-zero count
> mean.pos = (min.pos+max.pos)/2 #middle day between the previous two
Now, we need a,b,c for the quadratic form r = exp(c+ a*d + b*d^2), where d is the sampling
date.
> b.in = -1/(((max.pos-min.pos)/5)^2*2) # plus or minus 2.5 sd's get all observed data
> a.in = -mean.pos*2*b.in
> b.trans = log(-b.in) # transformed so that b is negative.
We'll investigate two starting values, differing in the height of the arrival rate curve.
The first uses the max count and the second uses 70% of the max count.
> lam.max = max(x)
> lam.max.7=max(x)*.7 # helps to reduce the max observed count here, in finding c
> c0.in= log( lam.max / (pnorm(mean.pos+3,mean=-a.in/(2*b.in),sd=sqrt(-1/(2*b.in)))-
> pnorm(mean.pos-3,mean=-a.in/(2*b.in),sd=sqrt(-1/(2*b.in)))) *
> -b.in/sqrt(-b.in*pi) / exp(-a.in^2/(4*b.in)) )
> c0.in.7= log( lam.max.7 / (pnorm(mean.pos+3,mean=-a.in/(2*b.in),sd=sqrt(-1/(2*b.in)))-
> pnorm(mean.pos-3,mean=-a.in/(2*b.in),sd=sqrt(-1/(2*b.in)))) *
> -b.in/sqrt(-b.in*pi) / exp(-a.in^2/(4*b.in)) )
The starting parameter vectors are:
> pars.in = c(log(11.3),c0.in,a.in,b.trans,logit(.75))
> pars.in.7=c(log(11.3),c0.in.7,a.in,b.trans,logit(.75))
Now we can find the negative log likelihood for each of these possible starting values. The
provided R function model_M1() takes four inputs:
1. vars gives the vector of starting values in the MLE search, in the order of
log(s), c, a, log(-b), logit(p). The log and logit transformations allow
unrestricted parameter space in R5.
2. n.it gives a matrix of the observed counts. Each row gives a different site, and each
column gives a different sampling day, in order with sampling day 1 in column 1.
Occasions when a site was not surveyed should be recorded as NA.
3. Date gives a matrix of the dates associated with observed counts. This should have
identical rows, and the left-most entries should be 1, with each column giving the
next day of sampling.
4. K gives the cut-off value for the infinite sums over site abundances A.it. The
default is 80, but should be higher when observed counts are large; different K
values can be tested until one is found that is sufficiently large so that
increasing it further does not effect the resulting negative log likelihood.
Note that increasing K increases the time required to run the function model_M1().
And the function model_M1 returns the negative log likelihood of the data given the parameter
values.
We will here use a value of 80 for K, since the largest observed count is only 11. One can
verify that increasing K above 80 does not change the results below.
Note that each of the next two commands takes about 28 minutes to run.
> hold.1= model_M1(vars=pars.in, n.it=n3.uniq, Date=Date3.uniq, K=80)
> hold.7= model_M1(vars=pars.in.7, n.it=n3.uniq, Date=Date3.uniq, K=80)
> c(hold.1,hold.7) # should give 1354.2558 and 974.2455
Now, we will use optim() to find the MLE. More help for optim() can be found by typing
> help(optim)
Note that running the following command takes approximately 15 days on a 64-bit server.
> Lit.mle = optim(pars.in.7, model_M1, n.it=n3.uniq, Date=Date3.uniq, K=80, method="BFGS",
> hessian=TRUE, control=list(parscale=c(1,20,.4,.02,.4)) )
First, verify convergence:
> Lit.mle$conv # should be 0
Then, investigate the hessian matrix. Eigen values near 0 or negative indicate over-
parameterization:
> min(eigen(Lit.mle$hess)$val) # should be approx. 0.975
Check out the negative log likelihood and the MLEs:
> Lit.mle$val # should be 321.1458
> Lit.mle$par # should be 2.5975736 -5.3256695 0.2216255 -5.8723271 -1.1620682
Next, we can get (asymptotic) 95% confidece intervals for p and s, using the inverse hessian
at the MLEs as the variance-covariance matrix and back-transforming from logit(p) and log(s).
> p.ci = expit(Lit.mle$par[5]+c(-1.96,1.96)*sqrt( solve(Lit.mle$hess)[5,5]))
> s.ci = exp(Lit.mle$par[1]+c(-1.96,1.96)*sqrt( solve(Lit.mle$hess)[1,1]))
> p.ci # should be (0.1388, 0.3779)
> s.ci # should be (9.0356, 19.9650)
Now we can get the estimates for site population size. To do this, we can use the function
M1_ans(), which takes the same inputs as model_M1() and returns a list of two components.
The first component is a vector of estimated site abundances, and the second is a vector of
values related to the arrival rate curve. Note that the following takes several hours to run.
> M1.ans.ret = M1_ans(vars=Lit.mle$par, n.it=n3.uniq, Date=Date3.uniq, K=80)
> M1.hat = M1.ans.ret[[1]]
The summaries of the arrival rate curve are four numbers, as follows:
1. Integral of the arrival curve from day 1 to the last sampling day
2. The mean arrival day
3 and 4. The days between which 95% of arrivals occur
> M1.ans.ret[[2]] # should be 12.69210 39.34677 13.23112 65.46241
Next, we will fit model M2. The inputs for model_M2 are the same as those for model_M1; the
arrival curve is different, and requires different initial values.
Again we will need some starting values for our search for the MLEs:
> Date=Date3.uniq
> n.it.z = n3.uniq
> n.it.z[which(is.na(n3.uniq))]=0
> sum.obs.1 = apply(n.it.z,2,sum)
> dates.1 = rep(Date[1,],sum.obs.1)
We will again use the observed counts to dictate our starting values for the rate curve.
Below, midd.in, ht.in, and delta.in wil be the middle, height, and standard deviation of the
arrival curve, respectively. Note that it is helpful for ht.in to be positive.
> midd.in = median(dates.1)
> ht.in = log(max(apply((n.it.z),2,sum))/dim(n.it.z)[1] )
> if(ht.in<0) ht.in = max(apply(n.it.z,2,sum)) / dim(n.it.z)[1]
> first = quantile(dates.1,.025)
> last = quantile(dates.1,.975)
> delta.in= as.numeric( (midd.in-first + last-midd.in)/2)
The initial values use rate = exp(var.2 + var.3*d + var.4*d^2), where d is sampling day. The
values of var.2, var.3, and var.4 can be determined from the quantities midd.in, ht.in, and
delta.in from above:
> var.2 = ht.in-midd.in^2*ht.in/delta.in^2
> var.3 = 2*midd.in*ht.in/delta.in^2
> var.4 = -abs(ht.in)/delta.in^2
It is helpful to transform var.4, using the following:
> var.4.new2 = logit(.25*abs(var.4))
Again, the starting values for p and s are 0.75 and 11.3, which come from the values assumed
for AUC:
> p.in = .75
> ms.in = 11.3
> var.1 = log(ms.in)
> var.5 = logit(p.in)
Our vector of initial values is therefore:
> vars.1=c(var.1,var.2,var.3,var.4.new2,var.5)
Next we will obtain the negative log likelihood at the starting value. The likelihood for
model M2 is calculated using the provided R function model_M2(). Inputs are the same as for
model_M1() above.
> h1=model_M2(vars.1, n.it=n3.uniq, Date=Date3.uniq, K=80)
> h1 # should be 428.4991
Now, use optim() to find the mle. (Note: running the following command takes approximately 3
days on a 64-bit server).
> open.mle = optim(vars.1, model_M2, n.it=n3.uniq, Date=Date3.uniq, K=80, method="BFGS",
> hessian=TRUE, control=list(parscale=c(1,10,.6,.2,.5)) )
First, verify convergence:
> open.mle$conv # should be 0
Then, investigate the hessian matrix. Eigen values near 0 or negative indicate over-
parameterization:
> min(eigen(open.mle$hess)$val) # should be 0.89784
Check out the negative log likelihood and MLEs:
> open.mle$val # should be 323.1582
> open.mle$par # should be 2.6175175 -3.5223278 0.2172912 -7.3551756 -1.1625109
Next, we can get (asymptotic) 95% confidece intervals for p and s, again back-transforming
from logit(p) and log(s).
> p.ci.2 = expit(open.mle$par[5]+c(-1.96,1.96)*sqrt( solve(open.mle$hess)[5,5]))
> s.ci.2 = exp(open.mle$par[1]+c(-1.96,1.96)*sqrt( solve(open.mle$hess)[1,1]))
> p.ci.2 # should be (0.1388, 0.3777)
> s.ci.2 # should be (9.1433, 20.5327)
Now we can get the estimates for site population size. M2.hat() has the same inputs as
model_M2(), and returns a vector of site abundance estimates. It takes about 30 seconds.
> M2.hat = M2_ans(n.it=n3.uniq, vars=open.mle$par, Date=Date3.uniq, K=80)
Lastly, we can get the AUC estimate of site population sizes, where we assume p is 0.75 and
s is 11.3:
> p.old=0.75
> MS.old=11.3
> AUC.old = numeric(dim(n3.uniq)[1])
> for(i in 1:dim(n3.uniq)[1]){
> T = sum(!is.na(n3.uniq[i,]))
> n.t = n3.uniq[i,which(!is.na(n3.uniq[i,]))]
> Date.t = Date3.uniq[i,which(!is.na(n3.uniq[i,]))]
> j.num.old=numeric(T)
> for(j in 2:T){
> j.num.old[j] = (Date.t[j]-Date.t[j-1])*(n.t[j]/p.old/MS.old +n.t[j-1]/p.old/MS.old)
> }
> AUC.old[i] = .5*sum(j.num.old) + .5 * n.t[1]/p.old + .5*n.t[T]/p.old
> }
A helpful reference is the max observed count and total count at each site:
> max.n=numeric(0)
> sum.n=numeric(0)
> for(i in 1:dim(n3.uniq)[1]){
> n.t = n.it3[i,which(!is.na(n.it3[i,]))]
> max.n = c(max.n, max(n.t))
> sum.n = c(sum.n, sum(n.t))
> }
A table with all the various estimates can now be compiled:
> ans.all = cbind(max.n, sum.n, AUC.old, M1.hat, M2.hat)
> round(ans.all,2)
The matrix ans.all should match Table 3 of the accompanying manuscript. Finally, we can get
the column totals:
> round(apply(ans.all,2,sum),2)