Sunday 28 August 2011

Likelihood estimation and optimisation

We noticed when generating the multivariate dataset, mydata, that although the mvrnorm command allows us to specify the correlation between the simulated variables, the actual correlation observed might depart from this value, especially when the sample size is small. You will also be familiar with the idea of a 'nonsignificant' yet positive correlation. For instance, if in a sample of 20 cases, you obtain a correlation of .2, this is not statistically significant, which is to say that it is not reliably different from zero. These observations illustrate the point that any observed statistic is an estimate of a parameter which is never quite precise – the degree of precision will depend in part on the sample size.  Rather than just specifying whether a correlation is significant or not, it is possible to adopt a different approach of estimating the likelihood that the observed value comes from a population that has a given true value.  This is a key concept that you need to grasp in order to understand how OpenMx works, and so we will spend some time illustrating it with examples.
 
Our script will use the mydata file that you created earlier. If you are running the same session, it will still be present in memory. If you have shut down R, it will only be present if you accepted the option "Save Workspace Image" when exiting R. This is not a problem, because we saved the data, and so can just read it in. Or you can always recreate it!

There is an easy way to check which variables are present in memory. At the R console, type
    objects()
and you get a list of all variables.

NB. If at any point you want to clear all variables from the workspace, type:
rm(list = ls(all = TRUE))
Alternatively, you can go to Misc on the menubar and select Remove all Objects.

Now copy this program into your Editor and run it.
#-------------------------------------------------------------------------------------------------------
# Likelihood_demo
#  Computes likelihood of observed correlation relative to range of true population correlations
#   by DVM Bishop, 6th March 2010
#-------------------------------------------------------------------------------------------------------

mydata=read.table("myfile")  # read in the data we created earlier
    # NB the name is not saved with the data – we reassign it here
                                          # Could use any filename we like
myn=nrow(mydata)             # set myn to the number of cases

mycoro=cor(mydata)           # set mycoro to the correlation matrix for mydata
mynvar=nrow(mycoro)          # set mynvar to the number of variables

myresult=matrix(0,nrow=10,ncol=2)   
# set up a blank matrix with 10 rows to hold expected correlations and likelihoods

mylabels=c('True correlation','-LL')    # Put labels for 2 columns of myresult in a vector
dimnames(myresult)=list(NULL,mylabels) # Allocate the labels to myresult
                                    
for (myx in c(1:10))   # Start a loop, which will run with value of myx updated
                            # from 1 to 10 on each pass through
{
 mytruecor=matrix(c(1,(myx-1)/10,(myx-1)/10,1),nrow=2)    
# Divide (myx-1) by 10 to give expected correlation for this pass of loop,
# i.e. likelihood is computed for correlations ranging from 0 to .9
 myresult[myx,1]=(myx-1)/10   # Results table has true covariance value in column 1

 myLL=(myn-1)%*%log(det(mycoro))-log(det(mytruecor))+sum(diag(mytruecor%*%solve(mycoro))-mynvar)
#  formula for computing negative log likelihood of observed correlation matrix, 
#  given true population value

myresult[myx,2]=myLL # put -likelihood value for this pass of the loop in myresult
# close curly bracket denotes end of loop

plot(myresult) #plot true correlation vs likelihood
#-------------------------------------------------------------------------------------------------------
In practice, for computational reasons, it is customary to report a value of 2 times the negative log likelihood, rather than the raw likelihood. The lower this value, the better the fit.
Figure 2: Output from Likelihood_demo

As you will see from the plot, the negative log likelihood function is at a minimum when the true population covariance is .5.
For further demonstration, we'll take advantage of R's prepackaged datasets.

Prepackaged datasets. The 'data' function can be used to read sample data that has been pre-packaged into the R library. To see a list of the prepackaged datasets that are available, type:

   data()
One of the listed datasets is 'attitude'. If you type
  data(attitude)

you will load in this dataset. You can see the data by typing attitude.

You will see that the demonstration file has 30 cases and 7 variables.
We shall first re-run the Likelihood_demo script using just the first two variables from 'attitudes'.
To select those variables, type:

   mydata=attitude[,1:2]

To select the current 'mydata', rather than reading in the saved file, just add a # to the start of the first uncommented program line – this 'comments out' that line so it will not run, and you will use the version of 'mydata' that you have just created. Compare the likelihood plot with the actual correlation in this dataset.

Why are we doing this?! Of course, if you want to know what the correlation is between two variables you can just compute it (and R will do that easily for you!). But what we've done here is rather different. We've considered how good a fit there is between the observed data and specific values of the correlation, ranging from 0 to .9. For the attitude data, the actual correlation is .82. The fit is poor (higher -LL values) if we test the data against values less than .6, reaching a peak -LL value (worst fit) if we assume the variables are uncorrelated, r = 0.  The fit is not so bad, though for values from .7 to .9.  This illustrates one feature of likelihood estimation - it can give an indication of how precise an estimate of a statistic is. But more importantly, this method comes into its own when we want to test a model that has several parameters and various constraints. Merely estimating a set of statistics from a dataset is not very exciting: but things get much more interesting when we start to develop models that aim to parsimoniously explain the data in terms of latent, unobserved variables. This will be clearer when we come (eventually!) to consider modeling twin data. But first, let's illustrate further using a more complicated dataset with 3 variables.
 
Likelihood estimation of correlation matrix with three variables
Our next script demonstrates how the same matrix formula for likelihood estimation works if we have three variables, and so three correlations to estimate: between variables 1 and 2, 1 and 3, and 2 and 3. We use this demonstration to show how some combinations of correlations yield incomputable likelihood estimates. Read this script and try to understand how it works; it can be useful to type it in to the console one step at a time and then look at the values of different variables, by typing their name into the Console.
#-------------------------------------------------------------------------------------------------------
# Likelihood_3var
#  Computes likelihood of observed correlation relative to range of true population correlations
#  for 3 variables
#  by DVM Bishop, 6th March 2010
#-------------------------------------------------------------------------------------------------------

data(attitude)  # read in one of R's prepackaged datasets
mydata=attitude[,1:3]  #select the first 3 columns (ratings, complaints, privileges) for new dataset called mydata

myn=nrow(mydata)             # set myn to the number of cases

mycoro=cor(mydata)           # set mycoro to the correlation matrix for mydata
mynvar=nrow(mycoro)          # set mynvar to the number of variables

mytable=matrix(0,nrow=1000,ncol=2)  # table to hold results
                                 
myloopcount=0 #initialise counter that increments each time you run through the loop
for (myx in c(0:9))   # Start a loop, which will run with value of myx updated from 0 to 9 on each pass
                      # myx determines correlation between vars 1 and 2 (= myx/10)
{                     # Curly brackets enclose all commands within the loop
  for (myy in c(0:9))   # Next loop (for correlation between vars 1 and 3)
  {
    for (myz in c(0:9))   # Next loop (for correlation between vars 2 and 3)
    {
     mytruecor=matrix(c(1,myx/10,myy/10, myx/10, 1,myz/10,myy/10,myz/10,1),nrow=3)    
#  expected correlation for this pass of loops; NB only positive correlations used here,
#  but could extend to negative values
     myloopcount=myloopcount+1
     myLL=(myn-1)%*%log(det(mycoro))-log(det(mytruecor))+sum(diag(mytruecor%*%solve(mycoro))-mynvar)
     # formula for computing log likelihood of observed correlation matrix, 
     # given true population value
     # Note this is same formula as for single variable; 
     # matrix algebra means we can just scale up N variables without altering formula

     mytable[myloopcount,]=c(myx*100+myy*10+myz,myLL)
     # Create 3 digit number that represents myx in first digit, myy in second, and myz in third
     }   # close curly bracket denotes end of innermost loop
   }     # end of next loop
}        # end of outermost loop

 droprow=c(which(is.nan(mytable[,2])),which(is.infinite(mytable[,2]))) 
 # identify rows where -LL is not calculated (not a number (NaN) or infinity),
 # You can see how the elements of this formula work by typing 
# help(which), help(is.nan),help(is.infinite)
mytable2=mytable[-droprow,]   #mytable2 is same as mytable but with dropped rows
                                 # (minus sign before droprow means select all but these rows)
plot(mytable2, xlab="100*r12 plus 10*r13 plus r23",ylab="-LL")

 myresult=match(mytable2[,2],min(mytable2[,2]))  
#finds row in mytable2 with minimum value in col. 2
 mymin=mytable2[is.finite(myresult)]  
#myresult has NaN for all rows except the one that matches minimum
 print ("Index for minimum value of -LL, -LL")
 print(mymin)
#---------------------------------------------------------------------------------------------------------------
When you run this script, you get a plot showing –LL values in relation to the pattern of correlations. This uses a simple, if rather clunky, method to show the three correlations associated with a given value of –LL with a 3-digit number (likelihood triplets); e.g. the number 435 means that correlation between rating and complaints is .4 (the first digit), correlation between rating and privileges is .3 (the second digit) and correlation between complaints and privileges is .5 (the third digit). You can see that the –LL reaches a minum somewhere in the middle of the range from 800-900. The precise value of the correlation index, 846,  is stored in variable mymin and is printed on the console, together with the associated value of –LL. So the best estimates of correlations between variables 1-2, 1-3 and 2-3 are .8, .4 and .6 respectively.  You can compare this with the actual correlations in mydata by typing
   cor(mydata)
Note that the level of precision of estimates is constrained by the fact that we have only estimated likelihood for true values of correlation coefficients in steps of .1. After you have finished this section, as an exercise you might like to consider how you could modify the program to give an estimate to two decimal places. Note that as well as modifying the ranges for the loops and the formulae for rx,  ry and rz, you would need to change the command that initialises mytable, as it will need to hold many more values (100^3, rather than 10^3).
Before making changes to the script, take a look at the values in mytable. This contains the data that are plotted in the output window, i.e. likelihood triplets with their corresponding log likelihood values. For some rows there are numeric values of –LL, but for others the value is given as NaN, which means 'not a number', or Inf (infinity).  The  mathematical explanation for this is that these are cases where the covariance matrix is 'not positive definite', i.e. it has one or more negative eigenvalues. There is also an intuitive explanation: these are cases where the set of correlations is not possible. For instance, the first row where a NaN occurs is where the index is 59, i.e. the fit of the data is being tested against this correlation matrix:
  1.0  0.0  0.5
  0.0  1.0  0.9
  0.5  0.9  1.0

A moment's thought indicates that if the correlation between variables 2 and 3 is .9, then they would be closely similar: it would not then be possible for variable 1 to be totally uncorrelated with variable 2, but still have correlation of .5 with variable 3.

Create this matrix on the console. The quickest way to do this is by typing
         thismat=diag(1,3)   # creates 3D matrix with ones on the diagonal

and then use
thismat=edit(thismat)

to open the edit screen and change the off-diagonal elements.
Now type:
eigen(thismat)

You will see that one of the eigenvalues is negative.
If you modify the matrix so that the correlation between variables 2 and 3 is .8, rather than .9, you will find that the eigenvalues are all positive. 


Optimization
In the context of structural equation modelling, optimization is a process whereby a model is specified, and different values of its free parameters are iteratively tested by computing the likelihood of observed vs. expected covariances (or other statistics). In our previous example, we could obtain an optimised estimate of the correlation matrix by testing all possible combinations of three correlations, and picking the correlation matrix with the lowest value of –LL. This is, however, a very labor-intensive method, especially if want to have estimates to precision of more than one decimal place. One can see that if more than three variables were involved, it would quickly become non-viable, even on a fast computer.
Optimization is the process of searching through the range of possible values that a model's parameters can take to identify the one that gives the best fit. Various computational algorithms can be used in optimization: this is a technically complex area. In OpenMx non-expert users need not concern themselves with the details of how optimization is done, because default procedures will usually work well. In effect, the method homes in on a set of values by tweaking estimated values in the direction that reduces the estimate of –LL. It is important to realise that optimization is not infallible, especially with complicated multivariate models.

No comments: