I wrote a quick post about Gavin Schmidt’s post comparing models to the satellite datasets. I thought Gavin’s post was very good, and explained the various issues really well. Steve McIntyre, however, is claiming that Schmidt’s histogram doesn’t refute Christy. This isn’t a surprise and isn’t what I was planning on discussing.
What I found more interesting was his criticism of a post Gavin wrote in 2007 (he also seems to have not got over the delayed publication of a comment in 2005). Nic Lewis also seems to think that Gavin’s 2007 post was wrong. So, I thought I’d have look.
In discussing a paper by Douglass, Christy, Pearson & Singer, Gavin says
the formula given defines the uncertainty on the estimate of the mean – i.e. how well we know what the average trend really is. But it only takes a moment to realise why that is irrelevant. Imagine there were 1000’s of simulations drawn from the same distribution, then our estimate of the mean trend would get sharper and sharper as N increased. However, the chances that any one realisation would be within those error bars, would become smaller and smaller.
In other words, in comparing the models and observations, Douglass et al. assumed that the uncertainty in the model trends was the uncertainty in the mean of those trends, not the uncertainty (or standard deviation) in the trends. This seems obviously wrong – as Gavin says – but Steve McIntrye and Nic Lewis appear to disagree.
The key point, though, is that we only have one realisation of the real world, which is very unlikely to match the mean of all possible realisations. With enough model realisations, however, we could produce a very accurate estimate of the mean model trend. Given, however, that the observations are very unlikely to produce a trend that matches the mean of all possible real trends, the model mean is very unlikely to match the observed trend, even if the model is a good representation of reality.
Gavin Cawley – who is also mentioned in Steve McIntyre’s post – discusses it in more detail here saying:
It is worth noting that the statistical test used in Douglass et al. (2008) is obviously inappropriate as a perfect climate model is almost guaranteed to fail it! This is because the uncertainty is measured by the standard error of the mean, rather than the standard deviation, which falls to zero as the number of models in the ensemble goes to infinity. If we could visit parallel universes, we could construct a perfect climate model by observing the climate on those parallel Earths with identical forcings and climate physics, but which differed only in variations in initial conditions. We could perfectly characterise the remaining uncertainty by using an infinite ensemble of these parallel Earths (showing the range of outcomes that are consistent with the forcings). Clearly as the actual Earth is statistically interchangeable with any of the parallel Earths, there is no reason to expect the climate on the actual Earth to be any closer to the ensemble mean than any randomly selected parallel Earth. However, as the Douglass et al test requires the observations to lie within +/- 2 standard errors of the mean, the perfect ensemble will fail the test unless the observations exactly match the ensemble mean as the standard error is zero (because it is an infinite ensemble). Had we used +/- twice the standard deviation, on the other hand, the perfect model would be very likely to pass the test. Having a test that becomes more and more difficult to pass as the size of the ensemble grows is clearly unreasonable. The spread of the ensemble is essentially an indication of the outcomes that are consistent with the forcings, given our ignorance of the initial conditions and our best understanding of the physics. Adding members to the ensemble does not reduce this uncertainty, but it does help to characterise it.
I should probably clarify something, though. If you want to characterise the uncertainty in the model mean, then of course you would want to use the uncertainty in the mean. However, if you want to compare models and observations, you can’t use this as the uncertainty, if the observed trend is not the mean of all possible observed trends.
To do such a comparison the standard deviation of the trends would seem more appropriate. However, even this may not be quite right, because as Victor points out the model spread is not the uncertainty. Typically, what is presented is the 95% model spread (5% of the models would fall outside this range at any time if the distribution were Gaussian). However, to take into account other possible uncertainties, this is typically presented as a likely range (66%), rather than as an extremely likely range (95%). Of course, if the assumed forcings turn out to be correct, and the models is regarded as a good representation of reality, then the model spread will start to approximate the actual uncertainty.
As usual, maybe I’m missing something, but it seems that the criticism of Gavin’s 2007 post is not correct, in the sense that Gavin is quite right to point out that using the uncertainty in the mean, when comparing models and observations, is wrong. This seems like another example of people with relevant related expertise making a technical criticism of what someone else has said, without really considering the details of what is being done, and doing so in a way that makes it hard for non-experts to recognise the issue with their criticism. Feature, not bug?
Update: It seems that some are arguing that the end of my post is wrong because a paper by Santer et al. did indeed use the uncertainty on the mean when comparing models and observations. However, Santer et al. also included the uncertainty in the observed trend in their comparison, which makes it more reasonable than what was done in Douglass et al., who did not include the uncertainty in the observed trend (this post is about a Gavin’s comment about Douglass et al.). However, having said that, I’m still not convinced that the Santer et al. test is sufficient to establish if models are biased, given that the distribution of the observed trends (i.e., trend plus uncertainty in trend) will not necessarily be equivalent to the distribution of all possible trends for the system being observed.
I realise that there might even be more confusion. Apart from when I quoted Gavin Cawley’s comment, the Gavin I’m mentioning is Gavin Schmidt.