Came across a very interesting paper on the art and science of climate model tuning. Based on the comments here, it appears that some are interpreting this as confirming their claims that climate models are tuned to give preferred results. However, I think it is a good deal subtler than that. As the abstract says

Tuning is an essential aspect of climate modeling with its own scientific issues, which is probably not advertised enough outside the community of model developers.

and the paper

concludes with a series of recommendations to make the process of climate model tuning more transparent.

Basically, model tuning is crucial and inevitable, but it would be much improved if the process was more transparent. I don’t want to go into too much detail, because the paper is actually quite readable and I’d encourage those who are interested to read it themselves.

What I will say, however, is that tuning is a key part of climate modelling; the system is too complex to model all aspects from first principles. The fundamental physics is well understood, but some processes require sub-grid models or parametrisations. These parameters are typically constrained in some way (for example, by physical calculations, or observations) but some are better constrained than others. The goal of tuning is then to minimise some difference between the model output and selected observations and theories. Although there are a number of different observations/theories that could be used for tuning, something I had not realised is that there is a

dominant shared target for coupled climate models: the climate system should reach a mean equilibrium temperature close to observations when energy received from the sun is close to its real value (340 W/m

^{2}).

The bit of the paper that I found most interesting was the section on *Tuning to 20th century warming*. The suggestion is that even though ECS is an emergent property of models and the match to the 20th century is typically used to evaluate models, there is an indication that some tuning to fit the 20th century is probable. This is largely because it’s been noted that high sensitivity models tend to have smaller total forcing, while low sensitivity models have larger forcing. Hence, there is less spread in historical warming than might be expected.

The other comment I found interesting was that internal variability could produce a variation of ± 0.1K on centennial timescales. Since we only have observations of one realisation, the models do not need to be closer than ± 0.1K to well represent our climate. Matching too closely might, in fact, suggest over-tuning. I also think that this relates to something I discussed in an earlier post.

Given that there are some indications of tuning to match the historical record, what was suggested is that one could construct outlier low- and high-sensitivity models and then run these in pre-historic climates to see if one can rule out some of the more extreme values. This seems like a particularly interesting possibility.

Anyway, I’ve ended up saying more than I intended. I think the basic idea of the paper is very good; being more transparent about how models are tuned will be very valuable, as it will not only make clear what is being done, but also provides the possibility that it will make clearer what role this parameter tuning is having on the model results. I would, however, certainly be interested in getting other people’s views about this, in particular it would be good to get Michael Tobis’s views, as he has commented before on how we could work to improve climate models.

Glad you posted on this

What I found interesting: The different modes they used for tuning and the different time periods.

Example: A smallish percentage tuned in a coupled mode using the historical period and Global temp. ( at least that is how I read the survey)

Steven,

Yes, I agree. Although I assume that it also means that some models (maybe all) are being tuned in multiple modes.

It would be interesting to see the ensemble mean of 20th century global temperatures from those models which were not tuned against the 20th century global temperatures (approximately 16 of the surveyed groups), if anyone wants to put it together.

I like the comparison to tuning musical instruments. You can’t just tune a musical instrument to whatever you want and then have the result come out as music. There are lots of constraints and restrictions that need to be met. Perhaps in theory climate modelling could be like everyone trying to tune within a heavy metal band and concluding that music is loud and heavy, but a bit more exploration of tuning options would reveal the possibility of slow lullabies.

So the question is how well has the possible model tuning space been explored? Is there an alternate tuning which would produce a much lower (or perhaps higher) sensitivity. If there is an alternate tuning which produces a low sensitivity then why hasn’t any ‘skeptic’ found it yet? Its not as if there is no money at all available for someone to do such a study, with fossil fuel interests etc. And with some climate models open source and runnable on a home PC its not as if the resources can only be obtained by a large well funded government back body.

So the question is how well has the possible model tuning space been explored? Is there an alternate tuning which would produce a much lower (or perhaps higher) sensitivity. If there is an alternate tuning which produces a low sensitivity then why hasn’t any ‘skeptic’ found it yet?Is being worked on by normal scientists. Richard Millar used these “objective” model tuning methods to find combinations of parameters that would give a very low equilibrium climate sensitivity. If I remember right one of 2°C. He managed to do so while staying in the physically possible range for the parameters.

I asked him about how the clouds and precipitation looked in such a model. His optimistic answer was, if I remember right, that some clouds looked good. In other words: others did not, it is very much work in progress. Maybe he will still find a combination that gives low sensitivity and still has somewhat decent clouds and precipitation.

It was a talk at the IMCS this summer: FINDING LOW CLIMATE SENSITIVITY GENERAL CIRCULATION MODELS THROUGH VERY LARGE PERTURBED PHYSICS ENSEMBLES by Richard Millar, University of Oxford

http://13imsc.pacificclimate.org/13imsc-program.pdf

Tom Curtis says:

It would be interesting to see the ensemble mean of 20th century global temperatures from those models which were not tuned against the 20th century global temperatures (approximately 16 of the surveyed groups), if anyone wants to put it together.You would need for find a large group of climate modellers who do not know who much the temperature has increased over the instrumental period. Good luck. 🙂

When the model fits well to the observations, so that there is no need for tuning, that is also implicit tuning. It is better to use the models for what they do best: study how the different parts of the climate system interplay and what is important.

Maybe it is also good to emphasis that for the detection of climate change and its attribution to human activities, the historical temperature increase is traditionally not used. If I understand it correctly this is exactly to avoid problems with tuning, which makes specifying the uncertainties very difficult. The attribution is made via

correlationswith the 3-dimensional spatial patterns between observations and models. By using the correlations (rather than root mean square errors), the magnitude of the change in either the models or the observations is no longer important.Tom,

Yes, that would be interesting to see.

Victor,

I didn’t realise that someone was actively looking at the possibility of tuning a model to produce low CS. Interesting.

It would be very interesting to see whether the models could be tuned to explain the 20th century climate without CO2 being a greenhouse gas, without setting the parameters to physically unrealistic values. A lot of work to address a climate skeptic argument, but it would give a definitive answer (rather than the tacit answer provided by the lack of a climate skeptic scientist having done it already).

It is a pity we don’t have (almost) infinite computational resources to marginalise (integrate over) the uncertainties in the parameters. Optimisation is the root of all evil in statistics, so it is generally better to marginalise than tune (where possible/feasible).

Dikran,

Interesting. I guess the 95% attribution study has partly done that, but you’re suggesting actually trying to tune to match the historical record without any extra CO2 forcing. Would be interesting to see what would need to be done. What would produce the change, for example?

Well quite ;o)

The code for several of the models is in the public domain, so the only things that stand in the way of performing the experiment is the computational expense (I suspect that there is probably a skeptic academic with access to a HPC, and if not then there is always the SETI@home type approach) and the effort required to understand the science and the code. Of course making assertions without seeing if they are valid is rather less work! ;o)

I think the perturbed physics experiments are also a partial answer, but I haven’t looked in detail at the results. Perhaps it would make a reasonable grant proposal to survey the boundaries of the space of plausible parameters to see what is just about plausible (rather than trying to survey the general shape of the p.d.f.).

Dikran Marsupial says: “

It would be very interesting to see whether the models could be tuned to explain the 20th century climate without CO2 being a greenhouse gas, without setting the parameters to physically unrealistic values.”That might be an interesting new task for Richard Millar. He was searching for climate models with a equilibrium climate sensitivity between 1.5 and 2°C because he thought that the research with simple statistical energy balance models suggested such values. We now know that the raw results of these energy balance models are biased too low and their actual estimates may even be higher than average.

http://variable-variability.blogspot.com/2016/07/climate-sensitivity-energy-balance-models.html

Thus Millar’s original question is now less relevant. (Producing models with a high climate sensitivity is easy.)

I wonder with Physics, what forcing the climate models would work with to produce any warming? Or do you see a way to flip the natural cooling into a warming signal with some bizarre parameters?

Dikran Marsupial says:

The code for several of the models is in the public domain, so the only things that stand in the way of performing the experiment is the computational expense (I suspect that there is probably a skeptic academic with access to a HPCYou can reduce the computational power with a trick: by training a statistical model on a small number of climate model runs. When you find potentially working parameter sets this way, you can use them in the original model and see how it behaves.

I cannot find it any more, but I noticed some time ago a “prominent” UK mitigation sceptic who complained to Tamsin Edwards about wasting tax payer money working on such climate model emulators. They really have no clue what they are talking about, otherwise they would have welcomed this kind of research to test the limits of physical climate models (under the generous assumption they actually doubt the science, not the political solutions).

ATTP – I hope this is not off topic, but I saw recently an interesting talk by Prof. Julia Slingo (who as you know is Chief Scientist at the Met Office): “Building a climate laboratory: How climate models have revolutionised our understanding of planet Earth”

Part of the talk looked at resolution in the models and said that AR5 used 80km grid, but the Met Office is now using 20km as standard, and with its new 16 petaflop computer is testing 5km. On one aspect of the model (thermohaline circulation), the 5km results were not bad, but the 20km were not so good.

My question: presumably there are opportunities when increasing the resolution to ‘de-parameterize’ some components of the model, when appropriate and useful?

Of course in many cases it may simply be a case of running the same old (unchanging) basic physics at finer and finer granularity (e.g. for the GCM). One of the figures she showed, revealed qualitatively better results when moving to these finer scales, meaning that some of the things that ’emerge’ are surprising (not just another decimal point or better error bar) and make sense of things we struggled to understand before (she gave examples).

The interplay (innovation feedback) between the models and the computing power is interesting to me, and is a phenomenon not unique to climate science of course.

Pingback: Puntare sul riscaldamento globale - Ocasapiens - Blog - Repubblica.it

Running models at a higher resolution does not mean that you no longer need parametrisations for small-scale processes. The climate system shows variability at all spatial and temporal (averaging) scales and has processes at a wide range of scales. Making the resolution higher, will make some of the current parameterization less important, but also adds new problems. Especially resolutions between 10km and 100m are very difficult.

An important parameterization is for convection (showers), which transports heat and moisture from the boundary layer near the surface to higher altitudes. At larger than 10 km resolutions you can parametrize this with a mass flux scheme that detects situations that are typical to trigger convection and then transports heat and moisture to higher levels.

Once you go belong 10 km, the flow of the model itself also starts to partially account for convection and the air next to the shower that flows down (subsidence), no longer does so in the same atmospheric column. Below 100 or 10 m the atmospheric model can model the vertical movement itself and no longer needs a convection parameterization. Below 10 m it would still need an additional parameterization for the turbulent fluxes due to vertically moving air. And it would need the right cloud micro-physics (all the different cloud and precipitation particles, how the freeze and interact) so that the shower behaves realistically. Weather prediction models are currently in this twilight zone and it is a tough spot.

Unfortunately, higher resolutions are easier to sell. It requires bigger computers, which means money for the private industry. Working on better parameterization would mean more scientists, which means more government employees. We life in a time were private=good and government employees=bureaucracy=bad.

One of the authors, I think:

She seems to like improving climate modelling.

Thanks Victor.

“That might be an interesting new task for Richard Millar. He was searching for climate models with a equilibrium climate sensitivity between 1.5 and 2°C because he thought that the research with simple statistical energy balance models suggested such values. We now know that the raw results of these energy balance models are biased too low and their actual estimates may even be higher than average.”

Even though Energy Balance models are biased low, even with the adjustments, you aren’t excluding ECS values from 1.5 C – 2 C at the 95% confidence level.

Thanks for leaving that mt bait. I’ll probably pick it up tomorrow. I have thought quite a bit about these questions. Whether that effort leaves me with useful things to say is another question.

-1,

I think that is an important point. The IPCC range is only a likely range, not an indication of where we definitely expect it to lie. It is clearly more likely to lie within this range, than outside this range, but being outside this range is not precluded.

MT,

I look forward to it.

In a podcast Victor V. links to in his article on the Art, Bjorn Stevens lays out this:

Among the model results – the northern hemisphere is frozen over

Observation – the northern hemisphere is

NOTfrozen overJob – find the bug in the model that froze the northern hemisphere

Solution – tune the model

Too simplistic?

JCH says: “

In a podcast Victor V. links to in his article on the Art, Bjorn Stevens lays out this:”Someone clicked on a link!! I am so proud of my readers.

I even put in a link joke, that I am sure nearly no one will find.

JCH says: “

Among the model results – the northern hemisphere is frozen over”Observation – the northern hemisphere is NOT frozen over

Job – find the bug in the model that froze the northern hemisphere

Solution – tune the model

Too simplistic?

Simplistic, I don’t know, but maybe a bit exaggerated; such a big bug would have been found in some other way. Bjorn Stevens is used to communicating to his colleagues, thus I think the example is mainly thought to refute claims of some people that they did not tune their model. There are no climate modellers without intimate knowledge of the climate of the Earth. A bug that deviates from this expectation is more likely to be found than one that happens to make the model fit better. Thus even if you would go out of your way not to tune your model, it will be tuned somewhat.

-1,

If one of the three different lines of evidence that Energy Balance Models give biased results still holds when the next IPCC report is written, it is highly likely that the lower bound of the Equilibrium Climate Sensitivity will be raised again to 2°C.

The IPCC is tasked with presenting a

consensus viewof our current understanding of climate change. That is likely why they dropped the lower bound from 2°C to 1.5°C in the last IPCC report in response to the new low estimates from EBM. I presume they realised the evidence was weak and went against most of the other evidence, but the EBM estimates were credible published research and the papers showing the biases were not published yet. Had it been a normal review, presenting our best understanding of the climate system, they would likely have stayed with 2°C and added a foot note. But the IPCC reports are aconsensus view. The lovely irony.Victor – I have discovered you write really informative stuff on your blog.

Is GISS Model E tuned to produce late 20th-century warming? Not intentionally.

There are no climate modellers without intimate knowledge of the climate of the Earth. A bug that deviates from this expectation is more likely to be foundthan one that happens to make the model fit better.Thus even if you would go out of your way not to tune your model, it will be tuned somewhat.I was looking into the relationship between the CMIP model spread and the trend uncertainty for the global mean temperature because I was interested in seeing whether this evidence from climate models would constrain how much warming we have seen.

If someone would claim that the observations show 0.2°C more warming or 0.2°C less warming, would you be able to say: sorry that is likely wrong because such a new observational temperature curve would fall outside of the ensemble model spread. I would currently think that 0.2°C deviations would be no problem even if that means leaving the model spread, because the model uncertainty is much larger.

If someone would claim that there was no warming or that the warming was 2°C since 1880, I would say the models exclude that (and much other observational evidence even more so). Where the limit is between 0.2 and 1°C is hard to say because much of the tuning (and the other reasons for a lack of spread) is implicit.

http://variable-variability.blogspot.com/2016/08/Climate-models-ensembles-spread-confidence-interval-uncertainty.html

Before I dig in (I haven’t even finished reading the Hourdin paper) I’d like to capture some interesting threads on Twitter.

Unsurprisingly, Roger Sr takes the bait:

“Tuning & lack of demonstrated skill at predicting changes regional climate shows major defects in climate policy tool”

“Remarkable admission – 96% of the models tune to obtain radiation balance.”

Also unsurprisingly. Roger is woefully muddled about details:

@thesinkovich: Does this come under the same heading as what the IPCC called “flux adjustments”?

RPSr: Yes. That is one part.

==

A somewhat reasonable skeptic, Derek Sorensen (@th3derek), buys in

@RogerAPielkeSr Is this as damning as it appears, at first glance, to this layman?

and before wandering off to Watts-land and observations of GMST and “altering data

and all that, does make this valid point:

@th3derek: I’ve always been uncomfortable with tuning; von Neuman’s elephant, and all that.

===

Chiris Colose, Andy Dessler, and Gavin Schmidt make the reasonable point that Pielke ought to have known all this:

@ClimateofGavin: I love how repetition of a basic modeling fundamental is a ‘remarkable admission’ to ppl who don’t know the basics.

@AndrewDessler: I’m sure @RogerAPielkeSr will next be shocked that satellite instruments are “calibrated”. Clearly nefarious.

@CColose: but the existence of ‘tuning’ is completely well-known and anyone telling you it is some ‘admission’ is selling u snake oil

NOTE:

However, I think it is fair to say, as Houdin et al do say, that the issue has been somewhat avoided in public. Indeed, when I tried, about a decade ago when I still had some standoing in the field, to get NCAR to document their tuning procedures, I got a bit of a runaround.

The tuning step is common knowledge among real climate modelers, but it’s been largely informal/grey literature knowledge. I think there are reasons that it used to be informal, and I was arguing a decade ago that we ought to formalize it.

Why modelers haven’t been eager to take that step is made eminently clear by RPSr’s behavior, which was, unlike much in climate, utterly predictable, except for whether it would be he or Curry who’d run with the ball first.

===

@jim_bouldin also makes some interesting claims:

@jim_bouldin: It’s not as if tuning were a desirable or preferred thing to be doing when building a model. It’s a last resort approach.

@ClimateOfGavin: hmm.. Rather, it’s unavoidable when dealing with a complex multi-scale system.

JB: I guess I’d say it’s avoidability depends on the urgency of building a model and the available info to build it

JB: The fact that first principles won’t explain it all means we have to stay very aware of exactly how all of our model parameters were generated, how hypothetical. Gotta be very clear and distinct on this issue.

@CColose: sure,in practice no complex/interesting science would ever get done if everything was knowable from first princ

NOTE:

I think this is a classic case of people bringing their backgrounds to a different science. I would say the epistemic status (believability) of climate models is stronger than that of ecological models. Jim doesn’t seem to understand this, and I’d venture that Chris may not understand that [Jim] doesn’t understand it.

===

Meanwhile, in a thread having nothing directly to do with Houdin, I found this interesting item from Jonathan Koomey ties into my thinking about the public-facing aspects of the tuning issue.

@jgkoomey: Great example of forecasts diverging from reality https://www.washingtonpost.com/posteverything/wp/2016/08/02/the-progressive-victory-nobodys-talking-about/?postshare=4141470168634141&tid=ss_tw

NOTE:

Here Jonathan points to a failed economic forecast of some importance, and an essay about it that I find dissatisfying.

The public concludes from this sort of thing that “experts don’t know what they are talking about”. I think we need to convey the more complex idea that “experts in some fields don’t know what they are talking about”.

My position is that the believability of **economic** models is extremely weak. And you will find that economists are among the people having the hardest time placing any value on climate models.

Thus, I am arguing that the epistemic status of different fields differs, and that the status of climate science is rather high, much higher than most outsiders have been led to believe.

SUMMARY:

The key internal topic for climate modeling is how to formalize model tuning, and whether a uniform approach across CMIP or multiple different approaches for optimizing for different concerns is indicated. The Houdin paper addresses this, I believe.

The key topic for public communication is rather the extent to which the necessity for tuning should reduce the believability of climatology and of climate models. This is an unusually complicated topic, which is also unusually interesting. In my opinion it says a great deal about how science will be conducted in a world where massive computation joins theory and observation as a third leg of science. The topic is deep and rich. But, given that it’s closely related to climatology itself, it will inevitably suffer from an immense amount of noise injection from confused and/or cynical parties.

I think we should address both here. I’m afraid, though, that it may be necessary to be a bit pedantic about which one we are talking about at any given time. We’ll see.

I will begin by advising NOT to resort to ad hominem or ad vericundiam dismissals of questions directed at us from “skeptical” camps, much as they are tempting based on track record.

Inevitably, the usual suspects will generate a great deal of noise here, but that doesn’t mean there won’t be a few challenging and important questions hidden here and there among their fulminations. I believe it is important to identify and respond to the best questions without falling prey to arguing about the great mass of nonsense that they will make sure to insert into this discussion.

Eh, it’s just inaccurate.

The model itself is one thing, while the parameters or inputs that go into the model are another. Both influence the outcome. A flawed outcome (e.g., frozen N Hemisphere) could come from either.

So if there’s a bug in the model, fix that. But if the parameters are off, and can be improved via tuning, do that.

Your real job is “find the source of the error”, whether that comes from the model or the parameters.

All the models are tuned to predict the past. Perhaps this is reasonable.

But the models quickly diverge from one another when applied to a given scenario.

It is possible and seems likely that tuning to solve one infidelity subsequently amplifies other infidelities.

TE,

You might have to define what you mean by “quickly diverge”.

It’s an interesting paper, and surprisingly readable considering it’s dealing with a fairly narrow technical topic. There could be any number of other terms to use than ‘tuning’… consistent multiparameter estimation comes to mind, but tuning is short and snappy.

I have little doubt after reading the reactions of Curry and Pielke Sr. (including airing his longstanding grudges against Gavin Schmidt for the nth time) that “but tuning” will quickly become a meme used by the contrarians and deniers. Hopefully a short-lived one.

Off-topic, but I see Curry also tweeted links to a couple of comment pieces by others. I’ll spare the links; the titles say enough:

Activism – Or, How to Turn People Off and Stall Progress

Weak Minds Think Alike

Oh dear, is Judith promoting posts from the new “Climate Denialism” site. I had a brief discussion with Ben Pile on a follow up post based on a comment on the Weak Minds Think Alike post. It didn’t go very well, but that wasn’t a surprise. I do find it quite fascinating that people who give the impression that they regard themselves as intellectuals will make such appalling weak arguments. Full of ad homs, and drawing all sorts of definitive conclusions from what is clearly limited information.

“quickly diverge”

TE,

But those are all the future emission pathways. Of course they diverge, they’re considering different changes of forcing beyond 2005.

Ben Pile makes some of the more vacuous arguments I’ve come across. That Judith has highlighted his stuff quite a few times speaks very poorly for her, IMO.

Please note that in the comments at Judith’s we have many along the “coffin nail” and “stake through the heart” variety. For those who who see some broad scale changes on the climate-o-sphere, I offer the following as evidence of samesameo – just from the first few comments.

https://judithcurry.com/2016/08/01/the-art-and-science-of-climate-model-tuning/#comment-800122

https://judithcurry.com/2016/08/01/the-art-and-science-of-climate-model-tuning/#comment-800238

https://judithcurry.com/2016/08/01/the-art-and-science-of-climate-model-tuning/#comment-800215

There are many more, of course.

Windchaser – Victor’s explanation, I think, gets to the point, which is it is probably difficult for modellers to un-know what they know, so when a defective model feature results in an agreement with what they know, they’re not as likely to find and fix that defect.

I don’t like the music analogy, but say you want a pure 5-string banjo improvisation of Generic Classical Piece Z (a climate model not tuned to 20-century climate.) Would you go to Julliard and find a musical genius kid who slums on the 5-string with some hillbillies, or would you hire the musical genius hillbilly banjo player from

Deliverance? Both are capable of producing a brilliant result (which is why this paper is not even remotely damming of climate modelling.) To work brilliantly, the improvisation has to be deeply rooted in Classical Piece Z (in crafting their improvisations, painful-to-the-ears deviations are avoided; otherwise, they would not be geniuses.) Both are capable of purity, but only one is incapable of impurity… because he is a musical genius who has no experience at all with classical music. The Julliard guy knows the melody. It’s impossible for him to not know it. It could seep into his improvisation without his awareness because it would sound right.MT,

Thanks for the comment. I agree with the end of your comment, as hard as it might sometimes be.

But those are all the future emission pathways. Of course they diverge, they’re considering different changes of forcing beyond 2005.They diverge among the scenario.

The dark blues diverge from other dark blues.

The reds diverge from other reds.

But, there is good reason to believe the models can predict the past.

TE,

They clearly don’t all follow the same path – we’re well aware that there is a range of future warming along all pathways; the range is about 0.5K for the lowest emission pathway and about 2K for the highest. However, I think your figure overstates the level of divergence, and it only goes to 2050.

I do not know whether it is possible to see this in the figure of Turbulent Eddy, but during the historical period the spread is too low. That is explained in my blog post.

http://variable-variability.blogspot.com/2016/08/Climate-models-ensembles-spread-confidence-interval-uncertainty.html

An important reason is that models with a high climate sensitivity tend to have more cooling by aerosols and low climate sensitivity models have less cooling by aerosols. In future the influence of aerosol cooling (proportional to emissions) becomes smaller that the influence of greenhouse gasses (CO2 accumulates), thus you see the spread in climate sensitivities more clearly in the future.

The range of climate sensitivities of climate models fit to the most accurate estimates we have of the climate sensitivity.

Ranges and best estimates of ECS based on different lines of evidence. Bars show 5-95% uncertainty ranges with the best estimates marked by dots. Dashed lines give alternative estimates within one study. The grey shaded range marks the likely 1.5°C to 4.5°C range as reported in AR5, and the grey solid line the extremely unlikely less than 1°C, the grey dashed line the very unlikely greater than 6°C. Figure taken from figure 1 of Box 12.2 in the IPCC 5th assessment report (AR5). Unlabeled ranges refer to studies cited in AR4. The figure in the review article by Knutti and Hegerl (2008) presented by Skeptical Science is also a very insightful overview.Joshua… Funny. We don’t use differential equations for simulations anywhere else, except.. you know… engineering. I wonder how they figured out how to design the space shuttle;

https://en.wikipedia.org/wiki/Space_Shuttle

Musta guessed. Yep thats it.

My point was more that if it makes sense to have climate models with climate sensitivities on the high range (3C < ECS < 4C) then it also makes sense to have climate models with sensitives on the low end (1.5C < ECS < 2C). For the most part we have many climate models on the high end, but very few on the low end.

I see no problem with tuning against the whole 20th century, consider the 20th c. to be a score sheet that the music performers should follow as exactly as possible. Then if by some fluke the improvisation of the early 21st century produces two really low notes on the base at beats 2007 and 2012 we could have a very good model. Maybe the base player had a stroke in those beats. Hah. But of course the composer here would be physics if you’re not into religious explanations in which case any earthly… (no actually, solar-systemic or even more exactly sun-moon-earth systemic) God would suffice as being the composer of the piece.

yeah, I know, the score sheet of music is an idealization of the music intended, if it’s intended to be played on natural scales and it sounds somewhat off if played on the even-tempered scale. I’d say that could be a better analogy for the difference between reality and models.

@ Victor Venema – Anyway, it’s pretty clear that we have different understandings of the best estimate of climate sensitivity and the probability distribution associated with that. In the past I’ve explained why climate model distributions can be biased, why paleoclimate estimates are generally overestimates, and why the Kyle Armor modifications to energy budget values are too high. I don’t see the best evidence excluding 1.5C-2.0C at the 95% confidence level.

-1,

But some of that is emergent. The more we engineer/tune the ECS of the models, the less confidence we can have in the resulting range. I think the idea that we tune models to have outlier sensitivity (high and low) and then apply those to pre-historic climates makes a a lot of sense to me. I don’t think it makes quite as much sense to do for projections.

TE I think your definition of “quickly diverge” diverges from mine (given the uncertainties involved the 3-4 (?) fold increase in model spread over 45 years doesn’t seem unduly rapid).

It is also worth noting that there will be a constriction of the spread of the models immediately after the baseline period which would give the (spurious) appearance of some divergence even if there were none.

“For the most part we have many climate models on the high end, but very few on the low end. ”

If we simulate the movement of gas molecules in a container there will be many solutions where the molecules are evenly distributed but very few where they are mostly in one half of the container. If there are many plausible ways of making/tuning a climate model so that it has a high ECS and few plausible models/parameter settings that give a low climate sensitivity, that just means that low ECS is less plausible than high ECS, and what we see is what we should expect to see.

Exploring the bounds of plausibility is a sensible thing to do to understand the climate system (science often works by bounding), however as ATTP suggests that would make no sense for making projections as for projections we want to get the distributions of the outcome to be correct (as that is needed for risk assessment) not just the bounds.

@ ATTP – fair enough, it is emergent. But if there is a discrepancy between climate sensitivity as suggested by climate models and climate sensitivity as suggested by instrumental or paleoclimate data, then finding what parameters result in lower climate sensitivity models that are consistent with observations may help explain the discrepancy between observations and models. That’s why I think it could be useful to explore tuning model parameters to get low sensitivity.

@Dikran –

“that just means that low ECS is less plausible than high ECS, and what we see is what we should expect to see.”

Or… the distribution of climate models has a systematic bias.

-1,

It sounds like we’re saying the same thing. Engineering outlier models to then test against paleo estimates, for example, seems like an obvious thing to do (although I don’t think there is a major discrepancy between paleo and models). I was simply suggesting that doing so for projections doesn’t make as much sense since you would really like the CS to be emergent.

Of course, if you discovered that you couldn’t eliminate some of the outliers, that may well inform how you then tune the models for projections.

While it would be nice for CS to be emergent, it’s really difficult to exclude the possibility of the distribution being unbiased, or the model spread being too small. Maybe it makes more sense to use an empirically justified distribution of climate sensitivity to determine the models to use for projections (and to use for economic analysis, etc.), rather than use the model distribution itself.

-1,

Of course you can’t exclude the possibility that the distribution is biased in some way – in a sense this paper is suggesting that that is possible. I would argue that explcitly tuning the ECS would, however, remove information, since then the ECS would be coming from another source. At least if the ECS is potentially emergent, you can regard this as a somewhat independent measure. It might still be biase (which is why testing it is important) but I think moving to a phase the models are tuned to give an ECS distribution from another source would be the wrong way to go.

As I understand it, most of the economic analysis is done on the basis of some kind of furture temperature change. You could then impose any distribution onto that to determine the likelihood of various impacts. I don’t think you need to specifically tune models in order to apply different possible ECS distributions to economic analyses.

“Or… the distribution of climate models has a systematic bias.”

yes, possibly, but then again only looking at the extremes will create an even greater bias (unless you think the distribution actually is bimodal), so what is your point? At the end of the day, you have to estimate the distribution as well as you can using the resources actually available using the approaches the research community consider plausible. Of course if you actively want climate sensitivity to be low then you can always claim that there is a systematic bias in order to discount the models with high ECS, but that is not an argument that is likely to convince others that do not share your prior belief. A better approach would be to demonstrate that there actually is a systematic bais. Good luck with that (genuinely, I’d like climate sensitivity to be low as well).

“or the model spread being too small. ”

we already know that is the case, there are many uncertainties that are not fully accounted for.

“Maybe it makes more sense to use an empirically justified distribution of climate sensitivity ”

The problem is that these estimates have a high variance. Error is composed of both bias and variance, so choosing models in close agreement is not necessarily a good way of choosing which models to use (although it is a good idea for decadal scale projections where getting things like ENSO right are important).

Consider taking a large number of model runs from a single climate model with the same parameter settings. Now choose one of these to be the “real Earth”. Next estimate ECS (or some other statistic of your choice) using that one model run using a period of say 30 years. Next select those models that best match the “true” model run on the same basis. The models you didn’t select are by definition equally useful for long term climate prediction, so what have you gained by deleting them? Nothing.

As far as the model spread being too small is concerned, my understanding is that the ECS range presented by the IPCC is formally a 95% range (extremely likely), but is presented as a 66% range (likely) in order to account for other uncertainties.

“The models you didn’t select are by definition equally useful for long term climate prediction,”

The point I was making is that of course they

appearbiased compared to the “one true model run”, but the appearance of bias is spurious and due to the variance of the estimate made from the “one true model run”. Of course ECS is essentially the same in all model runs as they are all from the same model with the same parameters.ATTP, as usual it is amazing what you can find out about the mainstream position on climate change from actually reading the IPCC reports ;o)

“I think moving to a phase the models are tuned to give an ECS distribution from another source would be the wrong way to go.”

Why would this be the case if empirically derived distributions are more reliable?

” I don’t think you need to specifically tune models in order to apply different possible ECS distributions to economic analyses.”

If you want to do proper economic analysis, you need to know the geographic distribution of temperature and rainfall changes, and one way you can get that is by using climate models. From memory, I believe that the FUND model uses CMIP5 model output to perform economic analysis. However, if the distribution of CMIP5 models is oversensitive, then this will result in biased estimates for the ‘social cost of carbon’ (which should really be called the net external cost of CO2 emissions, but whatever). One way you could fix this bias is to use an empirically justified distribution of climate sensitivity to determine the distribution and weight of models. However, if the empirically justified distribution involves climate sensitivity values outside the range of distribution of climate models (for example, perhaps there is a 5% change of an ECS of ~1.5C) then one way you could get a climate model with an ECS of 1.5 C is to tune model parameters of a pre-existing model.

In other words, tuning model parameters can give us models with ECS values that are plausible according to empirical evidence, but outside of the range of climate models.

“A better approach would be to demonstrate that there actually is a systematic bais. Good luck with that”

This is pretty easy to do. Just take the distribution of temperature trends for CMIP5 models under historical runs and compare that to observations. Compare the slope of the trends and determine the p-value. If there is no systematic bias, then one expects that the null hypothesis, that the trends are the same, will be true.

If they were, this might be, but I don’t think they are.

I also don’t think your latter point is correct. The models represent a wide range of parameters and also initial conditions. The observations represent a single realisation. A difference between the model trends and observations could be a bias, but it could also simply be that the initial condition in reality produced a trend that happened to lie within one region of model space, but could have been in a different region had the initial conditions been different. I think there are assumptions in your suggested process that may not be correct.

“Why would this be the case if empirically derived distributions are more reliable?”

what makes you think the empirically derived distributions are more reliable?

“This is pretty easy to do. Just take the distribution of temperature trends for CMIP5 models under historical runs and compare that to observations.”

No, this is not correct, for the reason I have already pointed out (consider the thought experiment).

“The models represent a wide range of parameters and also initial conditions. The observations represent a single realisation.”

The problem is that there is little reason to believe that the distributions of these parameters and initial conditions are unbiased, or that the spread is not an underestimate, or that there isn’t significant bias due specification error in the models used.

For example, there could be systematic bias in all models due to the fact that the grid size used is not infinitely small, and there is basically no way to get around this bias due to the limitation of computational resources. However, one could attempt to correct for this bias by comparing model output with observations.

The way I see it, if we have empirical evidence which can be used to correct for potential climate model bias and get a better distribution, then why not do that rather than use the raw distribution of climate models, which can have all sorts of problems?

“what makes you think the empirically derived distributions are more reliable?”

Because they avoid the systematic biases of climate models, they can avoid overfitting, etc. I would trust empirically derived distributions more provided that the methodology behind the empirically derived distributions is well justified (it’s really easy to specify a bad functional form and do empirical estimates poorly) than just taking the output of GCMs. Of course that doesn’t mean that you can’t get an even better distribution by correcting for climate model output with empirical evidence.

“No, this is not correct, for the reason I have already pointed out (consider the thought experiment).”

… I’ve done it in this past…

-1,

Because this requires assumptions about the observations that will almost certainly not be true. There’s no point in doing something that might narrow a distribution if you’re NOT confident that that narrower distribution is actually a reasonable representation of reality. It’s the classic precision versus accuracy issue. There’s not really much point in making something more precise if it’s not accurate (or, you’re not confident that it’s accurate).

This doesn’t make what you did the correct thing to do.

“… I’ve done it in this past…”

It is a shame when I take the trouble to explain why some line is reasoning is incorrect, with a thought experiment that shows that it is wrong, and yet the person refuses to engage with it and dismisses it with a flippant one-liner.

“Because they avoid the systematic biases of climate models”

which you haven’t demonstrated, because your reasoning is faulty, however you refuse to engage with the thought experiment, so there is little more I can do to help you see the flaw in your reasoning.

“Because this requires assumptions about the observations that will almost certainly not be true.”

Which assumptions in particular are you referring to?

“with a thought experiment that shows that it is wrong”

Could you please specify what you are referring to in this comment section as a thought experiment?

-1,

I think to do what you’re suggesting requires assuming that the observations represent some kind of mean of all possible observations, which may not be the case. In other words, the range of possible observed trends is likely bigger than the range of the observed trend plus uncertainty in that trend, and you don’t know where the currently observed trend should fit within that possible range.

I’ll try to explain what I think Dikran is getting at. He can correct me if I get it wrong, and I’ll do it the other way around. Let’s say you want to eliminate models. To do that, I think what you would have to do is run each model with initial conditions that cover the range of possible initial conditions, and with parameters that cover the range of possible parameters. If you do so and can show that your model results only match the observations in a small fraction of cases (say less than 5%) then you could possibly eliminate that model. However, eliminating models on the basis of a few runs not matching the observations closely enough is not really a suitable manner in which to eliminate models. You still wouldn’t know that that model was not a reasonable representation of how our climate responds to changes.

Ultimately, my point is that you shouldn’t be searching for ways in which to eliminate models (which is what you seem to be suggesting). You should really only eliminate models if you’re very confident that that models really has some kind of bias.

“Could you please specify what you are referring to in this comment section as a thought experiment?”

Consider…

“from a single climate model.. those models”

Could you please clarify? You talk about a single model, then you refer to other models that you haven’t defined yet. What are these other models in addition to the single model you refer to?

“all possible observations”

What do you mean by all possible observations? The only possible observations were the ones that were observed.

With respect to testing if the distribution of climate models is unbiased or not, what you can do is test the null hypothesis that the climate model distribution is unbiased and the model spread is a good estimate of the true spread. Then you can determine the p-value that you would get a set of observations as extreme or more extreme than what is observed given the null hypothesis. If the p-value is very small (say less than 5%) then I think it is reasonable to reject the null hypothesis.

“However, eliminating models on the basis of a few runs not matching the observations closely enough is not really a suitable manner in which to eliminate models.”

Well if you have few runs, then your p-value will likely not be small enough to reject the null hypothesis even if the null hypothesis is not true.

The main thing to test here is not if an individual model is biased, but if the overall distribution of models is unbiased (because that is what people are using to construct confidence intervals, perform attribution, perform economic analysis, etc.).

Lets try it a different way. Say we had Stargate SG1’s quantum mirror that allows us to travel to the Earths in parallel universes. Say we were good enough at operating the controls that we could restrict our travels to parallel universes in which the climate physics of the parallel Earths was identical to that of our Earth, and where the forcings were identical and the climate only differed as a result of the initial conditions (if there are an infinite number of parallel universes then they should be plenty of these). Would you agree that the true ECS of all of the parallel Earths would be identical to that of ours?

-1,

Let ensure that we agree about something. If we don’t then continuing this may be not worth it. We do indeed have only one set of observations. However, that set of observations does not necessarily represent some kind of best estimate of all possible observations. In other words, if we had multiple Earths, each of which had very slightly different – but possible – conditions in the mid-1800s, the resulting temperatures for the period from then to now, would not all be the same. Do you agree?

ATTP “In other words, if we had multiple Earths,”

That sounds like the thought experiment I am setting out above, perhaps the easiest approach would be for -1 to go through it step by step and get agreement at each step. If there is an error in the reasoning that explains the flaw in -1’s argument, then he/she will be able to point it out when it crops up.

Dikran,

Yes, I think we are heading in the same direction. Maybe what -1 could focus on is whether or not he agrees that if we could access multiple Earths with the same climate physics and the same geological history, but that only differed in terms of the precise conditions in the mid-1800s (i.e., as might be expected from variability) would the temperatures between then and now be identical, or would there be a non-zero range.

Alternatively I am happy to leave the exercise to you. In my experience people on blogs are generally unwilling to go through the argument step by step and explicitly agree where they can find no error as it means it limits wriggle room later, so I am happy not to expend the energy ;o)

“Would you agree that the true ECS of all of the parallel Earths would be identical to that of ours?”

Yes.

“However, that set of observations does not necessarily represent some kind of best estimate of all possible observations. Do you agree?

Exactly. Which is why a discrepancy between observations and the average of climate model output does not necessarily imply that the climate models are biased. Which is why if you test for unbiased, you want to perform a statistical test and determine the p-value. One set of observations is sufficient to do such a test.

Maybe an analogy would be useful. Let’s say I wanted to know who has more support among the American people, Trump or Clinton. I poll 1000 people and find that Clinton is ahead. Just because Clinton is ahead in the polls doesn’t necessarily mean that Clinton has more support in the overall population. However, I can still construct a statistical test and determine the p-value that Clinton has the same support as Trump. If the p-value is small enough then it would be reasonable to reject the null hypothesis that Trump and Clinton have the same level of support in favour of the alternative hypothesis that Trump has more support than Clinton. And I can do this even though I don’t have all possible realizations of 1000 people (I only have 1 realization).

“Which is why if you test for unbiased” should read “Which is why if you test for bias”

-1 it would probably be better if you just answered ATTP’s questions and see where they take you. If you actively want to find the errors in your own reasoning, then the best approach is to listen to the person trying to explain them to you, rather than concentrate on reasons why you think you are right. Of course if ATTP is wrong, then the best way to show that he is wrong is to point out the error in his logic when it crops up. All else is just getting in the way at this point.

-1,

I think your analogy is wrong. Clearly you can do the test that you suggest, but in inferring something about the entire population, you’re essentially assuming that your 1000 people is large enough to be a representative sample. In others words, your result would not depend strongly on which group of 1000 people you happened to choose. This is not necessarily true for the temperature observations that we actually have. A different Earth could have a completely different temperature history.

Maybe I’ll go back a step, though. If your suggestion is that for each model we have a large number of runs that cover the range of initial conditions and the range of parameters, then (as I think I said earlier) you could do a test to see if that model was consistent with the observations at some level. However, we don’t have this. We have an ensemble of models runs with – in some cases – a few runs per model. I don’t think this is sufficient to eliminate individual models given that we don’t know that an individual model would not be able to match the observations we do have.

“A different Earth could have a completely different temperature history. ”

And a different set of 1000 people could give very different results even if both are representative.

What matters is using the results you have to test hypotheses.

“I don’t think this is sufficient to eliminate individual models given that we don’t know that an individual model would not be able to match the observations we do have.”

But this isn’t not what I’m suggesting. I’m not suggesting that we test individual models. I’m suggesting we test the overall model distribution. I.e. is the model distribution oversensitive?

No, I think this is wrong. Representative implies that you can use that sample to represent the entire population. If two different samples give different results, then they can’t both be representative.

But I think that is the issue. We already know that the observed temperatures lie within the overall model distribution. Given that the observations are a single realisation of all possible observations, I do not see how you can say anything about the bias in the model distribution from a single observational realisation.

For example, Steven McIntyre does such a test here:

https://climateaudit.org/2016/05/05/schmidts-histogram-diagram-doesnt-refute-christy/

Even if one disagrees with Steve’s methodology used or with the dataset, etc. The point is that such statistical tests can be done.

-1 wrote “And a different set of 1000 people could give very different results even if both are representative.”

I think this is an indication that each step needs to end with a specific (preferably yes/no) question so that the line of argument is not continually deflected away onto something else.

-1,

Okay, you really do need to read what other people say. I’ve already pointed out that of course one can do the test. That isn’t the issue. The key issue is what that test tells you, not whether or not that test can actually be done.

My issue with SM’s methodology is that his two distributions are not – IMO – equivalent. One is the distribution of the model trends. The other is the observed trend plus it’s uncertainty (i.e., it’s the distribution of possible trends given a single observation). The latter is not necessarily the same as the distribution of all possible observed trends. Therefore, I do not think his test (even though he can actually do it) really suggests what he is claiming.

-1 ironically it was actually Schmidt, not McIntrye, that did the more appropriate test, again to see why, consider the parallel Earth’s thought experiment (which was pointed out at CA). Nobody is saying that statistical tests can’t be done, it is just that they don’t mean what you think they mean (bias and consistency are not the same thing).

If two different samples give different results, then they can’t both be representative.

We disagree on this point. By representative sample, I mean in the statistical sense. I.e.

“A representative sample should be an unbiased indication of what the population is like.”

Or in other words, the expected value of such a sample is equal to the population mean. In practice, this generally means that you don’t have any reason to believe that the sample isn’t representative (so if you poll 1000 people and the geographic, age, sex, race, etc. distributions appear good then for all intents and purposes it is representative).

“We already know that the observed temperatures lie within the overall model distribution.”

They do lie within the distribution, but we can still determine a p-value of the observations based on that distribution.

Let’s say I flip a coin a million times and each time I get heads. Technically, flipping a coin a million times is in the distribution of a fair sided coin. But do you not think that if you get heads all 1 million times that it is not reasonable to conclude that the coin is not a fair sided coin?

“I do not see how you can say anything about the bias in the model distribution from a single observational realisation.”

So you don’t see how you can do a statistical test and conclude something like: based on the results of the test, there is a 97.6% chance that the model distribution is not unbiased?

“The latter is not necessarily the same as the distribution of all possible observed trends.”

But it doesn’t need to be. You are trying to determine the probability that you would get observations as extreme or more extreme as what is observed given the distribution of climate models (i.e. obtain the p-value).

-1,

As far as I can tell, we agree about the term “representative”. If the sample is an unbiased indication of what the populations is like, then I don’t think you would expect wildly inconsistent results if you were to repeat the test using a different sample of the same size.

Yes, but you seem to be changing the test. If test test is “fair”/”not fair” then you can clearly do that test and choose some threshold beyond which you would conclude that the coin was “not fair”. However, how would you quantify a bias. In other words if you tossed a coin and produced a result that only had a small chance of happening, how would you distinguish between it being a fair coin that just happened to produce that result, and a coin with some kind of bias?

I think you can obviously do a test to compare the observations and models. That is obvious. What I don’t think you can do is draw strong conclusions about actual bias on the basis of that test.

-1,

You appear to be changing the test. Of course you can determine the p-value. What I’m suggesting is that you can’t necessarily draw strong conclusions about a bias based on this test. I’m not suggesting that you can’t do the test.

I may have asked you this before, but can you please try to actually read what others have written before responding? Maybe reading it more than once would be good.

-1 wrote “A representative sample should be an unbiased indication of what the population is like.”

Of course it has already been pointed out to -1 that the problem is variance, not bias, but this was quite a good way of derailing ATTP’s thought experiment. Who could have predicted that?

If -1 is willing to go through the chain of reasoning step by step and answer the questions posed to establish agreement (or explore disagreement) as we go along without digression, then I would be happy to have a go at going through my version of the thought experiment.

“how would you distinguish between it being a fair coin that just happened to produce that result, and a coin with some kind of bias?”

One approach is to calculate the p-value. If the p-value is below a certain confidence level you reject the null hypothesis. The most common confidence level used in many branches of science is the 5% confidence level.

“What I don’t think you can do is draw strong conclusions about actual bias on the basis of that test.”

You can make a conclusion about the probability of getting a result as extreme or more extreme under the null hypothesis and then use that to reject / accept the null hypothesis.

Pretty sure they do something similar in AR5 to conclude that we are at least 95% certain that more than half of warming since 1950 was anthropogenic.

If one can’t draw strong conclusions about the bias in climate models as you claim, then does that mean that the IPCC could not draw a strong conclusion such as ‘we are at least 95% certain that more than half of warming since 1950 was anthropogenic’ in AR5?

“it has already been pointed out to -1 that the problem is variance, not bias”

Let me rephrase the null hypothesis: the null hypothesis is that the distribution is unbiased and that the variance is not an underestimate of the true variance.

And if you don’t think that the CMIP5 variance is not good enough to test for bias in the distribution, then shouldn’t it also not be good enough to make confidence intervals about climate sensitivity, or determine whether at least half of warming since 1950 was anthropogenic?

Yes, this is what I said. You can clearly define some threshold beyond which you would conclude that it is “not fair”. However, it’s binary; you either reject the hypothesis, or you don’t.

They rejected the hypothesis that it was more than 50% non-anthropogenic.

You can clearly run a test with the hypothesis that the model trends are inconsistent with the observations at the 95% level. It’s been done. They aren’t.

-1,

Just to be clear, you didn’t really answer this question.

-1 wrote “And if you don’t think that the CMIP5 variance is not good enough to test for bias in the distribution, then shouldn’t it also not be good enough to make confidence intervals about climate sensitivity …”

ATTP earlier wrote

“As far as the model spread being too small is concerned, my understanding is that the ECS range presented by the IPCC is formally a 95% range (extremely likely), but is presented as a 66% range (likely) in order to account for other uncertainties.”So the IPCC obviously don’t think it is intrinsically good enough and hence downgrade their assessment of this confidence to account for this.

Ironically ATTP also wrote

“I may have asked you this before, but can you please try to actually read what others have written before responding? Maybe reading it more than once would be good.”Testing for consistency is not the same as testing for bias. Say we roll a dice and get a four, but we don’t know what kind of die it was a d4 (i.e. a four sided die), a d6 or a d8. If it was actually a d6, then a sample from a d4 or a d8 would give a biased distribution. But given one die roll, can we tell if any of the three distributions are biased? No, that is because (like climate) we only have one realisation of a chaotic process to observe. Of course we can test for consistency, in this case the observed roll is consistent with all three distributions/models, had we rolled a 6 it would have been inconsistent with the d4 distribution.

Consistency and bias are not the same things and you generally can’t make strong statements about bias from the result of a test for consistency.

“However, it’s binary; you either reject the hypothesis, or you don’t.”

Yes… are you implying that there is something wrong with that?

“It’s been done. They aren’t.”

I’ve done it and it is… Could you please provide a link to someone performing a test and getting a different result?

“how would you distinguish between it being a fair coin that just happened to produce that result, and a coin with some kind of bias?”

Did I misunderstand your question? You can test for bias? Or by distinguish do you mean know with 100% certainty. Well obviously you can’t know with 100% certainty, just like we don’t know with 100% certainty that the Higg’s Boson exists or that bowling balls fall down on earth (maybe they fall up 50% of the time and we’ve just been really really lucky). It makes sense to form beliefs based on Occam’s Razor.

Let’s assume the observations have a trend bias of 0.2°C per century either due to natural variability or due to inhomogeneities we did not fully remove.

Let’s assume we have models that are unbiased, thus a good statistical test would let them pass.

Now if you simply do a simple t-test whether the mean of the distribution of the models fits to the observations, your test will say the null-hypothesis is rejected if you only have enough models. In our thought experiment, we assumed the models were unbiased, thus this result is wrong.

You thus actually have to study the uncertainties in the observations (because it has biases) and the models (because model spread is not uncertainty) and cannot simplistically compare the time series.

“Testing for consistency is not the same as testing for bias. Say we roll a dice and get a four, but we don’t know what kind of die it was a d4 (i.e. a four sided die), a d6 or a d8. If it was actually a d6, then a sample from a d4 or a d8 would give a biased distribution.”

This is a terrible analogy. There is no information in a single dice roll about whether or not the dice has a biased distribution. You are comparing tests where there is information to test a hypothesis to a situation where there is literally ZERO information in your sample to test your hypothesis.

“Now if you simply do a simple t-test whether the mean of the distribution of the models fits to the observations, your test will say the null-hypothesis is rejected if you only have enough models. In our thought experiment, we assumed the models were unbiased, thus this result is wrong.”

Not necessarily. The climate model runs contain natural variability. I think you are confusing unbiased with no natural variability. Unless you are referring to a case where you get extreme levels of natural variability, in which case rejecting the null hypothesis under such an extreme case is reasonable.

“You thus actually have to study the uncertainties in the observations (because it has biases) and the models (because model spread is not uncertainty) and cannot simplistically compare the time series.”

Yes, you should take these uncertainties into account. Did anyone say you shouldn’t?

-1,

No, but it doesn’t tell you how biased something is. It tells you – based on a judgement – as to whether you reject your hypothesis, or not.

If you’ve done it, why not show your analysis. My understanding is that the observed trend is not outside the 95% confidence interval of the model trends.

-1,

Try reading Victor’s comment again. I don’t think you’ve got his point.

“No, but it doesn’t tell you how biased something is.”

You can still get an estimate of the bias.

“If you’ve done it, why not show your analysis.”

You want me to do a guest post?

“My understanding is that the observed trend is not outside the 95% confidence interval of the model trends.”

My guess is that you are just looking at whether the data for each year is inside the 95% confidence level for each year and are ‘eyeballing’ it (sorry if I am wrong, I do not mean to cause offense), rather than perform a proper statistical test. A few years outside the annual 95% confidence interval does not imply you reject the null hypothesis and alternatively, even if every year is inside the annual 95% confidence level it is possible to reject the null hypothesis.

“Try reading Victor’s comment again. I don’t think you’ve got his point.”

I probably don’t. But it is a bit unclear. Particularly, the distribution of natural variability according to climate models is not defined so I don’t know how extreme the 0.2 C effect of natural variability is relative to the overall distribution.

-1,

Hmm, no I don’t, but thanks for offering. I was hoping you could point to some analysis.

No, I’m suggesting that for the surface temperature datasets, the trends are consistent with the model trends at the 95% level.

Victor’s point is that we have observations of one actual realisation which might have been influenced by internal variability at the level of a few tenths of a K per century. Now imagine we have an unbiased model. We can run that model many times with slightly different initial conditions. If we did so, we could substantially reduce the error on the mean. You now have a very precise mean trend from the model runs and a single trend (with uncertainties) from the observations. If you were to do a t-test on these two distributions, it would almost certainly fail even though the model is unbiased.

Here’s a post that seems relevant.

“No, I’m suggesting that for the surface temperature datasets, the trends are consistent with the model trends at the 95% level.”

And I’m suggesting it isn’t. I would point to the work of others, but I don’t know of anyone who does it properly. I can point to a lot of people eyeballing things and handwaving that observations are consistent with models.

If you change your mind, let me know about the guest post.

“If you were to do a t-test on these two distributions, it would almost certainly fail even though the model is unbiased.”

That’s sort of the point of a confidence level. If you have a 95% confidence level then there is a 5% chance that when you reject the null hypothesis that the null hypothesis is true. The case you refer to is going to be that 5% extreme case.

-1,

I know you’re suggesting it, but you haven’t really shown it.

No, the point is that you can do what might seem like a valid statistical test that even a perfect model would almost certainly fail.

This Realclimate post also seems relevant.

“Here’s a post that seems relevant.”

Looking at it, Cato does not do a correct methodology to demonstrate that observations disagree with climate models, but stickman doesn’t demonstrate that observations do not disagree with climate models (rather he just shows that the Cato test fails, but it’s not a valid test anyway). You want to test the overall trend, thus a single p-value, not 70 p-values.

-1,

I’m not sure what your real criticism is. Stickman’s Corral is showing that for end years of 2005, 2000, and 1995 trends of all lengths up to about 70 years fall within the 95% confidence interval of the model trends.

“but you haven’t really shown it.”

I can suggest it in a guest post… 🙂

“No, the point is that you can do what might seem like a valid statistical test that even a perfect model would almost certainly fail.”

Yeah… that’s why we don’t call it a 100% confidence interval.

“This Realclimate post also seems relevant.”

Oh, that’s a great post by Gavin.

To be fair though, based on the histograms of the TMT trends, the satellite data is pretty close to being inconsistent with climate models at the 95% confidence level. It’s pretty hard to tell if it does or not based on eyeballing. Unfortunately, Gavin does not perform a proper statistical test and obtain an overall p-value.

With respect to Gavin’s confidence intervals for the TMT trends according to satellite data, he uses a standard OLS approach, which is invalid in this case because there is significant autocorrelation in temperature trends. It would be far better to take autocorrelation into account (i.e. do a Cochrane-Orcutt regression). And because of this Gavin’s confidence intervals for the satellite data trends are too large.

“Stickman’s Corral is showing that for end years of 2005, 2000, and 1995 trends of all lengths up to about 70 years fall within the 95% confidence interval of the model trends.”

That doesn’t show that the overall trend is in agreement with observations, rather that if you take any 2 years that they are in agreement. There is far more information in 150 years of data than just 2 years of data.

My point is that you should do a single test that takes all of the data into account, not 70 tests that only take a small portion of data into account. The cato test is simply not an appropriate test for model distribution bias.

No, that’s not the reason.

That’s true for some of them, but would you really reject the models when there is so much disagreement amongst the observational datasets.

His 70-year trends are not based on 2-years worth of data, but on 70 years worth of data.

“I can suggest it” should read “I can show it”

“No, that’s not the reason.”

You are referring to an extreme case. The distribution of climate model output includes uncertainty due to natural variability. So extreme cases of natural variability will cause a rejection of the null hypothesis even if the null hypothesis is true, but this is represented by the confidence level.

“would you really reject the models when there is so much disagreement amongst the observational datasets.”

Is there really? NOAA, NASA-GISTEMP, Berkeley Earth, Cowtan & Way, JMA all seem pretty consistent to me.

“His 70-year trends are not based on 2-years worth of data, but on 70 years worth of data.”

Oh, sorry, I misunderstood. However, in this case the choice of OLS is invalid because there is significant autocorrelation in the data. A Cochrane-Orcutt regression is more appropriate.

Actually this is pretty much my main criticism with most statistical tests using OLS to test if climate models are biased or not. There is significant autocorrelation in temperature data.

-1,

No, try reading it again. Victor is referring to an example of what might seem like a valid test, but that a perfect model would almost certainly fail. It has nothing to do with internal variability.

You do realise that auto-correlation does not influence the mean trend. It’s auto-correlation in the residuals. Also, I don’t know if he used considered this when determining his confidence interval, since he doesn’t actually seem to say (at least, not that I can tell).

““No, the point is that you can do what might seem like a valid statistical test that even a perfect model would almost certainly fail.”

Yeah… that’s why we don’t call it a 100% confidence interval.”

If you want proof that -1 isn’t paying much attention to what is being written, this is it.

“It has nothing to do with internal variability.”

Uhh… to quote Victor: “Let’s assume the observations have a trend bias of 0.2°C per century either due to natural variability”

“You do realise that auto-correlation does not influence the mean trend.”

It influences the estimate of the uncertainty of the mean trend. In particular, it causes an overestimate of the uncertainty.

“I don’t know if he used considered this when determining his confidence interval”

He uses the same statistical model as cato, which doesn’t take autocorrelation into account. Thus stickman doesn’t take autocorrelation into account.

“That’s sort of the point of a confidence level. If you have a 95% confidence level then there is a 5% chance that when you reject the null hypothesis that the null hypothesis is true. ”

No, this is an elementary statistics error called the p-value fallacy. A frequentist test can

nevertell you the chance/probability that a given hypothesis is true or false for the simple reason that a given hypothesis has no (non-trivial) long run frequency (which is how frequentists define probabilities). I would recommend you get a copy of Grant Foster’s book “Understanding Statistics” it is a really good primer on basic statistics that will help avoid errors like this one.@ Dikran – sorry, I should have worded that differently. ‘Given that the null hypothesis is true, there is a 5% chance that you will reject it.’ Happy now?

-1,

Yes, but that means that a perfect model will almost always fail the test that Victor suggests. It has nothing to do with internal variability in the models, which is what you seem to be suggesting.

Indeed, but if the mean trend lies within the 95% confidence interval of the models, it’s seem unlikely (impossible) that changing how you determine the uncertainty in the trend is going to influence the test result.

” It would be far better to take autocorrelation into account (i.e. do a Cochrane-Orcutt regression). And because of this Gavin’s confidence intervals for the satellite data trends are too large.”

Surely the autocorrellation means that the confidence intervals are too narrow, rather than too large. Autocorrelation means there is less information in the data than is implied by the number of samples, less information means more uncertainty and hence wider confidence intervals. BTW the autorcorrelation issue is well known.

I’ve just done a 60-year HadCRUT4 trend ending in 2005 from the Skeptical Science Trend Calculator. It comes out at 0.084 +- 0.023 K/decade. Eyeballing it, it looks similar to what is in the Stickman’s Corral post.

“. ‘Given that the null hypothesis is true, there is a 5% chance that you will reject it.’ Happy now?”

That is better. So what does that tell us? Actually not very much, which is why people tend to misinterpret frequentist tests as if they were Bayesian ones and draw faulty conclusions. Note the “if the null hypothesis is true” bit.

““You do realise that auto-correlation does not influence the mean trend.”

It influences the estimate of the uncertainty of the mean trend. In particular, it causes an overestimate of the uncertainty.”

It not influencing the mean trend means that it has no effect on the bias of the estimator (which depends only on the mean). Autocorrellation causes OLS to UNDERestimate the uncertainty, not overestimate it.

==> it is a really good primer on basic statistics that will help avoid errors like this one. ==>

That’s reminiscent of Nic Lewis’ or Steve McIntyre’s rhetoric. Of a sort of which I am very critical. Do you really think that minus 1 lacks knowledge of basic statistics?

To quote Michelle Obama, when they go low we go high.

Joshua, yes, repeating the p-value fallacy and the other statistical errors made while I was off for a swim are indeed indicative that -1 has issues in his understanding of basic statistics on topics that Fosters book explains very well. The recommendation was a genuine one, not rhetoric. There is no shame in not knowing or understanding something, and being pointed to a good source of information is usually considered helpful.

I do however admit to being a bit terse with -1, largely because of his avoidance of the flaw in the argument may (sic) have made me a bit irritable, so there is some truth in the Michelle Obama quote. I stand admonished, my apologies to -1.

@ ATTP – “It has nothing to do with internal variability in the models”

Victor refers to natural variability with respect to the 0.2 C ‘bias’ or whatever. If you don’t think Victor’s example involves natural variability, then I have no idea what your interpretation of Victor’s poorly worded hypothetical is. So I suggest if that if you wish to continue with this hypothetical, you more clearly define it so that we are at least on common ground as to what Victor is referring to.

“if the mean trend lies within the 95% confidence interval of the models, it’s seem unlikely (impossible) that changing how you determine the uncertainty in the trend is going to influence the test result.”

True. Some of the cato values lie in the 95% CI.

To be fair, with respect to the tests I have done in the past, I used 150+ year trends, rather than 60 year trends. So maybe that is why there is a difference. (alternatively, I may be underestimating the spread of climate model output, although I don’t think this is the case)

“Surely the autocorrellation means that the confidence intervals are too narrow, rather than too large.”

Not necessarily true, because the estimate of the variance of the residuals also decreases.

TE I think your definition of “quickly diverge” diverges from mineWell, let’s keep it in terms of the model runs, then.

The chart indicates that the model runs for a given scenario diverge from one another much more when predicting the future than they do when predicting the past.

That’s because they’re all tuned to the past. But evidently, that doesn’t help with predicting the future.

ATTP should be more than a little upset by all this,

because tuning isn’t physics, is it?And Then There’s Parmeter Twiddling?This isn’t anything new, of course. The atmosphere isn’t predictable as even the IPCC noted.

It is misinformation to the public to intimate that climate is predictable.

Incidentally, it is pretty much impossible to create a coin that is asymmetrical (having heads and tails) that is

exactlyunbiased, so we willalwaysbe able to reject the null hypothesis that the coin isexactlyunbiased, all we need to do is observe enough coin flips and eventually the p-value will be low enough. This is a common problem with standard frequentist NHSTs, especially with point null hypotheses. We know the coin is biased a-priori, becauseallcoins are biased to some extent. The real question is whether the effect size (the degree of bias) is high enough to be an issue.Similarly, if we want to compare say the model mean and the observations. The model is necessarily a simplification of reality, so it is unreasonable to expect the ensemble mean to exactly match the observations. This means the test for whether the model mean is

exactlythe same as the observations is not the most meaningful test. There is also of course the point that we wouldn’t actually expect the observations to lie any closer to the ensemble mean than a randomly chosen ensemble member (which is why Schmidt’s test is the more meaningful one), but that is another matter.-1,

I don’t really care. You read it again, if you want to.

As do all of those for end years of 2005, 2000, 1995, suggesting that the choice of end year can have an impact on the test.

TE,

Tuning is unavoidable, as the paper makes clear.

I don’t think anyone suggests that it is. Strawman?

-1: “Victor refers to natural variability with respect to the 0.2 C ‘bias’ or whatever. If you don’t think Victor’s example involves natural variability, then I have no idea what your interpretation of Victor’s poorly worded hypothetical is.”

My example referred to a bias due to natural variability or remaining inhomogeneities. You somehow cut of that last part when you quoted me. In fact the argument is valid for a trend bias for any reason and thus not really related to natural variability.

“Not necessarily true, because the estimate of the variance of the residuals also decreases.”

http://iopscience.iop.org/article/10.1088/1748-9326/6/4/044022

Sorry, I don’t think that I can continue the discussion in a civil manner if you are going to continue to respond in this way to people trying to help you see the flaw in your reasoning.

TE wrote “TE I think your definition of “quickly diverge” diverges from mine

Well, let’s keep it in terms of the model runs, then.

The chart indicates that the model runs for a given scenario diverge from one another much more when predicting the future than they do when predicting the past.

That’s because they’re all tuned to the past. But evidently, that doesn’t help with predicting the future.”

No, it is because they are all running under the same scenario in the past (as we have observed emissions etc.), but the scenarios diverge in the future (which is kind of the point in having more than one of them). Note the different RCP numbers

The divergence within an RCP is not clearly shown in the diagram.

Note also baselining is not the same as tuning, the grey bit looks like the baselining period, where offsets are added to each run to explicitly minimise the differences, so of course the diverge from the point where they were artificially made to converge!

-1,

From the paper Dikran highlights,

“so we will always be able to reject the null hypothesis”

You sure about that? Can you prove no fair sided coin exists?

“You read it again, if you want to.”

I don’t see a conflict. Oh well.

“As do all of those for end years of 2005, 2000, 1995, suggesting that the choice of end year can have an impact on the test.”

Indeed. Which is why you don’t want to base the choice of the time interval on something arbitrary. Using all the data available isn’t arbitrary and is the best way to reduce the magnitude of such impacts.

“ATTP should be more than a little upset by all this, because tuning isn’t physics, is it?”

Funnily enough, I am reading Alan Guth’s book “The Inflationary universe” at the moment, which has a fair bit of discussion of free parameters. Seemed like physics to me.

-1 wrote ““You read it again, if you want to.”

I don’t see a conflict. Oh well.”

This really sums up the problem, if someone tries to point out a flaw in your reasoning and you don’t see the conflict then the correct response is to try and get them to explain it to you again, not “Oh well”.

@ Victor –

“My example referred to a bias due to natural variability or remaining inhomogeneities.”

By inhomogeneities, do you mean inhomogeneities in the temperature data set?

@ Dikran –

“Sorry, I don’t think that I can continue the discussion in a civil manner if you are going to continue to respond in this way to people trying to help you see the flaw in your reasoning.”

Sorry, I’m not familiar with this correction factor. Do you know where I can find the derivation of it?

To just try and clarify Victor’s point. Unless I’m mistaken, what he was referring to was an unknown bias in the observations (internal variability influencing the long-term trend, homogenization bias). Therefore you have a single observational dataset that may be biased in some way that you don’t know. You can run your model many, many times with different initial conditions and produce a very precise estimate of the mean model trend. If you then test this against the observational trend, it will almost certainly fail, even if the model is unbiased.

-1,

Dikran linked to a paper, the appendix of which makes very clear that auto-correlation make the standard error larger, than if the noise is assumed to be white noise.

“You sure about that? Can you prove no fair sided coin exists?”

You may conceive of a perfectly unbiased coin, but you won’t be able to manufacture it in the real world, there will inevitably be some deviation from the design which will introduce some random bias. Say this bias is distributed according to a zero mean (so the bias is unbiased) Gasussian distribution (just to make things easy). Given a standard normal distribution, what is the probability of drawing a sample that is

exactlyzero?“Sorry, I’m not familiar with this correction factor. Do you know where I can find the derivation of it?”

that really is the last straw 😦

Dikran –

Not to be a tone/concern troll, and I won’t further distract from the technical discussion beyond this comment…but…

I can think of many reasons other than a lack of basic knowledge that might lead minus one to make a fundamental error in his reasoning (assuming that’s the case, I wouldn’t know).

I can’t come close to parsing the technical discussion, but I have a strong sense that it isn’t plausible that minus one lacks basic knowledge of statistics. I highly doubt you and Anders would be engaging him at this level if that were the case.

The argument that his error is because he lacks basic knowledge is of a sort that I often see from “skeptics” – often when I actually do have the direct knowledge needed to know just how fallacious those arguments are.

Yes, you apologized, but IMO, it isn’t really about apologizing. It’s about recognizing the weakness of your own assumption, whether driven by annoyance, or minus one refusing to acknowledge an error, or whatever. IMO, justification doesn’t really address the underlying problem. I say this because I want to rely on your expertise as a guide for helping me to parse these discussions, and I gain confidence in people when they can recognize and acknowledge, what seems to me to be, basic problems in reasoning.

-1: “

@ Victor – “My example referred to a bias due to natural variability or remaining inhomogeneities.””By inhomogeneities, do you mean inhomogeneities in the temperature data set?

I wrote: “

Let’s assume the observations have a trend bias of 0.2°C per century either due to natural variability or due to inhomogeneities we did not fully remove.”Naturally I was talking about observations.

Joshua may not like it, but I am able to judge the technical quality of the comments of -1 and I would be dishonest to be assuming good faith.

[Mod: redacted]The divergence within an RCP is not clearly shown in the diagram.Sure it is.

Look at all the dark blue lines. For 2050, they range from about 0.5C to 1.8C. ( 1.3C range )

Look at all the red lines. For 2050, they range from about 1.0C to 2.5C. ( 1.5C range ).

Look at all the past gray lines. For 1985, they range from about -0.5C to 0C. ( 0.5C range ).

“To just try and clarify Victor’s point. Unless I’m mistaken, what he was referring to was an unknown bias in the observations (internal variability influencing the long-term trend, homogenization bias).”

If it’s due to internal variability, given that internal variability is part of the uncertainty of the model distribution, I don’t see this as an issue.

If it’s due to homogenization, this is obviously an issue. I don’t really see any reason why there would be such a large homogenization bias, especially when the different temperature data sets homogenize in different ways. If there is no evidence of a homogenization bias, I don’t see a good reason to believe in one.

“Dikran linked to a paper, the appendix of which makes very clear that auto-correlation make the standard error larger”

I was interested in the derivation though. Maybe I have to look through Lee and Lund 2004.

“But you won’t be able to manufacture it in the real world”

How do you know? You can’t prove that I can’t manufacture such a perfect coin.

Heck, how about creating a coin of a few atoms in a ring? Let’s say hypothetically I create a ring of silicon and carbon which is (below are double bonds)

C

/ \

C Si

| |

Si Si

\ /

C

on one side and when flipped over looks like:

C

/ \

Si C

| |

Si Si

\ /

C

Seems like such a ring would act as a fair sided coin.

Or maybe I could use a sheet of benzene or something instead.

Sorry, I meant a graphene sheet, not a benzene sheet.

You’re still not reading his example carefully enough. I also can’t explain any more clearly than I have on more than one occasion.

Joshua “I can think of many reasons other than a lack of basic knowledge that might lead minus one to make a fundamental error in his reasoning (assuming that’s the case, I wouldn’t know). ”

No, if you fall into the p-value fallacy, that does mean you don’t have a good grasp of basic statistics (not a problem, most people don’t, including many working scientists). It is central to understanding what null hypothesis statistical tests do and what they mean (and more importantly, what they don’t mean). It wasn’t as if it was the only statistical error made while I was having my lunchtime swim.

” I highly doubt you and Anders would be engaging him at this level if that were the case.”

Actually both ATTP and I tried to engage him/her with a much more elementary level thought experiment that unfortunately -1 wouldn’t engage with and instead started discussing the technical details of statistical tests. I’d much rather we went back to the basics, because the statistics really isn’t the problem, the problem is understanding the physics so you know what statistics are sensible. It is -1s choice to be discussing the technical statistics, not mine, and I suspect not ATTPs either.

-1,

The latter part of your most recent comment just seems like pedantry.

I agree.

“Seems like such a ring would act as a fair sided coin.”

sorry, you are just taking the p*** now. I was discussing a general problem with point null hypotheses and it is obvious that you are engaging in this pedantic “badinage” about unbiased coins just to avoid discussing the problem with point null hypotheses.

OK, gotta respond to this…then I’ll let it go.

==> No, if you fall into the p-value fallacy, that does mean you don’t have a good grasp of basic statistics (not a problem, most people don’t, including many working scientists). ==>

When I read this, again, I think of what I read at someplace like Lucia’s crib, where people are absolutely convinced that Anders lacks basic knowledge on any variety of issues. I find it quite amusing that people throw that kind of accusation around so frequently, when it so frequently seems completely implausible to me. The example I remember quite well in particular was when one of Judith’s “denizens” insisted that Pekka had no idea what he was talking about (I believe when discussing renewable energy in Scandinavia).

A brainfart could be one explanation. A misunderstanding could be an explanation (you and I have exchanged views about how that dynamic can play out, and precautions to take to address that possibility). Biased reasoning could be an explanation (one that I happen to think explains a lot in these discussions).

I’ll repeat, I can’t say for sure, but it seems entirely implausible to me that minus one lacks a basic understanding of statistics. I would highly doubt that he couldn’t provide evidence of expertise, at least to the level of a basic understanding.

I’ll also point out that pedantry and taking a piss could also be explanations that seem entirely plausible to me.

” I was discussing a general problem with point null hypotheses”

Uhh… You were referring to “so we will always be able to reject the null hypothesis that the coin is exactly unbiased”. I asked why you think you could do this a priori and initially thought you meant in theory. But then you said “but you won’t be able to manufacture it in the real world”. Then I gave an example of a real world fair sided coin.

So forgive me if I’m lost as to why you think that you can exclude a priori the possibility that a coin is fair sided.

-1,

Come on, noone would regard a ring of silicon and carbon as a real world fair coin. We’re using words as they’re normally used. Finding ways to continually disagree with people just gets a but much after a while.

@ ATTP – to relate it back to the discussion, I was under the impression that Dikran was arguing that one can exclude a priori the possibility of a fair sided coin and similarly the possibility that the distribution of climate model output is perfectly unbiased. Is this accurate?

Joshua: “

When I read this, again, I think of what I read at someplace like Lucia’s crib, where people are absolutely convinced that Anders lacks basic knowledge on any variety of issues.”Karl Rove strategy #3: Accuse your opponent of your own weakness

This strategy is so successful because people who do not want to dive into the technical argument (nearly everyone) will just see two groups making the same claim. I am afraid that at some point you will need to go into the details, will have to make an effort, at least once in a while to sample the credibility of a source.

Joshua If I joined in a discussion of (say) biblical hermeneutics where one contributor suggested that agápē could mean brotherly love and someone else pointed out that they had fundamental misunderstandings and recommended a basic textbook, it would be sheer hubris for me to say that they shouldn’t assume a lack of basic understanding, e.g. it could just be a brainfart. I don’t know enough about hermeneutics to know, and I know it, so I would keep quiet.

In a world with Wikipedia it isn’t that difficult to give a lay audience an impression of technical competence without it actually having foundations. I’m not saying that is definitely what is happening here, but -1’s responses are indicative (e.g. badinage about graphene coins in order to evade the technical problem with point null hypotheses).

I am dissapointed.

Anyway, thanks for the link to Lee and Lund 2004. It answered my question and I retract my earlier claims of autocorrelation causing an overestimation of the confidence intervals.

Apologies for continuing the badinage, but I can’t help noticing that with -1’s graphene coin, being perfectly symmetrical, one would be unable to tell which side is “heads” and which is “tails”…

Phil, I think the point was that it wasn’t actually graphene, but a ring of carbon and silicon, which gets rid of the rotational symmetry, so you could use it as a coin. Rather difficult to keep in your pocket though, and they are not negotiable currency, because the Galactibanks refuse to deal in fiddling small change.

@ Phil – Depends on the shape of the graphene. For example, if you had an F shaped graphene coin, then you could determine which side is which.

Perhaps an example of a bias in observation that Victor cites could be the sea surface temperature data. Because of the changing methodology and coverage it is known to have a bias with a positive warming trend. A model that is tuned to that trend would appear skillful…

Adjusting the observations to remove the methodological biases however is difficult without an independent ‘Gold standard’.

This is an ongoing problem with satellite measurements. The derivation of temperature from sensor data is complex and has gone through numerous changes. It still may be in contradiction with other means of measurement, or it could be accurate. If only we could use models to determine what the troposphere ‘SHOULD’ be doing!

Graphene rings are not flat. One of the valence electrons sticks out of the plane. Which side of the plane it protrudes from is a quantum process. Makes it far more asymmetrical than any coin.

izen, interesting, glad I’ve learned something from the badinage ;o)

Anders,

I would say it isn’t a strawman. But this is; there are two kinds of error bounds in contrarianville:

1) Too large to be useful.

2) Too small to be credible.

> I guess I’d say [model tuning]’s avoidability depends on the urgency of building a model and the available info to build it.

I’m quite sure it’s a wrong guess: it mainly depends upon the size of the state space.

VV wrote “I am afraid that at some point you will need to go into the details, will have to make an effort, at least once in a while to sample the credibility of a source.”

Indeed, the “interpreter of interpretations” illustrates that you do actually need to read the peer-reviewed papers and understand them if you are to be a useful interpreter of interpretations. To paraphrase Arthur C. Clarke “Any sufficiently advanced bullshit is indistinguishable from technical competence” and generally only those that understand the technology in question can readily see the difference.

Okay, I retract my earlier claim, there was an error in my code.

The full range of Cowtan and Way observations do not falsify at the 95% confidence level the distribution of climate models when I compare the rates of warming.

I also checked acceleration, and again it does not falsify the hypothesis that the distribution of climate models is unbiased.

I still suspect the climate model distribution is not unbiased (many reasons to believe this is the case), but the variance is so large it can’t be shown based on the observations we have.

One thing I did was treat climate models as mere predictors of temperature and treat the true model as a linear combination of the climate models plus error (similar to what Annan and Hargreaves do in their paper on warming since the LGM). If I do this and use historical model runs that cover 1851-2012 (since this is the longest period available, and doing so avoids overfitting issues), and use the published ECS and TCR values for each model then I can use the estimate of the true model to estimate climate sensitivity and its error. Doing so gives me a best estimate of ECS as 2.2 C ([0.8,3.7] C is the 95% CI) and a best estimate of TCR as 1.0 C ([0.2,1.9] C is the 95% CI). To be fair, Cowtan and Way use sea surface temperature, so if this causes a 9% downward bias in temperature trends as people like Kyle Armour claim, then the best estimate of ECS is closer to 2.4 C. Treating climate models as imperfect predictors and using a linear combination of models allows one to try to correct for bias in the distribution of models.

-1,

Kudos.

I don’t really anyone is seriously suggesting that there isn’t some kind of bias. We can’t probe all possible initial conditions and all possible parameters. We also know that there is a range of climate sensitivity within the model ensemble; they can’t all be right. The issue – as I see it – is that we’re not really in a position where we can definitively reject certain of the models and hence it is better to work with the ensemble than to assume something about the bias. Of course, there are also other avenues of investigation, such as perturbed physics ensembles, which are also valuable.

-1 kudos^2.

“I still suspect the climate model distribution is not unbiased (many reasons to believe this is the case), but the variance is so large it can’t be shown based on the observations we have.”As I pointed out, the climate model distribution is essentially guaranteed to have

somebias, so a test for exact unbiasedness is not a particularly meaningful exercise. Estimating the magnitude of the bias, that would be a different matter. However we can’t estimate this bias from a single realisation of the observed climate system because the bias is the difference between the expected value of the modelled climate sensitivity and the expectation of the distribution of observed CS that we would get if we had an infinite number of parallel Earths to observe. The CS estimated from our one realisation is not that expected value.I should add this is basically the problem with the “truth centered” interpretation of the ensemble, which suggests that the ensemble mean should converge to the observations as the size of the ensemble grows. The alternative interpretation being one of “statistical exchangeability” which says that the observations can be treated as a random sample from the same distribution of model runs (if we had an ensemble of parallel Earths we would have the best climate model we could make and in that case our Earth would obviously be statistically exchangeable with the others). The truth centered interpretations is very hard to support as it basically requires that the effects of internal climate variability (on the estimate of CS in this case) are precisely zero, which seems pretty untenable. Statistical exchangeability on the other hand is a basic assumption of Monte Carlo simulation, and seems pretty obvious. Assuming that the difference between the estimated CS from a single observed realisation and the ensemble mean is a measure of bias would be valid under the “truth centered” interpretation, but not under “statistical exchangeability”.

Well the main issue with the possibly biased distribution is that it affects attribution calculations and economic analysis. So if there is no attempt to correct for it then this can significantly affect optimal mitigation policy, etc. It’s quite a concerning problem.

I suspect that most of any such bias is due to confirmation bias that started over 100 years ago with Arrhenius. He suspected a climate sensitivity of ~4C because he only looked at the water vapour feedback (in particular ignored lapse rate), fast forward to the 70’s and you have some scientists supposedly predicting the coming ice age due to overestimating the strength of aerosols. The tendency in the past has been to overestimate both climate sensitivity and the strength of aerosols, so this may cause a slight bias towards an overestimate of climate sensitivity and aerosol strength as people develop models because they may either consciously or subconsciously want their models to conform to the current literature. And of course, since GHGs and aerosols are working in opposite direction over the historical period, it is quite possible to have a grossly biased set of climate models that appear plausible. The results of Bjorn Stevens however suggest that there is a fair amount of overestimation of the strength of aerosols.

Found another error in my code. The confidence intervals I wrote in the last post are quite a bit larger to the point of being useless. So forget what I wrote.

“However we can’t estimate this bias from a single realisation of the observed climate system because the bias is the difference”

I’m not quite sure I follow. Sure we can’t know the bias perfectly, but we can still get imperfect estimates.

I think I’ll give up on trying to estimate the bias by doing something similar to Annan and Hargreaves for now. Main issue is that most of the KMNI historic runs end in 2005 (so don’t take into account the recent slowdown), so I only have 5 model runs that go to 2012 and that I have sensitivity values for.

“The alternative interpretation being one of “statistical exchangeability” which says that the observations can be treated as a random sample from the same distribution of model runs. The truth centered interpretations is very hard to support as it basically requires that the effects of internal climate variability (on the estimate of CS in this case) are precisely zero, which seems pretty untenable.”

I’m not sure I follow. If the distribution of model runs does not include variation due to natural variability, sure. But if the distribution of model runs does include variation due to natural variability, then can’t observations be treated as a random sample from the same distribution?

“I’m not sure I follow.”

It is a shame that you wouldn’t engage constructively with the parallel earth’s thought experiment upthread, the statistical exchangability was what the thought experiment was about. It is ironic that you talk about confirmation bias and yet are so resistant to being corrected (it happens, which is good, many can’t do that at all, but you make it hard work).

“But if the distribution of model runs does include variation due to natural variability, then can’t observations be treated as a random sample from the same distribution?”

yes, in which case you can’t determine the bias from a single observation as the observation is unlikely to be the expectation of the true (rather than modelled) distribution. That is the point!

Yes, but if the observations happen to lie near the boundary of model range, it doesn’t necessarily imply a bias, given that the model range is intended to represent the range in which the observations could lie. Of course, there probably is bias and maybe there are some test that could illustrate this (I think models tuned to give outlier ECS values that are then tested against paleo-climate might be instructive) but I think Dikran’s point is that since we only have realisation of reality, we can’t really use that to quantify the bias.

“the expectation of the distribution of observed CS”

But isn’t the realization of observations an unbiased estimate of the expectation of the distribution of observed CS?

“as the observation is unlikely”

Keyword is unlikely here. I agree that the estimate will have a fair amount of variation due to natural variability. But do you agree that this terrible estimate will at least be unbiased?

I’m under the impression that we agree but we just disagree over semantics.

-1,

No, I don’t think so; we can only draw a single realisation.

“But isn’t the realization of observations an unbiased estimate of the expectation of the distribution of observed CS?”

An estimate being unbiased does not mean that it exactly correct, it means that if you repeated the experiments substituting parallel earths you would get the correct answer

on average, but that doesn’t mean any of the parallel Earths would give you that result.As I suggested upthread, you need to understand the physics well enough to pose the question correctly before getting into the details of the statistics used to answer it.

“I’m under the impression that we agree but we just disagree over semantics.”

No, you still have at least one fundamental flaw in your understanding of the problem (see above).

I have a die in my pocket, but am not going to tell you whether it is a d4, a d6 or a d8. I roll it an get a three. Please tell me what the expected value of a roll of my die is.

Did rolling the die an observing the value give me an unbiased estimator of the expected value of the die roll? (yes)

“An estimate being unbiased does not mean that it exactly correct”

Yes, we agree.

“I have a die in my pocket, but am not going to tell you whether it is a d4, a d6 or a d8. I roll it an get a three. Please tell me what the expected value of a roll of my die is.”

Let’s see if I get this analogy correct… The different type of dice represent different parallel earths. And the different parallel Earths represent different levels/trends of natural variability?

Now if you have no information on the probability of the #of sides of dice, you can’t answer the question. However, with respect to natural variability, the climate models themselves give a distribution of natural variability. Now if you did have a distribution of the probability of the #of sides of the die then you could calculate the expected value of a dice roll given the probability of the number of dice. You could then test for possible dice bias even if you do not know the # of sides of the die, although a single point of observation isn’t enough. However, in the climate case, we have more than a single temperature measurement for observational data.

So I think for this analogy to work, you need both multiple dice rolls and a known probability distribution of the number of sides of the die.

“Let’s see if I get this analogy correct… The different type of dice represent different parallel earths”

No. I’m sorry, but again you are derailing the attempt to explain the flaw in your error. It is much easier if you just answer the question and wait for the next step. I don’t have the energy or enthusiasm to address the stream of additional misunderstandings you keep introducing.

BTW who said it was an analogy?

… I did answer your question. I said you don’t have adequate information to estimate the expected value.

I’m guessing it’s an analogy because you probably want to relate it back to climate models in some way.

“So I think for this analogy to work, you need both multiple dice rolls and a known probability distribution of the number of sides of the die.”

Talk of missing the point. If it were an analogy do we have multiple realisations of the climate (dice rolls) from which to estimate the real CS? No, THAT IS THE POINT!!!!!!

We have multiple data points, which we can use to estimate CS. You seem to be confusing multiple realizations with multiple data points. A single realization with a single data point is not the same thing as a single realization with multiple data points.

-1,

I think Dikran is using a single dice roll as equivalent to a single temperature time series. They are both single realisations of a system.

-1 wrote “… I did answer your question. I said you don’t have adequate information to estimate the expected value.”

yes, inside a gish gallop of misunderstandings that I don’t have the energy or enthusiams to explain.

“I’m guessing it’s an analogy because you probably want to relate it back to climate models in some way.”

you would make more progress if you did more listening and less anticipating. You are not a mind reader (apparently). I was using it to make a point about unbiaseness in estimation of bias in a generalised setting. It could be used as an analogy, but look at all the extra effort you have put my to in trying to explain this to you. Ask yourself why you are making so much work for me, rather than just assuming that I may have a point let me explain it without digression (note I asked you not to digress in this way earlier).

-1 “We have multiple data points, which we can use to estimate CS.”

we have multiple data points FROM

REALISATION OF THE CLIMATE SYSTEM.ONE” You seem to be confusing multiple realizations with multiple data points.”

this is hilarious coming from someone that commented about confirmation bias just a few comments ago! ;o)

” They are both single realisations of a system.”

They are both realisations. But you can’t test for climate sensitivity with a single temperature data point, you need at least 2 regardless of what method you use. With respect to the dice, a single roll does not allow you to test if the dice is fair, but multiple dice rolls can. Heck even two dice rolls would allow you to test it to some extent since the probability of getting the same number twice in a row is higher for an unfair die than for a fair die.

if you don’t think the number of dice rolls matter, then you should have no problem giving me a second dice roll.

-1,

I think you’re being way too literal.

No, I think the point is that in the same way we can’t get a temperature time series from another Earth, you can’t have a second dice roll.

Yes, but you have more than a single data point in that time series.

This is what I mean by it being a bad analogy.

Or you just haven’t really bothered to try and understand what’s being illustrated?

Can you give an analogy where I have at least 2 data points to work with then?

-1,

I don’t think the form of the analogy/illustration is all that important. The point is really that you can develop models that potentially cover the full range of outputs from a particular system. If you have only one set of observations from that system then it becomes difficult to determine bias in the models. You might be able to reject models if the observations lie outside the model range, but that’s not really the same as quantifying the bias.

O.K., I’ll make it an analogy, if that will help -1 to get the idea, but it will be more complicated and hence more difficult to understand, but if -1 won’t co-operate with a straight forward thought experiment, what can I do.

Lets assume that the temperature over the course of a century on the Earth can be represented as a linear function of time, with some added noise. The true slope of the line is denoted by mu, however we only get

onetime series of observations from which to estimate the slope and these are corrupted by (autocorrelated) noise (representing internal climate variability), so we estimate a value mu’ from our multiple data points from a single realisation.Now mu’ is not equal to mu (which we don’t know), although mu’ is an unbiased estimator of mu.

Now say we have N model runs and we want to determine the bias of the model estimates of CS, so we take the time series estimates from each model and estimate the slopes as mu”_1, m”_2, …, mu”_N, where the double prime indicates a model estimate.

The bias of the models is given by the mu – mean(mu”), not mu’ – mean(mu”), but we don’t know mu, only mu’. So how different from mu can we expect mu’ to be? We don’t know because we only have one realisation of the observed planet, only one mu’, so we don’t know the

varianceof the estimator.“Can you give an analogy where I have at least 2 data points to work with then?”

I don’t know why I bothered. 😦

BTW the observational estimates of CS are not actually unbiased either, I may have pointed that out already.

Indeed, it is almost

certainthat there issomebias in the models and model mean. Pedantically true.But the question is whether the bias is large enough to matter. And, “matter” for what purpose? Broad questions about sensitivity? Regional planning on climate change? Etc. Your answer will vary depending on the question.

Climate scientists are working hard to fix known problems with the models. And working hard to nail down sensitivity better. And while climate science is very much a work in progress, there’s also quite a bit of evidence that sensitivity is

likelyto be high enough that we should think about reducing emissions.That’s what really matters here. Where does the weight of the evidence lie?

“we want to determine the bias”

If by determine you mean know with 100% certainty, of course not.

If by determine you mean estimate, yes you can.

“mu – mean(mu”), not mu’ – mean(mu”)”

Yes, and mu’ – mean(mu) is an unbiased estimator of mu, which I’m sure you agree with.

“so we don’t know the variance of the estimator.”

Yes and no. No for your example, but with respect to the information from CMIP5 data, one can get an estimate for the variance of the estimator mu’. Since one can look at the variation caused by changing initial conditions while keeping the climate model the same and the forcing history the same to infer the variance of mu’.

Indeed, it is almost

certainthat there issomebias in the models and model mean. Pedantically true.But the question is whether the bias is large enough to matter. And, “matter” for what purpose? Broad questions about sensitivity? Regional planning on climate change? Etc. Your answer will vary depending on the question.

Climate scientists are working hard to fix known problems with the models. And working hard to nail down sensitivity better. And while climate science is very much a work in progress, there’s also quite a bit of evidence that sensitivity is

likelyto be high enough that we should think about reducing emissions.That’s what really matters here. Where does the weight of the evidence lie?

Dikran,

I’m still trying to follow that thought experiment

If these parallel Earths only varied in internal variability, then wouldn’t any one Earth have a 95% chance of falling within the 95% interval of this “ensemble”?

I’d agree that this represents a problem for determining the bias. Okay, backing up. Say you randomly select one of these parallel Earths, and you find that its warming since 1850 falls at or above the 70th percentile of the Parallel-Earth Ensemble (PEE).

And say that scientists on that Earth

alsohave a perfectly correct model, that produces the exact same result as the PEE. But they don’tknowthat their model is perfect. On the contrary, they’re trying to figure out if their (actually-perfect) model has a bias or not.So they compare their perfect model ensemble (PME) against their observations. And because the observations run high, they say “hmmm, there’s a 70% chance that our models run too cold”.

But, we know that’s wrong. We know that their models are perfect, and we know that the discrepancy is caused by sheer random chance, from picking just one of the observations among all the ones that are possible.

Is that right? It makes sense to me — the scientists there are essentially using bad logic, in how they compare their observations with their model ensemble.

“there’s also quite a bit of evidence that sensitivity is likely to be high enough that we should think about reducing emissions.”

You need a lot more information than just climate sensitivity to properly justify mitigation.

No, I think the point is that we don’t have the PEE. We have a model ensemble and one selection from the PEE. If the models were perfect, then the distribution from the models would match the PEE distribution. However, we only have one sample from the PEE, therefore we can’t really tell if the models are biased, or not.

“If by determine you mean estimate, yes you can.”

I’m sorry, but this is pure bullshit. I can estimate CS as 42, but that doesn’t make it a useful estimate. To be a useful estimate you need to know the variance of the estimator so you know if the confidence interval on the bias includes zero.

“Yes, and mu’ – mean(mu) is an unbiased estimator of mu, which I’m sure you agree with.”

yes, and if you had taken the time to read the analogy you would find I said so explicitly. Yes mu’ – mean(mu”) is an unbiased estimate of the bias, but to know whether the estimate is meaningfull you need to know the VARIANCE (a word that you don’t seem to want to use) of the estimate. And we can’t estimate that VARIANCE properly from one realisation.

“Yes and no. No for your example, but with respect to the information from CMIP5 data, one can get an estimate for the variance of the estimator mu’.”

Yes, sure of course we can estimate the variance from the variance of the model runs, but if the models are not trustworthy because they are biased (i.e. the mean is wrong) why should we trust their estimate of the variance (a higher moment of the distribution)? Do you not see the circularity?

Right! That’s what I’m getting at. We’re like the scientists on that parallel Earth.

Wedon’t know if our model is biased or not. That’s what we want to find out. But it’s really hard to do from a statistical perspective, with the limited data set we have. (I mean, sure, if we had tens of thousands of years of high-quality climate data, sure, then it’d be a different story.)We have to mostly focus on the physics instead.

The PEE is part of the thought experiment. It shows how even a perfect model could “fail” if we judge it by frequentist statistics against a single realization.

Ah, okay.

Of course 🙂

Yes, that is essentially the point.

“Yes, sure of course we can estimate the variance from the variance of the model runs, but if the models are not trustworthy because they are biased (i.e. the mean is wrong) why should we trust their estimate of the variance (a higher moment of the distribution)? Do you not see the circularity?”

The idea is to test for bias under the assumption that the estimate of the variance is good. If it fails the test then that is an indication that the distribution of climate models isn’t very useful to do things like attribution or economic analysis. If it passes the test, it may still be terrible to to attribution and economic analysis with, but we won’t have any good evidence to believe that it is terrible.

Windchaser, I’ll try and answer your questions, but my energy is currently a bit depleted.

“If these parallel Earths only varied in internal variability, then wouldn’t any one Earth have a 95% chance of falling within the 95% interval of this “ensemble”?”

Yes, that is correct, which is why Gavin Schmidt’s test for a model-observation inconsistency (seeing if the observations lie in the spread of the models) is a sensible (and obvious) one, although you need to take into account that the spread of the models is almost certainly a substantial underestimate of the true uncertainty (VV had an excellent blog post on this a while back).

“I’d agree that this represents a problem for determining the bias. Okay, backing up. Say you randomly select one of these parallel Earths, and you find that its warming since 1850 falls at or above the 70th percentile of the Parallel-Earth Ensemble (PEE).

And say that scientists on that Earth also have a perfectly correct model, that produces the exact same result as the PEE. But they don’t know that their model is perfect. On the contrary, they’re trying to figure out if their (actually-perfect) model has a bias or not.

So they compare their perfect model ensemble (PME) against their observations. And because the observations run high, they say “hmmm, there’s a 70% chance that our models run too cold”.

But, we know that’s wrong. We know that their models are perfect, and we know that the discrepancy is caused by sheer random chance, from picking just one of the observations among all the ones that are possible.

Is that right? It makes sense to me — the scientists there are essentially using bad logic, in how they compare their observations with their model ensemble.”

yes, that is exactly correct. I think there is a natural tendency to think that the model mean is supposed to be directly an estimate of the observed climate, but it obviously isn’t. The observed climate change is a result of the forced response plus the effects of the unforced response (i.e. weather noise). They don’t actually combine linearly, but assuming they do makes it easier to understand. I we take a large number of model runs and take the average, the effects of the unforced response won’t be coherent across model runs, and so will largely average out to near zero, leaving the expected forced response. The observed climate though is still forced response plus unforced response, so it won’t match the ensemble mean unless the unforced response just happened by random chance to be almost zero. Of course if we want to know the effects of GHGs, then it is the forced response that is of immediate interest.

I think you’ve explained it much more clearly than I did!

-1,

It sounds like you’re now saying what most others have said for more than a day. Assume the models produce a reasonable estimation of the range of warming. Now test the single set of observations against the model range. If it falls outside the range, it would be an indication of a problem with the models. If it falls within the model range, there may still be problems with the models, but we can’t really say from that test.

“The idea is to test for bias under the assumption that the estimate of the variance is good”

That is ridiculous. The mean response of the climate model largely depends on basic energy balance considerations, the variance requires careful modeling of internal climate variability (i.e. where the heat is distributed), expecting the models to get the variance right but not the mean is unrealistic.

This is one of the problems with papers like Fyfe et al. If we detect a model-observation inconsistency, we don’t know whether it is because the models are biased or because they underestimate the variability. My initial intuition was that the underestimating the variance was more likely (actually read VV’s excellent blog post, which gives lots of justification for this being the case), however some climate modellers I have discussed this with suggests it is more likely to be a bit of both.

Yeah, it just took me a little while to actually figure it out. ^_^ Seriously, until about a half-hour ago, I didn’t get what VV meant when he talked about the observations being “biased” too high or low.

And it’s not that applying the statistics correctly gives us

noknowledge about whether the models are biased. It just gives us rather little.ATTP ” If it falls outside the range, it would be an indication of a problem with the models.”

and the problem isn’t necessarily bias.

That is a test for consistency, not bias, as I pointed out here.

Windchaser “And it’s not that applying the statistics correctly gives us no knowledge about whether the models are biased. It just gives us rather little.”

Yes, if the observations are running cooler than the models then it is more likely that the models are biased warm than biased cold, but that doesn’t even mean there actually is a bias.

“however some climate modellers I have discussed this with suggests it is more likely to be a bit of both.”

IIRC beacuse they are not independent.

I think this makes sense. Although it’s simple to think of a long-term forced trend plus variability, this isn’t really correct.

Unfortunately I can’t remember the detailed explanation, but I suspect the climate modellers know more about this sort of thing than I do, so I’ll take their word for it on this occasion! ;o)

Well, we can add another layer to that.

Say that some scientists judge their models by how well they agree with observations of GMT, whether implicitly or explicitly. Like, say they discard some of the models that perform particularly bad by this measure, rather than simply judging their models by their physics. To continue the earlier thought experiment, if the observations are at the 70th percentile of the PEE, then scientists might trim away the models that are showing up at <20th percentile. And then the remainder might end up averaging at the 60th percentile.

So, with this pruning of the model ensemble, the models will

tendtowards agreeing with observations, even if the observations are “biased”. If the observations are high, this implicit model pruning will tend towards producing high-biased models, as compared against the PEE. If low, then low.To the extent that this implicit model pruning occurs, it would actually mean that the models are biased in the

wrongdirection. If the models appear too high compared to observations, then they actually should have beenhigher.And then you have to weigh that against the possibility of

actualbias. Is there a greater chance that the models are actually biased high, or that, because of this implicit pruning, they’re actually biased low?Heh. Good luck with that.

Windchaser says: “

Seriously, until about a half-hour ago, I didn’t get what VV meant when he talked about the observations being “biased” too high or low.”I honestly thought this was rather trivial question. 🙂 At least the concepts, the statistical tests are not easy. There goes my Nobel price for science communication. 😐

In the context of purely internal variability (noise) it would have been better not to talk about a “bias”, but simply of a difference, because the realisation that we live in is drawn from an unbiased sample as Dikran made clear above.

I was thinking of a bias due to inhomogeneities in the way temperature was measured. It is my job to remove such biases, so that is what I think of by default.

In case of natural variability it is also possible to talk about bias once you understand the reasons. It is expected that without global warming natural causes would have cooled the Earth a little over the last century.

Dikran, I have a new post on uncertainty and model spread:

http://variable-variability.blogspot.com/2016/08/Climate-models-ensembles-spread-confidence-interval-uncertainty.html

Sounds like you read my old one, which was only about this topic in the context of this “hiatus” thingy:

http://variable-variability.blogspot.com/2015/09/model-spread-is-not-uncertainty-nwp.html

@ Dikran –

“I think there is a natural tendency to think that the model mean is supposed to be directly an estimate of the observed climate, but it obviously isn’t.”

But it is an estimate, just not a very good one. If I have a factory that produces donuts and I want to know the average mass of the donuts, I could take a sample, calculate the mean, and use that as an estimate of the average… even if the sample size is 1.

“If we detect a model-observation inconsistency, we don’t know whether it is because the models are biased or because they underestimate the variability.”

My main issue here is that if the distribution of climate models is so terrible that you cannot meaningfully test for bias using empirical evidence, then why trust these climate models to make confidence intervals about climate sensitivity, perform attribution or perform economic analysis? But my impression is this means that the position of some people in these comments is a bit inconsistent.

Hypothetical conversation in an attempt to illustrate point:

Me: “Both instrumental and paleoclimate evidence suggests that climate sensitivity is in the lower half of the IPCC’s confidence interval, here is the evidence.”

Someone else: “But climate models suggest a higher climate sensitivity, in particular it suggests with high confidence than ECS > 2 C, therefore we can’t exclude high sensitivities.”

Me: “But there is a lot of reason to suspect that the climate model distribution is is significantly biased so those results are not reliable. How about we test for bias and try to make a correction to the distribution?”

Someone else: “The variance in climate model results due to natural variability is unreliable so we cannot make such a test.”

Me: “If climate model results are that unreliable, then how can we use them to counter the results of empirical evidence which suggest that climate sensitivity is not on the upper half of the IPCC’s confidence interval?”

So far, I haven’t seen someone property address that if climate model distribution isn’t reliable enough to test for bias in the distribution, then why should it be reliable for estimating climate sensitivity (let alone performing attribution or economic analysis)?

-1=e^iπ,

Seriously, how much centennial-scale analysis do *you* do every time you fill the tank with petrol? I can say with 100% certainty that if *we* weren’t burning the stuff, attribution studies would be moot. Capisce?

-1 wrote

“But it is an estimate, just not a very good one”no, it isn’t as I pointed out it is an estimate of the forced response of the climate system, if you want to treat it as an estimate of the observed climate then you need (i) give the caveat that we know it is unlikely to match the observations closesly and (b) to know the variance of the observations, so you know how uncertain the estimator is (i.e. credible interval), which you can’t do because you only have one realisation and you are back to square one.

“If I have a factory that produces donuts and I want to know the average mass of the donuts, I could take a sample, calculate the mean, and use that as an estimate of the average… even if the sample size is 1.”Yes, but if you only have a sample of 1 you can’t compute the variance, so you have no idea whether that is a good estimate of the mean or a real outlier. You can’t tell whether the estimator you have is biased because you don’t know the variance, so you don’t know the standard error of the mean.

“My main issue here is that if the distribution of climate models is so terrible that you cannot meaningfully test for bias using empirical evidence, “The reason you can’t reasonably test for bias is not a problem with the model distribution, but with the observations. How often do I need to point that out to you? We only have one realisation to work with and if we use the model spread as an estimate instead then the argument is circular and it is no longer a meaningful test

of BIASbut you can still reasonably test forCONSISTENCY.“But my impression is this means that the position of some people in these comments is a bit inconsistent.”Yes, your position is inconsistent. Your argument can be characterised as follows: If we assume the variance of the model runs accurately reflects the true uncertainty then any inconsistency must be bias instead of the variance being too small. If you can’t see the inconsistency there, you need to think a bit more.

“My main issue here is that if the distribution of climate models is so terrible that you cannot meaningfully test for bias using empirical evidence, then why trust these climate models to make confidence intervals about climate sensitivity, “As has already been pointed out to you, and you have ignored yet again, we know of reasons to suppose that the spread is too small, and the IPCC downgrade their confidence to reflect that in their assessment of climate sensitivity. You appear not to be reading the replies to your posts.

“So far, I haven’t seen someone property address that if climate model distribution isn’t reliable enough to test for bias in the distribution, then why should it be reliable for estimating climate sensitivity (let alone performing attribution or economic analysis)?”That is only because you have ignored the posts that point out how this problem is dealt with by the IPCC and because there is a flaw in your understanding of the statistics that you are unwilling to accept. Given that you have accepted three such errors already on this thread, I don’t understand your reluctance to engage with this one.

VV cheers, I’ll take a look!

“to know the variance of the observations”

Which you can get from climate models under the null hypothesis.

“If we ASSUME the variance of the model runs accurately reflects the true uncertainty then any inconsistency must be bias instead of the variance being too small. If you can’t see the inconsistency there, you need to think a bit more.”

There is no inconsistency there. That’s the point in making an assumption.

“and the IPCC downgrade their confidence to reflect that in their assessment of climate sensitivity”

In many branches of science, usually if you have multiple measures of the same thing, you better constrain what you are trying to estimate. However, for climate sensitivity, the opposite seems to occur.

If climate models aren’t performing well / cannot well constrain sensitivity, then why not go with pure empirical confidence intervals? As it is, the poor performance is being used to both increase the median of the confidence interval and increase the size of the interval. In practice, when applied to economic applications both effects will tend to result in a mitigation bias due to increased average marginal effect of CO2 and higher uncertainty. It could easily change the optimal tax on CO2 emissions from $15 per metric ton to $40 per metric ton for example.

-1,

Come on. The point is that if you use the variance from the climate models to then test the climate models, your test is circular!

-1,

We do have multiple measures. The range is based on all the different lines of evidence.

Because climate sensitivity is defined in terms of a doubling of atmospheric CO2 (technically, it’s a model metric). If, by empirical, you mean observationall-based, then we haven’t actually doubled atmospheric CO

_{2}nor reached equilibrium, so they’re not really fully empirical; they still require assumptions in order to estimate the climate sensitivity.ATTP sez:

-1 can ponder the ~0.9C transient response so far to 400ppm CO2 and the implications for the plausible lower bound of (formally defined) TCR estimates.

“The point is that if you use the variance from the climate models to then test the climate models, your test is circular!”

If you are testing bias and not variance, then it’s not circular.

“We do have multiple measures. The range is based on all the different lines of evidence.”

I mean usually if I have say 2 estimates of X, say one is X is 2 +/- 2 (95% CI) and a second is X is 4 +/- 2 (95% CI) then if I combined the information I might end up with X is 3 +/- 1.4 (95% CI); the estimate is more constrained.

But in the case of climate sensitivity, the final CI is far less constrained than many individual estimates used to obtain the final estimate. It is in the opposite direction. Now admittedly, many of those individual estimates underestimate uncertainty. But even so, on of the primary increases for the uncertainty is because climate models are predicting climate sensitivity values that empirical evidence suggest is ridiculously unlikely.

“they still require assumptions in order to estimate the climate sensitivity.”

Assumptions that aren’t necessarily that unreasonable. If I estimate the acceleration due to gravity of earth, I will often neglect the gravitational effects of the moon, but that assumption is pretty reasonable.

“-1 can ponder the ~0.9C transient response so far to 400ppm CO2 and the implications for the plausible lower bound of (formally defined) TCR estimates.”

It is a lot more than just CO2 that caused that increase.

-1,

Maybe I’ll let Dikran respond in detail – if he can be bothered – but I think the point is that with a single observational realisation, we can’t test for bias.

I don’t think you’re right about the different estimates. A lot are much broader than our overall range

Yes, because we can be pretty confident that in many circumstances it is negligible. To estimate CS from recent observations, you need to make assumptions that may not be reasonable.

-1

CO2 would appear to be the major forcing change, but yes, of course there are others. This is of course taken into account in attribution studies.

With just a single, short observational record, we cannot separate the two. We can’t separate the variance and the bias. You can’t test “bias and not variance”.

With more data, sure, we could. But we don’t have that data, as far as I can tell.

“With just a single, short observational record, we cannot separate the two.”

We can try, but yeah, there really isn’t enough data.

Now if we were to get some really good paleoclimate data (both temperature & forcing) we might be better able to test things. This is why things like the PAGES 2K reconstruction is so important.

-1 wrote “If you are testing bias and not variance, then it’s not circular.”

If there is an inconsistency between the observations and the model ensemble there can be three reasons, either the models are biased, or the model spread is narrower than it should be or both. If you assume that the model spread is correct then you can’t conclude that this is because of bias because the analysis assumes that any inconsistency is bias (as that is the only option left if you assume there is no error in the variance). Hence the argument is circular.

“If climate models aren’t performing well / cannot well constrain sensitivity, then why not go with pure empirical confidence intervals?”

This sentence suggests that -1 doesn’t really know what a confidence interval is. A credible interval yes, and ISTR there are several papers that attempt to construct such an interval from the various estimates of CS. I suspect that the CS from climate models (being somewhat in the middle of the range) don’t have that much effect on the tails.

“Assumptions that aren’t necessarily that unreasonable. If I estimate the acceleration due to gravity of earth, I will often neglect the gravitational effects of the moon, but that assumption is pretty reasonable.”

separating out the gravitational effects of the moon from that of the earth is rather more straightforward than separating the forced response of the climate from the unforced response (which is what needs to be done to estimate CS, which depends on the forced response only). Wishing the assumptions were reasonable by comparing them with a case where they are trivial doesn’t make them reasonable. As I pointed out upthread, observational estimates are likely to be biased (due to missing variable bias) anyway, it is curious that you don’t seem to care about that bias, only the bias of the GCMs.

dikranmarsupial says: “

If there is an inconsistency between the observations and the model ensemble there can be three reasons, either the models are biased, or the model spread is narrower than it should be or both”Some more options: the observations are biased, the analysis is biased, the comparison is biased. Please, stop assuming that the observations are perfect. That is a sure way to make sure they do not become better.

That is also why we need many lines of evidence. As Physics showed above for the climate sensitivity.

https://i1.wp.com/static.skepticalscience.com/graphics/Climate_Sensitivity_500.jpg?zoom=2

VV absolutely. Perhaps another would be that the analyst is biased ;o)

VV’s blog post is very interesting BTW, well worth a read.

Yes, the analyst can also be biased. When I was young and inexperienced I was quite afraid of that. Now as an old geezer I no longer see that much room for it, as long as you stay honest. It would, for example, be very hard to make a relative homogenization algorithm that exaggerates the trend and that it is not a good algorithm would be found in the next benchmarking study.

Thanks.

Ideally the analyst should be biased … against themselves, the most important form of skepticism being self-skepticism. In the long run being biased is a career limiting move (as the results of your bias will eventually be exposed), which is why I suspect most highly successful scientists operate in a truth seeking manner and know that being self-skeptical (so they don’t let their judgement get clouded by their enthusiasms) is good for them. Some less successful scientists try to do that as well ;o)

“it is curious that you don’t seem to care about that bias, only the bias of the GCMs.”

I do care about that bias.

But it seems far easier to solve/correct for such missing variable bias for empirical estimates than it does to correct for a bias in the distribution of GCMs.

Like if someone thinks that there is a bias with an empirical estimate then there are ways to test and quantify it (example: Marvel et al.). With respect to the GCM distribution, it is much harder to do.

(in writing my “Kepler” comment I looked up an old post and found a ref to https://scienceofdoom.com/2014/07/11/models-on-and-off-the-catwalk-part-four-tuning-the-magic-behind-the-scenes/ which you might find interesting)

“But it seems far easier to solve/correct for such missing variable bias for empirical estimates than it does to correct for a bias in the distribution of GCMs.”

really? Please tell us exactly how.

BTW I notice that -1 has not admitted the circularity of his argument. If you assume from the outset that the model spread is reliable then you have decided a-priori that any discrepancy must be due to bias not variance. Thus if you conclude that the problem is bias rather than model spread the argument is circular as you began by assuming you conclusion.

WMC,

Thanks, that SoD post is – as usual for SoD – very good. Do you have views of your own about this issue? Seems to me that the suggestion of more transparency is a very good one; can only help. However, these are complex models in which having parameters that will need tuning is unavoidable; can obviously aim to make it clearer how they are tuned and also maybe develop procedures that are more robust (although I suspect you’ll never get everyone to agree on the optimal procedure). I also get the sense that some seem to judge these as if they should be engineering tools (being used to design something), rather than as scientific tools that are mainly used to understand how a complex system responds to changes.

“Like if someone thinks that there is a bias with an empirical estimate then there are ways to test and quantify it (example: Marvel et al.). With respect to the GCM distribution, it is much harder to do.”

Actually, no. The reason that bias is hard to quantify is that we only have one realisation of the observations, and that is as true of the empirical estimates of CS as it is for those from the GCMs.

Of course you can show that there are problems with the models that will result in bias, but that is not the same thing. For example, you could show that the GCMs didn’t handle aerosols correctly and underestimated their effects.

ATTP: what makes you think I ever *read* the SoD post 😕 But now I look I must have, since I commented on some of the details. I think it does a fair job of conveying the complexity of the issue.

The answer, I think, is that it is very detailed, and very model / centre specific. There’s also a lot of “implicit” tuning, in the sense that if something works out, you leave it alone; and if it doesn’t, you tweak it. It is also such a multi-layered process (even picking one thing I’m vaguely familiar with, like the sea ice) that you’d be hard pressed to go back afterwards and work out what all the tweaks even were (which is why the rather naive Hourdin recommendation for “better documentation” is rather naive; that’s the sort of recommendation anything like this always comes up with). The Curry post is useless, except for its recommendation to read the Hourdin paper.

Thanks. Your point about it not being easy to work out all the tweaks is interesting and does make sense. It’s certainly possible that one of the reasons for the supposed lack of transparency is that it is actually very difficult to produce a coherent explanation of the process, rather than any explicit attempt to hide what is done.

Almost goes without saying 🙂

@ Dikran – “really? Please tell us exactly how.”

Well look at the Kyle Armor paper for example, it discusses and quantifies all sorts of biases with the energy balance approach.

@ ATTP – “However, these are complex models in which having parameters that will need tuning is unavoidable; can obviously aim to make it clearer how they are tuned and also maybe develop procedures that are more robust”

Maybe my ignorance of GCMs is showing but, why not have fit all of the tuned parameters simultaneously (as well as determine their uncertainty) using something like a Nelder-Mead or Gauss-Newton algorithm? Now you can’t really do this with high end GCMs due to computational complexity, but for low or medium complexity GCMs this should be possible. Then the parameter tuning is extremely transparent, and one can determine uncertainty in climate sensitivity by performing monte carlo simulations over the range of model parameters. So maybe the focus should be on trying to create 1 really good medium complexity GCM, that is computationally fast enough that one can fit model parameters to observations as opposed to creating many super complex GCMs that take forever to run.

-1,

I had thought that what you suggested at the end of your comment was similar to what Michael Tobis suggested in this guest post. Seems I’ve remember wrong (the guest post is still worth reading). I do think, however, that MT did make a suggestion that there would be merit to investing in climate models that had intermediate complexity (if I can find where, I’ll post a link). I agree that this can be very useful. On the other hand, there are limited resources and people can disagree about how best to use what is available.

-1 the same sorts of exercises can be performed with GCMs, but it still isn’t bias in the strict statistical sense because we have no means of directly observing the forced response of the climate which is what the models actually estimate.

This comment from MT seems to be suggesting something similar to what -1 was suggesting.

“Then the parameter tuning is extremely transparent”

well actually not completely, as pointed out model tuning is also introduced by the fact that the model builders know what recent climate looks like and that will influence the direction that the development of the models take. This is a problem that crops up in my own field of research (machine learning – model selection/tuning is my main research topic) where the data-sets we use have been over-fitted to some (unknown) degree by the fact that machine learning researchers have developed algorithms to work well on them. You also see this in machine learning challenges (such as those run by Kaggle) where often entries that score highly on the leaderboard perform badly in the final evaluation because the operators have developed the solution so that it works well on the leaderboard dataset.

This sort of model tuning is also almost guaranteed to lead to over-fitting as we only have one realisation of historical climate. A better approach would be to use an ensemble over all plausible parameter values weighted by their posterior probability, but there are technical difficulties that make that a decidedly non-trivial exercise.

My maxim is “In statistics optimisation (tuning) is the root of all (well most, anyway) evil”, and I think it has some relevance here as well.

Yes, MTs comment is a good one, although personally I would just average over the “pretty darn good” region, rather than optimise.

@ ATTP – Thanks for the links. Yeah, what MT is suggesting makes a lot of sense.

@ Dikran – With respect to the problem of overfitting, couldn’t this be resolved by choosing simpler GCMs with a fewer number of parameters that need to be optimized?

The other advantage of using lower complexity models is that, if they are computationally fast enough to optimize then they can be incorporated directly into integrated assessment models, which should make the conclusions of integrated assessment models more reasonable.

-1,

My suspicion is that they won’t be quite that fast for quite some time. Also, it’s not obvious what the benefit would be unless you tried to couple them in some other way.

“@ Dikran – With respect to the problem of overfitting, couldn’t this be resolved by choosing simpler GCMs with a fewer number of parameters that need to be optimized?”

In which case we are leaving out phsyics that the modellers consider important which means they will be more biased*. We only have one realisation of the historical record, it is not clear how many parameters we can tune without introducing a non-trivial degree of over-fitting. Havin only one realisation of the observations is a fundamental limitation, and we need to recognize that if we are to draw reliable conclusions.

* note over-fitting is often discussed as a “bias-variance trade-off”, limiting the complexity of the model reduces variance, but it increases bias, error is minimised by choosing a complexity that achieves the best compromise between these conflicting qualities. However it is ironic that you first complain that the models are biased and then suggest a solution to a problem that will increase the bias!

@ Dikran – I’m a bit confused by your post.

With respect to my comment, I was comparing low complexity GCMs and optimizing parameters using fit to observations with medium complexity GCMs and optimizing parameters using fit to observations. You can’t do such an approach with high complexity GCMs due to computation resources.

Yes obviously one has to look at the tradeoff.

However, then you claim ” However it is ironic that you first complain that the models are biased and then suggest a solution to a problem that will increase the bias!”.

Which is why I’m confused by what you are trying to say. My issue there is no reason to believe that the distribution of high complexity GCMs will be significantly unbiased or have a reasonable variance since the choice of such a distribution depends on the choice of parameters people arbitrarily pick. This is quite different from having an observationally constrained distribution of model parameters in a single model. Going from medium complexity GCMs to low complexity GCMs while optimizing parameters would increase bias, but going from high complexity GCMs to low complexity GCMs doesn’t necessarily increase bias because we are talking about 2 different approaches, 1 observationally constrained and the other not observationally constrained.

-1,

I think Dikran’s point is that it is quite likely that a climate model of moderate complexity will almost certainly leave out some physics. Therefore it probably has a bias and so does not necessarily help you to resolve that issue.

-1, it sounds like the people working on making tuning explicit, the people who wrote the BAMS article, would like to go into the direction you suggest. In my blog post I explain why this is not as easy as you seem to think (there is more to it than randomly varying parameters over their distribution) and why that would still only solve the problem partially (because there are more reasons why the model ensemble spread it too narrow).

ATTP precisely. Bias essentially means that the model is systematically different from the reality that it is trying to model. The physics in the model is there because the modellers think it is necessary as it is one of the things that governs reality. If you leave it out, then the model is likely to be more systematically different than it was.

So you take a biased model and you tune it to death on the observations you have, then part of that tuning will act to compensate for the bias. That is not neccessarily a good thing as (i) it means that the parameter values you get are not the true ones (as they have an offset that bodges a problem elsewhere in the model) and (ii) while it may reduce the discrepancy on the observations it may make the predictions of the model worse (because it gets the observations right, but for the wrong reason, so it doesn’t generalise).

Model selection (even in statistical models) is not straightforward, which is why people like me find it an interesting topic for research.

Pingback: Matt Ridley responds to Tim Palmer | …and Then There's Physics