Richard Tol’s fourth draft

So, Richard Tol has a fourth draft of his climate consensus paper. I don’t really want to say too much more about this as I’ve discussed his draft paper before (here and here). My basic view hasn’t really changed. Richard and I have also exchanged a few tweets about this work and they’ve been perfectly congenial exchanges.

I was, however, going to make a couple of general comments. In the paper Richard says

Consensus or near-consensus is not a scientific argument. Indeed, the heroes in the history of science are those who challenged the prevailing consensus and convincingly demonstrated that everyone else thought wrong.

A claim of consensus serves a political purpose, rather than a scientific one.

Yes, I don’t think that anyone is claiming that the existence of a consensus means that the science is settled. The motivation behind this work, as far as I understand it, is that some are claiming that no such consensus exists. Richard himself acknowledges that the basic results of this paper are probably correct. That there probably is good agreement, in the scientific literature, about the fundamentals of global warming. That is really all that this paper was trying to illustrate. It’s because this has become a political issue that such a paper may be necessary. In most fields, it wouldn’t be necessary.

Richard goes on to say

Others, however, are concerned about the standards of proof in climate research. They would emphasize the complexities of the climate system and highlight lack of rigour in peer-review, substandard statistical analysis, and unwillingness to share data. These people are unlikely to be convinced by Cook et al. It is well-known that most papers and most authors in the climate literature support the hypothesis of anthropogenic climate change. It does not matter whether the exact number is 90% or 99.9%. These people are concerned about the quality of the research. More papers does not mean better papers.

In fact, the paper by Cook et al. may strengthen the belief that all is not well in climate research. I argue below that data are hidden, that the conducted survey did not follow best practice, that there are signs of bias in the data, and that the sample is not representative. In sum, the conclusion of Cook et al. does not stand. It may well be right, but it does not stand.

So basically Richard appears to be saying, people don’t trust climate scientists for various reasons and the Cook et al. study is adding to this concern as it is biased, lacks rigour, is not representative and hence, although maybe essentially correct, does not stand.

Here’s where I think I have the biggest issue with what appears to be motivating Richard Tol. I’m not a climate scientist, but am an active researcher in the physical sciences. I will typically make my data available to those who ask, but I’m expected to do research, teach and help with the administration of my department. I don’t really have the resources to properly prepare all my codes and data for use by the general public. I have no strong objection to doing so. It’s really just a matter of time and money (although I will say that I would be slightly concerned about how many would actually be able to understand what the data was telling them and what the codes were really doing). Academic researchers don’t, typically, have hordes of support staff to help them with all the administrative work associated with making codes and data accessible to others.

Having become interested in global warming/climate science I am, however, quite impressed by what is available. I can access all sorts of datasets. I can access various online codes (MODTRAN for example). I’m sure that there are examples of those unwilling to share data or who are secretive, but – by and large – it all seems quite open to me. It seems quite likely that there are those who would like people to believe that there are fundamental problems with climate science. That there are issues with peer-review – there are, but it’s not unique to climate science. That there are some papers that aren’t very good – indeed, but again not unique to climate science. That some won’t share their data – I’m sure this is true, but it seems as though a remarkable amount is actually available. However, it doesn’t seem – to me – that there is any real evidence of a particular problem with climate science. What there is evidence of is people who want to believe – or to make others believe – that such problems do exist.

Essentially, this is what Richard appears to be doing. The Cook et al. paper is a study that appears to produce results that Richard doesn’t actually dispute. He, however, is going to publish a paper telling everyone that, despite this, the method is “flawed”, the results “unfounded”, and the authors “secretive and incompetent”. He’s then suggesting that this could be perceived by some as a metaphor for climate science in general and that, consequently, the Cook et al. study may do more harm than good. Well maybe, but that seems to be because Richard has chosen to tell people that there are problems with the study, not that there necessarily are any problems.

One might argue that if there were fundamental problems, then even if the results appear reasonable, the study would be flawed. In general I would agree. However, much of what Richard is highlighting either seems to be minor, seems to be his opinion about how a study should be carried out, or are issues that might have no effect on the overall result. In fact, Richard’s paper makes claims about inconsistent ratings and biases but it’s very unclear how he gets this result. He actually says

In the data provided, raters are not identified and time of rating is missing. I therefore cannot check for inconsistencies that may indicate fatigue. I nonetheless do so.

How did he do so, if he couldn’t do it? He concludes by saying

The reported data show signs of inconsistent rating, and a bias towards endorsement of the hypothesis of anthropogenic climate change. These concerns could easily be dismissed with the full data-set.

Okay, it shows signs of inconsistencies but it’s not certain because the full data set isn’t available. Admittedly the authors have not made this available. Personally, I’m not sure I would do so given the tone of Richard’s earlier drafts.

It may even be that Richard has some valid points. However, simply writing a paper highlighting potential issues with another study is not the normal approach (at least not in my field). Typically, one would repeat part or all of a study to show how these issues influence the result. If Richard is concerned about this paper being a metaphor for general issues in the climate sciences (largely unfounded in my opinion) maybe he should be a little more careful about what he writes about this paper and makes sure that what he claims is well-founded. In a sense, what Richard is doing is not consistent with my understanding of the “scientific method”. The “scientific method” works by collecting more data, doing more analysis, and refining methods and techniques. It’s a continual, evolutionary process. It, typically, doesn’t involve poking holes in other people’s work. If Richard really thinks this is important, he should do a study of his own to see how his results (obviously using the ideal methods, assumptions and strategy that would, clearly, be beyond reproach) differ from those obtained by Cook et al.

Okay, I will admit that my short post about Richard’s fourth draft has got a little lengthier than I intended. Oh well, can’t be helped. Comments, as usual, welcome.

This entry was posted in Climate sensitivity, Global warming and tagged , , , , , , . Bookmark the permalink.

377 Responses to Richard Tol’s fourth draft

  1. Seems that this draft was sent to ERL:

    ***

    > If Richard is concerned about this paper being a metaphor for general issues in the climate sciences (largely unfounded in my opinion) maybe he should be a little more careful about what he writes about this paper and makes sure that what he claims is well-founded.

    Richard has not changed much of his text between his versions. I see many problems with his commentary. Only had the chance to tell him of about half of them. Still many comments left in my notes. Not simple wordsmitting issues, as Richard try to dismiss them.

    A pity, since Richard may be right to criticize the paper. But now I see why Richard wrote so much. He does not come to me as a perfectionnist.

    Let’s wish him the best of luck in getting published.

  2. Yes, I read quite a few of your tweeted comments. There may well be valid criticisms, but what Richard has written doesn’t seem all that well founded. Even as papers go, it doesn’t actually explain what is being done particularly well. I don’t even really understand most of the figures, and reading the text doesn’t help much. Maybe that reflects more poorly on me than on Richard, but I would probably expect more detail in the text than he has provided.

  3. Fragmeister says:

    Perhaps the point is to get rejected by the peer review process and claim martyrdom – “They let Cook et al through but won’t publish me. Conspiracy!”

  4. Skeptikal says:

    I’m not going to comment on Richard Tol’s work as I haven’t really looked at it. What I will comment on is where you say;

    The “scientific method” works by collecting more data, doing more analysis, and refining methods and techniques. It’s a continual, evolutionary process. It, typically, doesn’t involve poking holes in other people’s work.

    I would suggest that part of the “scientific method” is peer review, and this extends beyond the handful of people who review a paper prior to publication. It is the whole scientific community being able to look at your work and “poke holes in it” if they find problems. You’re never going to refine your methods and techniques unless problems with your methods and techniques are brought to your attention.

    In my opinion, the fear of having others poke holes in their work is a problem which plagues climate science in general. Climate scientists are so fearful of others poking holes in their work that they become secretive. This is to the detriment of climate science as information, which should be shared, is being withheld and faults in scientists work go undetected. This secrecy also makes the general public less confident in the science… and rightly so.

  5. Sou says:

    That’s silly and wrong, Skeptikal. Climate sciences and related must be the most transparent of all the sciences in the world today. There is no other science that not only provides data, provides code, and prepares regular reports of the state of the science. That’s a huge voluntary effort by scientists from all over the world. Climate scientists even run websites set up specifically to engage with the general public. Sure there are blogs and popular magazines discussing other sciences but nothing like what is available for climate science, or as accessible to the general public.

    And for all their plaintive cries for data and code – now that there is more data and code around than ever before, tell me which of those ‘skeptics’ have ever done anything with it?

    As for Richard Tol, why stop him? He’s shown that he’s highly motivated to do whatever it is he thinks he’s doing and I for one won’t be trying to save him from himself. (He wants to study fatigue patterns, that’s got to be the funniest thing I’ve heard in ages. Yeah, I shouldn’t have even written that but couldn’t resist. It’s hilarious!)

  6. Sou says:

    I wouldn’t mind betting the “fatigue” notion came from Richard Tol’s colleague at the Global Warming Denial Foundation (GWPF) (have I got the initials right?), Benny Peiser. Isn’t he a phys ed trainer or some such thing?

  7. Skeptical, I have to say that I agree with Sou. As far as I can tell, climate science is under such scrutiny that it is – most probably – the least likely science area to suffer from some fundamental problem. Although I agree with you that peer-review is part of the scientific process, the point I was trying to make is that normally one wouldn’t expect someone to hand over all their working so that others can scrutinise it in fine detail. What is expected is that a paper is written in such a way that someone else can replicate the work. Also, if the paper relies on data or computer codes that aren’t easily available or reproducible, that these too are available. Science progresses by many been involved in the research so that over time methods are indeed honed and improved as is our understanding of the area in question. It doesn’t work by simply finding possible problems with someone else’s work. One also needs to show what influences these “possible problems” will have on the results and how the results might change if these “problems” are fixed.

  8. Possibly, and I can see it being a problem. He has a whole bunch of figures showing rolling standard deviations, but having read the text a number of times, I can’t really work out what these figures are illustrating. Maybe I’m just being dense, but one would hope a reviewer would – as a bare minimum – make him clarify much of what he is presenting.

  9. Tom Curtis says:

    Willard, I personally would consider it an insult to both have your comments on draft editions so completely ignored by Tol, and then to have him claim in acknowledgements that he has had “useful discussions” with you.

  10. In fairness, I might acknowledge someone with whom I’ve had discussions about some work, even if I didn’t actually take heed of – or disagreed with – much of what they said.

  11. Tom Curtis says:

    wuwtb, you ignore the fact that the climate “auditors” have redefined replicability to mean “hand over all data and code”. This regardless of the cost in generating the data and code in the first place, any possible commercial considerations or, (as in likely in this case) whether or not the authors have retained some data for further analysis in a subsequent paper. (In this case we know the authors are researching the results of a public access replication of their survey, reporting of which may well include much of the detailed data Tol and others have complained about not receiving.)

    What the various climate “auditors” conveniently forget is that:
    1) In the past, it has not been customary to report all data related to various studies, but only sufficient to establish the result and to allow replication. That has been in large part because, without computers and the internet, publication of all data was simply not feasible. Nevertheless it shows the demand for full revelation is not the standard procedure in science.

    2) In the past, as a courtesy, scientists often hand over additional data on request. It is, however, a courtesy, not a right. Mistaking a courtesy for a right is, IMO, an obnoxious trait and typically results in the courtesy no longer being extended to the fools who do it.

    and most importantly
    3) Replication involves doing a parallel experiment, thereby producing your own, distinct data for analysis. This step, it appears, the climate “auditors” almost never get around to. I suspect that is because they know full well that a replication using their exalted standards would produce the same results as the studies they criticize – a result that would not be politically useful.

  12. Tom Curtis says:

    wuwtb, that would be appropriate if you had given serious consideration to the arguments of the other person. In this case we have Tol dismissing Dana’s information that only 5 of 1000 endorsement level 4 abstracts in the subsidiary survey had a position of climate change on the basis that “Dana’s been saying all sorts of things”; then continuing to suggest that it is unclear whether there were 5 or 40, while acknowledging “useful discussions” with Dana. Clearly while Tol has discussed Cook et al with Dana, he has been completely dismissive of Dana’s input. Therefore claiming that the discussion was “useful” in Tol’s opinion, is a lie.

    I see no evidence that Tol has been any less dismissive of Willard, or you, or Halpern.

    Put another way, claiming that he had “useful discussions” with X is a way of saying that he took X’s points on board, even if he continued to disagree with them. It is also a way of letting his readers know that his expressed opinions in his comment are considered opinions formed after giving due weight to considerations raised by a variety of people. In fact, we know he did not take alternative points of view on board, and we know that the views expressed in his comment are not “considered opinions” but rather a biased and hostile attempt to find any flaw however irrational the evidence for the flaw is.

  13. Tom Curtis says:

    The draft submitted to ERL was not draft four, but draft five. I am not sure how they differ, if at all.

    One thing that is noteworthy about the differences between draft 5 and draft 3 is that a number of edits have been made to make the language more negative and critical without, in fact, any addition to the argument.

    In draft three, for example, Tol begins the paragraph discussing the subsidiary survey of 1000 endorsement level 4 papers by writing:

    “Cook et al. claim that 97% of abstracts endorse the hypothesis of anthropogenic climate. The available data, however, has 98%.”

    In draft five, however, those two sentences stand apart as a separate paragraph. By separating them from the discussion, Tol gives the appearance of an error where in fact none exists.

    When moving into the discussion of the subsidiary survey, Tol has not corrected any facts. He now ends the discussion by writing:

    “Data for the 4th rating are not available.The headline conclusion is not reproducible.”

    Of course, data for the fourth rating (ie, the subsidiary survey) is available. We know that 1000 abstracts were rated, and further, we know from the paper that 0.5% of those were rated 4b (Uncertain on AGW). That percentage has been confirmed publicly by Dana, a co-author of the paper. It has also been confirmed in private correspondence by John Cook. Given that, the headline result is easily reproducible. Presumably Tol means only that the raw data of the subsidiary survey is not available. Access to the raw data, however, is not necessary in order to reproduce a result.

    Perhaps Tol is claiming that access to that data is necessary to reproduce the headline result based on his fiction that there is doubt as to whether 0.5% of 4% of abstracts rated in the subsidiary survey were rated 4b. As noted, however, any doubt on that basis that existed in the paper (and I maintain no such reasonable doubt existed) was put to rest by the public statement of an author. What is more, we know that Tol was aware of the statement. His failure to mention that statement in his discussion, therefore, constitutes scientific misconduct. He has concealed data of which he is aware, and which rebuts the position he argues in his paper.

    Further on, Tol’s claim that “A number of authors have come out to publicly state that their papers were rated wrong, but their number is too small for any firm conclusion” has been turned into the end of a paragraph and had the qualification dropped, thus giving it emphasis. At the same time he has identified himself as one of the authors who disagreed with the rating of his paper. He does not, however, note that he responded to the survey of authors so that his disagreement is already included in the overall statistics. Again, this is relevant information to assessing the relevance of the seven authors who disagreed with the ratings, but is excluded because it runs contrary to Tol’s narrative.

    In fact, given the nature of Tol’s comment, the most germaine question regarding these seven authors is, are they a representative sample? And what measures have you taken to ensure that they are a representative sample? The answers, clearly, are no, and none. Given Tol’s critique of Cook et al, his inclusion of mention of these cherry picked examples is the rankest hypocrisy.

  14. Cook et al. is based on a survey. The lead author is based in a psychology department. The quality of the Cook paper should thus be held to the survey standard in psychology. This involves publication of the survey protocol, the data, and the paradata. Cook et al. have not done so.

    The Institute of Physics, the publisher of the journal, also demands that papers be documented such that the results can be checked. This is not the same as making all data available so that anyone can reproduce the paper.

    It is customary to test extensively for validity and consistency. Cook et al. do not. I ran what I could with the little data provided, and found signs of invalidity and inconsistency.

    For example: the data generating process implies homoskedasticity. The data are heteroskedastic. Why is that? The autocorrelation test suggests that someone fell asleep on the keyboard with her nose on the “4”.

    The skewness test suggests that the first fifth of the ratings are different than the rest, I guess because the ratings were being reinterpreted during the survey. These data should have been removed as they are pre-tests.

    The skewness test also indicates that strong endorsement are clustered in an inexplicable way.

    Papers were rated 2-4 times. Only one rating was published. Cook et al. hint at strong deviations between the ratings. Data quality is therefore low. How low? The reader is not told, and cannot check.

    Tests like these are typically run before you show your results at an internal seminar.

    A quick glance at my publication record would show that most of my research is the classic “let’s try and do things a bit better”.

    Climate research has been severely damaged by its low standards of research and conduct. Cook et al. is probably not the worst example. It did come to my attention, though, and I have more than a passing interest in surveys, meta-analysis, and scientometrics.

  15. Once again, thanks for commenting. I’m not a psychologist so don’t really know the expected standards for publishing details about surveys. Maybe you have a point. I’ll leave it at that, as I don’t really know how to expand on that.

    However, you’re critical of the details in the Cook et al. paper. You still insist on the data giving you 98% while the authors claim 97%. This appears to be still related to these 40 reclassified papers for which it still isn’t clear these are 40 out of 1000 or 40 out of 7970. You and I both know that one of the authors has clarified that this is 40 out of 7970. I don’t understand why you continue to insist that this isn’t known. I completely accept that it isn’t clear from the paper, but in my field clarification from the authors would seem acceptable.

    You’re critical of the details provided by the Cook et al. authors. You have a series of paragraphs on inconsistencies, fatigue, skewness and a set figures in your supplementary data. I have no idea how you produced these figures or what data you used. Maybe this is obvious to someone in your field, but I can’t work out what you’ve done here or what these results indicate. For a paper critical of the details in another paper, it seems very lacking in detail.

    You finish your above comment by saying

    Climate research has been severely damaged by its low standards of research and conduct.

    Really? I presume this is based on the climategate emails. I’ve read quite a few of these and I can find perfectly reasonable alternative explanations for many of the emails that are often touted as being examples of bad practice in climate science.

    It seems to me that what is really happening is that a bunch of people want others to believe that there is bad practice in climate science and have managed to find some questionable evidence to support this claim. As I mentioned in the post above, the whole concept of writing a paper to simply highlight errors in another paper, without actually carrying out any significant research yourself, seems odd and somewhat unprofessional to me – especially as the result of the original study isn’t really disputed. I can’t really see what’s motivating you to do this.

  16. Yes, as I mention in my comment to Richard below, I find the details in his own paper somewhat lacking. I have no real idea what he’s done to test for skewness, etc. There are a few paragraphs of text and some figures that I don’t understand. Might be me, but I would normally expect more detail.

    Fixating on these 40 re-rated papers also seems odd. That it was done is made clear in the paper. That the 40 refer to 40 out of 7970 has been clarified by the authors. Continuing to criticise Cook et al. for this seems unnecessary.

  17. You say you’re a physicist. Presumably you have some knowledge of experimental physics.

    Suppose someone ran an experiment. She finds that the results are inconclusive. So she runs 1/12 of a variant of the experiment. She submits a paper with data on 11/12 of the original experiment and 1/12 of the variant. Would that paper be accepted?

    I would think not.

  18. I’m not sure that I agree with the analogy. If someone did that without acknowledging it, indeed that would be poor. However, there may be scenarios where some retesting is necessary. As far as the Cook et al. paper is concerned, you have the original results (which gives 98%) and you have the results from the re-analysis (which gives 97%). In some sense, one could argue that the re-classification hasn’t made much difference. In a physics experiment, if you were concerned about some of your results and decided to retest and discovered that it made a 1% difference, you would probably write it as confirming that your experiment is robust. Maybe Cook et al. have expressed this in a manner that a physicist wouldn’t. I’m finding it difficult – as you can tell – to see the significance of these 40 re-classified papers.

  19. Tom Curtis says:

    Richard Tol, you just make things up as you go along.

    John Cook is not “based in a psychology department”. He is a member of the Global Change Institute at UQ, only one of whose researchers has a background in psychology. The director of the Global Change Institute reports to the Deputy Vice Chancellor Research. In contrast, the School of Psychology is part of the Faculty of Social and Behavioral Sciences, whose Executive Dean reports to the Senior Deputy Vice Chancellor. Organizationally, the only relationship between the Shool of Psychology and the Global Change Institute is that they both come under the purview of the CEO of the university (ie, the Vice Chancellor).

    I have commented elsewhere on your insistence that data ordered randomly with respect to order of rating can contain information about the effects of order of rating on the rating procedure; and above about your concealment in your submitted manuscript of the fact, of which you were perfectly aware, that it had been confirmed that only 0.5% of papers in the subsidiary survey were rated 4a.

    All I can add is that given the quality of your comment, your purporting to discuss lack of merit in others research is rankest hypocrisy.

  20. Richard Tol says:

    Ad hoc changes to survey results are not acceptable.

    97% is almost 98%? 3% is almost 2%?

    I hope you hold your own research to a higher standard.

  21. Richard Tol says:

    So, Tom, you think that John Cook lied about his affiliation? The ERL paper puts him at the School of Psychology of the University of Western Australia.

  22. Tom Curtis says:

    No, Richard. I think you make assumptions without knowledge, and allow them to dominate your analysis regardless of correcting information. In this case, Cook is not a member of staff at the School of Psychology at the University of Western Australia. His doctoral supervisor is, hence the listed affiliation.

  23. I wasn’t suggesting that 97% is almost 98%. I was suggesting that retesting a sample of your data and getting a consistent result can improve confidence in the result – i.e., within the uncertainties, although I accept the Cook et al. survey doesn’t have uncertainties, but surely you’d agree that it must have uncertainties at at least the percent level.

    I was a little put out by the final sentence of your above comment, but given that I’m anonymously critiquing your paper, it’s probably fair to have a bit of a go at me 🙂

    I have, however, been pondering our disagreements and wanted to run something past you. We don’t have to reach any kind of agreement though. Nothing wrong with disagreeing.

    I suspect we’ll agree on the following. In any experiment/survey it is important to be consistent. One can’t change the experimental procedure or survey strategy during the experiment/survey or it will compromise the data. Agreed?

    However, the analysis of experimental data/survey data is less rigid. There may well be multiple ways in which to do the analysis. There may be various tests for consistency. One may retest data that produced anomalous results. Of course it’s important to explain the analysis clearly in any publication but the rules are less rigid. The same set of raw data may well be analysed in different ways by different groups. Maybe you don’t agree with this, but this is my experience.

    There is where I suspect our disagreement comes from. In the Cook et al. survey, the raw data – in my view – is the abstracts. The survey is therefore the process of extracting these abstracts from the database. This is the part that has to be consistent – in my view. The ratings of these abstracts is the analysis. The goal is to produce a rating for each abstract using the volunteers. Given that this is the analysis, sorting out discrepancies, retesting some of the results, all seems perfectly normal to me.

    It seems to me, however, that you regard the rating of the abstracts as the survey. That the ratings is the raw data. I would argue that that is not correct. If the goal was to see how well a group of volunteers could rate a set of abstracts, I would agree with you. That, however, is not the goal. The goal is to rate (analyse) the abstracts (raw data).

    Maybe we will still disagree, and there’s nothing wrong with that. The above is, however, why I’m less convinced by your claims of inconsistency and your concerns about the survey strategy.

  24. Richard Tol says:

    Agreed: You shouldn’t change the experiment/survey after you started.
    I disagree on your other point. The ratings are the raw data, rather than the abstracts. Your point is a bit like saying “I’m not giving you my data. Instead, I give you the blueprint. Go build your own Large Hadron Collider.”

  25. Richard Tol says:

    So John doesn’t have a PhD although he is “post-doc”?
    I thought he had a PhD in psychology and is affiliated with a psychology department, and thus bound by the codes of practice in psychology.
    You argue he is a PhD candidate in psychology. He is thus bound by the codes of practice in psychology.

  26. I think this is where we are likely to just have to agree to disagree. You can’t have two sets of raw data – the abstracts and the ratings. As I said above, if the goal was to run a survey to analyse how a group of volunteers rated a set of abstracts, I would agree with you. That isn’t the goal. The volunteers are your data analysis tools who are acting to rate the abstracts. The goal is to end up with a rating for each abstracts.

    Your LHC analogy is odd. The raw data would be the measurements from each collision. What we see in the publications are an ensemble of many collisions that tell us whether or not something – like the Higgs Boson – exists. Producing the figures you see in the papers take a great of analysis. That analysis could be repeated by someone using the same raw data, but with a different technique. I accept that it is not always cut and dried what one should regard as raw data and what should be regarded as analysis data, but in the Cook et al. survey I would argue that the abstracts are the raw data. You could take the same set of abstracts and analyse them in a different way (which in some sense is what you’re suggesting). That would be a valid way in which to test the basic results of the Cook et al. survey.

    I could stretch your analogy to, I’m not giving you the abstracts – go away and write 11944 papers.

  27. Richard Tol says:

    The convention in the social sciences is that the survey is the instrument (LHC) with which you try to measure something about reality (abstracts cq elementary particles). The raw data follows from the application of the instrument to the sample.

    The final draft has links to the codes of practice of the relevant professional bodies. They are quite clear that providing the questionnaire and the sampling strategy are necessary but not sufficient.

  28. Exactly. This is one of the issues I have with many supposed skeptics. They seem to think that the scientific method is a bunch of scientists doing some research that they then pick holes in and consequently claim that it’s all wrong and that the scientists should think again. It’s as if they see themselves in a special place. That they have a role as “auditors” – as you call them. Science doesn’t really work like this and this lack of understanding simply underlines, in my opinion, their lack of scientific knowledge.

  29. Well, that almost seems like we’re agreeing. The survey is something “with which you try to measure something about reality”. I think I agree with that – at least in the sense that in the Cook et al. paper the “survey” is formally the extraction of the abstracts from the database.

    I was referring to this as the raw data, but maybe one could argue that some of the data from the analysis is also a form of raw data. However, what I was suggesting was that the part of the process that has to be rigorous and consistent is the extraction of the sample (i.e., the abstracts). It’s not as obvious to me that the analysis of these abstracts qualifies as being a survey in the sense that I understand it.

    If I was walking down the street with a sample of abstracts and stopping people to ask them to rate these abstracts, that would be a survey of people’s ability to rate abstracts. That would require a consistent survey strategy as the goal would be (presumably) to determine how well people can rate abstracts.

    If the goal is to rate a sample of abstracts, then it’s not clear to me that the rating process is – strictly speaking – analogous to a survey. It is simply a process by which the survey sample (the abstracts) is analysed.

  30. Tom Curtis says:

    People have become full professors before this based on real life experience rather than academic qualifications. It is unusual but no contradiction that they should become post-docs on the same basis.

    Further, I did not claim Cook was a doctoral candidate in psychology. I pointed out that his doctoral supervisor was a psychologist. I believe his thesis will be interdisciplinary, but am not sure. You, however, are the one making an argument without verifying your assumptions. If you want to make the argument, it is incumbent on you to find out the exact field of his thesis.

  31. Richard Tol says:

    Tom: Cook claims to be affiliated to a psychology department. I feel therefore entitled to judge him by the standards of that discipline.

  32. Richard Tol says:

    I think you get that all wrong.
    The object of study are elementary particles. The sample is the particles that hang around near Geneva. The instrument is the LHC. The raw data are whatever the LHC produces. The processed data go into the journal.
    The object of study is the literature of climate change. The sample is whatever the WoS produced. The instrument is the rating. The raw data are the ratings. The processed data go into the journal.

  33. Essentially I think you’re wrong (we can both believe we’re right I guess). In the case of the Cook et al. survey, the instrument is the WoS search engine, not the analysis of the abstracts. The survey sample is the abstracts. Collecting the sample is what one has to do in a consistent and rigorous manner. You can’t change the manner in which you collect the sample during the sampling. The survey – as far as I see it – is the search of the WoS database.

    The analysis of this sample is the next step. This is what is reported in the literature. The analysis process must be explained clearly. It must be repeatable. All the standard practices must be followed. However, analysing the sample is not a survey. It is analysis. Claiming that the Cook et al.paper is flawed because they don’t follow a rigid survey strategy when analysing their samples seem wrong to me. You can still criticise the analysis method (assuming there is something to criticise) but criticising it simply because it didn’t follow a rigid survey strategy seems wrong, given that it wasn’t a survey – it was an analysis of survey data.

  34. Marco says:

    And of course, Richard Tol will now point us to the “code of practice” or standards in psychology and the specific point(s) relevant to the literature survey we are discussing.

  35. Tom Curtis says:

    Well said Richard. Of course, the instrument, ie, the system of rating, in Cook et al is the full process of two independent ratings plus dispute resolution. Consequently the raw data is the output of that system, ie, a singular rating of classification and endorsement level for each abstract. And that raw data you have been provided. You are insisting that you should be given additional information about the first, second, third and fourth ratings. That, however, is like insisting on getting the data on power usage for each electromagnet at the LHC.

  36. Tom Curtis says:

    Feel free, but as yet you have not done so. You have merely alluded to those standards without quoting them or showing how Cook et al fail to apply those standards.

    I on the other hand will continue to feel free to judge you by the standards whereby intentionally concealing relevant information that runs counter to your hypothesis is considered scientific malpractice. I will also judge you by the standard that presenting data known to be irrelevant to your hypothesis as though it supported your hypothesis is also scientific malpractice.

  37. Richard Tol says:

    References to accepted standards of data availability and documentation are in footnotes 2, 3 and 4 of the submitted paper.
    Lyberg and Biemer argue that keystrokes should be recorded and made available, but I think that is a bit much unless it’s a behavioural study.

  38. Curtis, Got to hand it to you. The distinct data points by two different rating events for each abstract are not raw data?

    wotts, please pause and consider Tom Curtis’ contention above. No one can do anything about the Cook team’s actual work. For what it is worth, they went through thousands of abstracts. No sane person is going to replicate the experiment point-to-point. Yet, you have a member of the team, fighting every inch. Curtis is not even one of the authors on the list, though I’d presume the authors might listen to him if he gave them his advice.

    And this is what he has to say? That the initial ratings and the reconciled second ratings are not ‘raw data’.

    Curtis, for your benefit: Any rating, be it the first or the second, or the final dispute-resolved one, all produce the same data type – i.e., a rating. In each instance, the process is the same – it produces a rating. So, it is not like looking at power usage for electromagnets at the LHC. That would be like asking what was going on in the heads of raters when actually rated every abstract.

    If you can accept the handing out of the dispute-resolved data points, you should accept handing out the pre-resolved ratings as well. The process that produced either, is the same.

  39. Richard Tol says:

    What Shub forgot to say is that I requested the additional data so as to test for validity and consistency. Why hide data that would show that the analysis is valid and consistent?

  40. 33% of the final output is post-processed data. It is not raw. If one mechanism in the study was to get raters to read a set of instructions and carry out rating on thousands of abstracts, the dispute resolution steps are a second, distinct mechanism. Both together produce the final output. Data generated by individual steps which act together to produce the final output are raw data.

  41. Tom Curtis says:

    Shub, based on Tol’s analogy, it is as appropriate to consider the entire rating process as the instrument as it is to consider each individual act of rating the instrument. The analogy gives no reason to prefer one interpretation to the other, and therefore does not advance his case, or yours.

    If you want to go back to first principles and explain why handing over all data related to scientific papers has suddenly become mandatory, whereas for the last two hundred years of scientific practice it has only been a courtesy, by all means go ahead. Please be sure to explain just exactly which principle of free market economics it is that entitles you to free access to the fruit of other peoples labour without regard to their opinions on the matter.

    As it stands, Cook et al conducted research and reported the result. They reported their result in sufficient detail so that anybody else who wants to could replicate the research. That is the limit of what is mandatory in scientific research. All else beyond that is courtesy. That is especially the case given that your frankly pathetic excuse for why it is unrealistic to expect replication Cook et al is that it is a lot of work.

  42. Tom Curtis says:

    Richard, you had previously demonstrated that you had misread the paper and not even understood its basic purpose. You had further demonstrated that you had hostile intent, and where extremely biased in your analysis. Given that, there was exactly zero reason to extend you any cooperation.

    And as I am sure you will misrepresent this argument, please be clear that it was not your intent to check details that justifies lack of cooperation, but your demonstrated malice and bias.

  43. Richard Tol says:

    John Locke would be proud.

  44. “If you want to go back to first principles and explain why handing over all data related to scientific papers has suddenly become mandatory”

    Of course not. It is not mandatory. But if regulatory orgs and journals require it of authors, it might become so.

    Saying that replication of Cook’s work is impossible is not ‘pathetic’. A better, more unbiased method would have to be implemented, instead of a half-dozen raters ploughing through thousands of abstracts. Based on my knowledge of next-generation sequencing techniques, I can think of at least one crowd-sourcing method that would roughly replicate the survey methodology, but would >not< require dispute-resolution of thousands of ratings and get better coverage. Even then, it would require prolonged surveying with brief volunteer participation from lot more volunteers. Cook's method overcame this issue with committed participation from motivated volunteers. It introduces biases. Cook holds the data to examine these biases.

    Intermediate data are the key to validations exercises,and you know it. The claims have been made in pre-meditated dramatic fashion. Correspondingly, the responsibility of the authors to provide data increases. How can you retreat behind 'courtesy' after appearing on national TV?

  45. Richard Tol says:

    If I submit a comment that argues that the Cook data are inconsistent and invalid, even though they are not, my reputation is in tatters. Why would I risk that? What better way to mock me then for Cook to release all his data and show that my assertions are baseless?

  46. Richard, I have to go with Tom on this one. I still maintain that the rating of the papers is part of the analysis (call the results raw data if you wish, but the ratings aren’t a survey). They’ve presumably invested a lot of time and effort into carrying out this work. They have presented the results in a peer-reviewed paper. Their survey/search strategy is clear (you can get the abstracts yourself if you wish). They, also, appear to be doing follow up work and so could, reasonably, expect to keep this data to themselves for additional analysis, rather than simply passing it on to anyone who asks. Clearly what they have done can be repeated, although – I accept – that it isn’t a trivial process (in terms of effort at least).

    If someone asks me for some of my work, I will often give it happily. However, I’m not sure that I would – or that I would be obliged to – if the person asking was someone who had made it fairly clear that they thought my work was flawed. My job isn’t to convince everyone. My job (as a scientist) is to do research and to report on that research and to do so in a manner that allows others to replicate what I’ve done. This would require making any raw data or codes that aren’t easily accessible available and to explain my procedures clearly and openly. It’s not, however, to simply provide all my working to whoever asks for it.

  47. Richard Tol says:

    Why would I give you my data if all you want to do is show me wrong?

  48. Yes, in a sense that’s what I’m suggesting. Maybe you think I’m characterising you unfairly, but it’s hard to see how else to interpret the drafts of your paper. You seem quite comfortable making quite strong statements about the authors and their work – “secretive”, “incompetent”, “flawed”, “unfounded”. Maybe you’re right. Maybe also, if you were given all the data you would find something more positive to say and might acknowledge what they’ve done that is worthy of credit (I’d be surprised if there was nothing, but maybe I’m wrong). However, I still maintain that there is no obligation to simply provide all working to anyone who asks. Given that, I’m wouldn’t be that keen to provide all my working to someone who appeared to already have made up their mind (your paper is quite definite).

    However, even if there is some obligation to provide all this working, it would seem reasonable to have some timescale over which the data is released. They may want to work on it further themselves. It may take time to put it into a suitable format and to provide any additional information. Maybe they’re busy. To be expecting this and to – to a certain extent – be basing your judgement on their lack of openness seems a little unreasonable given that their paper only appeared a couple of weeks ago.

  49. Tom Curtis says:

    Frankly, I think your determination to analyze data that you know has no bearing on the subject you purport to analyze will be more than enough for your reputation to be in tatters.

    Prior to this episode, I used to respect you. Now, I wouldn’t trust one of your analyses for love nor money. You have far to demonstrably been distorting the data you do have for a predetermined outcome.

  50. ” to free access to the fruit of other peoples labour”

    What the…? I don’t want to take Cook group’s data and claim it as my own or derive personal profit from it.

    Once a scientific finding is out there (i.e., published), the data in reality belongs with it., i.e., out there. The common courtesy that researchers extend to each other that they don’t ask for all pieces of data. But if asked, you are under some form of obligation to provide. The claims based on the data were made public, weren’t they?

  51. Tom Curtis says:

    No! Why should I give you my data when you will only use it to generate talking points for climate change deniers while suppressing data that counters your preferred narrative? Your entire approach has been polemical and antagonistic from the start. You have clearly not applied to yourself standards you purport to require of Cook. And as I have previously noted, you have not even understood the basic purpose of Cook et al survey, or indeed, the meaning of relevant words when they have stood in the way of your polemic.

  52. I think this where we will probably continue to disagree. I don’t think this is strictly correct. I’ve said this a number of times, so am starting to feel like a stuck record, but what needs to be available are the raw data and any codes that might be necessary but that aren’t easily accessible. You also need to explain your assumptions and methods clearly so that they can be replicated. Essentially, the results of the work is in the paper. If you are asking for everything, in a sense you’re asking to check that they presented their work properly, which seems somewhat insulting.

    There’s also a “scientific method” argument for why simply providing everything is not the right thing to do. The “scientific method” is a process through which we improve our understanding by continually collecting more data, developing new theories and models, and continuing to study the area in more detail. At times, it may seem that someone’s work is incorrect. The “scientific method” does not suggest that someone takes all of their work and check it thoroughly. It suggests that someone else redoes the work to see what they get and to see if they can understand if the other person’s results are credible or not. In a sense we want reproducibility, not checkability.

  53. Tom Curtis says:

    Shub, the claims were based on data that was made public. The search terms and data base were made public. The list of the abstracts was made public along with the ratings for category and endorsement level. The only substantive data withheld was the self rating data for each abstract, which Cook is ethically required to withhold to preserve anonymity.

    On that point, the mere fact that you can have an ethical obligation to withhold data demonstrates that you do not have a universal obligation to reveal it.

    But thanks for the display of idiocy in equating propriety interest with commercial interest.

  54. toby52 says:

    Would he be proud of this?
    The autocorrelation test suggests that someone fell asleep on the keyboard with her nose on the “4″.

    Just about sums up your attitude. I read no farther, nor will I bother reading your publication.

    Though it might rise a laugh at your next dinner with Lord Lawson.

  55. tallbloke says:

    “Why should I make the data available to you, when your aim is to try and find something wrong with it?” – P Jones

  56. tallbloke says:

    The faux outrage bus is fired up and rolling.
    “I won’t even read your work because you sir, are a bounder and a cad”

  57. I see, is that the context? I missed that. I think how you interpret that depends on other factors. If someone assumes that immediately, that would seem unfortunate. If, however, the person asking for the data has already written something that appears to suggest that they’ve already made up their mind, then it may be a reasonable assumption to make.

    Maybe Richard did ask for all the data before writing his first draft. As far as I can tell, however, this may have happened almost immediately after the Cook et al. paper come out. It wouldn’t seem unreasonable to give them some time. Even if not, once you write something expressing, quite strongly, your view that the study is flawed, expecting the authors to then give you more of their work will seem to be asking a bit much.

  58. Tom
    Personal profit does not have to be money, does it?

    If you performed research that was wrong, and became famous because of it, you are virtually guaranteeing the fame of someone else who would be able to show that you were wrong. Unless you hold back the basis for your work and keep it hidden.

    All research are attempts to reveal a hidden facet or fact of the real world. If you claim you found something new, lay outsiders may be impressed by what you found (or what you say you found). Your peers are more likely to be impressed by the validity of what you found.

    Cook has been provided with methods to release the authors self-ratings, without compromising their identity. Just release a bunch of author ratings, corresponding categories, and volunteer ratings. No one’s interested in knowing who these scientists were. Only their ratings.

    Taking ethics lessons from Cook is like trying to learn the Scripture from the devil. This is a person who went around provoking people and collecting their responses to write a psychology profiling paper.

  59. > I can think of at least one crowd-sourcing method that would roughly replicate the survey methodology […]

    Citation needed.

  60. > I don’t want to take Cook group’s data and claim it as my own or derive personal profit from it.

    Define “profit”, then we’ll see how we can falsify that claim.

  61. wotts
    I would slightly reword, and agree that there is no one ‘scientific method’ that encompasses what all of science is. Reproducibility is science. Replication is science too. If the former is pursued at the cost of the latter, science would be littered with isolated findings that are potentially wrong, but can never be tested because the authors withheld their material. Replicability checks are important that they might lead to uncovering of methodologic flaws, which then might mean the findings would have to be abandoned.

    I wrote about it here: http://nigguraths.wordpress.com/2011/02/27/nature-code-availability/

  62. > Your peers are more likely to be impressed by the validity of what you found.

    And after all these man-hours invested in this hurly burly, we have yet to have one tentative to provide a formal concept of validity that might help improve such future endeavours.

    Richard’s introduction applies to his own comment: it looks like a political hit job.

  63. I haven’t had a chance to read your post, but will do so. I agree with some of what I think you’re saying, and I guess my point is that the Cook et al. study is repeatable/replicable/reproducible (whatever term one wishes to use). Ideally, if there are concerns and if it is important, then it should be repeated by another group to see if the results stand. That’s why I find it hard to get worked up about whether or not they’re releasing all of their data. Firstly, it’s my view that they’re not obliged to. Secondly, they may be still be working on it. Most studies have some kind of proprietary period. Reading Richard’s paper (and some of what you’ve written) it does come across as an attempt to discredit what they’ve done, rather than a genuine attempt to simply understand it better. Maybe I’m wrong and I apologies if I’m characterising you unfairly, but that is how it appears.

  64. > Replicability checks are important that they might lead to uncovering of methodologic flaws […]

    Reading too is important to uncover methodological flaws. And even if we grant Richard’s point about inter-judge reliability, it might not lead to where he claims.

    In fact, following his tweets of consciousness suffices to show that all Richard is looking of are talking points.

  65. > If, however, the person asking for the data has already written something that appears to suggest that they’ve already made up their mind, then it may be a reasonable assumption to make.

    Even if it is not reasonable, it is a natural one to make. In that case, we can surmise that the whole point of Richard’s provocations are to make sure that Cook & al won’t cooperate. Let Richard file a complaint, and make his case to the relevant authorities where he justifies his needs, something he has failed to do in his comment. If you think about it, asking for missing data goes against his conclusions, or at the very least show that they might have been hasty.

    Instead of taking it upstairs, Richard might prefer brownie points from the auditing community. This might explain the recursive fury of it all. The recursion already started with the gentle bullying around Phil’s famous sentence.

  66. > The common courtesy that researchers […]

    Here you go:

    Common courtesy. Check.

  67. Marco says:

    WUWTB, do note the context. I cite here the whole e-mail:

    “I should warn you that some data we have we are not supposed top pass on to others. We can pass on the gridded data – which we do. Even if WMO agrees, I will still not pass on the data. We have 25 or so years invested in the work. Why should I make the data available to you, when your aim is to try and find something wrong with it. There is IPR to consider.

    You can get similar data from GHCN at NCDC. Australia isn’t restricted there. Several European countries are. Basically because, for example, France doesn’t want the French picking up data on France from Asheville. Meteo France wants to supply data to the French on France. Same story in most of the others.

    Cheers
    Phil”

    In other words: do your own homework, you lazy …

  68. Yes, interesting. Thanks. I hadn’t seen that before. As I’ve said in a number of places, there is a big difference between scientific work being repeatable/reproducible and simply handing over years worth of work to anyone who asks. Especially if it appears that what’s motivating them is a sense that you’ve got it all horribly wrong!

  69. willard
    There are sensible people everywhere, including Cook’s own camp. Examine their latest post, by Ari Jokkimaki, the guy who’s probably rated more papers on the Cook survey than any one else. He’s attempting to examine the validity of the sample their survey organizer Cook pulled from the literature.

  70. Shub,

    As I said at Eli’s and elsewhere, I’m interested in knowing what would satisfy the contrarians. That includes knowing all the relevant tests they would deem enough to quench their thirst for science.

    If contrarians don’t present that before they start their audits, all we have are a bunch of an ad hoc semi-formal constructions.

    Considering that Richard’s main point of the moment is about ad hocness of the 98%, I’m sure you can get the irony of his recursive fury.

  71. Shub, I have indeed seen the post by Ari Jokkimaki. But doesn’t that then suggest that the team who worked on the Cook et al. paper should be given some time to work through the additional analysis. There is normally a proprietary period and asking for them to give access to all their data so soon after publishing their paper seems well within any reasonable proprietary period.

  72. “Why should I give you my data when you will only use it to generate talking points for climate change deniers while suppressing data that counters your preferred narrative?”

    If you give out your data, it might be shown that your methodology and calculations were wrong, which formed the basis of your talking points.

    Remember how John Cook poached on public comments given by individuals in good faith to generate talking points that they were conspirators.

  73. Except that the main point of the “scientific method” is that work that is wrong will, at some point, be discovered to be wrong, without the original authors needing to hand over all their data. That’s the basis of the method. Sure, it can take some time, but it is more robust – in my view – than people spending their time pouring over other people’s workings.

    No, I don’t actually remember what John Cook did.

  74. > 97% is almost 98%?

    The authors might have used the most conservative number:

    Among abstracts that expressed a position on AGW, 97.1% endorsed the scientific consensus. Among scientists who expressed a position on AGW in their abstract, 98.4% endorsed the consensus.

    http://iopscience.iop.org/1748-9326/8/2/024024/article

    If Richard wishes to use 98%, fine. At least we now have a better idea of what he means by:

    There is no doubt in my mind that the literature on climate change overwhelmingly supports the hypothesis that climate change is caused by humans. I have little reason to doubt that this is indeed true and that the consensus is correct.

    https://docs.google.com/file/d/0Bz17rNCpfuDNOHRsMVZFYXdxR0k/edit?pli=1

  75. > Remember how John Cook poached on public comments given by individuals in good faith to generate talking points that they were conspirators.

    You must be new here, Wott. Shub is coatracking the episode surrounding this:

    Conspiracist ideation has been repeatedly implicated in the rejection of scientific propositions, although empirical evidence to date has been sparse. A recent study involving visitors to climate blogs found that conspiracist ideation was associated with the rejection of climate science and the rejection of other scientific propositions such as the link between lung cancer and smoking, and between HIV and AIDS (Lewandowsky, Oberauer, & Gignac, in press; LOG12 from here on). This article analyzes the response of the climate blogosphere to the publication of LOG12. We identify and trace the hypotheses that emerged in response to LOG12 and that questioned the validity of the paper’s conclusions. Using established criteria to identify conspiracist ideation, we show that many of the hypotheses exhibited conspiratorial content and counterfactual thinking. For example, whereas hypotheses were initially narrowly focused on LOG12, some ultimately grew in scope to include actors beyond the authors of LOG12, such as university executives, a media organization, and the Australian government. The overall pattern of the blogosphere’s response to LOG12 illustrates the possible role of conspiracist ideation in the rejection of science, although alternative scholarly interpretations may be advanced in the future

    http://www.frontiersin.org/Personality_Science_and_Individual_Differences/10.3389/fpsyg.2013.00073/full

  76. Okay, that does ring a bell now. I didn’t read much about it at the time though. I hadn’t yet become infuriated enough by what was written on WUWT to start my own blog at that stage 🙂

  77. > I’m not going to comment on Richard Tol’s work as I haven’t really looked at it.

    I did.

    Start here for the comments on his drafts:

    For now, it goes up to 86. I’m still on p. 3. Other notes that are awaiting.

  78. Speaking of outrage, here’s my first tweet about this hurly burly:

    “Just asking questions”: never heard that one before?

  79. Speaking of counterfactual thinking:

    Goldilocks was a tough character to satisfy. By chance Wee Bear was there. Not that the story left Wee Bear happy.

  80. BBD says:

    Hello again willard

    Richard’s introduction applies to his own comment: it looks like a political hit job.

    Well, we know why that is, don’t we?

  81. Thanks. Finally!

    I think the keystrokes were introduced for perceptual studies, to inspect record rates that are in ms or less. This could be useful here to discriminate raters that would have followed the Auditor’s advice:

    Note that I suggested that readers spend equivalent time to those who responded to Lew’s survey. If, for example, you don’t care about the quality of your answer or you are answering the question the same way – as some Lew respondents did -, it takes scarcely any time to fill out the survey. Indeed, if one were so inclined, one could submit multiple responses very quickly. If one were so inclined, HideMyAss.com enables IP address changes in the blink of an eye as well.

    http://climateaudit.org/2013/05/05/cooks-survey/#comment-417816

    Our emphasis.

    Perhaps that suggestion belongs to the doctrine of preemptive audit.

  82. “”Except that the main point of the “scientific method” is that work that is wrong will, at some point, be discovered to be wrong, without the original authors needing to hand over all their data.”

    wotts
    You have a benign, idealized view of science. In all areas where science is intimately linked to public policy – environmental decision making and public and community health, for instance, the standards ought to be higher. Can I do what the authors did? If I do what the authors did, do I get the same answer? Did the authors do what they say they did? – these are all valid questions. You insist that ‘science’ is only limited to the first question.

  83. There’s no need to repeat your distinction between reproducibility and replicability, Shub.

    Repeating this distinction won’t justify why you presume it’s necessary for producing valid results.

    Richard is not even trying to reproduce the study with the raters’ data anyway. At best, he’s trying to analyze the inter-judge reliability, at worse he’s trying to reveal the behaviours of the raters.

    Just try to pretend that you can hide the results from 12 raters. We know who most are. We know how many they rated. We even have “data” from private discussions.

    I would not play the ethics card if I were you.

  84. Why only the first question? What I’m suggesting applies also to the second question and, in a sense, to the third – although this might be indirect in the sense that if you do what they say they did and get the same answer, you can assume that they did what they said they did. I’m suggesting that the “scientific method” is a process through which multiple researchers converge on results that are consistent. Over time there is a general agreement about a particular area – maybe not that the science is absolutely settled, but that the fundamentals are well understood.

    You seem to want some kind of auditing system where people can check that researchers have done what they say they did and check that you get the same results from the same data. My argument is that the “scientific method” already does this, although it can sometimes be a slow process. The problem with what you suggest is who does the auditing? How do you make it objective and unbiased? How do you ensure that people who really don’t understand the field confuse things by claiming problems with something when there isn’t really a problem – or rather there is a problem but it’s with their understanding rather than with the original research.

  85. BBD says:

    And once again, the contrarians created a vast hullaballoo about *nothing* which nonetheless leaves the *impression* that there was some sort of problem. They can also reference back to the morass they have created as if it actually had substance (rather than concealed the lack of it).

    See Marcott et al. for a recent example of this strategy in action.

  86. Indeed. And it is a little ironic that one of the premises in Richard’s paper is that the behaviour of Cook et al. further “confirms” that climate scientists are “secretive”. That their strategy is “flawed” adds to the sense that climate scientists are “incompetent”. Given that the only place where this is being said is Richard’s own paper does seem like a somewhat circular argument.

  87. Tom Curtis,

    Richard did accept some of my recommendations. For instance, he seems to have added his own name to the list of the person concerned, as suggested here:

    There’s also the under/over sampling issue that got resolved (see below), at least in part.

    I think my criticisms were fair, but Richard is the best judge as to what he’s willing to own as a comment. They were offered pro bono. They are still there, for what it’s worth.

    Tweeting comments was fun. Writing them made me feel I was onto something. Besides, my online persona will soon be able to brag about having improved a real piece of science. For it was a piece of science, right?

    There is no need to get mad.

  88. I really can’t understand this fixation with Richard Tol and the usefulness to debate with him. Tol is the Lindzen of the 2010’s because of his academic credentials and 4x positions. He is one of the most successful signings of the denial machine, or should be. But he doesn’t desserve the attention he gets.

    The trouble with Tol is that it is more difficult to debunk than, say, Lindzen or Michaels. He is not a “hard” scientist, someone you can argue with through a shared set of rules that you can show to be valid. Hélas, Tol is an economist, and those kindness are much less demanding in his world – while pretending they are at the top of his concerns and asking for that on other’s. More precisely, he is what is usually termed a “neoclassical” economist (with David Colander’s permission), using the equations of the “general equilibrium model” borrowed from physics in the late XIX century.

    Of course he, as most economists, knows those “fundamental” equations don’t work, nor do the models based on them that are unable, for example, to anticipate economic crashes. This fact pushes economists in general toward a more or less concious skeptic vision regarding any kind of mathematical model, worldview they extend to those coming from the natural sciences. Anyway they are not uneasy when politicians make decisions based on them (i.e. tackling climate change is expensive! providing the proper unnoticed assumptions deemed neutral). But they ask for an impossible degree of certainty in climate models while not offering on theirs at all.

    This applies to the economic side of the FUND Integrated Assessmet Model (IAM), the Tol’s creature. Here, the discount rate is one of the highest values in all the (disappointing) IAM family, and the cost functions used have been seriously challenged for being soft and not updated with the most recent knowledge.

    From the climatic side, FUND takes climate sensitivity at a fixed value of 2,5 ºC (no uncertainty), so in the lower margin we know less probable. The carbon cycle feedback with the climate system is weak and too simple.

    It is no coincidence he has stated that a little warming is a good thing, even if he recently has moderate this position. He has supported the Copenhaguen Consensus Center where climate change is absolutely not a prioroty, and has been Bjorn Lomborg’s reference economist. He states that the EU goal 20/20/20 does not stand a cost-benefit analysis.

    Tax carbon, well, ok, I’m not a hard denier, my role in the machine is to be “reasonable”. But then I “demonstrate” the tax should be very low, and finally the result won’t make a real difference when compared with normal price fluctuation.

    We may say Tol is a “soft” denier, which is the most dangerous group: they don’t seem to be deniers. They are more credible and respected by the people who discount hard denial and looks for non “alarming” messages. Indeed, politicians pay attention to this kind of economists, not to natural scientists. Look at him closely and you will find the same rhetorical figures, the same memes, the same tactics, the same instruments, the same timing used by the denial machine as a whole. And, of course, the same ideology. One difference to his credit: Tol is much more intelligent and active than most of his cause peers.

    So everyone must decide if debating with him is a pleasure. Or if this means playing his game, and so a waste of time.

  89. wotts
    You know I provided a list of items that can only be checked if the authors release data they haven’t to date. I have specific questions, not just a generalized request.

    Is there any justification beyond:

    “we won’t give you ’cause we don’t like you”,
    “if you want to, do a survey yourself”,
    “you want to find something wrong with it”,
    “you’ll generate denier talking points”
    “what I’ve given you is enough for you”

    for not releasing this data? Because these are childish excuses. One of the authors dana1891 is giving a +1 to “denier talking points” argument. The entire paper is a gigantic talking point.

    For instance, why not release individual volunteer ratings, after anonymising them? Why not release author self-ratings after de-identifying them?

  90. Tom Curtis says:

    Shub says: “For instance, why not release individual volunteer ratings, after anonymising them?”

    Because unethical people have made use of hacked data to publish the approximate number of abstracts rated by most of the abstract raters. Therefore “anonymysing” abstract ratings will not confer anonymity.

    “Why not release author self-ratings after de-identifying them?”

    Because papers have a limited number of authors, so that releasing author self-rating of papers violates the ethical requirement for anonymity. It follows that the only appropriate way to release author self ratings is in tabular form, which has been done. (See tables 4, 5 and S2 in the original paper and supplementary information.)

    Let’s add another on Shub’s behalf:
    Why not release detailed cross comparisons of endorsement ratings between abstract and self-ratings?

    I admit I was disappointed not to see that. It is not obligatory on the authors to release that data, but the paper would have been better if they had. But, it is known that the authors are undertaking additional research related to the paper in which comparisons between original abstract ratings and ratings in a public survey are likely to be made. It is quite possible that detailed cross comparisons have been held back in order to be published in the paper related to that research (and entirely appropriate reason to hold back data). The same may also be true for detailed cross comparisons of initial abstract ratings and final abstract ratings.

    Are these the actual reasons data was held back? Don’t know and don’t care. The authors own the data that is being held back. They have copyright, and can publish or not publish at their own discretion. The only data release incumbent on them is that they should release sufficient data to allow replication of their result, which they have done. Any further requirement, if it exists, exists solely based on the contract between them and their publisher, and prima facie has been met based on the fact that they were published.

  91. willard says:

    > why not release individual volunteer ratings, after anonymising them?

    Again, Shub tries to pretend that one can hide the results from 12 raters where we know most of them, we know how many papers each rated, and have “data” from their private discussions, some of which has been quoted by Richard.

    More to the point, the whole idea of testing for rating fatigue might be misguided. We are not testing peanut sorters. Perfect raters would be even more worrisome. There’s a nice #Goldilocks framework right there:

    – The raters were tired!

    – Could be. What about the authors themselves?

    – Sorry, my bad. I meant to say that they are too perfect!

    – Would you please tell me of your formal tests before making up your mind?

    – AUDIT ALL THE DATA!

    Et cetera.

    ***

    > One of the authors dana1891 is giving a +1 to “denier talking points” argument.

    Dana might have given a +1 on more than that. Here is a sample of what Tom said:

    (1) Nothing prevents you from suppressing data that counters your preferred narrative.

    (2) The entire approach has been polemical and antagonistic from the start.

    (3) You have clearly not applied to yourself standards you purport to require of Cook.

    (4) You have not even understood the basic purpose of Cook et al survey.

    (5) You have not even understood the meaning of relevant words when they have stood in the way of your polemic.

    None of these points can be dismissed with the simple “Yes, but the D word”. One simply does not claim victimhood when after starting a discussion with “your data is a load of nonsense”.

  92. “Because unethical people have made use of hacked data to publish the approximate number of abstracts rated by most of the abstract raters. Therefore “anonymysing” abstract ratings will not confer anonymity.”

    This is not correct. Cook has handed the data, that you say people have obtained from the forum leak, in spreadsheet form, to Richard Tol. Secondly, authorship in the paper was gifted to volunteers. The Skepticalscience website states this directly. Thirdly, dana1981 volunteered information that a ‘vast majority’ of the abstracts were rated by the paper’s authors.

    Enough data can and should be released that permits validation of any authors’ work. If intermediary data would show a paper’s weaknesses, it is understandable that the authors would not be inclined to release such data, but they are not simply exempt. The mere fact that there was a 33% discrepancy rate shows that there was significant subjectivity involved in rating. This has a direct bearing on the paper’s results, as I have shown here. This is because it is the subjective component of the rated papers that gives rise to the high number of consensus abstracts, and the 97% result.

  93. @Tom C
    Cook can simply release a 11944×14 table with
    Abstract rating 1, 2, 3, 4
    Rating time 1, 2, 3, 4
    Rater ID 1, 2, 3, 4
    Paper rating

    Authors cannot be identified from that.

    Raters could be identified. This is a design flaw. Why are raters not anonymous to the authors of the paper? Normally, this alone would be a reason for rejection.

    I would happily sign a confidentiality agreement that says that I can have the rater IDs on condition that I will not reveal their identity. That is easily done by publishing the results on the consistency and validity tests as p-value, suppressing the degrees of freedom.

  94. Tom Curtis says:

    Shub,
    1) Tol’s complaint is that he was not handed the data in spreadsheet or any other form;
    2) While some volunteers became coauthors and some received acknowledgement, they all had the option or retaining anonymity, and some did so. Publishing “individual volunteer ratings” would, together with the forum leak, remove that anonymity for some at least.

    Ergo your implicit claim that anonymity of raters has already been violated by Cook is false. It would, however, be violated by the release of individual rating data plus the unethical use of forum material.

    The 33% initial discrepancy in no way indicates subjective judgements were a major factor in rating. It merely shows that there are a number of borderline cases. It should be noted that the 33% discrepancy is not the discrepancy rate between endorsements, rejections and no positions. An abstract rated 3 by one rater and 2 by another would count towards the discrepancy, even though both rated the paper as an endorsement.

    Further, even those discrepancies that do make a difference between endorsement, rejection and no position will not necessarily change the result to any degree. I am currently, gradually rating abstracts to compare my assessments with those in the paper. Ideally, I intend to rate one to two thousand abstracts to allow a reasonable statistical comparison. At the moment, I am running at a 14% discrepancy rate, and of those discrepancies, in an equal number of cases, I have rated as 3 an abstract rated as 4 in the paper, and vice versa. The net effect is that in only 2% of cases does the difference in rating actually change the result.

    Finally, the linked blog post exemplifies your usual tactic of distortion and misdirection. Notably, despite all the graphs you do you do not graph the trend of Endorsements as a percentage of abstracts taking a position. That percentage has a positive trend, and the trend is statistically significant as shown by RomanM. The trend is positive regardless of whether or not you include 0.5% of neutral papers among the papers that “take a position”. Cook et al report the percentages excluding endorsement level 4b ratings in the paper, which you misrepresent as a claim that endorsements as a percentage of all abstracts were increasing.

    Further, you claim Cook et al “identify ‘strengthening consensus’”, but in fact, the only use of the term ‘strengthening consensus’ in the paper is when they write, “For both self-ratings and our abstract ratings, the percentage of endorsements among papers expressing a position on AGW marginally increased over time, consistent with Bray (2010) in finding a strengthening consensus.” Given that Cook et al clearly identify the increasing percentage as the percentage of papers “taking a position”, your failure to calculate the trend of that percentage when criticizing them must count as deliberate evasion. Like Tol, you keep inconvenient facts out of sight.

    Finally, as I have noted elsewhere, the fact that papers endorsing the consensus increase relative to papers rejecting or uncertain about the consensus falsifies claims that the increasing proportion of neutral papers is because of increasing doubt about AGW. Indeed, it is strong corroborating evidence for the explanation of that fact given in Cook et al.

  95. @Tom
    Cook et al. do no allow replication. For instance, the released data have a 98% consensus. The paper has 97%.

    Cook should release a 11944×14 table with
    Rating 1, 2, 3, reconciled, 4
    Rater ID 1, 2, 3, 4
    Rating time 1, 2, 3, 4
    Paper rating

    Authors cannot be identified from that.

    Some raters can be identified. This indicates a major design flaw. Why are the raters not anonymous to the authors? Normally, this only would lead a rejection.

    I would be happy to sign a confidentiality agreement that gives me the full data under the condition of no identifying the raters. That is readily implemented by showing the results of the validity and consistency tests as p-values, suppressing the degrees of freedom.

  96. @Tom
    See above on anonymity.

    The trend is a trend in composition rather than a trend in endorsement.

  97. Richard, I think you misunderstand the concept of replication. It does not mean can I use their data to replicate their result. It means, do I have enough information to redo the research (using whatever method I see fit) to see if I get the same result (to within the uncertainties). Clearly you can do this. You can get the abstracts and you could use some method to rate the abstracts according to whether or not they endorse AGW, or have no position. That is what is required.

    Also, what do you mean by “the released data has 98%… The paper has 97%”. Are you saying that the numbers in the released data do not match that in the paper?

  98. George Orwell would be proud: Replication does not mean that the results can be replicated.

  99. Richard, okay I see what you mean. The spreadsheet appears to be data before the 4s were re-rated. A little odd maybe, but you have a paper stating that 40 of the 4s were re-rated.

  100. Oh come on Richard. Do you really understand the scientific method? Okay, I’ve acknowledged that the spreadsheet appears to be prior to the re-ratings of the 4s. Okay, so maybe they should have provided the final sheet. Do you really believe that they didn’t do the re-rating. That somehow their arithmetic is wrong?

    The point I was making – in case you really didn’t get it – was that from a scientific perspective, replication is not about checking another researcher’s arithmetic. It’s a process through which other researchers repeat some research so that over time it becomes clear (or not) that a particular set of results stand (or not). What you want to do is audit this work, not replicate.

  101. The Defense of Cook descends to new levels of farce.
    Replication does not mean replication.
    Validity and consistency tests are optional.
    Data gathered by an employee of a public university funded by taxpayers’ money belongs to that employee.
    Data can be hidden if the data could be used to reveal low data quality or to support an opposite conclusion.

  102. Okay, I’ve kind of had enough. It seems like I might have hit a nerve, which makes me think that some of what I’ve said has merit. You, on the other hand, haven’t attempted to engage in a pleasant and reasonable manner. You’re completely mis-representing the discussions we (and others) have had on this post. You’re welcome to keep commenting. I haven’t blocked anyone yet (as tempting as it may be) but I’m not going to waste my time engaging with someone who clearly is incapable of having a decent discussion about a reasonably complex topic. Good luck with the paper though.

  103. tallbloke says:

    The Dutch are well known for their directness and dislike of fidgeting around the bit. Richard Tol is entitled to his honest opinion, and has produced the work to back it up so far as he can in the absence of the requested data which *may* have better enabled him to prove the point.

    Also, it’s of note that Richard Tol is happy to extend an offer to Dana to help him improve his stats capability – a substantial commitment in terms of his time. Also noticeable that where he isn’t limited to a 140 chars tweet, he fully explains his brief synopsis of his own opinion.

    So get off the outrage bus and assist science by encouraging Dana and John Cook to allow Tol to do a rigorous examination of the full dataset.

  104. Oh, so it’s called directness is it – and it’s because he’s Dutch. I see. Well, if someone called my work a “load of nonsense” and then later asked to work with me, I know what my answer would be. Maybe John Cook and his team are more forgiving than me though.

  105. The sequence of events gets mixed up.

    First, I alerted people to some negative reviews of the Cook paper.

    Second, Dana accused me of not reading the paper (I had), of denial (I have campaigned for carbon taxes for 20 years), of lies, and of misrepresentation (which is against the law in my country of residence).

    Third, I repeatedly asked Dana to explicate the alleged lies and misrepresentation. He has yet to do so.

    Fourth, I used moderately intemperate words.

  106. So, let’s see. You highlighted, in particular, some negative reviews of the Cook et al. paper (choosing, for some reason, not to highlight any positive reviews). Dana defended the paper – although maybe making some unfounded accusations. You asked him to retract these accusations. He hasn’t done so. You call his strategy “a load of nonsense”. So far, a fair representation?

    So, maybe neither of you come out of this looking particularly good. However, the idea that the Cook et al. team should now give you their data so that you can check whether or not the strategy is “a load of nonsense” or not, just seems absurd.

  107. Paul Matthews says:

    Farce indeed. I would say that the real farce is that someone claiming to be a serious scientist from a highly regarded university is prepared to devote so much time and energy to trying to pick holes in a brief comment by Richard, and defend a misleading piece of junk by some notorious activists.

  108. Fair enough. You’re entitled to your opinion. I have devoted somewhat too much time to this, so I’ll grant you that. Although, I would be quite keen to know why you think the Cook et al. study is misleading. Results are consistent with earlier work and not even Richard disputes that the level of consensus they get from their work is probably about right. Actually, I take that back. I’m not that keen to know what you think.

  109. Tom
    Your explanation above about anonymity is wildly hand-wavy. No one cares who did how many ratings; what matters for analysis is what any given rater did and how it was resolved. The authors all did thousands of them and that information has been released. The forum leak is immaterial.

    Examine what you wrote: “The 33% initial discrepancy in no way indicates subjective judgements were a major factor in rating. It merely shows that there are a number of borderline cases.”

    “The 33% initial discrepancy in no way indicates subjective judgements were a major factor in rating” = “It merely shows that there are a number of borderline cases.” Both statements convey the same issue. There are a lot of borderline cases, because there are a lot of non-explicit papers which had to be forced into an ‘implicit category’ via the use of subjective judgement (which by definition is the mechanism of identifying implicit endorsement) and the high rate of discrepancy reflects this process. Figure 4 and 5 show this exact thing.

    Subjective decision-making is the basis of the whole study. Tens of thousands of abstracts don’t declare themselves one way or the other

    Cook et al perform only two-three calculations on the number of abstracts. Their figure 1(a) does show an increasing number of ‘endorse’ abstracts over time. Their discussion para[1] does draw the inference that decreasing consensus abstracts means increasing consensus. Regardless, there is no critical analysis of data in Cook et al, excepting discussion para 1. Space constraints might have been an issue, I don’t know. The paper is presented purely as an observational work.

    The trend of explicit orthodox papers proportions in all explicits can be calculated. It hovers largely between 95-100% through the years, and the linear trend is statistically not significant. Sure, the percent of explicit orthodox position papers as a percent of all explicits is high. In fact, it would have been best to present only explicits. But that would change the paper itself. Secondly, the authors would then have to present this ratio: 986:25. Doesn’t look as good at Oreskes’ 928:0 right?

    wotts
    You are tempted to block people? Bye. I thought this was going well.

  110. Marco says:

    Richard, several of the authors are NOT employees of any publicly funded organisation. Doesn’t that mean you can only have some, not all of the data?

    Besides, John Cook is an employee of an Australian university, paid by Australian tax payer money. Why should you, as a person not paying any taxes in Australia, have any right to that data? You likely haven’t contributed one euro to it!

    See, if you want to use stupid arguments, I can make some, too…

  111. Tom Curtis says:

    Above I commented:

    And as I am sure you will misrepresent this argument, please be clear that it was not your intent to check details that justifies lack of cooperation, but your demonstrated malice and bias.”

    (Emphasis added)

    And sure enough, along comes Richard to misrepresent the point I predicted would be misrepresented.

    Pathetic.

  112. I’m not really tempted to block anyone and haven’t done so. Maybe the last response to Richard was little intemperate so – if it came across that way – I apologise. So far, exchanges between yourself and myself have been quite reasonable.

    I guess I am being openly critical of Richard’s paper, so him being openly critical of my post and comments is fair enough. Starting to tend towards ad hominen’s, though, is a little less easy to accept.

  113. Richard Tol says:

    @Tom
    I paraphrased your
    “Why should I give you my data when you will only use it to generate talking points for climate change deniers”
    as
    “Data can be hidden if the data could be used to support an opposite conclusion.”

  114. Tom Curtis says:

    Richard, I believe the term you are looking for is “misrepresent”, not “paraphrase”.

    If you want a paraphrase, try, “Why should I give you data if you’ll only use it for propaganda while suppressing data runs counter to your preffered narrative?”.

    That is evidently your purpose, as shown by your out of context quotation of my statement, “Why should I give you my data when you will only use it to generate talking points for climate change deniers while suppressing data that counters your preferred narrative?”
    (Emphasis added)

    You couldn’t even bring yourself to use an ellipsis to mark the omitted text.

  115. Why should I give you data if you’ll only use it for propaganda while suppressing data runs counter to your preffered narrative?”

    That is a lame excuse not to give data. “You might do something else with it”.

    Do you believe John cook’s team has presented all possible aspects of the data he collected, or do you think that he has presented some of it?

    More importantly, do you think Cook has presented aspects of the data that do not support the already-presented conclusions?

  116. Richard Tol says:

    I omitted the bit about suppressing data because that is not an option. I am not in control of the data. Any attempt by me to suppress data would be obvious.
    Cook, on the other hand, is in the position to suppress data and is exercising that option.

  117. I suspect we’re not going to reach some kind of agreement, but that’s fine. Let me present my views – again I guess.

    Do you believe John cook’s team has presented all possible aspects of the data he collected, or do you think that he has presented some of it?

    No, they probably haven’t. Their publication is only a few weeks old (I think) and you yourself pointed out the paper coverage work by Aki Jokkimaki at Skeptical Science. They also seem to be running another form of consensus project that will presumably use some of the data from their earlier work. It’s not unreasonable for there to be a proprietary period during which a team who’ve done the work get to continue working with the data before making it available to others.

    More importantly, do you think Cook has presented aspects of the data that do not support the already-presented conclusions?

    Tricky question to answer. Do I think there is a chance that the actual consensus in the literature (assuming we can agree on what we mean by the term consensus) is wildly different from what they present. No, I would be very surprised. Will there be some inconsistencies in their ratings? Possibly. Will these be important? I clearly don’t know (apart from the ones already discussed in their paper) but, my guess, would be that they’re not going to change the result. Are there some biases? Maybe. Should they give the data to you and to Richard Tol to check? I’m not so sure. It’s not clear to me that you or Richard can claim to be objective about this. It could be a never ending circle. You find some discrepancy in their work. They find some discrepancy in your work. No agreement is ever reached. This is why the end of my post was about the “scientific method”. It resolves these issue by others doing similar work and seeing if a consistent result is obtained. Auditing each others work seems, in general, an unworkable way to proceed.

    On the other hand, if what you want to do is do some work on their data then convince them that it is worth doing and see if they’ll let you get involved. That’s the norm in my field and can often be quite a valuable way to proceed.

  118. Tom Curtis says:

    Richard, you are correct. Attempts by you to suppress data have been obvious. As, for example when you raise doubt as to whether the survey of 1000 endorsement level 4 papers found 5 or 40 papers rated as endorsement level 4b when you know that it has been confirmed by a co-author that it was 5, and do not mention that fact.

    While it is conceivable that Cook et al have suppressed contrary data in that they have not released all data, and it is logically possible that some of the data they have not released contradicts the claim that 97% of papers endorsed the consensus; I know for a fact that you have suppressed data that counters your claims in a paper you have submitted for publication.

  119. Tom Curtis says:

    Actually, Shub, it is a very good reason for not releasing data. The release of data is a courtesy based on the assumption that the data will be analyzed in good faith. If you have good reason to doubt that it will be analyzed in good faith, you thereby have good reason to not extend the courtesy.

  120. Grant M says:

    I’ve just started reading through Tol’s critique. Anyone have an idea of how he gets his Chi-squared statistics on p. 4?

    However, it is clear from Table 5 in Cook et al. (1) that the subsample of abstracts that were also rated by the authors is not representative for the whole sample (X^2 = 22; p<0.001) and (2) that the paper ratings are different from the abstract ratings (X^2 = 5793; p<0.001).

    If I quickly calculate these stats myself, I get 1) X^2 = 18 and 2) X^2 = 684, respectively…


    (791-699)^2 / 699 + (1339-1429)^2 / 1429 + (12-14)^2 / 14 = 18

    (791-1342)^2 / 1342 + (1339-761)^2 / 761 + (12-39)^2 / 39 = 684

    (For both cases I assume that abstract ratings act as the “observed” values.)

  121. Richard Tol says:

    Tom: I am aware of Dana’s claim. He has said a few other things about the paper that are not true (e.g., that the Science Citation Index was used rather than the Web of Science), so I discounted this.

  122. > I used moderately intemperate words.

  123. Good question. I don’t know the answer. Maybe Richard can clarify.

  124. > Authors cannot be identified from that.

    Wana bet, Richard?

    Here’s a hint:

    You’ll never guess where I got that reference.

  125. On second reading, it seems that by “authors”, Richard refers to the papers’ authors, not the raters, some of whom are also the authors of Cook & al, since he admits:

    > Some raters can be identified.

    This might indicate a “major” design flaw. (Cf. “moderately” intemperate words.) This also explains why Richard’s request might look unreasonable to the authors. More so that rater ID and rating time won’t help him replicate the 97%.

    Richard has yet to justify his request in other terms than the usual auditing claptraps.

    Conflating replication with V&V does not help.

  126. Richard Tol says:

    I don’t understand where you get the numbers for your first calculation. You should compare the numbers that were rated by abstract only (Table 3) to the numbers rated by both abstract and paper (Table 5).
    The numbers in the second line are clear. The null is that the subsamples are from the same population. You should therefore first from the population totals, and then test for deviations from that.

  127. Richard Tol says:

    form the population totals

  128. Tom Curtis says:

    Shub,

    Suppose that you ask five people to mark an interval of 23.2 mm using a standard 300 mm rule. There will be appreciable variance in the lengths of the intervals marked; and those variations will depend on factors relating to the people marking the intervals ie, subjective factors. It does not follow from this that marking a 23.2 mm interval is a subjective activity, or involves subjective judgements. On the contrary, it is possible to measure the intervals marked and judge them objectively. Some will be more accurate, others less so – and that is an objective matter. In contrast, if it were indeed a matter of subjective judgement, then there would be no objective standard against which the judgements could be compared. Judgements could differ, but they could not be wrong (or right).

    That is not the situation with rating papers in Cook et al. I, and Tol, and I suspect you, certainly want to be able to say that some of the judgements made were wrong, or right. Ergo they are not subjective judgements except in the trivial sense in that they are judgements made by subjects. Trivial, because in that sense all measurements or judgements are subjective.

    In this instance, they are not subjective because there are rules of implicature which make the concept of “implicit endorsement” formalisable. In principle we could design an algorithm which would rate papers according to that rule. The raters did not in fact use such a formal algorithm, but they used an implicit algorithm, and took measures to ensure the implicit algorithms they used were the same.

    If we assume they were entirely successful in employing the same implicit algorithm, then there would still be differences in ratings, just as there would be differences in the intervals marked when attempting to mark 23.2 mm. That is because there are borderline cases, and in an example such as this they are tricky.

    Now, you appear to be trying to explain all discrepancies in terms of different implicit algorithms (subjective judgement) whereas it is probable that most are due simply to applying effectively the same implicit algorithm in difficult cases. Some will also, no doubt, be due to fatigue or distraction, or some other factor. That is, errors will have been made.

    If you wish to find a statistic that indicates the rate of discrepancy due to different implicit algorithms (ie, subjectivity of rating), it is probably the 16% discrepancy after the first rerating. In borderline cases raters would have been 50-50 on their assigned rating to begin with. Ergo it is likely that a discrepancy would be resolved by rerating. For different reasons, that is also likely in cases of outright error. But use of a different implicit algorithm is likely to result in an irresolvable dispute.

    Again, it should be noted even with the 16% that many of those discrepancies would be within the overall category (endorsement, no position, rejection).

  129. The more I think about this, the more confused I am about these numbers. Maybe Richard could clarify. As far as I’m aware all we have is the data from Cook et al. that tells us that 2142 were rated by the volunteers and by the authors. Of these papers the volunteers rated 791 as endorsing AGW, while the authors rated 1342 as endorsing AGW. The volunteers found 1339 that had no position, while the authors only rated 791 as having no position. For rejection there were 12 from the volunteers and 39 from the authors.

    Firstly, these aren’t quite the same comparisons, as the volunteers were rating only the abstracts, while the authors were rating the whole paper. There could be case where the abstract isn’t clear but the paper itself is. However, I still don’t understand how you get Chi-squared from this data. You don’t know the results for individual papers. One could make some sensible assumptions. It seems unlikely that a paper rated as endorse by the volunteers would be rated differently by the authors. It also seems unlikely that a paper rated as reject by the volunteers would be rated differently by the authors (although this may not be as obvious as the numbers are small and so it could be a completely different set). What seems clear is that it is much less likely that a paper rated as having no position by the volunteers would be rated as such by the authors.

    Therefore, it seems to me that when determining the consistency between the volunteers and the authors, one needs to be clear about what question is actually being asked. Also, it would seem to require individual paper data and so it’s not clear to me that Richard’s numbers have any real meaning – although happy to be corrected if I’m wrong.

  130. Tom Curtis says:

    Richard, I know you discounted that. You are not entitled to discount it without mentioning. You are especially not entitled to discount it without attempting to verify the information with the lead author.

    The simple fact is that mentioning Dana’s comment would have made your attempt to generate doubt were there was none absurd, so you neglected to mention it even though it was clearly relevant. That is scientific malpractice.

  131. Maybe I’m being dense, but can you actually explain the calculation in more detail.

  132. Tom Curtis says:

    Wotts, there are at least some known instances of papers rated as endorsing the consensus from the abstract that are rated as rejecting the consensus by the authors. Some of these are due to creative redefinition of the consensus position by the author, or incomprehension of what is meant by “endorse” by the author; but some represent genuine errors by the abstract raters. Of course, Richard does not have the statistics on this. Personally, I would have liked to see those statistics as well; but that is a different thing from the authors being obligated to present them.

  133. Grant M says:

    I’m replying from out in town, so I can’t give exact figures. However, I got my numbers for the first case by taking the expected values that are implied by the full sample (i.e. the percentage distribution of each category; “Endorse”, “No position”, etc) for the subsample. I then compare these expected values with those that we actually observed for the subsample.

  134. Richard Tol says:

    See http://www.sussex.ac.uk/Users/rt220/consensus.html
    Click “data and graphs”
    Sheet compare.
    Top calculations are for representativeness, bottom calculations are for similarity.

  135. Richard Tol says:

    Tom: Usefulness is in the eye of the beholder. Eli and Willard indicated passages that were over the top. Dana convinced me that it was worth my while to keep digging.

  136. It is not a matter of courtesy when it comes to scientific data. Of course, like the school kid who owns the ball, a scientist who has data may refuse to share his data, and in the end they may well be nothing anyone can coerce him to do. But he’s not entitled in any way to his data. It may sound outrageous but it is so.

    John Cook may not trust Tol or me or anyone else to do any good to him with his data. But the only valid recourse he has, as a scientist (if he is one), is to defend himself in the scientific arena. If Tol’s calcutations are invalid, then show them to be so. If you think my graphs have a problem, write about it. Do I not write about the Cook paper, even though let me tell you, I hold work of this type in low regard? Many skeptical commenters may instinctively recoil from this type of publicity-oriented pre-meditated publication, but I think it is not possible to offer criticism without nominally accepting the paper/author’s premises and examining the details. Engagement is the only way, not secrecy.

  137. Okay, I’ve been through that and I have genuinely serious comment that could be quite significant. You decide though. They way you’ve calculated the Chi-squared seems to be as follows. I’ll consider only the endorse papers as it is the same calculation for all the 3 categories (endorse, no position, reject).

    For the endorse you do the following. You consider initially the 2142 rated by both volunteers and authors and add the number rated as endorsed by the volunteers to the number rated by the authors as endorse (791 + 1342 = 2133). You then divide this by the total number of papers in the survey (2133/11944 = 0.1785). You then use this to determine what fraction of those papers that were rated by volunteers and authors should have been rated as endorsed (0.1785*2142 = 383). Firstly, I have no idea why this is the right way to make the estimate. According to the full survey (and your own spreadsheet) the volunteers rated 32.6% of the papers as endorse. One would then expect 698 of the 2142 doubly rated sample to be endorsed by the volunteers. You then go on to calculate the Chi-squared by subtracting the number actually endorsed by the volunteers from those predicted [(791 – 383)^2/383 = 436].

    This just seems completely wrong to me. Why would you add the papers endorsed by the volunteers to those endorsed by the authors. Presumably, about half of them are the same papers. Why would the sum of these papers divided by the total in the survey give you an estimate of how many should have been endorsed by the volunteers. That doesn’t make sense to me. You also use exactly the same numbers to determine what fraction should have been endorsed by both (i.e., you get 383 single and 383 both). Hence you get a much bigger Chi-squared for the both sample than you do for the single.

    I don’t really know where to go with this, because the whole calculation just doesn’t really make any sense. I’m happy to be corrected if you were willing to do so, but I think that you’ve just made a mistake here. I will acknowledge that my statistics is not necessarily that strong, so feel free to clarify where I’m going wrong.

  138. I’ve just read your response and I think what I’ve said below echoes what you’ve said. Richard seems to have done rather an odd calculation in which he adds those endorsed by the volunteers to those endorsed by the authors (and similarly for no position and reject) and then uses these numbers to make the estimates. I think this is wrong and what you’ve done seems more correct, but am happy for Richard to correct my understanding if I’m wrong.

  139. wotts, re your post above

    Here is a bit of history. It may edify you, or you may already be aware of parts of the story:

    [1] LOG12 – a paper published by Cook’s mentor Lewandowsky, where L claimed he survey skeptics at consensus blogs, supported its method by stating recently that the survey was offered to the readers of Skepticalscience which is Cook’s website. There is no record this was done whatsoever. Cook has completely stonewalled all communication. The paper remains published.

    [2] Cook, along with L and Hubble Marriott, a skeptic-hostile Australian blogger surreptitiously collected public comments *offered in good faith*, written blogs in response to non-release of survey methods in the previously mentioned study. Cook was a part of the conversation even as he collected data on members participating in it for research purposes. Scientists engaged in science communication, Judith Curry, Richard Betts, seasoned bloggers Watts, McIntyre, Nova, et al, were all named conspiratorial commenters in data collected by Cook. This paper has been temporarily withdrawn arising from complaints.

    [4] Cook, after his conversion to the climate orthodox side, began an extensive rewriting of his web articles, and retrospectively deleted, modified and/or censored comments on his own website, thus altering the public record extensively. This was carried out with no forewarning or intimation of readers.

    [5] Cook orchestrated an extensive book review writing campaign for scientist Michael Mann’s book on Amazon.com

    [6] Cook now refuses to release data in the present paper.

    Cook’s record of good practice in science, is weak.

  140. Okay, I don’t know how to respond to that. Thanks for letting me know, I will give it some thought. It’s an awful lot of accusations and could reflect poorly on John Cook – if true – but apart from possibly illustrating his character, it’s not clear how relevant it is to the current discussion.

  141. It’s the test for equality of proportions. Karl Pearson designed it.

  142. Yes, I realise how it works. You need two distributions. One expected and one estimated – or whatever the correct terminology is. My claim above is that the calculation you use to determine the expected distribution of the doubly rated sample is wrong. Firstly, I don’t see why you should be adding the numbers together (why would you add the volunteer numbers to the author numbers – they’re not independent). Once you’ve done this, you then determine the proportion by using these added numbers in the numerator with the total from the full sample in the denominator – hence your numbers F9 to F11 do not add to 1, which I think they should. Hence your predicted numbers (G9 to G11) don’t even add up to the correct total (it’s neither 2142 nor 4284). In my view your calculation is simply wrong. Telling me who designed the test, isn’t really responding to my basic question which is, I guess, please explain why you’ve done it the way you have and why you think the predicted numbers you get are a correct representation of the predicted distribution.

  143. That a professor of mathematics uses ‘et tu, Wott’ may at least make climate ballers smile.

    That Richard changed his mind and invested a lot more time than he said would have been good for Sound Science ™, had he took the opportunity not to play Goldilocks.

  144. For a professional denier, the relevance of the arguments is of secondary importance. Remember their goal is not to win an argument, but to make an effect. The minimum effect is to cast doubt on any reader approaching the subject in good faith, and thinking the conversation is honest and has the goal of clarifying something – to say seeking the truth is probably too demanding. The professional denier learned how to win arguments (and what arguments can’t never win) in the University. Later, in think tanks advised by PR companies, they learned how to make an effect. For example, they know very well that ad hominem arguments and hard words in the discussion dumb the casual reader and turn him to disconnect from the subject. Victory.

    One of the many techniques is to always be the last speaker, for which the Internet 2.0 is very well suited. It’s well known and documented that the last argument has the highest influence. A spin off is that this forces the opponent to never stop arguing, so wasting his time until mental and physical exhaustion is a real productive intermediate goal. Victory.

    To face Anthony Watts and Co. means to be aware of these and many other tactics, to have a deep understanding of the denial machine, to know that you will never win the effect, to be able to anticipate and counteract their moves and to be aware of the need to be supported by a network of emergency followers. I’m not sure John Cook is but, in a first look, Wotts, with his insistence on argument relevance, looks ill-equipped for the task. It merits otherwise.

  145. “He has yet to do so.”

    We can add that to the long list of Tol’s misrepresentations of me and Cook et al. (2013).

  146. Dana, would you care to remind us? What did I write, before you accused me of lies, that was untrue and that I knew to be untrue? What did I write, before you accused me of misrepresentation, that was incorrect?

  147. The actual series of events:

    1) Tol misrepresented Cook et al. (2013), for example equating the abstract ratings with the author self-ratings. He re-Tweeted Tweets from deniers like Climate Depot with the same misrepresentations.

    2) I generously assumed Tol was misrepresenting our paper simply because he had not read it and was misinformed. He told me he had read it and repeated his misrepresentaitons of our work.

    3) I said “I didn’t have you pegged as a denier before”, meant to point out that by both by disparaging our paper based on misrepresentations, and by encouraging deniers to do the same (I hope nobody will dispute that Marc Morano is a denier), he was behaving like a denier.

    I have explained Tol’s misrepresentations of our paper several times. He obviously doesn’t agree with my explanations, but to claim I haven’t provided them – frankly that is a lie. He knows it’s not true.

  148. If Shub, Richard, or anyone else think they have a case, they can complain to IoS.

    Handwaving graphs ain’t a case yet.

    Easier to raise #concerns on the public arena than provide constructive criticisms, it seems.

  149. [1] is mostly true.

    [2] isn’t very accurate. All Cook did was collect comments from ‘skeptic’ blogs espousing various conspiracy theories about LOG12.

    [4] is absurd. We update myth rebuttals as new relevant research is published. That’s all there is to this accusation.

    [5] is also absurd. Cook has suggested that people who have read Mann’s book should comment on it at Amazon, in part because so many ‘skeptics’ who have not read the book have given it bogus 1-star reviews.

    [6] is a lie. Our data are publically available in the supplementary material and on SkS.

  150. Shub does a political hit job. Relevance has nothing to do with this. Nor does Sound Science ™, for that matter. The auditing sciences are word placement disciplines.

    Think of it as black hat marketing.

  151. Shub, but aren’t you essentially saying what some of us have been saying all the time. Use the “scientific arena”. If you think there is a problem with Cook et al., publish a paper that not only demonstrates what they’ve done wrong but how a correct approach would influence the results. You don’t need any more of their data for that. You just do your own work on the same or equivalent sample. That’s the scientific method.

  152. For those whose memory need refreshing, below is the entire conversation I had with Dana until he accused me of misrepresentation.

    At that point, I had raised concerns with sampling and data quality.

    Dana Nuccitelli ‏@dana1981 23 May
    @richardabetts @richardtol is behaving like one, RTing Marc Morano’s Climate Depot and misrepresenting our paper.

    Richard Tol ‏@RichardTol 23 May
    @Foxgoose interesting they apply the D word to me, one of the 1st to show the A in AGW, argued for carbon taxes for 20 yr @hro001 @dana1981

    Richard Tol ‏@RichardTol 23 May
    .@dana1981 Most importantly, consensus is not an argument.

    Richard Tol ‏@RichardTol 23 May
    .@dana1981 I published 118 neutral (in your parlance) papers. You missed 111. Of the 7 you assessed, you misclassified 4.

    Richard Tol ‏@RichardTol 23 May
    .@dana1981 I published 4 papers that show that humans are the main cause of global warming. You missed 1, and classified another as lukewarm

    Dana Nuccitelli ‏@dana1981 23 May
    @RichardTol Have to say I’m disappointed. Didn’t have you pegged as a denier before. Fine to dislike our paper, but don’t lie about it.

    Richard Tol ‏@RichardTol 23 May
    .@dana1981 Don’t worry. I did read your paper. A silly idea poorly implemented.

    Dana Nuccitelli ‏@dana1981 23 May
    @RichardTol You might want to actually read our paper before claiming it’s ‘coming apart’ based on ignorant and wrong claims.

    Richard Tol ‏@RichardTol 22 May
    Cooked survey (ctd) http://www.populartechnology.net/2013/05/97-study-falsely-classifies-scientists.html

    Richard Tol ‏@RichardTol 22 May
    Cook survey included 10 of my 122 eligible papers. 5/10 were rated incorrectly. 4/5 were rated as endorse rather than neutral.

    Richard Tol ‏@RichardTol 21 May
    The Cook paper comes further apart http://www.populartechnology.net/2013/05/97-study-falsely-classifies-scientists.html

  153. Dana, thanks for the clarification.

  154. Dana: I suggest that you consult a dictionary.

    The entire conversation will be reproduced below. I raised issues with your data quality, and with your sampling strategy.

    I wrote what I then thought was true. That is, I did not lie.

    I did not write single word about what you did. I therefore cannot have misrepresented you.

  155. Don’t want to get stuck in the middle of a fight, but Richard you’ve been very rigid about setting a survey strategy. The Cook et al. survey strategy was the Web of Science search “global warming” or “global climate change”. That search returns 10 of your papers. Noone has claimed that it was a search of all possible papers published in the period. Yes, they could have chosen a different search terms and got an expanded (or different) sample, but there isn’t really any evidence (at this stage) that the sample returned by their search wasn’t a reasonable sample to use for this study.

  156. I’m almost certain I’m ill-equipped for this. Trying to do my best though 🙂

  157. Wott must be new here. (An old Internet saying, Wott, wink wink.) But he’s learning fast. He’s being tested, right now, that’s all.

    Wott’s style reminds me of Bart’s, who knows some jujitsu. Nobody touches Bart anymore.

    As the Yi-King oftentimes says, no harm.

  158. Vintage May 19:

  159. Cook’s history rewriting is extensively documented with screenshots. No one knows why he did it, but he did.

    Cook adminsters Lewandowsky- associated website called shapingtomorrowsworld.org. A substantial portion of the argument about LOG13 appeared at this website. Cook’s group people, – moderated the website comments. Complaints about censorship were moderated. Comments about this website’s conduct were harvested and entered as examples as ‘conspiracist thinking’ by Cook and Marriott. It is available in the paper’s supplementary data. See for yourself.

    Cook’s Mann book review astroturf campaign consisted of book pre-release review copy bulk emailing to Skepticalscience readers and solicitation of reviews. The reviews were posted *hours* after the book was released on Amazon.com. Following which, Cook’s group people monitored Amazon.com review pages arguing with anyone else who posted reviews. Again, documented.

    The specific items of data requested from Cook and co-authors are listed right on this page. Tweets have been sent out several times.

    Dana1981 rarely offers any form of argument, beyond calling things *lies*, *falsehoods* or some such simplified trope.

    These practices illustrate the unfortunately abysmal standards consensus practitioners ahdere to. They feel justified in doing it in the climate blog wars, that is their prerogative. These issues and such standards are transplanted into the academic domain.

  160. “Lie” was perhaps the wrong choice of words. I assumed when you claimed to have read our paper, that meant you had understood it. My mistake.

  161. Thank you for taking back that I lied.

    Care to comment on the alleged misrepresentation?

  162. Sorry. I had overlooked that one.

  163. 1) I just explained why he revised myth rebuttals – because new relevant research was published. I do this all the time on SkS (we now have a date at the bottom documenting when a rebuttal was last revised). This is a really ridiculous thing to complain about. Would you prefer we leave the myths old and outdated? I’m sure you would, but that’s not going to happen.

    2) You’re complaining that they moderate complaints about moderation at STW? Is this a joke?

    3) Reviews of Mann’s book were posted hours after the book was release because we had early copies of it. Yes, some people monitored the page for deniers who gave bogus 1-star reviews without reading the book (there were dozens of them). And? Is this a criticism of Cook, or of deniers? And what does it have to do with Cook et al. (2013)?

  164. 4) I don’t use the term “lie” lightly. I rarely use it – only in cases where I know the person knows what he is saying is false. And you are lying. You know we released all of our abstract data. No, we didn’t release every bit of data, like what time we went to the bathroom while rating abstracts, as Tol would like, but we have released a whole lot of data.

    5) Engaging in a smear campaign against the paper’s lead author in order to discredit the results is kind of pathetic.

  165. @shubclimate If I replicate the query “global climate change” or “global warming” and “1991-2011”, I find 34,651 papers (rather than 11,944)

  166. I believe this is an issue related to whether or not one uses single quotes or double quotes.

  167. and I had overlooked that one too …

    still no misrepresentation, though, just a failed attempt to replicate (I believe by mixing up the Web of Science and the Web of Knowledge)

  168. Reich.Eschhaus says:

    Irrespective if the numbers are correct, I do wonder what it should be a criticism of? The paper self-ratings and the abstract ratings do not agree? Well, that’s acknowledged in the paper. Nothing new. The self-ratings are not representative of the whole sample? Why should they be? They were used as a check on the abstract ratings. Above that, it cannot be expected that a self-selected sample of authors happens to be representative. Check the supplementary information. Response rate rises with recency of the paper. Later years are overrepresented. Endorsement levels change over the years as well… Not representative –> no wonder.

  169. @Reich
    Agreed.

    Now re-read Cook.

    In the abstract:
    “Among abstracts expressing a position on AGW, 97.1% endorsed the consensus position that humans are causing global warming.” and “Among self-rated papers expressing a position on AGW, 97.2% endorsed the consensus.”

    In the conclusion:
    “Among papers expressing a position on AGW, an overwhelming percentage (97.2% based on self-ratings, 97.1% based on abstract ratings) endorses the scientific consensus on AGW.”

    They emphasize the similarity of the result.

  170. > still no misrepresentation […]

    Not so fast.

    First, trying to replicate a WoS seach using Scopus because:

    you just can’t wait to return to the office to replicate Cook’s search using the proper tools does not bode very well on my inter-judge reliability score.

    It sure seems to misrepresent what replication means.

    (Also note the cameo appearance of a G search for “Scopus” in the conversation.)

    ***

    Second, a good candidate to try to understand why Dana spoke of misrepresentation would be this one:

    This tweet does seem to refer to two of your tweets already mentioned, which you seemed to endorse, while they were endorsing you, which may remind auditors of check kiting:

    http://neverendingaudit.tumblr.com/post/318623946

    Paying due diligence to what has been done there might explain Dana’s loss of temper.

    ***

    Third, another candidate for misrepresentation would be this one you mentioned, which contain a claim about your own papers that does not seem to hold after Tom Curtis paid due diligence to it:

    http://bybrisbanewaters.blogspot.ca/2013/05/tols-gaffe.html

    ***

    Fourth, Dana already answered your question:

    ***

    There’s also the tweet where you mentioned a lukewarm category (I still have no idea what you’re talking about), using Morano for your social networking, and, for good measures, under sampling claim at the time.

    ***

    It’s tough to know which misrepresentation Dana had in mind. But that’s your own problem with Dana. What matters to me is how you can to the conclusion that 97% was “a load of nonsense”. Have you done any kind of analysis at the time? Can you provide evidence that you made an analysis that justifies this claim at the time?

    To that effect, it would be nice if you could send the file of this analysis, with all the relevant timestamps of your control version system. For you do use a versioning system, right? I know there’s one in G Drive.

    Playing the replication game cuts both ways.

  171. Willard: You may want to consult a dictionary too.

    Misrepresentation would be if I had claimed they used Scopus.

  172. Reich.Eschhaus says:

    “Agreed.”

    On what exactly? Do you agree that those Chi squares are not really a criticism of the paper? I ask because you -after that statement- come up with some other possible criticism, something to do with the presentation of the results:

    “Now re-read Cook.”

    Not really necessary 😉

    “In the abstract:
    “Among abstracts expressing a position on AGW, 97.1% endorsed the consensus position that humans are causing global warming.” and “Among self-rated papers expressing a position on AGW, 97.2% endorsed the consensus.””

    Also from the abstract: “We find that 66.4% of abstracts expressed no position on AGW” and “Compared to abstract ratings, a smaller percentage of self-rated papers expressed no position on AGW (35.5%)”. Thus mentioning a clear difference in self-ratings vs abstract ratings.

    “In the conclusion:
    “Among papers expressing a position on AGW, an overwhelming percentage (97.2% based on self-ratings, 97.1% based on abstract ratings) endorses the scientific consensus on AGW.”

    Well, they did find that, didn’t they? (Sample 1: 97.1%; Sample 2: 97.2%; of papers taking a position). In the Discussion before the Conclusion uncertainties in the data are discussed and comparisons with other studies. Sounds more relevant than the Conclusion section which is all about public misperception about scientific consensus and the sentence you quote is the “take-home-message”. I see that. Is there something wrong with that?

    “They emphasize the similarity of the result.”

    In my opinion not in the way you try to portray it, but that’s just me!

  173. Cook revised his ‘rebuttals’ and revised and/or altered the comments on threads which were previously offered. This is re-writing history. The practice is illustrative of scholarly standards, or the poor knowledge of it.

    I don’t think you understood the point about Cook STW involvement and its relation to the now suspended Frontiers paper. You don’t seem to be well-informed about these activities. I wouldn’t blame you.

    You’ve accepted my points about Mann’s review writing campaign that was undertaken. The activities with Mann’s book show a lack of standards and campaigning zeal.

    You can release tagged volunteer ratings (initial, secondary and tertiary (if any)), iteration number/version number and author self-ratings. Unless I am mistaken, this data is not released. The list of abstracts is data, no doubt, and your team has to commended for releasing it. Your team has to be commended for releasing the final ratings too, but, that is not data. That is the final output. That is the outcome of the processing of initial data points, namely the volunteer ratings.

    I appreciate Cook for enthusiastically leading a large team of volunteers and goading them/motivating them into reading a large number of abstracts. I do research and I am familiar with the monotonous tasks that are required to generate data of this kind. I might not appreciate his conduct with the Lewandowsky Frontiers affair, but that is a different matter. This, is merely a matter of data and validation of results.

    My own take on the paper: If one sets aside the implicits, the explicits give a ratio of .967 favoring the consensus. Your team identified a handful of skeptical papers that Oreskes missed. Sure you wouldn’t want to highlight this aspect, but it is in your data. Standing in your shoes, I would say, good stuff. I think Tol has said pretty much the same thing in his ERL comment.

  174. Tom Curtis says:

    Starting with the first Tol tweet of which I was aware (21 May), Tol clearly endorses the article by Poptech. Given that, and given the nature of his criticisms of Cook et al, I’m sure he will be able to tell us:

    1) On what basis he decided that the three authors surveyed by Poptech (at the time of tweeting) constituted a representative sample of authors of papers surveyed in Cook et al.

    2) On what basis he determined that those three authors constituted a sufficient sample size to draw statistical conclusions.

    Failing that, I’m sure he can tell us why a cherry picked sample of three should have more weight than the survey of authors in the paper.

    While he is about it, perhaps he would comment on Scafetta’s claim (which he has implicitly endorsed) that:

    “Cook et al. (2013) is based on a strawman argument because it does not correctly define the IPCC AGW theory, which is NOT that human emissions have contributed 50%+ of the global warming since 1900 but that almost 90-100% of the observed global warming was induced by human emission.”

    IMO, that the IPCC states only that:

    “Most of the observed increase in global average temperatures since the mid-20th century is very likely due to the observed increase in anthropogenic GHG concentrations.”

    makes Scafetta’s claim about the “IPCC AGW theory” is simply false, yet Tol implicitly endorses it without comment.

    Further, given that the abstract of Scafetta’s paper states:

    “We estimate that the sun contributed as much as 45–50% of the 1900–2000 global warming, and 25–35% of the 1980–2000 global warming.”

    and the conclusion states that:

    “By considering a 20 – 30% uncertainty of the sensitivity parameters, the sun could have roughly contributed 35– 60% and 20– 40% of the 1900 – 2000 and 1980 – 2000 global warming, respectively”

    does Tol not agree that Scafetta has significantly misrepresented his paper in saying,

    “What my papers say is that the IPCC view is erroneous because about 40-70% of the global warming observed from 1900 to 2000 was induced by the sun.”

    ?

    Further, given that the abstract says “as much as 45–50%” where “as much as” indicates the following number sets an upper limit on the contribution, isn’t discussion of the paper in reference to a rating of the abstract misleading by Scafetta.

    Again, perhaps Tol could explain why he implicitly endorsed these claims by Scafetta?

    Or did he not know what he was endorsing because he was so eager to criticize Cook et al that he did not bother checking the validity of criticisms of it?

  175. Tom Curtis says:

    Cook and his team have:

    1) Revised articles to update them. Initially they did not note the time and nature of alterations which was poor practice. When they recognized that was poor practice, however, they started archiving prior versions and noting updates.

    2) Deleted posts that violate the comments policy, and deleted segments of posts in violation of comments policy as a substitute for deleting the whole post. As the comments policies are clearly delineated and as warnings are typically given prior to any such deletion, there is no ground for complaint with regard to this practice. Put simply, if you don’t want to adhere to the comments policy, don’t post there. If you do want to adhere to the comments policy, then you will have no objection when posts not adhering to that policy are deleted. Other than the deletion of segments of text in violation of the comments policy, there has not been to my knowledge any editing of comments except by request of the author of the comment, or to fix broken html code/links.

    3) On one occasion a moderator marked a very frequent violator of the comments policy as spam. He did so not knowing that as a consequence all of the violators comments were deleted. This was recognized as a mistake immediately, and it was made very clear that marking posts as spam was to be reserved for genuine commercial spam only. Deniers who have trolled through the forum hack no this to be the case because they have read the internal discussion – but still bring the example up as if it were a case of deliberate malfeasance.

    4) On one occasion a post was modified and then at a later date, a response was made to an earlier comment which predated the edit, in which the response chided the prior commentor for not having read the material introduced by the edit. Again, this was an error brought about by the failure to note time of updates. As noted above, SkS have introduced procedures to avoid recurrence.

    Shub’s complaints come down to this – any error by SkS or its authors is treated as indicative of a general pattern of behaviour or strategy, even when steps are taken to correc the error and to ensure it does not reoccur. Meanwhile equal or more outrageous errors by deniers are simply ignored.

  176. Reich.Eschhaus says:

    And on another point

    “In the data provided, raters are not identified and time of rating is missing. I therefore cannot check for inconsistencies that may indicate fatigue. I nonetheless do so. Figures S1-S9 shows the 50-, 100- and 500-paper rolling standard deviation, first-order autocorrelation – tests for fatigue – and skewness – a test for drift. I bootstrapped the data 10,000 times to estimate the expected value of these indicators and the 95% confidence interval. Table 1 summarizes the exceedence frequencies.

    The data do not behave as expected. Rolling standard deviations are occasionally too large, and more frequently so than would be expected by chance alone. This may be because, in part of the sample, raters alternated between endorsement and rejection. It may also be because, in part of the sample, all abstracts were rated near the mean. First-order autocorrelation should be zero, but it is not. In parts of the sample, ratings are consistently above average – perhaps because long sequences of abstracts were rated neutral (4).8 The results for skewness indicates drift (towards endorsement of anthropogenic climate change) in the first fifth of the sample. Some parts of the sample show more negative skew than would be expected by chance: Endorsements are clustered. It thus appears that rating was not done consistently, perhaps because the raters tired.”

    The paper states that abstracts were rated randomly. How is analysing the abstract ratings ordered by year and title going to tell us something of interest? So you see some patterns in the YEAR-TITLE order that seems to disagree with random rating? Why could that be? Are the raters to blame or are the abstracts to blame? You don’t even consider this question. That’s quite weak. For instance, you see skewness getting more negative. The abstract ratings show increases over the years in neutral (huge increase) and endorsing (increase), but not for rejecting (not much of a change). Here is your increase in negative skewness. Or explain how the ratings procedure caused this and not the abstract themselves (why is it ‘tired raters’ but not ‘consensus changing over the years’? You don’t differentiate.)

    You do not look at the abstracts to consider why the YEAR-TITLE order might produce some apparent statistical inconsistencies, but instead only observe there may be some patterns in the abstract rating data and alleging it might be due to the raters while forgetting to consider the abstracts themselves as as a source as well. Not very scientific.

  177. Tom Curtis says:

    Tol writes:

    “The trend is a trend in composition rather than a trend in endorsement.”

    Foolishly I just accepted that on your say so. It was irrelevant, if true, because your imputation from it is based on misinterpreting “endorse” to mean “is evidence of”, or something similar.

    Then I recognized that the trend towards endorsement which you identify in the first 20% (1991-2000) of the sample shows that trend to be strongest in the earliest half of the period, while the trend in composition over that period is close to zero. More specifically, the trend in composition in the period 1991-2000 (0.048) is less than the trend from 2001-2011 (1.074), whereas the trend in endorsements in the earlier period (0.004) is greater than the trend in the later period (0.001). Puzzled by this I checked the correlation over the entire interval, which yielded an r^2 of 0.065.

    I do not think these facts support your claim.

    In fact, it appears that your claim is based on no more than an eyeball assessment of a single graph. Can you confirm that? Can you confirm that after insulting Dana by suggesting he needed to be lectured by you about statistics, you have made a claim in a comment submitted for publication based on no stronger statistical analysis than simple eyeballing?

    And if that is not the case, where are the statistics that support your claim?

  178. Tom: Conjecture. You infer a lot from 6 words.

    People had already noted that the 97% wasn’t quite 97% of the surveyed paper. People had already noted that large numbers of papers had been omitted. And now people started questioning data quality, and raised the prospect of systematic bias.

    That’s what I meant to say.

  179. @Reich
    If an unrepresentative subsample confirms the finding of the larger sample (as is the case in Cook et al.), then representative sample would contradict those findings.

    Furthermore, the unrepresentative subsample only leads to the same conclusion through cancelling errors.

    The paper ratings thus reveal that the abstract ratings are invalid.

  180. Marco says:

    interesting how many prior tweets that are directly relevant Richard Tol “missed”.

  181. Richard, I was wondering if there was a reason why you haven’t really responded to my comment about your Chi-squared tests. The calculation seems – to me – incorrect and I am genuinely interested in hearing an explanation for why you’ve done it the way you have.

  182. The test is standard: Squared deviation between actual and predicted (under the null).

  183. I know, you’ve said that before. I’ll be more specific. Chi-squared tests if a particular distribution is as expected (or tests the underlying assumptions of the distribution – or whatever the correct terminology is). A classic is testing the fairness of a coin. You would expect 50% heads, 50% tails and hence you can toss a coin many times and use Chi-squared to check that the coin is indeed fair (or not). I’m not saying this because you don’t know this, but simply for completeness.

    Given the above, it’s my understanding that both distributions need to have the same number of elements. In your Chi-squared test for the doubly-rated papers (volunteers plus authors) your two distributions are (383,377,9 – total 768) and (791,1339,12 – total 2142) for the volunteer rated papers, and (383, 377, 9 – total 768) and (1342, 761, 39 – total 2142) for the author rated papers.

    Straight away this seems wrong. How can you use Chi-squared to test a distribution if the distribution doesn’t have the same number of elements as the distribution you’re comparing it to.

    There are other issues with this too (in my opinion) but the above is, I think, fundamental. I don’t see how this can be correct, but am happy to be corrected if I am indeed wrong. However, that will require you doing a little more than simply telling me what the test does or telling me the definition of the test.

  184. @Wotts
    The test is for proportions, of course.

  185. Yes, but doesn’t the distribution you’re comparing with have to have the same size (total) as the distribution you’re testing? That was question 1 🙂

    Question 2: When producing your distributions for testing (G9 to G11and H9 to H11 in your spreadsheet) you sum the distributions for the volunteers (C9 to C11) with the distributions for the authors (D9 to D11). How is this a reasonable representation of a distribution for the volunteers and author rated papers? These are not two sets of separate papers rated by two different groups. They’re the same set of papers rated by two different groups. All you’ve done is averaged the two distributions. It’s clear (without a Chi-squared) that the author rated distribution is very different to the volunteer-rated distribution (and one could argue that there is a perfectly good reason why this is so). By summing this with the distribution from the volunteers it’s clear that what you get will have a large Chi-squared value. This isn’t surprising and doesn’t, in my view, tell us anything that we didn’t already know and that isn’t already in the Cook et al. paper.

  186. Indeed. The null is that the distributions are the same, that is, addition is okay.

  187. No, it’s not. Look at your spreadsheet properly. Your Chi-squared is calculated by comparing the distributions in column G (G9 to G11) and column H (H9 to H11) to the distribution in column B (B9 to B11). These distributions do not have the same number of elements. Those in columns G and H total to 768 while that in column B totals to 2142. They are different. How can you use Chi-squared to compare these distributions. This seems like a fundamental mistake to me. You’re not obliged to do anything, but it would be nice if you would put a little bit of effort into explaining why what you’ve done is correct and why what I’m saying is wrong.

    Also, you haven’t even attempted to explain why adding the volunteer distribution to the author distribution (for those papers that are double rated) is appropriate. I think it is not. You’ve just responded by saying that it’s okay. Why is it okay? Again, I don’t think you’ve produce a representative distribution. You’ve produced an average. That, again, doesn’t seem correct to me. A slight expansion of your explanation would be appreciated.

  188. Google testing for equality of proportions

  189. I have, that’s why I think what you’ve done is incorrect. Maybe you’re really busy and one line answers are all you have time for. If what I’ve said is wrong, it shouldn’t take long for you to explain where I’ve gone wrong. Just to be clear. At the moment, I think your Chi-squared calculation is incorrect and I think your calculation of a predicted distribution for the doubly-rated papers is also incorrect (or unfounded/unjustified). Happy to be corrected, but until you do so, I’m unlikely to change my mind. Will keep considering it though. Maybe, I’ve made a silly mistake. Every time I look at it though, it just seems to confirm your error.

  190. Tom Curtis says:

    Tol,
    1) The 97% was 97% of abstracts taking a position on AGW, as indicated in the paper. People noting what the paper says and pretending that was a criticism is not valid criticism – it is posturing.

    2) No papers had been omitted. The abstracts rated consisted of all abstracts returned by the search, which was specified in the paper. Describing the fact that alternative searches, or searches on alternative databases return more abstracts as the omitting of abstracts is a straightforward misrepresentation of the fact.

    3) People in general were not questioning the data quality. Only well known AGW deniers. They did so using a small sample that was not representative and which, for all you knew, had already been included in the large, much more representative sample from the author self-ratings. Given your criticism of Cook et al, that should have been enough for you to dismiss the criticism out of hand. Instead you laud it and ignore the fact that it is effectively pre-rebutted by the author self ratings.

    4) I did not force you to use twitter to mount your criticisms. You do not get to choose an abbreviated communication form and then plead that you did not have room to express appropriate qualifications etc. If you thought qualifications were necessary, you should have used a less abbreviated form of communication. If you thought no qualification was necessary, then you in fact endorsed the contents of the article to which you linked, including Scafetta’s absurdities.

  191. Tom: Internal validity and external validity are different things.

  192. Tom Curtis says:

    Tol writes:

    “If an unrepresentative subsample confirms the finding of the larger sample (as is the case in Cook et al.), then representative sample would contradict those findings.”

    Restricting the analysis to only those abstracts for which the paper was self-rated by authors, we find that endorsements are 98.5% endorsements plus rejections. That compares to the 97.2% in the self-rated papers. Ergo, even restricting our analysis to only those papers for which we have both an abstract rating and a self-rating, the 97% figure is effectively confirmed.

    Tol goes on to say:

    “Furthermore, the unrepresentative subsample only leads to the same conclusion through cancelling errors.”

    In fact they lead to effectively the same conclusion because abstract ratings are conservative relative to self-ratings, being biased away from either endorsement or rejection towards neutrality. That may be because abstract raters were conservative, because self-raters had more information with which to rate, or because authors who strongly endorsed or rejected the consensus were more motivated to respond to the survey of authors; or to some combination of the three. The combination of these three factors do lead to a slightly stronger conservative bias in rating of rejection abstracts than of endorsement abstract. The slight difference in bias, however, is sufficiently small that the self-ratings show the overall result of the abstract rating are likely to be close to correct.

    So, in two more instances Tol takes the trouble to show where there may be an issue, but takes no trouble to quantify the potential magnitude of the effect on the outcome of the paper. In each instance where Tol has done this, the magnitude of the effect can be shown to be small. That is unlikely to be coincidence, IMO.

  193. @Richard,

    I’m back in the office and have now had some time to look at your chi-squared calculations. (Thanks for providing the link to your Excel file.)

    1) Okay, you included the non-subsample (i.e. “single”) deviations in your chi-squared calculation. This explains the difference in our chi-2 numbers. (I would say that including the non-subsample deviations are somewhat redundant since you can compare the representivity of your subsample to the sample directly, but obviously it isn’t wrong to include them either.)

    2) Here, I still do not understand what you are doing. In particular, look at your predicted values for the “single” and “both” cases (cells G9:H11). Neither of these sum up to the expected subsample total (i.e. 2142)! The reason is clear; in working out your proportions, you have divided by the sample total (not the subsample total). You can see this by looking at the formulas in cells F9:F11… You have divided by E$6. (Surely you should be dividing by E$12?) This would vastly inflate your calculated chi-2 stat.

  194. Tom
    Instead of carrying the water for others’ mistakes, why don’t you step up and ask for the raw data to be released?

    Your use of words like “deniers” is meaningless. Do you even know who you are talking to?

  195. @Reich

    You’ve exactly pinpointed another issue that I wanted to raise; autocorrelation in terms of what dimension? (i.e. Given the random assignment of abstracts among raters.) To not even mention the possibility of evolving scientific opinion — or some such similar factor — alongside the suggestions of “author fatigue”, etc, strikes me as rather disingenuous:

    You do not look at the abstracts to consider why the YEAR-TITLE order might produce some apparent statistical inconsistencies, but instead only observe there may be some patterns in the abstract rating data and alleging it might be due to the raters while forgetting to consider the abstracts themselves as as a source as well.

    PS – FWIW, I completely agree with your above comment that performing chi-squared tests on the subsample is of pretty limited value as a critique of the Cook paper, given the stated qualifications in the original text (acknowledged differences, author self-selection, etc.) However, I was/am interested in how Richard got the exact chi-squared figures that he did provide.

  196. Wott,

    What your guest are doing, right now, I call bulldozing to the endzone. It is a way to punt a conversation that belongs to a subthread at the end of the comments. It seems that it has an effect on readership.

    Here’s how to replicate this.

    ***

    > Misrepresentation would be if I had claimed they used Scopus.

    I said misrepresenting what is replication, Richard. One simply does not replicate a WoS search by using Scopus. In any case, my main point was that to say

    At that point, I had raised concerns with sampling and data quality.

    is at the very least a misrepresentation of what you did at this point.

    ***

    And speaking of what you did at that point, we have no way to replicate the analysis you did to backup the concerns you were raising about what you called “load of nonsense”.

    On which basis did you said to Klein that the 97% was a load of nonsense?

    Please send:

    – ALL THE TIMESTAMPS
    – ALL THE FILES
    – ALL THE VERSIONS
    – ALL THE RELATED CORRESPONDENCE

    If anything from that is missing, I won’t be able to replicate what you did. Which would spells doom on the significance of your claim, if I’m to apply your own criteria. I have the tweet to attest of that, in case you’re wondering.

    Just trying to make sure proper methods are being followed, of course.

    ***

    > You may want to consult a dictionary too.

    This is the second time Richard uses that line:

    Readers will still have to wonder in which sense “your 97% is a load of nonsense” disputes anything, and if

    > Questioning a sampling strategy is the new anti-science

    does not misrepresent a bit what Richard was doing at the time.

    Some may argue that Richard might have been sugar-coating his own participation in this hurly burly, to say the least.

    ***

    Also note my response:

    Here are the meanings:

    1. To put a question to. See Synonyms at ask.
    2. To examine (a witness, for example) by questioning; interrogate.
    3. To express doubt about; dispute.
    4. To analyze; examine.

    In retrospect, we’re glad Richard tried to fulfill other meanings of “question” than “dispute”.

  197. > Internal validity and external validity are different things.

    Indeed. In some other fields, these concepts do not even represented with the same words.

    Patronizing crap won’t cut it anymore, Richard.

  198. There are others too, Marco. So to imply, like Richard seems to do, that:

    > At that point, I had raised concerns with sampling and data quality.

    might not represent very well what he did.

    Nor has Richard replicated his concerns very well.

  199. If I may summarize.

    1) Tol promoted absurd inaccurate attacks on our paper from contrarians like Poptech, Scafetta, etc.

    2) This ticked me off. I suggested Tol was behaving like a denier. Probably a poor word choice (though I’d still argue an accurate description of his behavior).

    3) Tol subsequently dug in deeper and found what he believes are valid criticisms of our paper, completely distinct from those original inaccurate criticisms that he promoted.

    4) Tol now argues that because he believes his current criticisms are valid, the original criticisms were also valid? More precisely, he’s not willing to admit he made a mistake in promoting those inaccurate attacks on our paper.

  200. @Grant

    Exactly and I’ve asked Richard that exact question and not really had a satisfactory answer.

  201. Dana, I guess you mean this:
    http://www.populartechnology.net/2013/05/97-study-falsely-classifies-scientists.html

    Andrew asked a number of people to give their views. I think it is perfectly valid to ask “Cook says you said A. Is that so?”

    Of course, it does not mean much that 7 out of a large number disagree, but they all do. Carlin’s remarks are intriguing, but we all know that John Cook and integrity are synonyms.

  202. shub says:

    Summarize please. But provide the data and participate as well.

    It is natural for authors to feel criticism to be unfair. Name-calling torques the discussion. Your group uses “denier”, “liar” and other designations quite frequently. I don’t think it benefits anyone.

  203. > Your group uses “denier”, “liar” and other designations quite frequently. I don’t think it benefits anyone.

    That must be because they ain’t true Dutch, for if they were, we would excuse them for “directness”, and perhaps even see dislike of fidgeting around the bit as a benefit.

    But speaking of designations, what about anti-science?

  204. Another interesting, or should I say Dutch, choice of epithet:

    Note that this was meant for a formal comment.

  205. Peter Jacobson says:

    This has become pathetic. Climate science is too important to have descended into such childish behaviour. I am certain that the vast majority of literature published in academic journals supports the notion of anthropogenic global warming. It’s also apparent that the Cook et al paper unfortunately lacks statistical rigor.

    Let’s learn from this lesson and move on. We, the climate science community, need to improve: we need to be more open and honest, we need to be unbiased, we need to accept valid criticism and we need to improve the quality of our analyses. And we need to do so whether or not we feel the ‘other side’ fails to exhibit these very same qualities.

    Let’s elevate the level of scientific discourse with respect to climate science, rather than devalue it.

  206. “swivel-eyed loon” is well-understood in British English, but less suited for an international audience.

  207. Certainly, I agree with the importance of climate science and how we should elevate the level of discourse. Part of my criticism of Richard’s paper was the level of the discourse. The style of writing is antagonist and insulting to the authors of the other work. Maybe this post is a bad example, but my intent in writing this blog was to try and keep the discussion civil. I may not have achieved that myself and what I certainly haven’t done is work out what to do if others choose not to do so.

    I must admit, however, that I don’t really know what lesson we’ve learned from this. I agree that one should always aim to improve, to be more open and honest, to improve the quality of the analysis and to do so irrespective of how others behave. I’m just not quite sure what this has to with this particular discussion (although maybe it’s valid for some aspects).

  208. Agreed.

    I would add that it is high time that climate researchers stand up against sloppy research, and against the apologists of sloppy research.

  209. Richard, absolutely. My understanding of your position is that you are a rigorous researcher who has an understanding of statistical analysis. Given that, can you please explain why your Chi-squared test calculation isn’t wrong. I (and others above such as Grant M) have been very clear – and polite I believe – in our issues with that calculation. It seems to be incorrect (which – if true – would be a little ironic given your statements about rigour and sloppy research). It shouldn’t be difficult to explain and I’d be happy to be shown to be wrong.

  210. The second test was indeed wrong, but not as suggested by Grant or you. The chi2-stat is 300 rather than 6000; p-value is still less than 0.1%.

    The test for equality of proportions is just that. There are two samples. No assumptions about replacement, independence, whatever.

    I would rather do a paired test, but that would require the hidden data.

  211. Grant: Well spotted. Chi2 is 316 or so. p < 0.1%

  212. > Andrew asked a number of people to give their views. I think it is perfectly valid to ask “Cook says you said A. Is that so?”

    Then at least Richard agrees that the second part of Cook’s study is “perfectly valid”, then, which might be the most relevant part anyway. Not that this means the authors were perfect. See how Richard rated his own papers, for instance.

    ***

    Network analysts will sure appreciate this tweet:

    “Frolics”. Spoken like a true Dutchman.

  213. “but not as suggested by Grant or you”

    Come on Richard, you really are being as elliptical as a Sphinx here. Mind telling us how it was wrong in a way not suggested by us?

    “p-value is still less than 0.1%”

    Yes, but that’s not entirely the point is it? After all I know papers that have been written about potential addition mistakes that confuse 98% for 97%… 🙂

  214. Grant: Our posts crossed. You spotted the error. That’s why we make data and code available for all to inspect.

  215. “It’s also apparent that the Cook et al paper unfortunately lacks statistical rigor.”

    Sorry, maybe I missed something – how is that apparent?

  216. > I would add that it is high time that climate researchers stand up against sloppy research, and against the apologists of sloppy research.

    Those damn “apologists,” again. Spoken like a true Mike.

    Let’s hope that when they’ll be finished grandstanding, these climate researchers will stand down from their soap boxes and answer simple questions like Grant’s above:

    2) Here, I still do not understand what you are doing.

    Grandstanding does not help answer such questions.

  217. > I must admit, however, that I don’t really know what lesson we’ve learned from this.

    Mine would be that we’ve just seen a reason why the word to question evolved the way it did, and that prolific writers are certainly not perfectionists.

  218. Reich.Eschhaus says:

    @Richard Tol

    I did a quick check on the numbers now:

    You made an error in your Excel file. Chi Square 5793 is definitely wrong.

    Maybe you forgot to change a reference to a cell after copying the formula. E9/E$6 should be E9/E$12. Hope this clears up some confusion. 😉

  219. Same error, found independently by three people.
    I love open review.

  220. Reich.Eschhaus says:

    Oh well, Grant beat me to it… Good catch Grant!

  221. Thanks for confirming.

    about data, though some restrictions will always remain.

    lesson is that we should automate our tests as far as possible (e.g. “tabi” command in Stata). Excel calculations fast becoming the bane of economists.

  222. Tom Curtis says:

    Dana, it is a weasel word. Cook et al did not apply every conceivable statistical test. Indeed, it applied very few statistical tests, which is a weakness in the paper. Therefore it is not as rigorous statistically as it could be. That is then treated as indicating that the results do not follow from the data, which is false.

    Bizarrely this is then turned around to attack people who point out that many criticisms of the paper either are based in clear misinterpretations of the paper, or are statistically invalid, or (like most of Tol’s criticisms) depend on pointing out statistical facts which have no consequence on the result of the paper, while carefully not drawing attention to the irrelevance of the statistical facts that are mentioned.

    One example of this was RomanM’s discussion showing that Cook et al could have used a better method for determining linear trends but carefully not mentioning that his better method showed the trend in endorsements as a percentage of endorsements plus rejections was positive and statistically significant, and attempting to dismiss the relevance of that after it was highlighted.

    The most absurd example of the latter strategy has to be Tol’s statistical analysis of effect of order or listing of abstracts on ratings. As the order of rating was random with respect to the order of listing, the order of listing conveyed zero information about fatigue or any other influence on rating. Tol knew this but yet included the analysis as though he could find out about influences of order of rating on rating by examining the statistical properties of order of listing.

    Peter Jacobsen adopts the absurd position that defenders of climate science should not defend papers from such absurd and irrelevant criticisms. If sincere, he is at best ignorant of the fact that climate change deniers (ie, Andrew Kahan and the like) and their fellow travelers (ie, Richard Tol and the like) will attack any paper they find inconvenient, regardless of its merits. His call to not defend climate science papers is merely a disguised call to let deniers dominate the public discourse with false or irrelevant criticisms (at best).

  223. Reich.Eschhaus says:

    @Richard Tol

    (In addition to what Tom Curtis already mentioned.)

    “If an unrepresentative subsample confirms the finding of the larger sample (as is the case in Cook et al.), then representative sample would contradict those findings.”

    Not necessarily relevant. It depends on how the subsample is unrepresentative and if it matters to the conclusion that is under discussion. As said, rate of response rises with recency. Probably a big source of unrepresentativeness (and might explain (part of) the 22 Chi square). There is however a comparison over the years in the paper. For the conclusion under discussion, i.e. percentage endorsement of papers taking a position, the abstract ratings and the self-ratings align quite well (but again, not perfect).

    “The paper ratings thus reveal that the abstract ratings are invalid.”

    No, they don’t. But here we touch on the more interesting question. Why do paper self-ratings and abstract ratings diverge (though about agreeing on some statistic)? Is it because the abstract ratings were too much biased to the neutral position or is it because the abstracts themselves were formulated too neutral (this appears to be the biggest discrepancy between abstract ratings and self-ratings)? I hope Cook et al will write some more on that still. Because it is an interesting question in itself to what extent abstracts can be used to estimate the variability of opinion in a scientific field and some information about possible pitfalls there, hopefully accompanied by some clear numbers, would be most welcome.

    But we appear to be back at square one. I still do not see that those Chi squares are relevant criticisms. Abstract ratings different from self-ratings: check. Subsample different from whole sample: check.

  224. Richard, I’m impressed that you’ve admitted an error (nothing wrong with there being an error in fact). Of course, not quite sure what Grant or I noted that wasn’t the error but I should be thankful for small mercies I guess. Just to be clear, I didn’t ever state that the p values was likely to be greater than 0.05, simply that your calculation had an error. Maybe you’re suggesting that we randomly guessed that there was an error and just by lucky chance happened to be right. Maybe you could test for the likelihood of that being true.

    However, you still haven’t clarified why you think the two doubly-rated samples should be summed. I don’t really see how this represents anything. It is essentially an average of the distribution of the doubly-rated papers that were ranked by the volunteers and the same set of papers ranked by the authors. What does a Chi-squared test on this “average” distribution tell us. I guess it tells us that it isn’t consistent with the test distribution but so what, that’s obvious. No one is claiming that it should be or that it not being consistent is significant. I would quite appreciate you clarifying this if you have some time and are willing to put the effort into trying to do so.

  225. Reich.Eschhaus says:

    @Richard
    @Dana

    The point where Richard “runs’ with the PopTech story appears to an onlooker like me to be the pivotal point in the discussion. PopTech writes a piece in which he makes it look like that Cook misrepresented PAPERS. What Cook et al did was rate ABSTRACTS. In the paper it is clear abstract ratings and self-ratings are no 100% fit. Nowhere in the paper does it say (e.g.) “Idso/Scafetta/Toll endorse AGW!” PopTech’s piece is a clear misrepresentation.

    This explains (I guess) Dana loosing his temper somewhat and it all kind of ‘escalates’.

    What’s also interesting, the reason that PopTech could do that piece at all was because Cook et al made it possible on the Skeptical Science website for people to rate abstracts themselves and compare with Cook et al abstract rating (or by checking the supplementary data file). You give access to the data, now see what they do with it…

  226. So, I’ve just checked and indeed I get – as do you I assume – a chi-squared of 608. So the two distributions that I used where the expected distribution, based on the proportions of the results from the full Cook et al. sample and a total of 4284 (2142 x 2)

    endorse – 1397
    no position – 2858
    reject – 28

    and the distribution being tested was

    endorse – 2133
    no position – 2100
    reject – 51

    So, my question to you – which I’ve posed before – is why is the above distribution (which is obtained by summing the distribution of the volunteer rated sample with the author rated sample) a distribution that is worth testing. It’s clear that the author-rated sample would have a large Chi-squared. We don’t need to do a Chi-squared test to tells us that. Adding this to the volunteer-rated sample will therefore clearly also produce a large Chi-squared value. As others have pointed out, what is the significance of these differences. Maybe there are very good reasons. It is clear from the Cook et al. paper that they’re different. They didn’t try to hide this. It seems to me that a Chi-squared test only has any value if there some valid reason for comparing the two distributions and if one then tries interpret why they differ. Simply showing that the differ (especially if that is already obvious) doesn’t seem to have much value.

  227. Okay, I’ve just realised that I mis-remembered your new Chi-squared value. You got 300, not 600. I got 608. I believe that this is simply because my two distributions total 4284 and yours total 2142. With that correction in mind, the rest of my comment is unchanged.

  228. > Abstract ratings different from self-ratings: check.

    Another replication success!

  229. Reich.Eschhaus says:

    Wotts,

    Go here:

    http://www.quantpsy.org/chisq/chisq.htm

    In the first calculator, fill in 791, 1339, 12 in the first row, and 1342, 761, 39 in the second row. You’ll get the 316 Chi Square. Hope that helps.

  230. Reich.Eschhaus says:

    Wotts,

    In the same calculator, fill in 3105, 6631, 66 in the first row, and 791, 1339, 12 in the second. This will give you the 22 Chi Square. The first are the abstract ratings that were not also self rated, the second the abstract ratings that were also self-rated.

  231. Speaking of RomanM, on May 21st, 2013 at 5:18 am:

    [A] test for the difference of two proportions or equivalently, a chi-squared contingency test might be appropriate here. [Follows two implementations in R, then] I say “might be” appropriate because there are possible complications due to the fact that the same authors can appear in different papers so that the ratings of individual papers need not be independent.

    http://rankexploits.com/musings/2013/possible-self-selection-bias-in-cook-author-responses/#comment-113569

    The comment finishes off with the mandatory peroration.

    Readers will also observe that once RomanM chimed in, the thread petered out quite fast. Chewbacca mentioned my name, but did not get the motherly refuge he might have been seeking.

  232. Reich.Eschhaus says:

    Willard

    Now I feel like a replicant!

    (There was a Dutch replicant too btw!)

  233. Tom Curtis says:

    Reich, thankyou for the link to the calculator.

    Using it I have checked the predicted number of neutral (endorsement level 4) and rejection papers among just rating 4 and rejection papers. The predicted quantities are 1338 neutral and 13 rejections, compared to 1339 neutral and 12 rejections among abstract ratings that were also self rated. The Chi squared was 0.04 with a p value of 0.841 indicating (if I am interpreting this correctly) that authors of papers rated as rejections in the abstract rating where neither more nor less likely to self rate their papers than were authors of papers rated as neutral.

    Repeating the calculation for endorsement and neutral abstracts, the predicted values are 699 endorsements and 1431 rejections, with a Chi squared of 8.736 and a p value of 0.003; indicating that authors of papers rated as endorsing AGW were more likely to respond than those of papers not rated as endorsing AGW.

    For the comparison across endorsements, neutral and rejections, predictions are 699, 1429 and 14, with a Chi squared of 8.761 and a p value of 0.125. That is significantly smaller than the Chi squared of 315.7 from comparing abstract ratings to self-ratings among papers that were rated both ways. That suggests that response bias was significantly smaller than the combined effects of bias due to additional information, and the conservative bias towards a neutral rating that between them explain the difference.

  234. Tom Curtis says:

    I’ll just add that if you weight the self ratings by the ratio of expected number to actual number of abstract ratings in each category, the result is that endorsements are 96.27% of papers taking a position. Once again, it is easy to identify potential biases in the data – but when you compensate for them they always turn out to be non-consequential.

  235. A. The sample of paper ratings is not representative for the sample of abstract ratings.

    Any aggregate statistic (say, 97% endorsement) for the actual sample paper ratings would be different for a representative sample of paper ratings.

    B. The distribution of paper ratings is different than the distribution of abstract ratings.

    Therefore, the paper ratings are statistically significantly different than the abstract ratings. If we follow Cook and take the paper ratings as closer to the true ratings, the abstract ratings are full of errors so.

    The similarity of the aggregate endorsement level is then due to cancelling errors in the abstract ratings.

  236. Richard, I agree but am unsure why what you say is surprising given that it is obvious from Cook et al. that the distribution of paper ratings differs from the distribution of abstract ratings. You still haven’t actually answered the fairly simply question of why you decided to Chi-squared test the sum of the doubly-rated volunteer sample and author sample with the full volunteer sample.

  237. Okay, I’ve just read Reich’s comment. Richard, maybe you could clarify. Have you changed your calculation to be that done by Reich which does make a little more sense – in that it no longer sums the two doubly rated distributions.

  238. I do the test designed by Karl Pearson: Form the supersample, test whether the subsamples are identical.

    You can of course directly test for similarity between the samples A and B, but then the result depends on whether A is the null or B.

  239. Richard, I know the test. You don’t need to keep telling me who designed it. The question I’m asking (which is very simple) is that in your spreadsheet one of your samples was formed by summing together the distribution of volunteer rated papers – that were also rated by the authors – with the distribution of author rated papers. Is this what you did in your new calculation and if so why?

  240. You need the supersample so that you have an objective null.

  241. In what way is that an objective null? It is the average of the volunteer rated papers and the author rated papers. What does it represent? Nothing as far as I can tell. The papers are not independent, so you’ve just added together two different ratings for the same set of papers. In my view, you’ve simply done a Chi-squared test on a distribution that has no relevance.

  242. Wott: Maybe it is time that you reread Pearson.

  243. we’re in a different thread all of a sudden

    the reason that I used Pearson’s test is simple: I want to test that the two samples are drawn from the same population

    I do not want to test whether the two samples are the same, because then I would need to do two tests

    as the chi2 is symmetric, the result is the same, of course

  244. Indeed, maybe I should. However, maybe I should also consider the possibility that you’re unwilling or unable to answer my question 🙂 If fact, I could ask it differently. Did you use the same numbers as used by Reich in a comment above? They at least appear to make sense.

  245. Yes, I managed to get into the wrong thread and then moved my comment to the right place as you submitted.

    I will acknowledge that maybe you’ve done something really clever and I simply don’t understand this test. However, I’ve read quite a bit about this in the last couple and nowhere can I find something that says – if you want to test that two distributions are from the same population, add them together and then do a Chi-squared test on the new, summed distribution. Okay, maybe I can see that it could be a method (i.e., if they were from the same population, then adding them together should still give you a small Chi-squared). However, it’s patently obvious that the author rated distribution differs from the abstract rated distribution. Adding this to the volunteer-rated sample is not going to suddenly give a small Chi-squared.

    This is obvious from Cook et al. Maybe one should be rigorous and do the test, but I really don’t think anyone would claim that there was any chance that the Chi-squared would end up being small. So, you’ve done a test that gives the answer that everyone would have expected. What’s the significance? It’s been discussed elsewhere, so you can treat the question as rhetorical.

  246. Reich, thanks. Very useful. However, I don’t think Richard is doing quite the same test as you – in the first case you describe – despite you getting about the same answer. Or at least, Richard isn’t testing exactly the same distribution. Maybe it doesn’t really make any difference to the interpretation. The author distribution is clearly different to the volunteer-rated distribution, so you’re showing the same thing, but in a slightly different way.

  247. Reich.Eschhaus says:

    @Wotts,

    It is exactly the same calculation. Check Tol’s excel file (and correct E$6 to E$12 in the relevant F9 to F11 cells).

    I had hoped the numbers and how they were used to calculate test statistics were clear by now.

    @Tom Curtis

    “For the comparison across endorsements, neutral and rejections, predictions are 699, 1429 and 14,”

    You could have lifted those numbers straight out of Tol’s excel file. 😉

    @Richard Tol

    You could have ended this “numbers discussion” ages ago. No need to have it going on perpetually.

  248. Reich, indeed that is what I had done. We can probably stop this now as it’s fairly clear but I think the actual numbers you put into your earlier comment are not the correct numbers for the distribution that were used – I think so at least (although maybe they are and I’m still confused :-)). Anyway, we could have sorted this out ages ago and it seems we’ve at least reached agreement about the actual calculation.

  249. Richard Tol says:

    Maybe the confusion arises because hypothesis tests are not reflexive. That is, a test for A=P is not a test for P=A.
    The Pearson test (as used by me) tests for both A=P and P=A.
    The test by Reich and others only tests for P=A.

  250. Indeed, that may well be it. Reich, above indicates that he did do the same test (using your spreadsheet) – I was a little confused by the actual numbers he presented in his comment. I guess we could continue this and discuss the relevance and significance of the tests and other such things (which has been addressed in some comments above), but since we’ve actually seemed to reach some kind of agreement about the calculation itself that might seem like a good place to simply stop and accept that maybe we’ve actually achieved something here 🙂

  251. Tom Curtis says:

    Richard Tol:

    “Any aggregate statistic (say, 97% endorsement) for the actual sample paper ratings would be different for a representative sample of paper ratings.”

    Yes. As already noted a representative sample of paper ratings would show a 96.27% endorsement percentage, compared to a 98.04% endorsement percentage for abstract ratings. How does that difference undermine the paper? And given the small size of the difference, why have you calculated the Chi squared but not calculated the impact of the sample difference on the results?

    “The similarity of the aggregate endorsement level is then due to cancelling errors in the abstract ratings.”

    Well possibly. But it is far more likely that it is the result of the same source of bias in favour of neutral ratings being applied near equally to both endorsement and rejection papers. If we accept that possibility, but still conclude that the similarity of the aggregate endorsement level is accidental, we are rejecting outright the validity of being conservative in rating, making estimates etc. For my part, I believe that if you are conservative and still obtain similar aggregate endorsement levels, that merely shows that the result is robust.

  252. Richard Tol says:

    Tom:
    That’s an issue of scale.
    The consensus rate is 98% for the abstracts. It is 97% for the papers, and 96% for the bias-corrected papers.
    The dissensus rate is therefore 2% for the abstracts, 3% for papers, and 4% for the bias-corrected papers.

  253. Reich.Eschhaus says:

    @Richard

    You should have checked first. The calculator I directed Wotts towards (in the hope that the numbers discussion would stop) produces exactly the same test statistics as your excel file does. Now, why would that be?

    Btw, if in a discussion as happening here, there are many people raising doubts about your numbers, it can do no harm to check the numbers in your own excel file. The discovered error is glaringly obvious. No need to keep on discussing about until someone pinpoints the exact position where you went wrong. Saves everybody a lot of time.

  254. Richard Tol says:

    Sorry, Reich, my bad.

  255. Tom Curtis says:

    Richard, assume that there are 4% rejection papers in the scientific literature. In what way does that call into question the claim that there is a consensus in support of AGW in the scientific literature? Does the difference between 3% claimed in the paper and 4% in bias corrected self-rated papers call that consensus into question?

    It is very clear, in my mind, that you are trying to make a mountain out of a molehill.

  256. Richard Tol says:

    Well, Tom, if you think that 2 is a good approximation of 4, then I don’t really see the difference between a mountain and a molehill.

  257. Richard, I think that is an unfair representation of Tom’s comment. He clearly isn’t suggesting that 2 is the same as 4. He is suggesting that the corrections that one can make to the data still result in a consensus level in the high 90 percents.

    Here’s my summary of where we seem to currently stand. Given the data that is available there is some indication of a bias (based on using the author rankings, if I’ve understood Tom correctly). It appears that there is general agreement about this. Correcting for this bias indicates that the consensus may be more correctly represented as 96% rather than 97% or 98%. Maybe one could argue that Cook et al. should have done this analysis, but it’s not changing the ultimate conclusions significantly (the conclusions being that a significant majority of papers endorse AGW – asyou yourself seem to agree).

    I accept that there may be additional biases and inconsistencies that we can’t yet know about given that we don’t have access to all of the data, but I’m not that keen to start that whole discussion again.

  258. Richard Tol says:

    The consensus is of course in the high nineties. No one ever said it was not. We don’t need Cook’s survey to tell us that.
    Cook’s paper tries to put a precise number on something everyone knows. They failed. Their number is not very precise.

  259. I think you’re talking about symmetry, Richard. Testing A on A or P on P would give reflexivity. Any flavor of such tests should be reflexive.

  260. Okay, so maybe we actually agree about something now. The consensus is somewhere in the high-nineties and Cook et al. presented a number that was more precise than their analysis merited. They could have done some corrections for biases and maybe presented some semblance of error analysis (although I think they did discuss some of this in the paper).

    Here’s where I disagree with you. You say “noone said it was not”. I think this is not correct. There are plenty claiming that it is not. That why a paper such as that by Cook et al. has some relevance. If people weren’t claiming such a consensus did not exists, such a paper would be completely unnecessary.

  261. Tom, the proportion of rejection papers in the literature, as has been used by Cook’s paper, is a meaningless statistic. Think about it: Cook et al propose that high consensus means less and less people will talk about it. But the proportion of explicit acceptors hovers around 97% starting right from 1991, i.e., roughly same proportions of papers keep accepting.

  262. > [T]he proportion of explicit acceptors hovers around 97% starting right from 1991 […]

    And yet, this meaningless statistic disproves this narrative:

    The narrative presented by some dissenters is that the scientific consensus is ‘…on the point of collapse’ (Oddie 2012) while ‘…the number of scientific “heretics” is growing with each passing year’ (Allègre et al 2012). A systematic, comprehensive review of the literature provides quantitative evidence countering this assertion.

    http://iopscience.iop.org/1748-9326/8/2/024024/article

    Here’s the list of authors of Allègre et al 2012:

    Claude Allegre, former director of the Institute for the Study of the Earth, University of Paris; J. Scott Armstrong, cofounder of the Journal of Forecasting and the International Journal of Forecasting; Jan Breslow, head of the Laboratory of Biochemical Genetics and Metabolism, Rockefeller University; Roger Cohen, fellow, American Physical Society; Edward David, member, National Academy of Engineering and National Academy of Sciences; William Happer, professor of physics, Princeton; Michael Kelly, professor of technology, University of Cambridge, U.K.; William Kininmonth, former head of climate research at the Australian Bureau of Meteorology; Richard Lindzen, professor of atmospheric sciences, MIT; James McGrath, professor of chemistry, Virginia Technical University; Rodney Nichols, former president and CEO of the New York Academy of Sciences; Burt Rutan, aerospace engineer, designer of Voyager and SpaceShipOne; Harrison H. Schmitt, Apollo 17 astronaut and former U.S. senator; Nir Shaviv, professor of astrophysics, Hebrew University, Jerusalem; Henk Tennekes, former director, Royal Dutch Meteorological Service; Antonio Zichichi, president of the World Federation of Scientists, Geneva.

    These illustrious gentlemen endorse a narrative that is false, if we accept, like Richard does, that the literature on climate change overwhelmingly supports the hypothesis that climate change is caused by humans, and that we have little reason to doubt that this is indeed true and that the consensus is correct.

    We can wonder what role these are guys playing in Richard’s introductory remarks from his comment: are they heroes of the future history of science or swivel-eyed loons?

  263. We should also note that if the Pearson test is symmetrical the conclusion that:

    > A. The sample of paper ratings is not representative for the sample of abstract ratings.

    entails its converse, i.e. that the sample of ABSTRACT ratings is not representative for the sample of PAPER ratings, not only that the former are different than the latter.

  264. Richard Tol says:

    The literature has been overwhelming pro-AGW for 20 years or more. The people who I know that disagree with the consensus are well aware that they are a tiny minority.

  265. Tom Curtis says:

    Richard, the paper you purport to dissect concludes that:

    “The number of papers rejecting AGW is a miniscule proportion of the published research, with the percentage slightly decreasing over time. Among papers expressing a position on AGW, an overwhelming percentage (97.2% based on self-ratings, 97.1% based on abstract ratings) endorses the scientific consensus on AGW.”

    Clearly the key point is that only a miniscule proportion of papers challenge AGW, while an overwhelming proportion of those taking a position endorse (please note, not “are evidence of”, but “endorse”) AGW. The specific percentages mentioned are mentioned only because they happen to be the results of the study. They are not spurious claims to precision, and cannot be in that they have no error margin. Elsewhere I have claimed the state of knowledge after Cook et al on endorsement of AGW in academic papers is that it is almost certainly greater than 90% of those taking a position, and very likely greater than 95%. I do not think the authors of Cook et al would disagree with that assessment, and doubt very much they are trying to claim a greater precision than that assessment.

    Among my problems with your critique of Cook et al is that, where valid, you raise points which if properly assessed shift the percentage points around by 1 or 2%, but in a way which represents no challenge to that assessment. In each case, however, you are careful to not extend the analysis to the point where it is clear that it does not challenge the assessment. You leave it open for the swivel eyed loons to believe that you have demolished Cook et al, and that there is no consensus in the scientific literature.

    So, when I say you are making mountains out of molehills, it is because you are (repeatedly) casting facts that do not challenge the claim that “very likely greater than 95% of papers expressing an opinion endorse AGW” in such a way as to suggest that they do challenge that claim.

    I also think your reasons for doing that are closely related to your reasons for being on the academic advisory council of the GWPF, whose pseudoscience, I note, you show no inclination to criticize.

    As for the suggestion that nobody denies that there is a consensus, Cook et al clearly document the falsehood of that claim.

  266. > Cook’s paper tries to put a precise number on something everyone knows.

    Again, from the horse’s mouth:

    We analyze the evolution of the scientific consensus on anthropogenic global warming (AGW) in the peer-reviewed scientific literature, examining 11 944 climate abstracts from 1991–2011 matching the topics ‘global climate change’ or ‘global warming’. […] Our analysis indicates that the number of papers rejecting the consensus on AGW is a vanishingly small proportion of the published research.

    http://iopscience.iop.org/1748-9326/8/2/024024/article

    If we compare with previous efforts, the scale of the endeavour makes it the first of its kind. The validation with authors’ themselves also provides a nice touch.

    ***

    I also tried to replicate Richard’s conclusion by:

    – loading Cook & al 2013 in my browser;
    – hitting CTRL-F;
    – entering “prec”;

    This was made once. It took less than 10 seconds. I found no hit.

    If somebody can confirm this, that would be appreciated.

    My provisory conclusion is that if Cook & al 2013 were after precision, they sure did not oversell it.

    As I see it, they simply assume that bigger is better.

  267. Tom Curtis says:

    Tol writes:

    “The literature has been overwhelming pro-AGW for 20 years or more. The people who I know that disagree with the consensus are well aware that they are a tiny minority.”

    Perhaps Richard can explain why it is then, that a large number of people who disagree with the consensus go around telling people that there is no consensus; and how they managed to convince the US public that only 50% of climate scientists agree with the consensus.

  268. Richard Tol says:

    Tom: My draft papers lists five reasons why people may not accept anthropogenic climate change or greenhouse gas emission reduction. (There may be more.) Four reasons are impervious to Cook’s arguments. People who worry about sloppy climate research have been reconfirmed by Cook.

  269. Reich.Eschhaus says:

    @Willard

    “I also tried to replicate Richard’s conclusion by:

    – loading Cook & al 2013 in my browser;
    – hitting CTRL-F;
    – entering “prec”;

    This was made once. It took less than 10 seconds. I found no hit.

    If somebody can confirm this, that would be appreciated.”

    Confirmed by a slightly different method. Cook et al open in a pdf viewer window, clicked with mouse in search field at the top right of the window, entered “prec”: ‘Found on 0 pages.’ Greetings, replicant.

  270. Reich.Eschhaus says:

    OK, then I guess that the discussion about what numbers were used to calculate what statistic and how is now over. Finally!

    The discussion of the interpretation rages on above.

    One could still discuss if the statistics are the right ones in both cases, but this appears futile to me (i.e., one could consider the whole sample of abstracts as the population from which the subsample of abstracts (for which there exists also a self-rating) is a sample in itself. Then one can define expected frequencies for the sample from the population. The resulting Chi square is around 18 instead of 22: still p < .001).

  271. “One could still discuss if the statistics are the right ones in both cases, but this appears futile to me (i.e., one could consider the whole sample of abstracts as the population from which the subsample of abstracts (for which there exists also a self-rating) is a sample in itself. Then one can define expected frequencies for the sample from the population. The resulting Chi square is around 18 instead of 22: still p < .001)."

    Exactly. (That's precisely what I did for my first stab at the Chi-2 test above.) Indeed, I would think that’s more appropriate since a test of equal proportions among (sub)samples is unnecessary in this case — we can compare the subsample to the sample (i.e. “population”) directly — and can lead to complications in certain situations… but that’s another story!

    Reich, I must say that I have yet to disagree with a single thing that you’ve said on this thread. You’re proving yourself to be a highly competent and intelligent individual!
    😉

  272. > One could still discuss if the statistics are the right ones in both cases […]

    To that effect, it might be useful to revisit the discussion at Lucia’s, starting with Carrick’s autobiographical testimony:

    I derived this stuff back in the 1980s from first principles (what any respectable physicist would do) and use generalizations of it for wind-noise studies (where log-normal-like distributions are more typically what is seen).

    http://rankexploits.com/musings/2013/richard-tol-draft-comment-on-skscook-survey-paper/#comment-115679

    Carrick also points to the assumptions mentioned by thy Wiki:

    The chi-squared test, when used with the standard approximation that a chi-squared distribution is applicable, has the following assumptions: [citation needed] Simple random sample […] Sample size […] Expected cell count […] Independence.

    We emphasize the citation needed, which might show that anti-science apologists might be invading thy Wiki.

    I have not seen any reason to trust that the data analyzed are simple random variables. RomanM also questioned (but not disputed) independence.

    ***

    A bit later on the same thread, we can read a comment by Richard:

    @Kenneth
    There are some threads you cannot depart.

    My tests and yours show that something is amiss. I hope that is enough to trigger the release of the data.

    I thought that Richard’s test was purported to show more than “something’s amiss”. The starting hypothesis was not “more research is needed”, i.e. something’s amiss, but that Dana’s 97% was a load of nonsense.

    Also, cf. Richard’s conclusion to the comment he submitted to ERL:

    https://docs.google.com/file/d/0Bz17rNCpfuDNMU1GWERQdV9zUkk/edit?pli=1

    It might be useful to replicate Richard’s adjectives from that conclusion, but we won’t.

  273. Willard, you miss the point. There are no meaningful consensus-related trends in the literature. The proportion of accepting papers doesn’t change much. The proportion of rejecting-papers doesn’t change much. If Claude Allegre and Oddie (IDK who that is) allege a trend of some entity (increasing heretics, collapsing consensus), Cook’s data is not seen to refute that point because (a) what they measured shows nothing opposite to was alleged (b) what actually increases over time (i.e., shows a trend) is an artifact simply of a trend in total papers being published.

    The literature is not behaving the way Cook et al say it does.

  274. Reich.Eschhaus says:

    @Grant

    Thanks for the kind words. You are no slouch yourself! I did see the 18 you produced far, far up in the comments. But I entered the discussion wondering what the Chi squares were meant to prove (that’s why you beat me finding the Excel cell mix-up!). However, the discussion went on and on about the numbers and got really mixed up about the numbers used and what is the statistics calculated with them and what all of that meant. So I decided to look at the numbers as well to try to get some order in the whole mess by first trying to make it clear what the numbers were that Richard used and which statistics he calculated. Now we can move on. Mvh, replicant 😉

  275. Reich.Eschhaus says:

    @Shub

    “There are no meaningful consensus-related trends in the literature. The proportion of accepting papers doesn’t change much. The proportion of rejecting-papers doesn’t change much.”

    Well, since the proportion of endorsing papers of the papers taking a position is near ceiling level, it cannot change much anymore towards higher proportions 😉

  276. Strictly speaking, I contend that even a positive trend would not refute a claim about an increasing number of scientists whom we would be tempted to characterize, were we true Dutchmen, as attention-seeking, Galileo-complexed contrarians.

    So perhaps the authors of Cook & al 2013 should clarify what they mean by:

    A systematic, comprehensive review of the literature provides quantitative evidence countering this assertion.

    What has been underlined is a narrative, not a single assertion. Even if we accept an absence of trend in Cook’s data, the theory of collapse has no merit. So Shub’s claim that “what they measured shows nothing opposite to was alleged” is false, and infringes upon the limits of justified disingenuousness.

  277. Reich
    Impressed as you are by the ceiling high trends in endorsements, please note that these are constituted by ‘implicit endorsers’. All other groups of accepters or rejectors will declare themselves. This group won’t. What you are seeing is merely an artifact of the classification schema chosen

  278. Reich.Eschhaus says:

    Should have added extra 😉 ‘s! Now continue!

  279. > What you are seeing is merely an artifact of the classification schema chosen.

    This has very little to do with a statistic argued to be meaningless according to something that Cook et all were allegedly proposing.

    Let the rope-a-dope continues: where’s the replication data for that “artifact” hypothesis?

  280. Data for artifact has been published at blog. Tol has it in his paper. Think about the Cook abstract categories once again and you’ll see it.

  281. Reich.Eschhaus says:

    “People who worry about sloppy climate research have been reconfirmed by Cook.”

    One little problem, Cook et al is no climate research. It is a social science study estimating the variability of opinion in a scientific research area (yes, climate research) by the interesting method of rating abstracts. This allows a huge sample, but has its disadvantages as is clear from the paper. (If you will one can say it is climate research in the sense that it measures the atmospheric conditions in the climate research community. Conclusion: mostly calm weather, some clouds in a far away corner. Somewhere someone is trying to make it rain…).

    Now, you could have written an interesting comment on Cook et al, describing advantages and disadvantages with the method they used. Maybe give them advice where there could be improvements. As you I am hoping for some more information about the mismatches between abstract ratings and self-ratings as I already alluded to here. A paper on that would be interesting.

    Instead all you do -IMO- is highlighting minor issues and alleging that the raters must somehow have rated badly (as I mentioned here. How about some more constructive criticism? How would you have done this study? #Looking for nails at low tide

    You also sidestepped Tom’s question “why it is then, that a large number of people who disagree with the consensus go around telling people that there is no consensus; and how they managed to convince the US public that only 50% of climate scientists agree with the consensus.”

  282. > Data for artifact has been published at blog. Tol has it in his paper. Think about the Cook abstract categories once again and you’ll see it.

    And the disproof of P = NP must be somewhere on the Internetz.

    Doubts about the artifactness proof have been raised at Eli’s in May. Half of my criticisms of Tol’s work have already been tweeted. Thoughts about Cook’s categories have been put forward a month ago at Bart’s.

  283. Reich.Eschhaus says:

    @shub

    I have tried to follow you comments here just now, but it is impossible, they are all over the place (learn to place comments inside a thread!). Read some of your blog too. So, to allow people to understand what you are on about, please give a concise description of your issues with the Cook paper in this thread, or open a new thread for this purpose. It is really not possible to follow your argument on here as it is presented now. Thanks.

  284. Wow, willard, anyone reading your responses will surely think you know what you’re talking about.

  285. Reich
    Thanks for the admonishment. Nesting comments is not easy.

    I have a handful of issues with this paper and this type of work. Many people (skeptical people) believe this paper is not worth bothering since it was written in pre-determined fashion. Bet that as it may, I think you should look at a paper for what it’s worth and decide. There are a few questions that I’ve already raised, and a few which I will, and a few which I cannot (because they won’t release the data).

    I’ve raised two major issues to date:
    [1] the paper’s search strategy: Contrary to all the noise that has been made to date, Cook and co-authors do not perform any validation testing to examine the result of their search strategy. The components of the search itself are self-evidently considered the standard. I.e., they used a standard academic database, a large time period, appropriate search terms. However, their results are significantly smaller compared to those obtained from other databases. This is because the authors performed no validation of any kind. Post-publication, one of the helpers has ventured examining the representative-ness of their own search. Does their search produce a ‘representative’ sample of the literature? Unknown, but 11944 is surely a large number. Does their search produce an absolute representative result of the *search strategy* the authors assumed would self-validate the results? Surely not. What is the impact of this change? Unknown, but it is certain to inflate numbers of the ‘No position’, and implicit endorsers group and dilute the proportion of explicit endorsements.

    [2] The composition of the classified subgroups, and its effect on the conclusions of the paper; Cook et al conclude that the number of implicit acceptance papers increases with time – this is the first graph they show. Further, they conclude that consensus is increasing with time. But, both are artifacts of a sharp increase in numbers of papers that make no explicit statement about AGW. This group gives rise to categories ‘3’, ‘4’, and ‘5’. This large group gets split into roughly steady proportions of ‘3’ and ‘4’ papers, with a minuscule ‘5’ by the volunteers through the years. The slight decrease in the proportion of ‘implicits’, the sharp increase in numbers of implicits and neutrals are just consequences of increasing numbers of papers. Contrary to the (explicit) groups, three categories – no position, implicit endorser, and implicit rejector – are completely made up. The other papers carry an explicit position stated by the authors. The made up categories, lacking intrinsic properties of their own for identification, simply reflect, or exhibit bulk trends observed in the total literature.

  286. Shub, so your first point is suggesting that they didn’t do anything to check if the result of their chosen search produced a representative sample. Okay, maybe you have a point. But you finish by saying

    What is the impact of this change? Unknown, but it is certain to inflate numbers of the ‘No position’, and implicit endorsers group and dilute the proportion of explicit endorsements.

    Possibly, but we can’t know this so saying “certainly” seems a little presumptuous. Also, this issue of a representative sample is presumably quite a complex concept. I would suggest that they have certainly chosen what I would regard as a “search limited” sample. By this I mean that they have selected all papers that satisfy “global warming” or “global climate change” in a particular database. For this not to be suitable you’d need to show that a sample produced by different search terms would – for example – have a higher (or lower) fraction of endorse papers (with respect to those that take a position) than the sample they’ve used. I can see no obvious reason why an author who’s papers would satisfy “climate change” or “global warming” would be more likely to reject AGW – for example – than authors who’s papers satisfy “global warming” or “global climate change”. It may be that the test should be done, but the I’d be incredibly surprised if the results were very different.

    I must admit, I don’t really get your second point – or at least I don’t have the data at hand to check what you’re trying to say in your second point. You seem to be making a big deal out of the increase of endorsement with time. However, the Cook et al. paper says “increases marginally with time” and gives a rate of about 0.1% per year (so 1% per decade) and suggests that it is tending asymptotically to 98%. So maybe this increase is indeed due to s the change in the number of papers published (I can’t really tell) but correcting for that would presumably only make a change of a percent or so.

  287. > [A]nyone reading your responses will surely think you know what you’re talking about.

    Strangely enough, I think I do. Here’s a link to my first comment on the matter:

    http://rabett.blogspot.com/2013/05/cook-et-al-preview-teeth-gnashing.html?showComment=1368814912313#c7443561847389996947

    Notice the date.

    Handwaving can sometimes be a sufficient response to handwaving. If Shub wants something else, he’ll have to go first.

  288. Reich.Eschhaus says:

    Shub

    “Many people (skeptical people) believe this paper is not worth bothering since it was written in pre-determined fashion.”

    The inclusion of this comment shows a bias (not that anything you say about the paper is untrue because of that).

    1. “Cook and co-authors do not perform any validation testing to examine the result of their search strategy.”

    so what is your issue with the search strategy? What search strategy do you propose that would include ALL relevant articles? (what counts as climate research, which database has it all, questions everywhere.)

    2. “The composition of the classified subgroups, and its effect on the conclusions of the paper”. Please explain.

    “Cook et al conclude that the number of implicit acceptance papers increases with time – this is the first graph they show.”

    First graph is endorse, reject, no position, please explain.

    “both are artifacts of a sharp increase in numbers of papers that make no explicit statement about AGW.”

    I guess you refer to your blog there. The interpretation there is different from the Cook interpretation. You need to show why they are wrong and you are right.

    The 3,4,5 stuff you need to rewrite so others can understand what you are on about.

    “completely made up”

    How?

  289. Eli Rabett says:

    Tol’s problem with Cook et al, is that they did not bow down before his 122 papers. That is all.

    Basically, as with any blowhard (hi Shub) if you have to time and patience to dig down into their bleats there is nothing there. Take the consensus nonsense, as mt said

    Consensus as commonly understood is not the process by which science is decided. But consensus is the evidence that the decision has happened.

    Next

  290. wotts
    The search-related issues were examined earlier. “Global climate change” traps papers or the profile that Oreskes examined. It turns up about 3800 papers (approx). Adding “global warming” captures a lot of papers that are enriched in the impacts and mitigation category.

    The search has to return as many items that satisfy conditions that set its own boundaries – that is all there is to it. If you’ve performed social studies research, or any large-scale literature analysis, you’d know that your literature search terms are formally defined and cross-checked. The sampling characteristics of a search are a matter separate from the pre-decided parameters of the search. For e.g., if you decide to restrict your search to a specific database, you do not claim that you retrieved all possible results. If searching other databases pulls almost as many excluded papers as you included, your search is not comprehensive.

    There are in excess of 9000 papers that satisfy Cook search criteria. Going by the trend exhibited in the paper, it is likely a majority of these papers are either neutrals, or implicit acceptors.

    W.r.t the second point. It is not I who’s making a great deal about increasing consensus. Go to the Skepticalscience website and look at the graph they have on a Heartland post. See what message the graph conveys and what their post says. Observe which quantity they are selling.

    Reich,
    The second para was not clear. There are some mistakes. The first portion should read; “Cook et al conclude that the number of acceptance papers increases with time – this is the first graph they show”

    There is nothing to ‘interpret’. There are two kinds of papers: those that ‘state a position’, and those that don’t. This is mentioned by the authors, in the paper. Now, if you look at Fig 4 in my post, you’ll see that after 2005, there is a somewhat sharp increase in total papers per year. Next, using Cook’s data, if you divide papers into those stating a position and those that don’t, you’ll see the sharp rise to be made up almost wholly by papers that don’t state an position. This is shown in Fig 5.

    Explict acceptance, or rejection is easy to detect. The authors will state this directly in the abstract text. The non-explicit fraction gets classified into mainly two groups – implicit endorser and no position, i.e., ‘3’ and ‘4’ in the paper. Both these categories are ‘made up’, because, when author/s want to quantify, accept, or reject a consensus position, the author/s will state it directly, whereas these two categories arise only in consideration with the classification exercise undertaken here. No one writes a paper in order to provide implicit support. ‘Implicit support’ is an inference made by the volunteer rater.

    Familiarize yourself with the paper. ‘3’, 4 and 5 is better shorthand for a blog discussion.

    Don’t ask me to undertake exercises to disprove something the authors never undertook to prove in the first place. Showing they failed by their own standards is enough.

    There is more on the way.

  291. Reich,
    The second para was not clear. There are some mistakes. The first portion should read; “Cook et al conclude that the number of acceptance papers increases with time – this is the first graph they show”

    There is nothing to ‘interpret’. There are two kinds of papers: those that ‘state a position’, and those that don’t. This is mentioned by the authors, in the paper. Now, if you look at Fig 4 in my post, you’ll see that after 2005, there is a somewhat sharp increase in total papers per year. Next, using Cook’s data, if you divide papers into those stating a position and those that don’t, you’ll see the sharp rise to be made up almost wholly by papers that don’t state an position. This is shown in Fig 5.

    Explict acceptance, or rejection is easy to detect. The authors will state this directly in the abstract text. The non-explicit fraction gets classified into mainly two groups – implicit endorser and no position, i.e., ’3′ and ’4′ in the paper. Both these categories are ‘made up’, because, when author/s want to quantify, accept, or reject a consensus position, the author/s will state it directly, whereas these two categories arise only in consideration with the classification exercise undertaken here. No one writes a paper in order to provide implicit support. ‘Implicit support’ is an inference made by the volunteer rater.

    Familiarize yourself with the paper. ’3′, 4 and 5 is better shorthand for a blog discussion.

    Don’t ask me to undertake exercises to disprove something the authors never undertook to prove in the first place. Showing they failed by their own standards is enough.

    There is more on the way.

  292. Tom Curtis says:

    From Shub Niggurath’s blog post:

    “Cook and co-authors say they identify ‘strengthening consensus’, among other increasing consensus trends. The underlying data however does not support their claims. Instead, there is a remarkable stability in the overall composition of the literature. There is a steady increase in the proportion of neutral papers (called ‘No position’). In other words, no partisan category increases (or decreases) at the cost of another (Figure 3A & 3B).”

    (My emphasis)

    The bolded phrase is ambiguous. It could refer to absolute numbers, in which case it is true, but given the rapid increase in the number of papers overall, irrelevant. However, Cook et al always, when referring to the increasing consensus, refer to the percentage of endorsement papers as a percentage of papers taking a position. From context, and given that the highlighted phrase disputes Cook et al’s claim, it must also be taken as referring to the percentage.

    So understood, it is false. The percentage of papers taking a position that endorse AGW shows a OLS trend of 0.1 per annum (0.26 GLM trend) indicating that in percentage terms, endorsements rise relative at the expense of rejections. Shub tries to suggest that this is due to the rising number of implicit endorsements; but the percentage of papers taking an explicit position that endorse AGW show an OLS trend of 0.26 per annum. That is, including the implicit category reduces the trend. Including excluding implicit categories does increase the aggregate percentage of endorsements, but only by 0.4%.

    Clearly, despite Shub’s misdirection, the inclusion of the implicit endorsement category has no impact on the overall results of Cook et al. It was necessary for comparison to prior studies in that both Oreskes 2004 and Schulte 2008 included implicit categories. It is also, IMO, informative. It certainly is not, however, the basis on which the claim of increasing consensus is founded.

  293. Welcome. Indeed, that was the interpretation that I drew in the first post I wrote about Richard Tol.

  294. “However, Cook et al always, when referring to the increasing consensus, refer to the percentage of endorsement papers as a percentage of papers taking a position. From context, and given that the highlighted phrase disputes Cook et al’s claim, it must also be taken as referring to the percentage.”

    This amounts to saying “I don’t understand what you mean, so I’ll take it to mean what I understand”.

    The yearly percentage of papers that explicitly accept the orthodox position – categories ‘1’ and ‘2’ – decreases with time. It goes from 11% to 8%. It doesn’t really fall in a straight line. It dips down to 5% and rises. The number of explicit accepting papers increases. This is shown in Fig 1 and Fig 5, respectively in the blog.

    Why would you classify thousands of papers and then throw all those away when calculating proportions of your desired subset? Is it because, only then, you obtain figures that are favourable to your hypothesis?

    Oreskes had no implicit/explicit division and she did not quantify them.

  295. Tom Curtis says:

    Shub:

    “This amounts to saying “I don’t understand what you mean, so I’ll take it to mean what I understand”.”

    No, it amounts to assuming that you were trying to make a relevant point rather than trying to baffle people with irrelevancies. If you would prefer, however, that I assume you are trying to dissemble, it would make a lot of sense of much that you write. As, for instance, when you write in your blog that Cook et al “… take the increase in absolute numbers of orthodox position papers as evidence for ‘increasing consensus’.” Cook et al, of course, only ever refer to changes in relative percentages of papers taking a position as indicating an ‘increasing consensus’.

    “The yearly percentage of papers …”

    The percentage of all papers with an abstract rating of 1 or 2 falls as you describe, the OLS trend being -0.06. The percentage of endorsement papers rated 1 or 2, on the other hand, rises with an OLS of 0.18. And yes, that means the percentage of all papers with an absract rating of 3 falls faster (OLS trend of -0.36) than does the the percent rated 1 or 2.
    Any way you cut it, the fall in the proportion of 1 or 2 rated papers relative to all papers is purely an artifact of the increase in 4 rated papers. They rise as a percentage of endorsement papers. They rise as a percentage of all papers excluding those rated 4, ie, even if you counted implicit endorsements as rejections, the percentage of endorsements would still rise over time (though only just).

    “Why would you classify ….

    Your position, then, comes down to the equivalent of asserting that because papers on planetary orbits almost never endorse general relativity either explicitly or implicitly, that therefore astronomers do not accept General Relativity. Endorsement rating 4a papers are not counted because they provide no information, one way or the other about the beliefs scientists about AGW. That is the sole reason for not counting them. Even then, Cook et al adopted a conservative approach in that, arguably, nearly all impacts are mitigation papers implicitly endorse AGW from their choice of subject matter, yet most of them were rated at endorsement level 4.

    And finally, the cap stone of Shub’s ignorance in this debate:

    Oreskes had no implicit/explicit division and she did not quantify them.”

    From Oreskes 2004:

    “The 928 papers were divided into six categories: explicit endorsement of the consensus position, evaluation of impacts, mitigation proposals, methods, paleoclimate analysis, and rejection of the consensus position. Of all the papers, 75% fell into the first three categories, either explicitly or implicitly accepting the consensus view; 25% dealt with methods or paleoclimate, taking no position on current anthropogenic climate change. Remarkably, none of the papers disagreed with the consensus position.”

    (My emphasis)

  296. dana1981 says:

    Oreskes had an explicit endorsement category, and then various other categories similar to ours (methods, impacts, etc.). Some of these categories (i.e. impacts) she assumed were implicit endorsements (which isn’t a bad assumption, as Tom notes, but we chose to rate each abstract rather than making that assumption. Note that Tol made the bizarre assumption that impacts papers are ‘no opinion’). So she did have both implicit and explicit endorsements, though the implicits were based on assumptions about the type of research category.

    It strikes me that most of the criticisms of our paper are based on people essentially arguing “they should have analyzed the data in this other, totally nonsensical way.” For example, Shub arguing that the growth (or lack thereof) of the consensus should be determined based on the category percentages among total papers, as opposed to the percentages of papers taking a position on the cause of GW. Shub isn’t actually arguing the consensus isn’t growing, he’s just arguing that fewer papers are taking a position on the cause of GW in their abstracts. That by the way could very easily be an indicator of a growing consensus – scientists no longer feel it’s necessary to state something so obvious in the abstract.

    Similarly, Brandon has argued that our Category 1 should be compared not just with 7, but with 5+6+7. Implicit rejections count, but explicit endorsements don’t! That makes loads of sense.

    And of course there’s the Tol argument that our ratings are wrong because we actually read and categorized every single abstract rather than making blanket generalizations like impacts = ‘no opinion’ (which as noted above, makes no sense at all).

    Suffice it to say I’ve found these criticisms unimpressive and unconvincing.

  297. Dana,

    For what it’s worth, I think that some of Shub’s remarks about the paper’s trendology should be accepted as constructive criticisms. For instance:

    Cook and co-authors rationalize the decrease in the proportion of papers supporting the consensus, via a convoluted theory, as evidence for a high degree of consensus. They contend the decrease implies more papers have accepted the consensus and therefore don’t need to talk about it. At the same time, they take the increase in absolute numbers of orthodox position papers as evidence for ‘increasing consensus’.

    http://nigguraths.wordpress.com/2013/06/08/why-the-cook-paper-is-bunk-part-i

    My emphasis. Nevermind the contrarian packaging. Shub has his own narrative to push and his eyeballs to attract. This is a small price to pay if that provides you with ways to improve your work.

    If Cook & al needs to rationalize its trendology the way it does, there’s not amount of formal results that would compensate the heavy assumptions on which its interpretation rests. Besides, the trends observed in the paper look quite constant to me. And they do not seem to rest on the most robust stats around. And they do not look contrarian-proof.

    So I think responding to this criticism would be a good idea here. Take both criticisms like a scientist would, and use them to construct a better paper, and with it a better world.

    ***

    If I had an advice for climate scientists, it would be:

    One does not simply graph a trendology in front of a contrarian crowd.

    Not following that rule is asking for trouble, Dana. You know why, I hope. If not, please read back the last 7 years of Steve’s.

    Don’t you know any statistician? They could help, you know. There’s a post doc who contributed in this very thread who might be interested to get his name on some papers.

    ***

    I digress.

    My main point should be this: even if there’s no real trend, that does not affect your results much. The point of the paper, as I see it, is to refute the theory of consensus collapse (Please note to that effect my criticism in my comment above: the results of the paper refutes a narrative, not an assertion, as we can read in the paper.) This theory is shown false even with a flat trend.

    There is no collapse. That is all you need to show. This meme is untrue. No need to push the trendology further: drop the stick and back slowly away from the horse carcass.

  298. Tom
    “… trying to baffle people with irrelevancies”

    Yeah right. That’s why I referred to Figure 3A and 3B.

    Dana
    Imagine there were only 20 papers that took a position in all of 11944 papers you analysed. Let us say, of the 20, 19 explicitly accepted/supported the orthodox position. By your method this would still be reported as “95% endorse consensus”. It would plainly be wrong, though literally true.

    The right denominator is important in studies of this type, as any epidemiologist worth his salt will tell you. A somewhat suitable parallel is Lab Corp’s Ovasure test fiasco with researcher Gil Mor. You can always generate impressive percentage figures by arbitrarily chopping off chunks of the larger studied population from the denominator. In most instances, the flaw is harder to approach as the larger population remains unstudied (researchers examine 30 samples and want to extrapolate). Cook’s strategy is interesting because his team actually examined a fairly large sample and yet want to report percentages on smaller groups. It is the classical cherry-pick.

    You have not released data that has been requested.

  299. Willard
    We cross-posted.

    You are right that I have my ‘agenda’. But examine my posts on the Cook consensus paper. I’ve kept editorializing to a minimum.

  300. Tom Curtis says:

    Willard quotes Shub saying:

    “At the same time, they take the increase in absolute numbers of orthodox position papers as evidence for ‘increasing consensus’.”

    Can you show me where in the paper that argument is made. I can only find instances where it is argued that endorsements as a percentage of abstracts expressing an opinion is rising, and that this is indicative of an increasing consensus. Examples are:

    “For both abstract ratings and authors’ self-ratings, the percentage of endorsements among papers expressing a position on AGW marginally increased over time.”

    “The time series of each level of endorsement of the consensus on AGW was analyzed in terms of the number of abstracts (figure 1(a)) and the percentage of abstracts (figure 1(b)). Over time, the no position percentage has increased (simple linear regression trend 0.87% ± 0.28% yr^−1, 95% CI, R^2 = 0.66,p < 0.001) and the percentage of papers taking a position on AGW has equally decreased."

    “The percentage of self-rated rejection papers decreased (simple linear regression trend −0.25% ± 0.18% yr^−1, 95% CI, R^2 = 0.28,p = 0.01, figure 2(b)). The time series of self-rated no position and consensus endorsement papers both show no clear trend over time.

    “The percentage of AGW endorsements for both self-rating and abstract-rated papers increase marginally over time (simple linear regression trends 0.10 ± 0.09% yr^−1, 95% CI, R^2 = 0.20,p = 0.04 for abstracts, 0.35 ± 0.26% yr^−1, 95% CI, R^2 = 0.26,p = 0.02 for self-ratings), with both series approaching approximately 98% endorsements in 2011.”

    For both self-ratings and our abstract ratings, the percentage of endorsements among papers expressing a position on AGW marginally increased over time, consistent with Bray (2010) in finding a strengthening consensus.

    (My emphasis)

    “The number of papers rejecting AGW is a miniscule proportion of the published research, with the percentage slightly decreasing over time. Among papers expressing a position on AGW, an overwhelming percentage (97.2% based on self-ratings, 97.1% based on abstract ratings) endorses the scientific consensus on AGW.”

    That is all the references to increasing or decreasing numbers I can find in the paper. The only one that even marginally echoes the argument that Shub purports was made by Cook et al is the one I have bolded. Even it only makes the argument from the increase in percentage of papers taking a position, and then only in the weak form of saying that it is consistent with the finding of another paper.

    The argument does not even feature in the popular presentation of the result by SkS authors. In the initial popular announcement of the result, there is no discussion of increase or decrease at all.

    Further, you should not make the mistake of thinking that good science can only be done through statistics. Some of the best science ever done was done without statistics. A case in point is Darwin, who in all his works included just one equation (not statistical) which he got wrong. Despite this his observational accounts on the Beagle were brilliant; he made a major contribution to geology be, essentially, formulating the modern theory of the formation of coral atolls; introduced the first formal and reasonably comprehensive classification barnacles which discovered previously unknown features of barnacle life cycles; invented ethology; and of course, discovered and established on firm evidence the theory of evolution by natural selection. All without a statistician in sight.

    Improved statistical analysis can make papers better, and in some areas is genuinely essential. In this paper, however, that improved analysis has only shown that the trend of increasing endorsements as a percentage of papers taking a position was larger than reported,and statistically significant (ie, strengthened the result), and pointed out possible areas of weakness already identified in the paper without having been used to quantify the likely impact of those areas of weakness.

  301. dana1981 says:

    “Imagine there were only 20 papers that took a position”
    Imagine bananas were blue. On second thought, let’s just stay grounded in reality.
    “The right denominator is important in studies of this type”
    I 100% agree. The problem is you’re quite obviously using the wrong denominator.

  302. Tom Curtis,

    Not much time to answer tonight, as the second game of Stanley Cup’s finals is beginning in a minute. I still have time to put two clarifications.

    First, I have not said that Shub had a KO argument. I said this was an argument that should be met. Clarification and correction are always welcome. Richard just made a blunder by copy-pasting his Excel file. No big deal, except for his pride.

    Second, I could not care less about stats. To say more about that part of the paper, I would ahve to take a better look at the paper during the week. What matters to me now is not that the scope of one’s claims must be balanced against the technical apparatus one uses. It’s the other way around: my point is that one does not need to push quantifiers and modalities beyond what is needed.

    In a way, my point is about Grice’s maxims:

    ***

    I don’t think it’s a sound strategy to claim “consistency with Bray (2010) in finding a strengthening consensus” when the trend is not that obvious and that you need to weasel your way out with the word “consistency”. Disproving this might not matter much anyway, except for ClimateBallers, and Bray might be the only person who cares about his result being paid lips service that way.

    This strenghtening is so small as to beg for a statistical counter-analysis. Worse, all this trendology to insert “is consistent with a strenghtening” does distract us from what should jump at you when looking at the graph and what matters to the paper: the constancy of the consensus. This constancy refutes a meme that justifies a paper.

    Statistical prowesses are quite secondary to that end.

    ***

    None of this would change Shub’s concerns to be raised. That’s what Shub is dedicated to do. This dedication may turn to be his biggest weakness: sooner or later, his competence will have the better of him, and he will have to put forward constructive criticisms.

    I don’t even have to hide that plan. Our society is built on this. Science wins, in the end.

    Wott is very wise.

  303. I don’t think I’ve seen you respond to any criticism constructively. Not to say you might have done so…. I haven’t seen it. You are impeding the paper’s discussion at this stage. Surprisingly enough, it is your paper. What’s more, you guys have already accomplished what you wanted with this paper, i.e., make a big media splash.

    The representative nature of a percent figure you quote, for any quantity, is the whole reason for quoting percent figures. If you have to constantly qualify what your denominator is, you are better off not using that percent figure in the first place. For the previous example, that would be 95% of 0.1% (20/19,444). With the Cook study, it is 97% of 30%. Between two such statements, the differences are only quantitative; qualitatively, they belong in the same category of selective reporting. Nothing can be done if it doesn’t bother you.

  304. “The argument does not even feature in the popular presentation of the result by SkS authors. ‘

    Tom, take a look at the latest post on the Skepticalscience website., figure 1a of the paper, the consensus project website and its head graphic. They all present the same thing

    Clearly, the authors and the people making these graphs have no understanding of their underlying data.

  305. > They all present the same thing

    That sentence does not even deserve a dot.

    > Clearly, the authors and the people making these graphs have no understanding of their underlying data.

    Shub should at least have taken the time to finish his previous sentence before editorializing in this minimal manner.

  306. > I don’t think I’ve seen you respond to any criticism constructively.

    I don’t think I’ve ever seen Shub respond to anything.

  307. > I’ve kept editorializing to a minimum.

    Indeed, in this one:

    http://nigguraths.wordpress.com/2013/06/05/the-stuff-that-makes-up-the-97-of-the-consensus/

    there is no explicit editorial content, only dog whistling about a misread abstract. Paraphrasing the conversation that follows:

    [willard] This looks like an explicit endorsement to me.

    Human fossil fuel consumption dumps 90 million metric tons of carbon into the atmosphere annually. Increasing CO2 levels are linked to global warming, melting Arctic ice, rising sea levels, and climate instability

    [Shub] Which abstract is this?

    [Willard] It’s an excerpt from an abstract taken to illustrate stuff like this that makes up the ’97%’, which I read in [your] op-ed [,] taken from stuff that makes up the set of contrarian constructive criticisms.

  308. @Dana
    Of course, impact papers often express an opinion on the extent and causes of climate change.

    That opinion is irrelevant, however. According to your title, you aimed to survey the “scientific literature”.

    Somebody who is an expert in mollusks may well have an opinion on climate change, just as she may well have an opinion on the going-ons at Taksim Square and on the relative merits of apple pie and rhubarb crumble. Scientifically, however, such opinions are irrelevant. They are the opinions of a lay-person. Only her opinions on mollusks have scientific validity.

    If may of course be that you had intended to study the opinions of lay-people on the extent and causes of climate change. Then your sample is even less representative.

  309. Richard, there is – however – a subtlety here. You’re suggesting, I think, that simply expressing an opinion (in the abstract) about AGW doesn’t actually mean that those authors have sufficient in depth knowledge about AGW for their opinion to be relevant. However, if authors take a position on something in their abstract it’s not normally because it is simply their opinion, it’s normally because they are aware of other scientific evidence that suggests that that position has merit. To suggest that it is simply the opinions of a lay-person is, in my opinion, wrong. It is the opinion of a published scientist based on their interpretation of the existing scientific literature.

    The goal of the Cook et al. study was to determine the consensus in the literature and so that there are authors taking a position on AGW in their abstracts is relevant as that indicates the consensus in the literature. That consensus is presumably based on the results of studies that have considered the drivers of global warming and that have been largely accepted by the rest of the scientific community.

  310. Just look at the start of this abstract: “There is broad scientific consensus that Earth’s climate is warming rapidly and at an accelerating rate. Human activities, primarily the burning of fossil fuels, are very likely (>90% probability) to be the main cause of this warming.”

    Published in PEDIATRICS Vol. 120 No. 5 November 1, 2007, pp. 1149 -1152.

    If the doctor says so …

  311. A Scopus search on “Higgs boson” returns papers in business, economics, law, theology …

    I must say I did not believe it was really found until I saw confirmation in Queens Quarterly.

  312. I think you’re missing (or choosing to ignore) the point I was trying to make. As an author of a paper, if I want to make some comment or statement about something in my abstract I should base that statement on my understanding of the current scientific literature. Therefore, if you want to establish the consensus, evaluating the abstracts in the broader literature seems entirely reasonable.

    Having said that, there may well be some papers that aren’t really relevant. However, pointing to a few papers for which this may be true doesn’t necessarily indicate a particularly significant issue. Furthermore, you were very insistent about having a strict survey strategy. If Cook et al. were to leave papers out even though they satisified their search strategy, they would need to justify such a decision and would probably then be criticised for these judgements.

  313. Indeed a search for “Higgs Boson” on SCOPUS returns papers in some areas that don’t appear to be relevant. However, my SCOPUS search for “Higgs Boson” returns 5632 documents of which 5536 are Physicss and Astronomy, 805 are Mathematics, 115 are Engineering. Yes, there are some in areas that don’t seem relevant, but it seems insiginificant (single digits). If I repeat the WoS search done by Cook et al. there are indeed some Research Areas that don’t seem relevant but, again, the numbers in these areas seem very small. Yes, they could have expanded their strategy to exclude articles in pediatrics, urology, respiratory systems,…. but it appears that this would have complicated the strategy and have had virtually no effect on the result.

  314. That’s exactly the point, wotts. The paper’s authors start with the premise that there is a consensus in the literature. What there actually might be, is a bandwagon effect, with a number of un-related specialties mentioning ‘global warming’ in their abstract.The search period encompassed a period of rapid growth in the adaptation and mitigation areas. These increases are all recorded as an increase in ‘consensus’ due to the loosely structured nature of the classification scheme.

  315. Alright willard, maybe it did not convey the meaning I intended.

    Tom, whose sentence I quoted is starting to see that using number figures for ‘endorse’ is likely a meaningless act. Which is why he says the authors do not highlight or use the idea in the paper, or the popular presentation. In the paper, all presented material are highlights of what the authors think the data shows. In their popular material, they are pushing the same numbers, even as Tom, here, is arguing they are not. Clearly, he understands this better than whoever is making the graphs. Cook, certainly, is approving them.

  316. I don’t understand what you’re saying or why it’s “exactly the point”. Who’s started with the assumption that there is a consensus. Cook et al. or the authors of the papers who’s abstracts they’ve rated?

  317. Yes, they could have expanded their strategy to exclude articles in pediatrics, urology, respiratory systems,…. this would complicate the strategy.

    It would have at least been a good start. But, as you say, it would complicate things, because then you’d have to decide what types of abstracts to leave out and exclude and it would mess up an already complicated classification system. So the authors don’t do anything much (except for weeding out non-climate related abstracts). So they end up including and classifying abstracts which are related to climate, mention the key words, and are by authors who are in no position to evaluate the “GW is anthropogenic” orthodox position. These papers look like ‘endorsements’ because that is what the authors are looking for.

  318. I’m not really sure how to interpret what you say above. You say

    These papers look like ‘endorsements’ because that is what the authors are looking for.
    Reply

    That seems very much like an accusation for which you have no evidence. This seems ironic given that a great deal of this debate seems to revolve around the rigour of the analysis in Cook et al. rather than the results they obtain.

  319. > Tom, whose sentence I quoted is starting to see that using number figures for ‘endorse’ is likely a meaningless act.

    In that sentence, Tom was speaking of a trend, not a quantity.

    Sentences make more sense when we don’t use them to inject irrelevant arguments.

  320. chris says:

    As several commentors have shown including wottsupwiththatblog in blue just above (re Higgs bosons papers) the numbers of these papers (e.g. apparently “off subject”) is small. I guess if you’re really interested in the effects of eliminating these you could do so yourself.

    However why should one remove, for example, the paper in Pediatrics? I disagree with your assertion that this is an abstract “…by authors who are in no position to evaluate the “GW is anthropogenic” orthodox position”. This is a paper by a group of experienced scientists who have addressed potential impacts of global warming on children. Inspection of the literature they cite indicates that they have sourced authoritative scientific information (IPCC, NCDC, EPA summary reports and relevant scientific literature) and as practicing and publishing scientists should be well qualified to assess scientific information. No doubt if (in their investigations) they had encountered contrary data and interpretations that cast doubt on the consensus position, then they wouldn’t have made the statements in the abstracts that they did make.

    Surely the extent to which the scientific consensus on the underlying anthropogenic origin of 20th century and contemporary global warming is “tested” in relation to sciences (like pediatrics) not directly involved the physics of global warming, but which have important consequential relationships with the subject, gives us valuable insight into the strength of the scientific consensus and its evidence base..

  321. Chris, thanks for the comment and I agree. I was trying to say something similar in an earlier comment. A “statement” in an abstract is not typically simply the opinion of the author, it is typically based on their assessment of the literature that addresses that particular issue. As you say, there is no reason to suspect that someone publishing a paper in a pediatrics journal is unable to assess the literature relating to AGW.

  322. Tom Curtis says:

    Richard Tol (June 16, 8:58 AM) returns to discussion of an abstract he has previously, and incorrectly, suggested was “identical to another abstract” based solely on the fact that he did not distinguish between the “policy statement” from which he now quotes, and the associated “technical report”, which he does not quote. The two articles have the same authors. They are the only two articles from Pediatrics within the Cook et al database.

    The technical report starts with an eight paragraph summary of the evidence for global warming and the anthropogenic cause of that warming. The preparation of that eight paragraph summary means that the authors, while not expert, are informed on the topic. Not only that, but they clearly took the effort to be guided by relevant experts in the field. Their endorsement of AGW is not evidence of AGW, but it is an informed and considered endorsement and therefore deserves to be included in the database.

    Tol disagrees. He ignores, in doing so, that the raters did not have access to journal name, author’s names or date of publication. They, therefore, cannot decide the rating on the presumed expertise (or lack of it) of the authors. Indeed, for all they knew, the authors of the technical report included invited climatologists to ensure relevant expertise.

    On the opposite side of the coin, there is a paper rated “explicitly rejects but does not quantify” (6) in the database, whose abstract reads as follows:

    “Computer models predict that clear signs of the greenhouse effect should have appeared as a consequence of increases in greenhouse gases, equivalent to a 50% increase in carbon dioxide in the last 100 years. The predictions are contradicted by the climate record in nearly every important respect. Contrary to the models:

    1.
    (1) the Northern Hemisphere has not warmed more than the Southern Hemisphere,
    2.
    (2) high latitudes have not warmed more than low latitudes, and
    3.
    (3) the U.S. has not shown the predicted warming trend, although this is the largest area in the world for which well-distributed, reliable records are available.
    Finally, all of the computations of the greenhouse effect show an accelerating increase in temperature in the 1980s, reflecting the rapid increase in greenhouse gases in recent years. However, measurements from orbiting satellites with a precision of 0.01 °C show no trend to higher temperatures in the 1980s.”

    The authors, as it happens, where physicists without significant background in climate studies. Therefore they are, by Tol’s criterion, mere laity and the rejection should not have been recorded. Their paper, like that from Pediatrics, involves a survey of the evidence by non-experts, and in this case, non-experts with a clear, and known political agenda.

    The simple fact is that title and abstract do not tell you these details about authorship, and hence assumptions about author expertise cannot form the basis for rejection or inclusion of abstracts in the database.

  323. Marco says:

    WOWTB and Richard Tol: one wonders why strawmen are attacked here. Cook et al did NOT set out to measure how many had shown GW to be A vs N. This is immediately clear to anyone reading the introduction (or the abstract, for that matter). They set out to determine how many assume GW to be A vs N. Why? Because the general public seems to think it is more like a 50-50 than the actual 97%. Because several people claim this consensus is falling apart.

    That MDs decide to follow the consensus is not anything that weakens the consensus, but rather reinforces the consensus: even scientists outside the field have noticed the consensus and align themselves with it in their scientific research. You can also turn it around: if scientists outside the field do not align themselves with a believed consensus, it suggests there is no such consensus (or one that is very recent).

    At no point does this say anything about the *validity* of the consensus position, but Cook et al make no claims about that.

  324. Tom Curtis says:

    Shub says:

    “That’s exactly the point, wotts. The paper’s authors start with the premise that there is a consensus in the literature.”

    This illustrates perfectly what is wrong with Tol and Shub’s argument. Shub merely assumes that the authors of the abstract quoted by Tol start with the premise of consensus. He is, however, wrong. They started with a survey of evidence. (See my post currently in moderation.) The survey of evidence was not comprehensive, and relied on guidance by bodies with relevant expertise – but it was there. Tol and Shub make incorrect assumptions about the evidentiary basis of the statements of endorsement because they assume they know something about the authors; but abstract raters were not entitled to any such assumption.

  325. > [I]mpact papers often express an opinion on the extent and causes of climate change.

    To express an opinion can mean lots of things. For the sake of Cook et al’s study, the relevant meaning should be to endorse. In his formal comment submitted to ERL, Richard omits to mention the fact that impact papers often endorse the consensus on AGW:

    The majority of the selected papers are not on climate change per se, but rather on its impacts or on climate policy. The causes for climate change are irrelevant for its impact. Therefore, impact papers should be rated as normal (if included).

    Richard’s argument could amount to say that the opinion of an author is relevant if and only if its research subject is logically entailed by AGW. That argument sounds a bit farfetched: it is quite possible for researchers on impact or even policy to have an informed opinion on AGW. As Wott says, we have no reason to expect that researchers express personal, uninformed opinions in their abstracts.

    Richard might need to look up a dictionary to see various meanings of the verb to endorse.

    ***

    > That opinion is irrelevant, however.

    However, in Richard’s drive-by, that claim is argued by assertion. Perhaps there’s an implicit argument behind it: portraying endorsements as “opinions” can hint at the fact that the authors of impact studies or policy scenarios are no valid authorities on AGW. If that’s the case, the argument is different than the one made in Richard’s formal comment to ERL.

    The argument from authority, even if true (I don’t think so), is of little relevance in our case, as opinions do not need to be authoritative claims to count as endorsements.

    ***

    My comments #57-61 are relevant to this discussion, starting here:

    Richard is still unresponsive to these tweets. In fact, Richard is unresponsive to lots of things, the latest being Wott’s comment above.

    Too bad there’s no Excel files to replicate bad arguments.

  326. > Tom, whose sentence I quoted is starting to see that using number figures for ‘endorse’ is likely a meaningless act.

    Tom was talking about trends, not quantities. Shub’s points do not relate to trends, but quantities. Therefore, it might be useful to Shub not to mention what Tom was talking about.

    Not only are Shub’s point irrelevant to what Tom was saying in the sentence Shub commented, but his “meaningless act” claim channels his inner Chewbacca:

    http://neverendingaudit.tumblr.com/tagged/chewbacca

    The easiest way to conclude that something makes no sense is to refuse to read it.

  327. Brilliant arguments. I’m fully convinced now.
    Next time my children fall ill, I’ll consult a meteorologist.

  328. Well done. Once again you’ve managed to effectively mis-represent the arguments that others were making – unless you were being serious that is 🙂

  329. > Next time my children fall ill, I’ll consult a meteorologist.

    An econometrist might be an even better choice.

  330. Guess who wrote: “Unless the speaker is an expert in the field, their opinions should be given no more weight than any other uninformed observer. Would you ask your allergist about the heart surgery your cardiologist recommends?”

  331. I don’t know who wrote that. Maybe you could tell me. However, the point of what others (and myself) were trying to say above was that a statement or comment in the abstract of a paper is not the author’s opinion, so whether or not someone’s opinion should be given any weight is not really relevant. I would explain further, but maybe you could just try reading the above comments again.

  332. Jonathan Koomey wrote that. John Cook wrote a positive review, highlighting the above phrase.

  333. Not that this matters one bit for what has been replicated so far by Wott, chris, and many more commenters here and elsewhere, I see no reason not to provide the data so we can replicate Richard’s Gedankenexperiment:

    The first few chapters outline the science in a user-friendly, readable manner, explaining what’s happening to our climate and the reasons why we need serious, significant action and fast. My ears did prick up when I reached Chapter 7: Talking to Skeptics. You’ll have to indulge me if I excerpt the section titled “There’s an app for that”:

    The single most important web site for addressing claims of climate deniers is Skeptical Science. It lists every argument made by the deniers and summarizes what the peer-reviewed scientific literature says about the topic. In fact, Skeptical Science even has apps for iOS, Android, and Nokia phones, so you can access it while on the move. This is especially handy when you’re at a party and someone makes an incorrect statement about climate —you can then quickly find the exact issue and show why their concern is unfounded. […]

    Koomey also suggests a higher-level approach that he’s found particularly effective:

    If you hear someone using such talking points, try asking the speaker these questions: “Do you feel qualified to judge the current findings of the science on combustion, or gravity, or quantum physics? No? Why then do you opine on a topic that is equally complex but upon which you have no more mastery? Why do you think your judgment on these complex issues is the equal of that of people who have studied the topic for decades?”

    Typically the speaker will reply with some statement of authority, like “I’ve studied engineering for years”, or “I’m a weather forecaster”, or “my uncle Joe the physicist said so”. Such responses are beside the point. Unless the speaker is an expert in the field, their opinions should be given no more weight than any other uninformed observer. Would you ask your allergist about the heart surgery your cardiologist recommends?

    This book is written for entrepreneurs who are looking for business opportunities in climate change, with Koomey acting in the role of the scientific advisor to a start-up company.

    http://www.skepticalscience.com/Book-review-Cold-Cash-Cool-Climate-Jonathan-Koomey.html

    Our emphasis.

    It would be interesting to see how Koomey’s high-level approach, that uses the SkS app to counter your uncle’s libertarian talking points, would fare when reading scientific abstracts to see if they simply endorse AGW. Perhaps Richard could ask Koomey to rate his own papers.

    Wait. Does that mean Richard read that John Cook endorsed that high-level approach in his review? Now, that’s interesting.

  334. Tom
    “[Shub] merely assumes that the authors of the abstract quoted by Tol start with the premise of consensus.”

    It is not anything as crude. Since the rating system looks for the presence or absence of consensus, any property of the data or the method that gives rise to consensus ratings, will result in a consensus result, even if it doesn’t actually reflect the underlying data.

    I just posted data to support: http://nigguraths.wordpress.com/2013/06/16/why-the-cook-paper-is-bunk-part-ii/

  335. Shub, I’ve read your post. How did you get Figure 2? The left-hand panel is supposedly the fraction of papers with “no stated position” that got classified as “implicit endorsers”. This whole discussion has got rather convoluted, but – if my memory serves me right – only 40 papers initially rated as “no position” were later changed to “endorse”.

    Presumably, you also got the right-hand panel of Figure 2 by multiplying number of papers (top panel of Figure 1) by 0.33. Firstly, you don’t know for certain that this is the correct fraction of error ratings. Secondly, given that the endorsement level doesn’t appear to vary much with time, surely that there is a correlation between the error rate (which is simply some fixed fraction of the total number of papers) and the number rated as implicit endorsement is largely meaningless. It’s simply that they both – in a sense – depend on the same underlying distribution (the total number of papers).

  336. Tom Curtis says:

    wotts, the left panel are endorsement rating 2. The right panel is 33% of the sum of endorsement ratings 3, 4 and 5. The whole post is nonsense. He treats the high correlation between 3 and 33% of 3-5 as significant evidence that 3 rated papers are largely just the errors from the pool of 3-5 papers, disregarding the fact that the correlation between papers with an endorsement rating of 2 and 3-5 is almost as high. He treats the 33% disagreement rate in initial ratings as the error rate, whereas the there will be a disagreement if one paper is rated correctly and one incorrectly. It follows that the 33% figure is the 2 times the error rate of initial rating minus the error rate of initial rating squared. From that it can be determined that the error rate of initial rating is approx 17.6% and the final error rate (ie, the error rate after dispute resolution) is less than 15% and probably less than 10%. That is the error rate on specific rating, of course, not the error rate between endorse/no position/reject; which will be smaller as some errors in ratings will be between endorsement categories or rejection categories.

    Assume (as Shub does) that nearly all errors are rating of neutral papers as implicitly endorsing AGW. In that case, from among the 3-5 rated abstracts, the initial error rate is 28%, with a final error approximately equal to 21.3%. Subtracting 21.3% times the sum of rating 3-5 papers from rating 3 papers, we are left with 582 out of an initial 2910 abstracts rated 3. Assume all abstracts rated 5 are correctly rated. Then with these revised figures 95.26% of papers with a position are endorsement papers. That is the figure that shows Cook’s paper to be bunk.

    Well, it would be if Shub bothered calculating it. Instead he simply makes a few ridiculous assumptions. Makes no adequate test of his hypothesis, and then compounds error after error and makes no attempt to actually calculate the impact on Cook’s results. And then he thinks he is actually contributing something useful.

  337. Indeed, that was my interpretation too. The other issue is that the 33% is the initial disagreement which is then moderated and a final rating assigned. Therefore, it’s not clear that one should really regard this 33% as an error since it isn’t actually present in the final rankings.

  338. Tom Curtis says:

    I accepted Shub’s assumption that all errors consisted of rating neutral (4) papers as implicitly endorsing AGW (3). Then the initial disagreement rate consists of the rate at which abstracts are rated incorrectly be just one rater. Thus we have that 0.33 = 2E – E^2 where E is the initial error rate. The error rate after dispute resolution consists of the rate at which both raters incorrectly rated the abstract in the initial rating, plus the rate of incorrect rating at the first and second stages of dispute resolution. Hence E(E + 0.33) . E =~= 0.176, hence the final rate of errors is 9% on this assumption.

    Of course, we know from comparison between abstract ratings and author self ratings that the error rate is greater than that, but that most errors consist in rating as neutral papers that in fact endorse or reject AGW.

  339. Thanks. I’ll have to give that some thought as I’m not quite getting it at this stage.

  340. Richard Tol says:

    Let 1-p be the error rate.

    If there are 2 classes only, there is a chance of p^2 that both abstracts are rated correctly, a chance of (1-p)^2 that both are rated incorrectly, and a chance of 2p(1-p) that the abstracts are rated differently.

    A disagreement rate of 0.33 implies 0.33=2p(1-p) or p=20.8% or 79.2%.

    If there are 7 classes, there is a 5 in 6 chance that 2 incorrectly rated abstract will be rated differently.

    A disagreement rate of 0.33 implies 0.33=2(1-p) + 5/6(1-p)^2 or p=81.5%.

    The error rate in Cook et al is therefore 18.5%.

  341. Okay, that kind-of makes sense. However – as I said to Tom – I’ll have to give it some more thought. But, what you’ve presented is presumably the error rate after the first rating. It doesn’t really tell you what the rate is after the second rating and after moderation. As Tom mentioned, one could use the author ratings to quantify the final error in the abstract ratings but even this is not necessarily appropriate. It’s not clear that the difference between the volunteer ratings and the author ratings is because the volunteers rated the abstracts poorly or because the abstracts did not have as clear a position on AGW as the paper – as a whole – did. That could – I guess – still indicate an error, in the sense that abstracts are not an ideal indicators of the position of a paper with respect to AGW. However, for that to be significant there would need to be some indication that this difference (between the abstract and the paper) depends on whether or not the paper endorses or rejects AGW.

  342. Richard Tol says:

    We know that (1-p)^2/6 = 0.6% were incorrectly but identically rated.
    We know that 16% were re-rated. Assuming the same error rate, this makes 3.0%.
    We know that 17% were reconciled. Assuming that half were reconciled to the true rate, this makes 8.5%.
    The error rate in the reported data is 12%.

  343. Where is your (1-p)^2/6 = 0.6 incorrectly but identically rated coming from? Is that through comparison with the author ratings?

    If so, let me make a comment about this. I think we need to define what we mean by an error here. The goal was to rate abstracts and see if this could provide an indicator of the consensus in the literature. If we, at this stage, ignore that authors provided their own ratings of some of the papers, what does an error in the ratings mean? Does it imply that there is some correct rating for each abstract? I don’t quite see how this is actually possible. Hence, prior to our knowledge of the author’s ratings, I don’t see how we can actually define an error in the final ratings of the abstracts.

    Now, we can include our knowledge of the author ratings. We now know that a substantial fraction of the abstracts were rated differently to the rating given by the author to the paper as a whole. What does this indicate? Does it indicate that the volunteers incorrectly rated the abstracts (i.e., the information about AGW in the abstract was the same as that in the paper, but the volunteers simply mis-interpreted that abstract and gave it an incorrect rating), or does it simply mean that the position wrt AGW taken by a paper isn’t necessarily reflected in the abstract. Now, I don’t know the answer to this question, but it’s not clear that it is relevant.

    The goal of the Cook et al. study was to use abstracts to quantify the consensus in the literature, it wasn’t to see if you could use an abstract to determine the position of an individual paper wrt AGW. It appears, given the information available, that the abstracts alone give a very similar level of consensus to that achieved by looking at the ratings given by the paper authors. Hence, one could conclude that rating abstracts is a reasonable way in which to determine the level of consensus in the literature, but is not a good way to determine the position (wrt AGW) of an individual paper.

  344. Tom
    If you slow down a moment, you wouldn’t have written some of the above.

    The rating exercise produced two ratings for each abstract from two different people. From the paper: “Each abstract was categorized by two independent, anonymized raters”. Only discrepancies between two ratings in the initial rating exercise are meaningful as errors because the further reconciliation steps are carried out by different methods. The first rating exercise is the only reflection of raters applying the pre-determined endorsement/rejection system to the abstracts. Whatever you do after that, you are losing information.

    Secondly, since the error likely maximally affects the large abstracts with no position category, any error ‘resolution’ you might carry out on this group would only consist of moving abstracts from ‘implicits’ to ‘neutrals’ or vice versa. The end result is just as arbitrary. It would just mean picking one of the two ratings and going with it. Unlike other wrong error pairs, there is no gold standard (i.e., information in the abstract text) to fall back on, in errors between ‘3’ and ‘4’. This is by definition. In other words, the error resolution mechanism does not work the same way across all categories. It is meaningless when it comes to categories ‘3’ and ‘4’.

    It follows that attempting to use error figures to modify the ‘97%’ are also meaningless. It would be circular.

  345. Tol,
    “A disagreement rate of 0.33 implies 0.33=2p(1-p) or p=20.8% or 79.2%.

    If there are 7 classes, there is a 5 in 6 chance that 2 incorrectly rated abstract will be rated differently.”

    There are 7 classes. But several error pairs are unlikely compared to others. For instance, pairs 1-3,4,5,6,7 are unlikely as errors. 5-6 is likely, but 5-7 is not. Resolution is possible in all error pairs except in 3-4 and 4-3, where argument can be made for one just as easily as the other. This implies ‘3’ is nothing but a lump that can be shelled out arbitrarily. If you tell raters that a ‘3’ exists, they will pull it out. If you’d not told them that ‘3’ exists, there would be no ‘3’.

    This is not true of other categories, including ‘4’. ‘1’ exists, or ‘1+2’ exists, regardless of any classification scheme. ‘1’ and ‘2’ can be collapsed, ‘6’ and ‘7’ can be collapsed maybe even with some ‘5’, but they would be within their own broad endorsements. In the case of ‘3’, it crosses boundaries. It is taken out of a group of abstracts with no stated position and put into ‘endorsers’. The impact of ‘3’ on endorsers is numerically much higher than ‘5’ on its cousins, ‘6’ and ‘7’ because the latter are a ratty little group anyway.

  346. “I don’t see how we can actually define an error in the final ratings of the abstracts.”

    I would agree.

    If there is such a thing as a true rating (gold standard), against which Cook’s final ratings can be compared, that would be the absolute error. If a gold standard were to exist, there would be at least four sets of ratings to contend with: Rating1, Rating2, Final, GoldStandard. (There are more but they don’t concern us here).

    Of these, final vs goldstandard can tell us the absolute number of abstracts that were wrongly classified by Cook’s team.

    Only Rating1 vs Rating2 will provide information on the error/discrepancy rate *associated with the classification exercise* itself.

    We are in a situation where no gold standard exists (ie some system which Cook and everyone else agrees upon). We don’t have rating1 and rating2; but we have total error between the two.

  347. Yes, I agree although we do have two possible ways of defining the errors. One is based on the initial two ratings, the other is through a comparison with the author ratings. These aren’t the same and so we should, in my view, be careful about confusing the two. Also, it’s not clear to me what an error estimate based on the discrepancy in the initial two ratings actually tells us about the final ratings. I appreciate that you have responded to Tom with an explanation of how these error might be interpreted, but it does seem to be based on a bunch of assumptions that you’ve made and – as such – could be questioned.

  348. Tom Curtis says:

    Richard, thanks for correcting my mathematical error (which I attribute to disorientation due to having a cancer cut out of my face this afternoon). I do not understand the basis for your second formula (2(1-p) + 5/6(1-p)^2) which yields values greater than 1 for some values of p between 0 and 1. I assume, therefore, that it is an approximation. Do you have a more exact formula. Also (and to my surprise) I find myself agreeing with Shub that it is unlikely that error rates between differing ratings are identical, which would (I believe) make your second formula dubious.

  349. > Resolution is possible in all error pairs except in 3-4 and 4-3, where argument can be made for one just as easily as the other.

    Let’s hope for Shub that this argument will be made: without this argument, his FUD about the artificiality of 3 might very well collapse.

    Also note that an error from 3 to 4 is not the same as an error from 4 and 3.

  350. [Scratch that last comment. It belongs elsewhere. WP and Twitter do not seem to like each other.]

    > [S]ince the error likely maximally affects the large abstracts with no position category, any error ‘resolution’ you might carry out on this group would only consist of moving abstracts from ‘implicits’ to ‘neutrals’ or vice versa. In other words, the error resolution mechanism does not work the same way across all categories. It is meaningless when it comes to categories ’3′ and ’4′.

    In the first sentence, the word “likely” likely introduces a speculative comment.

    The first sentence also takes for granted that Shub knows which ABSTRACTS belong to 4, when the initial problem is a bunch of ABSTRACTS that could either belong to 3 or 4 or 5. Again, errors between categories are no symmetrical. There’s also no distinction between training error and systematic error.

    The second sentence is a non sequitur. Repeating it, as Shub does, indicates that this claim might very well be the FUD objective. The error mechanism has not been created for 1-2 or 6-7, but for 3-4-5 anyway.

    The third sentence might also be a non sequitur, but it’s tough to know if it’s purported to be an inference, or is just the usual editorial fall. Shub again channels his inner Chewbacca. Anything can make sense when one puts enough effort not to understand it.

    If there was a “gold standard” to classify ABSTRACTS, we’d shovel the problem to machine learning gurus.

  351. Okay, I’ve removed that last comment of yours. I have been getting some duplicate comments from you, some of which end up in my spam queue. I’ve been trying to avoid both ending up approved.

  352. > Only Rating1 vs Rating2 will provide information on the error/discrepancy rate *associated with the classification exercise* itself.

    This is true only if by “the classification exercise”, Shub refers to the the classification of ABSTRACTS. If “the classification exercise” applies both to ABSTRACTS and PAPERS, I see no reason why looking at the classification of the authors themselves would not help. See for instance how Richard classified his own PAPERS:

    http://bybrisbanewaters.blogspot.ca/2013/05/tols-gaffe.html

    Inter-rater reliability applies to every raters.

    Authors are no gold standards.

  353. Thanks, Wott. Sorry about the trouble. Twitter now seems to have returned. Last week, I was getting a systematic “your login has expired”.

    Let’s also issue this erratum:

    > Anything can make no sense when one puts enough effort not to understand it.

    I can’t say I’m not having problems with negation.

  354. Indeed, that’s the problem as far as I can see. However you determine the error, it is based on some assumption about some kind of gold standard which doesn’t really exist.

    It appears that Richard’s paper has been desk-rejected by ERL so maybe all of this discussion is – for the moment at least – moot.

  355. dana1981 says:

    If we take this argument to its logical conclusion, we should only include detection & attribution papers. For example, Richard Tol is certainly not an expert in the causes of global warming, so we should exclude his papers as well. I’ve never seen a D&A paper attribute less than 50% (or less than 100% for that matter) of GW over the past 50 years to anthropogenic effects, so now the consensus has likely risen from 97% to 100%.

  356. Yes, given that Richard is an economist, I did wonder how he was going to make this interpretation consistent with his earlier criticism that the Cook et al. search ignored most of his papers.

  357. @Dana
    Had you studied your subject, you would have known that I published 4 detection and attribution papers.

  358. @shub
    Agreed.

    I lazily assume that errors are random.

  359. Reich.Eschhaus says:

    @Shub

    I invited you to start a new thread here, and my last comment to you was when it was quite late for me and my concentration was not that good anymore. Surprisingly for me, after that I have left town for several days and couldn’t react. However, I see that you have got quite some discussion out of your position, so that is good I guess! Not sure if I have something to add to what already has been said, but maybe I will consider tomorrow if I find the time to read all new comments here. (if the thread is not dead by then)

  360. Pingback: Watt about the 97% consensus, again? | Wotts Up With That Blog

  361. Pingback: Richard Tol and the 97% consensus – again! | And Then There's Physics

  362. Pingback: Richard Tol's 97% Scientific Consensus Gremlins » Real Sceptic

  363. Pingback: Bravo, Richard Tol, Bravo! | And Then There's Physics

  364. Pingback: Same ol’ same ol’ | …and Then There's Physics

  365. Pingback: More nonsense – sorry, nonsensus – from Richard Tol | …and Then There's Physics

  366. Pingback: Scientists Respond To Tol's Misrepresentation Of Their Consensus Research - Real Skeptic

  367. Pingback: Devastating Reply To Richard Tol's Nonsensus In Peer-Reviewed Journal - Real Skeptic

  368. Pingback: It’s settled: 90–100% of climate experts agree on human-caused global warming | Dana Nuccitelli – Enjeux énergies et environnement

  369. Pingback: Its settled: 90100% of climate experts agree on human-caused global warming | Dana Nuccitelli | My Blog

  370. Pingback: Consensus – Blue Ridge Leader

  371. Pingback: Polar Bears – a rebuttal | …and Then There's Physics

  372. Pingback: Facebook video spreads climate denial misinformation to 5 million users | Dana Nuccitelli | Environment - 7newsplus.com | Latest World News

  373. Pingback: Facebook video spreads climate denial misinformation to 5 million users | Dana Nuccitelli | Environment | Daily Green World

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.