Evaluation Failures: 22 Tales of Mistakes Made and Lessons Learned

Edited by: Kylie Hutchinson – Community Solutions, Vancouver, Canada. 2018 Published by Sage. https://us.sagepub.com/en-us/nam/evaluation-failures/book260109

But $30 for 184-page paperback is going to limit its appeal! The electronic version is similarly expensive, more like the cost of a hardback. Fortunately, two example chapters (1 and 8) are available as free pdfs, see below. Reading those two chapters makes me think the rest of the book would also be well worthwhile reading. It is not ofter you see anything written at length about evaluation failures. Perhaps we should set up an online-confessional, where we can line up to anonymously confess our un/professional sins. I will certainly be one of those needing to join such a queue! :)

Chapter 2. The Scope Creep Train Wreck: How Responsive Evaluation Can Go Off the Rails
Chapter 3. The Buffalo Jump: Lessons After the Fall
Chapter 4. Evaluator Self-Evaluation: When Self-Flagellation Is Not Enough
Chapter 5. That Alien Feeling: Engaging All Stakeholders in the Universe
Chapter 6. Seeds of Failure: How the Evaluation of a West African
Chapter 7. I Didn’t Know I Would Be a Tightrope Walker Someday: Balancing Evaluator Responsiveness and Independence
Chapter 9. Stars in Our Eyes: What Happens When Things Are Too Good to Be True
Chapter 10. A “Failed” Logic Model: How I Learned to Connect With All Stakeholders
Chapter 11. Lost Without You: A Lesson in System Mapping and Engaging Stakeholders
Chapter 12. You Got to Know When to Hold ’Em: An Evaluation That Went From Bad to Worse
Chapter 13. The Evaluation From Hell: When Evaluators and Clients Don’t Quite Fit
Chapter 14. The Best Laid Plans of Mice and Evaluators: Dealing With Data Collection Surprises in the Field
Chapter 15. Are You My Amigo, or My Chero? The Importance of Cultural Competence in Data Collection and Evaluation
Chapter 16. OMG, Why Can’t We Get the Data? A Lesson in Managing Evaluation Expectations
Chapter 17. No, Actually, This Project Has to Stop Now: Learning When to Pull the Plug
Chapter 18. Missing in Action: How Assumptions, Language, History, and Soft Skills Influenced a Cross-Cultural Participatory Evaluation
Chapter 19. “This Is Highly Illogical”: How a Spock Evaluator Learns That Context and Mixed Methods Are Everything
Chapter 20. The Ripple That Became a Splash: The Importance of Context and Why I Now Do Data Parties
Chapter 21. The Voldemort Evaluation: How I Learned to Survive Organizational Dysfunction, Confusion, and Distrust
Chapter 22. The Only Way Out Is Through




Free Coursera online course: Qualitative Comparative Analysis (QCA)

Highly recommended! A well organised and very clear and systematic exposition. Available at: https://www.coursera.org/learn/qualitative-comparative-analysis

About this Course

Welcome to this massive open online course (MOOC) about Qualitative Comparative Analysis (QCA). Please read the points below before you start the course. This will help you prepare well for the course and attend it properly. It will also help you determine if the course offers the knowledge and skills you are looking for.

What can you do with QCA?

  • QCA is a comparative method that is mainly used in the social sciences for the assessment of cause-effect relations (i.e. causation).
  • QCA is relevant for researchers who normally work with qualitative methods and are looking for a more systematic way of comparing and assessing cases.
  • QCA is also useful for quantitative researchers who like to assess alternative (more complex) aspects of causation, such as how factors work together in producing an effect.
  • QCA can be used for the analysis of cases on all levels: macro (e.g. countries), meso (e.g. organizations) and micro (e.g. individuals).
  • QCA is mostly used for research of small- and medium-sized samples and populations (10-100 cases), but it can also be used for larger groups. Ideally, the number of cases is at least 10.
  • QCA cannot be used if you are doing an in-depth study of one case

What will you learn in this course?

  • The course is designed for people who have no or little experience with QCA.
  • After the course you will understand the methodological foundations of QCA.
  • After the course you will know how to conduct a basic QCA study by yourself.

How is this course organized?

  • The MOOC takes five weeks. The specific learning objectives and activities per week are mentioned in appendix A of the course guide. Please find the course guide under Resources in the main menu.
  • The learning objectives with regard to understanding the foundations of QCA and practically conducting a QCA study are pursued throughout the course. However, week 1 focuses more on the general analytic foundations, and weeks 2 to 5 are more about the practical aspects of a QCA study.
  • The activities of the course include watching the videos, consulting supplementary material where necessary, and doing assignments. The activities should be done in that order: first watch the videos; then consult supplementary material (if desired) for more details and examples; then do the assignments. • There are 10 assignments. Appendix A in the course guide states the estimated time needed to make the assignments and how the assignments are graded. Only assignments 1 to 6 and 8 are mandatory. These 7 mandatory assignments must be completed successfully to pass the course. • Making the assignments successfully is one condition for receiving a course certificate. Further information about receiving a course certificate can be found here: https://learner.coursera.help/hc/en-us/articles/209819053-Get-a-Course-Certificate

About the supplementary material

  • The course can be followed by watching the videos. It is not absolutely necessary yet recommended to study the supplementary reading material (as mentioned in the course guide) for further details and examples. Further, because some of the covered topics are quite technical (particularly topics in weeks 3 and 4 of the course), we provide several worked examples that supplement the videos by offering more specific illustrations and explanation. These worked examples can be found under Resources in the main menu. •
  • Note that the supplementary readings are mostly not freely available. Books have to be bought or might be available in a university library; journal publications have to be ordered online or are accessible via a university license. •
  • The textbook by Schneider and Wagemann (2012) functions as the primary reference for further information on the topics that are covered in the MOOC. Appendix A in the course guide mentions which chapters in that book can be consulted for which week of the course. •
  • The publication by Schneider and Wagemann (2012) is comprehensive and detailed, and covers almost all topics discussed in the MOOC. However, for further study, appendix A in the course guide also mentions some additional supplementary literature. •
  • Please find the full list of references for all citations (mentioned in this course guide, in the MOOC, and in the assignments) in appendix B of the course guide.



Five ways to ensure that models serve society: A manifesto

Saltelli, A., Bammer, G., Bruno, I., Charters, E., Fiore, M. D., Didier, E., Espeland, W. N., Kay, J., Piano, S. L., Mayo, D., Jr, R. P., Portaluri, T., Porter, T. M., Puy, A., Rafols, I., Ravetz, J. R., Reinert, E., Sarewitz, D., Stark, P. B., … Vineis, P. (2020). Five ways to ensure that models serve society: A manifesto. Nature, 582(7813), 482–484. https://doi.org/10.1038/d41586-020-01812-9

The five ways:

    1. Mind the assumptions
      • “One way to mitigate these issues is to perform global uncertainty and sensitivity analyses. In practice, that means allowing all that is uncertain — variables, mathematical relationships and boundary conditions — to vary simultaneously as runs of the model produce its range of predictions. This often reveals that the uncertainty in predictions is substantially larger than originally asserted”
    2. Mind the hubris
      • Most modellers are aware that there is a tradeoff between the usefulness of a model and the breadth it tries to capture. But many are seduced by the idea of adding complexity in an attempt to capture reality more accurately. As modellers incorporate more phenomena, a model might fit better to the training data, but at a cost. Its predictions typically become less
    3. Mind the framing
      • “Match purpose and context. Results from models will at least partly reflect the interests, disciplinary orientations and biases of the developers. No one model can serve all purposes. accurate”
    4. Mind the consequences
      • Quantification can backfire. Excessive regard for producing numbers can push a discipline away from being roughly right towards being precisely wrong. Undiscriminating use of statistical tests can substitute for sound judgement. By helping to make risky financial products seem safe, models contributed to derailing the global economy in 2007–08 (ref. 5).”
    5. Mind the unknowns
      • Acknowledge ignorance. For most of the history of Western philosophy, self-awareness of ignorance was considered a virtue, the worthy object of intellectual pursuit”

“Ignore the five, and model predictions become Trojan horses for unstated
interests and values”

“Models’ assumptions and limitations must be appraised openly and honestly. Process and ethics matter as much as intellectual prowess”

“Mathematical models are a great way to explore questions. They are also a dangerous way to assert answers. Asking models for  certainty or consensus is more a sign of the  difficulties in making controversial decisions  than it is a solution, and can invite ritualistic use of quantification”

Evaluating the Future

A blog posting and (summarising) podcast, produced for the EU Evaluation Support Services, by Rick Davies, June 2020

The podcast is available here, on the Capacity4Dev website

The blog posting full text is here as a pdf

    • Limitations of common evaluative thinking
    • Scenario planning
    • Risk vs uncertainty
    • Additional evaluation criteria
    • Meaningful differences
    • Other information sources

Story Completion exercises: An idea worth borrowing?

Yesterday, TheoNabben, a friend and colleague of mine and an MSC trainer, sent me a link to a webpage full of information about a method called Story Completion: https://www.psych.auckland.ac.nz/en/about/story-completion.html 


Story Completion is a qualitative research method first developed in the field of psychology but subsequently taken up primarily by feminist researchers. It was originally of interest as a method of enquiring about psychological meanings particularly those that people could not or did not want to explicitly communicate. However, it was subsequently re-conceptualised as a valuable method of accessing and investigating social discourses. These two different perspectives have been described as essentialist versus social constructionist.

Story completion is a useful tool for accessing meaning-making around a particular topic of interest. It is particularly useful for exploring (dominant) assumptions about a topic. This type of research can be framed as exploring either perceptions and understandings or social/discursive constructions of a topic.

This 2019 paper by Clarke et al. provides a good overview and is my main source of comments and explanations on this page

How It Works

The researcher provides the participant with the beginning of the story, called the stem. Typically this is one sentence long but can be longer. For example…

“Catherine has decided that she needs to lose weight. Full of enthusiasm, and in order to prevent her from changing her mind, she is telling her friends in the pub about her plans and motivations.”

The participant is then asked by the researcher to extend that story, by explaining – usually in writing – what happens next. Typically this storyline is about a third person (e.g. a Catherine), not about the participant themselves.

In practice, this form of enquiry can take various forms as suggested by Figure 1 below.

Figure 1: Four different versions of a Story Completion inquiry

Analysis of responses can be done in two ways: (a) horizontally – comparisons across respondents, (B) vertically – changes over time within the narratives.

Here is a good how-to-do-it  introduction to Story Completion: http://blogs.brighton.ac.uk/sasspsychlab/2017/10/15/story-completion/ 

And here is an annotated bibliography that looks very useful: https://cdn.auckland.ac.nz/assets/psych/about/our-research/documents/Resources%20for%20qualitative%20story%20completion%20(July%202019).pdf

How it could be useful for monitoring and evaluation purposes

Story Completion exercises could be a good way of identifying different stakeholders views of the possible consequences of an intervention. Variations in the text of the story stem could allow the exploration of consequences that might vary across gender or other social differences. Variations in the respondents being interviewed would allow exploration of differences in perspective on how a specific intervention might have consequences.

Of course, these responses will need interpretation and would benefit from further questioning. Participatory processes could be designed to enable this type of follow-up. Rather than simply relying on third parties (e.g. researchers), as informed as they might be.

Variations could be developed where literacy is likely to be a problem. Voice recordings could be made instead, and small groups could be encouraged to collectively develop a response to the stem. There would seem to be plenty of room for creativity here.


There is a considerable overlap between the Story Completion method and how the ParEvo participatory scenario planning process works.

The commonality of the two methods is that they are both narrative-based. They both start with a story stem/seed designed by the researcher/Facilitator. Then the respondent/participants add an extension onto that story stem describing what happens next. Both methods are future-orientated and largely other-orientated, in other words not about the storyteller themselves. And both processes pay quite a lot of attention after the narratives are developed, to how those narratives can be analysed and compared.

Now for some key differences. With ParEvo the process of narrative development involves multiple people rather than one person. This means multiple alternative storylines can develop, some of which die out, some which continue, and some of which branch into multiple variants. The other difference, already implied, is that the ParEvo process goes through multiple iterations, where is the Story Completion process has only one iteration. So in the case of ParEvo the storylines accumulate multiple segments of text, with a new segment added with each iteration.  Content analysis can be carried out with the results of Story Completion and ParEvo exercises. But in the case of ParEvo it is also possible to analyse the structure of people’s participation and how it relates to the contents of the storylines.


Brian Castellani’s Map of the Complexity Sciences

I have limited tolerance for “complexity babble” That is, people talking about complexity in abstract and ungrounded, and in effect, practically inconsequential terms. Also in ways that give no acknowledgement to the surrounding history of ideas.

So, I really appreciate the work Brian has put into his “Map of the Complexity Sciences” produced in 2018. And thought it deserves wider circulation. Note that this is one of a number of iterations and more iterations are likely in the future. Click on the image to go to a bigger copy.

And please note: when you get taken to the bigger copy and when you click on any node a hypertext link there, this will take you to another web page providing detailed information about that concept or person. A lot of work has gone into the construction of this map, which deserves recognition.

Here is a discussion of an earlier iteration: https://www.theoryculturesociety.org/brian-castellani-on-the-complexity-sciences/

Process Tracing as a Practical Evaluation Method: Comparative Learning from Six Evaluations

By Alix Wadeson, Bernardo Monzani and Tom Aston
March 2020. pdf available here

Rick Davies comment: This is the most interesting and useful paper I have seen yet written on process tracing and its use for evaluation purposes. A good mix of methodology discussion, practical examples and useful recommendations.

The Power of Experiments: Decision Making in a Data-Driven World

By Michael Luca and Max H. Bazerman, March 2020. Published by MIT Press

How organizations—including Google, StubHub, Airbnb, and Facebook—learn from experiments in a data-driven world.


Have you logged into Facebook recently? Searched for something on Google? Chosen a movie on Netflix? If so, you’ve probably been an unwitting participant in a variety of experiments—also known as randomized controlled trials—designed to test the impact of changes to an experience or product. Once an esoteric tool for academic research, the randomized controlled trial has gone mainstream—and is becoming an important part of the managerial toolkit. In The Power of Experiments: Decision-Making in a Data Driven World, Michael Luca and Max Bazerman explore the value of experiments and the ways in which they can improve organizational decisions. Drawing on real world experiments and case studies, Luca and Bazerman show that going by gut is no longer enough—successful leaders need frameworks for moving between data and decisions. Experiments can save companies money—eBay, for example, discovered how to cut $50 million from its yearly advertising budget without losing customers. Experiments can also bring to light something previously ignored, as when Airbnb was forced to confront rampant discrimination by its hosts. The Power of Experiments introduces readers to the topic of experimentation and the managerial challenges that surround them. Looking at experiments in the tech sector and beyond, this book offers lessons and best practices for making the most of experiments.

In The Power of Experiments: Decision-Making in a Data Driven World, Michael Luca and Max Bazerman explore the value of experiments, and the ways in which they can improve organizational decisions. Drawing on real world experiments and case studies, Luca and Bazerman show that going by gut is no longer enough—successful leaders need frameworks for moving between data and decisions. Experiments can save companies money—eBay, for example, discovered how to cut $50 million from its yearly advertising budget without losing customers. Experiments can also bring to light something previously ignored, as when Airbnb was forced to confront rampant discrimination by its hosts.

The Power of Experiments introduces readers to the topic of experimentation and the managerial challenges that surround them. Looking at experiments in the tech sector and beyond, this book offers lessons and best practices for making the most of experiments.

See also a World bank blog review by David McKenzie

The impact of impact evaluation

WIDER Working Paper 2020/20. Richard Manning, Ian Goldman, and Gonzalo Hernández Licona. PDF copy available

Abstract: In 2006 the Center for Global Development’s report ‘When Will We Ever Learn? Improving lives through impact evaluation’ bemoaned the lack of rigorous impact evaluations. The authors of the present paper researched international organizations and countries including Mexico, Colombia, South Africa, Uganda, and Philippines to understand how impact evaluations and systematic reviews are being implemented and used, drawing out the emerging lessons. The number of impact evaluations has risen (to over 500 per year), as have those of systematic reviews and other synthesis products, such as evidence maps. However, impact evaluations are too often donor-driven, and not embedded in partner governments. The willingness of politicians and top policymakers to take evidence seriously is variable, even in a single country, and the use of evidence is not tracked well enough. We need to see impact evaluations within a broader spectrum of tools available to support policymakers, ranging from evidence maps, rapid evaluations, and rapid synthesis work, to formative/process evaluations and classic impact evaluations and systematic reviews.

Selected quotes

4.1 Adoption of IEs On the basis of our survey, we feel that real progress has been made since 2006 in the adoption of IEs to assess programmes and policies in LMICs. As shown above, this progress has not just been in terms of the number of IEs commissioned, but also in the topics covered, and in the development of a more flexible suite of IE products. There is also some evidence, though mainly anecdotal, 89 that the insistence of the IE community on rigour has had some effect both in levering up the quality of other forms of evaluation and in gaining wider acceptance that ‘before and after’ evaluations with no valid control group tell one very little about the real impact of interventions. In some countries, such as South Africa, Mexico, and Colombia, institutional arrangements have favoured the use of evaluations, including IEs, although more uptake is needed.

There is also perhaps a clearer understanding of where IE techniques can or cannot usefully be applied, or combined with other types of evaluation.

At the same time, some limitations are evident. In the first place, despite the application of IE techniques to new areas, the field remains dominated by medical trials and interventions in the social sectors. Second, even in the health sector, other types of evaluation still account for the bulk of total evaluations, whether by donor agencies or LMIC governments.

Third, despite the increase in willingness of a few LMICs to finance and commission their own IEs, the majority of IEs on policies and programmes in such countries are still financed and commissioned by donor agencies, albeit in some cases with the topics defined by the countries, such as in 3ie’s policy windows. In quite a few cases, the prime objectives of such IEs are domestic accountability and/or learning within the donor agency. We believe that greater local ownership of IEs is highly desirable. While there is much that could not have been achieved without donor finance and commissioning, our sense is that—as with other forms of evaluation—a more balanced pattern of finance and commissioning is needed if IEs are to become a more accepted part of national evidence systems.

Fourth, the vast majority of IEs in LMICs appear to have ‘northern’ principal investigators. Undoubtedly, quality and rigour are essential to IEs, but it is important that IEs should not be perceived as a supply-driven product of a limited number of high-level academic departments in, for the most part, Anglo-Saxon universities, sometimes mediated through specialist consultancy firms. Fortunately, ‘southern’ capacity is increasing, and some programmes have made significant investments in developing this. We take the view that this progress needs to be ramped up very considerably in the interests of sustainability, local institutional development, and contributing over time to the local culture of evidence.

Fifth, as pointed out in Section 2.1, the financing of IEs depends to a troubling extent on a small body of official agencies and foundations that regard IEs as extremely important products. Major shifts in policy by even a few such agencies could radically reduce the number of IEs being financed.

Finally, while IEs of individual interventions are numerous and often valuable to the programmes concerned, IEs that transform thinking about policies or broad approaches to key issues of development are less evident. The natural tools for such results are more often synthesis products than one-off IEs, and to these we now turn

4.2 Adoption of synthesis products (building body of evidence)

Systematic reviews and other meta-analyses depend on an adequate underpinning of well structured IEs, although methodological innovation is now using a more diverse set of sources. 91 The take-off of such products therefore followed the rise in the stock of IEs, and can be regarded as a further wave of the ‘evidence revolution’, as it has been described by Howard White (2019). Such products are increasingly necessary, as the evidence from individual IEs grows.

As with IEs, synthesis products have diversified from full systematic reviews to a more flexible suite of products. We noted examples from international agencies in Section 2.1 and to a lesser extent from countries in Section 3, but many more could be cited. In several cases, synthesis products seek to integrate evidence from quasi-experimental evaluations (e.g. J-PAL’s Policy Insights) or other high-quality research and evaluation evidence.

The need to understand what is now available and where the main gaps in knowledge exist has led in recent years to the burgeoning of evidence maps, pioneered by 3ie but now produced by a variety of institutions and countries. The example of the 500+ evaluations in Uganda cited earlier shows the range of evidence that already exists, which should be mapped and used before new evidence is sought. This should be a priority in all countries.

The popularity of evidence maps shows that there is now a real demand to ‘navigate’ the growing body of IE-based evidence in an efficient manner, as well as to understand the gaps that still exist. The innovation happening also in rapid synthesis shows the demand for synthesis products—but more synthesis is still needed in many sectors and, bearing in mind the expansion in IEs, should be increasingly possible.