Random Control Trials for charities - another false dawn for evaluation?

science

Random Control Trials for charities - another false dawn for evaluation?

The RCT is often considered the gold standard for a clinical trial - but can this methodology be applied successfully in the charity sector for impact and evaluation measurement?
Joe Saxton

Random Control Trials are the next big thing for measuring impact and evaluation in charities. They are portrayed as the gold star standard - or so the propaganda goes. In reality, they represent a kind of evaluation fantasy: presenting the exceptional and atypical as it were within the grasp of all charities.

Now that you know what I think, let me properly explain what RCTs are, and why the hyperbole about them is so misleading.
 
Random Control Trials are a trial of an action or intervention which test against a control group (who should have no intervention) chosen at random (i.e. the users have no say in which group they go).
 
RCTs originate from the world of medicine and drug trials. In drug trials, they are typically double-blind random control trials, meaning neither the person taking the drug nor the person administering the drug know who gets what. The control group gets a harmless placebo which looks like the drug being tested (but isn’t).
 
An RCT works best when a single variable is being tested, and when that single variable can be independently measured. In drug trials, the efficacy of the drug is the single variable and the measurement is the symptoms.
 
RCTs are being implemented in the charity (and other) sectors – for example, in this 2015 support programme for ex-service personnel. This method is fraught with problems and complexity (The NESTA guide to RCTs is 78 pages long).  
 

Here are five key challenges I’ve identified:

1. Nobody mentions double-blind

The best trials of an intervention are those where neither the test subject or the person administering the test knows who gets which intervention. This is to stop either the user or the evaluator from adding their own bias into the observations. I have yet to see any mention of the double-blind in the charity context. So when people talk about RCTs being the gold standard – they clearly aren’t. At best RCTs are the silver standard, unless they are double blind.

2. Is random feasible or ethical?

How can a charity create a randomised control trial ethically? What are the ethics of giving one group of beneficiaries nothing and the other group an intervention? If an intervention were based in a community, would that mean only half the community might get support? And how would they divide the group randomly?

At nfpSynergy, we did a five year study into the impact of outdoor education on a south London school. In one year, the vast majority of students went on three outdoor residential visits, but about 15-20 pupils each year didn’t. It would be easy to see this as the control group, but of course, they weren’t random. Those that didn’t go did so for financial, cultural, personality or domestic reasons - they weren’t like those who went. Getting a genuinely random control group is very hard.

3. Charity interventions are rarely single variables, and often hard to measure.


In the world of drug trials, a single variable is relatively easy to achieve. In the world of charities, a single variable is much harder to achieve, and deciding what to measure is equally difficult. Even in the education setting, a different kind of approach to teaching class A and class B would require the two classes to have been chosen at random (e.g. not streamed on ability), and that the pupils in the classes didn’t talk to each other about what they were doing (as might happen in sex or sports education, or other topics).
More complex interventions such as in mental health would require both a control group (as opposed to a pre and post intervention study) and a method that allowed objective measurement of benefit.

Many charities work on self-reporting of results. There is plenty of evidence of subjects in studies like this wanting to be better, or to please the researcher. Even more likely is that if people who set up and run the interventions also do the testing, then another bias will creep in.

4. External researchers are usually needed.

For an RCT to be taken seriously, it probably needs to be carried out by independent researchers - not by people employed by the body carrying out the intervention. This is to ensure that those who run the interventions do not consciously or unconsciously bias the results. Even a researcher employed by the same organisation would have a strong conflict of interest if they knew their employer would get significant extra funding if the results of an RCT were positive. Indeed even an external researcher would feel a degree of pressure to find positive results if they knew that this might result in extra evaluation or even just a happy client. Nonetheless, independent research is the best (if not perfect) way to carry out an RCT. However, it does considerably increase the costs.

5. The sample size needs to be pretty big.

Imagine that you are carrying out an intervention that improves exam results by from 20% getting an A* to 24%. Even if the test group and the control group each had 250 pupils in them, this result would only just be statistically significant. The Warrior Programme RCT had under 30 in each group. To conduct an RCT of this size while ensuring that the issue being tested was the only variable and the groups were truly random would be a mammoth undertaking.

All these factors mean that carrying out a valid RCT for a charity or other non-profit is a pretty tall order. Statistical significance is another challenge. For a drug trial, it would be mandatory to show that any benefit of a drug is statistically significance because of the cost of a drug and the risk to human health. But if a new intervention showed a benefit to the participants (say a way of teaching) which had no additional costs, and no potential downside, then is statistical significance important? If all other things are equal then going with an intervention that works better can be a legitimate approach, even if the results are statistically insignificant.

RCTs and volunteering

The government is currently funding some RCTs on volunteering for the over 50s. The criteria for the trials demonstrate just some of the points I have been talking about. They want volunteering events which bring in around 1000 people (or at least 500+), who are then split into groups of 500 to have the different interventions inflicted on them. This type of trial is the tail wagging the dog in almost every sense. Very few volunteer programmes look to recruit 500+ volunteers ‘in a single day’ (to quote the criteria). Most volunteer programmes work on a drip drip drip of volunteers. A volunteering event of this size could only be run by a large organisation (so small organisations penalised yet again). And even if these results all deliver, how replicable will the results be to ordinary organisations recruiting ordinary volunteers?

Fundraisers have effectively used RCTs for many years.

The irony is that fundraisers have used RCTs for years, and never bragged about it. Any direct marketer worth their salary will be continuously testing different interventions on split test or samples of their database. Many years ago, I helped test for RSPCA on whether a request for £8, £10, £12 or £15 as a donation prompt raised the most money (£8 generated the highest response rate, but around £12 the highest income). Perhaps the most bizarre test result I ever got was that a reply envelope with a window showing the reply address on the donation form generated a third more income than one with no window. Don’t ask me why.

In summary, my concern about the current vogue for RCTs is threefold:

1. There are few situations where a genuine RCT will work given all the necessary criteria I have spelt out. Why create scenarios purely for the benefit of having an RCT? Why have an evaluation standard that is applicable to very few of the interventions that charities make?

2. It makes evaluation even harder and more expensive, and it is outside the price bracket for small charities. It is, in effect, a way that makes it even harder for small charities to compete in a big charity world.

3. It may mean that good interventions don’t happen. If we set the bar for a successful intervention as being a statistically significant result in an RCT, then some successful powerful interventions will not get over that bar - and that would be the greatest tragedy of all.

Submitted by Calum (not verified) on 30 Nov 2016

Permalink

This seems massively overly critical, unnecessarily dismissive of what could be a great tool for improved program delivery if widely adopted by the NGO community, and frequently factually inaccurate.

The first mistake is that RCTs are not considered the 'gold standard' in evaluation, certainly not in the medical world (where they come from) anyway. The gold standard is a systematic review and meta-analysis of randomised controlled trials. This is a process of identifying and combining the results of all RCTs ever done on a particular subject, which gives you a much more accurate result than a single RCT ever could.

Secondly, in the world of medicine non-blinded RCTs are relatively common. This is because not everything can be effectively concealed from the person giving or receiving a particular intervention. It's just not really possible, for example, to 'blind' a surgeon or their patient to whether or not someone has received or conducted an operation. Non-blinded RCTs are the appropriate go-to experiment for a new intervention when blinding is not possible. If blinding is possible then of course an RCT should do it.

Thirdly, there are lots of alternatives to simply giving one group of people nothing and the others something (point two in Mr Saxton's article). You can, for example, conduct your RCT according to a step-wedge design. This simply means that you give the intervention to different groups at different times. You compare those who have received the intervention to those who haven't yet and that way still have a control group. An other alternative is, if delivering a new (hopefully better) intervention for a problem which already has workable solutions, to compare your new idea to current best practice rather than nothing or a placebo. This is what is done in medical trials. You couldn't very well test a new antibiotic by giving one group of patients nothing if an effective cure already existed, but you may well want to see if your new antibiotic is better than existing ones.

In the authors third point, they claim that charity interventions are often too complex to assess as there are multiple variables and it can be difficult to find control groups and insure the groups do not contaminate each other by participants sharing what they have learnt in a class for example. Again, there are plenty of ways of designing an RCT to account for this. The entire point of random allocation to a particular treatment arm is that given a sufficiently large sample, it's very likely that the only significant difference between people in each arm is the intervention being delivered (or the program the charity is enrolling them in). You can assess the synergistic effect of your entire program, or you can assess the separate effects of each part of your program by choosing what you do with what group. A common way of dealing with contamination between arms is to run a cluster RCT. What this means is that instead of randomising individual people to each arm, you randomise groups of people, a school for example. Everyone in the cluster receives the same intervention, but different clusters receive different interventions (or the intervention at different times if doing a step wedge cluster RCT). This eliminates the problem of people from different classes sharing what they have learnt with each other. There are also simple statistical tests that can be done to assess how much of an impact inter-cluster correlation (when people from different clusters share information) has had on your results.

The author is right that external and independent researchers would be a bonus, but this is no different to traditional M&E. It's even easier for an NGO to subconsciously or otherwise influence the results of other less scientifically rigorous M&E techniques to suite their needs. Besides, partnering with an external organisation with research expertise need not be a negative. In my experience, it has bought new perspectives, increased manpower, and new sources of funding which more than pay for the extra cost.

The final point about statistical significance is the most incorrect and by far the most dangerous. A statistically significant result is one where you are confident that the reason for the effect you have found is something other than just chance. If you flip a coin 10 times in each hand there's a reasonable chance you'll find you get heads more often with one hand than the other, but this is far more likely than not to just be due to random variation rather than any difference in technique. If you are not able to demonstrate a statistically significant difference between the outcomes in your intervention and control group, this means that it is reasonably likely that the difference you have found is just due to chance and not due to your intervention. This is not a 'successful powerful intervention', or at least it hasn't been shown to be.

All this being said, RCTs are only appropriate if you are doing something new, or something old in a new context. If you are delivering an intervention which already has good evidence to show it works (as is often the case for an NGO), then doing an RCT is not only unethical as you are delaying delivery of an intervention with a proven benefit for no good reason, but also a huge waste of time. In this context, observational data demonstrating you are achieving an impact in line with what the existing evidence predicts would be far more appropriate.

As a parting note, conducting an RCT need not be any more expensive than delivering your project already will be. It is just a matter of properly planning your intervention delivery and M&E data collection in a way that means you can generate the best quality data to assess your impact. Perhaps you're planning on delivering a new stye of sex education in a group of 20 schools. Instead of delivering the intervention at any old time, why not randomly allocate different schools to receive the intervention at different times and plan to collect baseline and follow-up data in a way that gives you a control group? This way of thinking will greatly increase your ability to measure your impact and shouldn't cost a penny more (in fact, a stronger evidence base to support your programs will make winning grants a whole lot easier so you should come out in a better position financially than when you went in).

Submitted by Caroline Fiennes (not verified) on 5 Dec 2016

Permalink

Calum and Toby do a good job here.

Most charities shouldn't do evaluations of their own work. {I'm using 'evaluation' as the National Audit Office uses it: "a serious attempt to establish causation.} they don't have the skills, or sample size, nor incentive to do so well. Rather than *producing* research, better if they *used* existing research. I've written about this quite a bit, e.g, www.giving-evidence.com/m&e,

The skills matter: for example, using the wrong method of randomisation, for example, has been shown to swing the answer sometimes by >40%. {Source study is in the CONSORT long-form statement}

Having worked for a year for the bit of MIT which does RCTs of anti-poverty programmes, and for four years for the org spun out of Yale which does, I can tell you that to 90% of the calls they get from charities wanting their programme RCT'd, the answer is that it's not right for them: the prog is too small, the cost would never outweigh the benefit, and/or the answer is already known.

So *sometimes* a charity's programme should be evaluated in an RCT, but almost never an RCT run by that charity itself.

To spell out the sample size issue, most social programmes have a small effect (this is just a feature of our universe). To distinguish the effect from random noise, the smaller the effect, the bigger the sample. Most charities' programmes are way too small to allow the effect to be distinguished from the noise.

Tenet of clinical research from Professor Sir Richard Peto: "ask an important question and answer it reliably". Most charities aren't able to "investigate causation" reliably, to better to use existing sound research, and if that doesn't exist, then commission some proper research, only if the benefits of answering that question reliably would outweigh the benefit.

Also, on a small (but telling?) point, they're normally called randomised controlled trials.

Submitted by Sohini Bhattacharya (not verified) on 6 Dec 2016

Permalink

We are running two RCTs in our project areas on gender norm change - a difficult and rare area for a RCT to evaluate. Our experirnce is that RCTs are expensive and needs organisations to have a complete rethink on the depth and reach of its strategies - it is also human resource heavy. But if you are going to use the results for convincing governments/donors to adopt a certain intervention over others to solve a particular problem, they are good tools to use. It shows clearly the advantage of using x over y and as Calum says creates a strong evidence base to convince anyone. The step-wedge suggested by Calum are ways of ensuring that everyone receives the intervention some point in time which is good. We ourselves run a cluster method to avoid contamination.

Submitted by Ian Leggett (not verified) on 7 Dec 2016

Permalink

Good article….it’s time someone questioned the assumptions underpinning RCT’s. Certainly for international ngo’s where we are so often addressing multiple problems that are often a combination of economic, social, cultural and political forces, the simplicity of RCT’s is simply pie in the sky. And that simplicity of course comes at great cost – a point you make v well.

I accept we need to get better at evaluation and learning, but the RCT tool is unlikely to be of much use.

Submitted by Toby Blume (not verified) on 7 Dec 2016

Permalink

Are we assuming all RCTs are the same?
I agree with the suggestion that large-scale, expensive and complex RCTs are beyond the scope of most charities. Even if they were attainable they may well be completely disproportionate and therefore wholly inappropriate. Is that not the same as, say, longitudinal studies and a host of other research methods?
Just because a method is suitable for a particular set of circumstances it doesn't mean it should be dismissed. After all 'if you judge a fish by its ability to climb a tree it'll live its whole life believing that it's stupid.'
But not all trials are the same. The RCTs i work on are deliberately designed to be small, simple(ish), incremental changes to see how we can improve communication in order to change behaviour. They may still be beyond the scope of some charities - particularly the smallest community groups - but to suggest they are not a realistic option for most is simply not true.
The other thought is that perhaps we ought to talk about experimental methods and an organisational culture of experimentation. Even when full-blown trials are not feasible, there is often considerable scope for using experimental or quasi-experimental methods and a 'test, learn, adapt' methodology.

Add new comment

The content of this field is kept private and will not be shown publicly.

Plain text

  • No HTML tags allowed.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.