1. Background

Estimating the effectiveness of alternative healthcare interventions is at the core of both clinical and economic technology assessments, including those commissioned by the National Institute for Health and Clinical Excellence (NICE) and used to make recommendations for the NHS of England and Wales.

It is generally accepted that data from well conducted randomized, controlled trials (RCTs) are the most reliable for informing such assessments,[1] and identification of such evidence should be achieved through the use of transparent systematic review methods.[2] For example, the handbook of the Cochrane Collaboration[3] provides detailed information on how to conduct systematic reviews of RCTs in healthcare. Meta-analysis is a routine component of a systematic review, and is used to quantitatively combine multiple studies in order to obtain overall pooled estimates of effectiveness. Well conducted meta-analyses of RCTs are often considered to be at the top of the hierarchy of evidence when considering the effectiveness of clinical interventions.[1]

Reviews within the Cochrane Database of Systematic Reviews and those published in academic journals often make comparisons between two specific interventions. This is a natural and logical way of summarizing RCTs, since they themselves are often designed with two arms making a single comparison to inform the comparative effectiveness of two alternative interventions. Pair-wise meta-analyses are also routinely used to summarize evidence on clinical effectiveness in NICE technology appraisals, and the estimates of effect they produce are often used to inform associated economic analyses. However, there are limitations in considering individual pairs of treatments independently. These are considered briefly below and in more detail throughout the paper.

First, it is not uncommon for two technologies being evaluated never to have been trialled against each other (e.g. new drugs are often compared with placebo or standard care in trials aimed to contribute towards obtaining drug licensing approval, but not against each other). In order to answer the relevant policy question of interest (i.e. which is the best treatment?), it is (still) necessary to assess the comparative effectiveness of the two technologies. A further scenario is when the policy question of interest requires a decision to be made regarding more than two alternative interventions. As we shall see, there are inadequacies in the use of pair-wise meta-analysis methods in these and other contexts.

While extensions to meta-analysis that allow more than two treatments to be simultaneously compared have been developed,[47] until recently, they have seldom been used in practice. These methods are often referred to as ‘mixed treatment comparison’ (MTC) models,[8] although the name ‘network meta-analysis’ has also been used[5] (modelling indirect comparisons[9] forms a special case of this more general methodology, as explained fully in section 3). Although such models can be viewed as a logical and coherent extension of standard pair-wise meta-analysis, their increased complexity raises some unique issues with far-reaching implications concerning how we use data in technology assessment. The identification and consideration of these issues during the process of updating the NICE 2004 Guide to the Methods of Technology Appraisal[10] is the primary focus of this article.

The structure of the remainder of this article is as follows: section 2 presents a nontechnical account of the use of meta-analysis in technology appraisal; and section 3 presents a detailed, but, again, nontechnical, account of indirect and MTC approaches, carefully detailing the assumptions these methods make and how these assumptions relate to those made in standard pair-wise meta-analysis. In these sections, we forego algebra for a more intuitive exposition utilizing network diagrams, but citations are provided throughout to where statistically rigorous elaborations can be found, and a more technical review has recently been made available elsewhere.[11] Section 4 outlines the issues identified as being relevant to updating NICE’s 2004 Methods Guide with respect to indirect and MTC approaches.[10] The conclusions (section 5) end the paper.

2. Systematic Review and Meta-Analysis in Technology Appraisal

To appreciate the assumptions that underlie indirect and MTC modelling and the potential threats to its validity, it is crucial to identify those assumptions that exist in the standard meta-analysis context, since all of these are also relevant to the MTC context.

Figure 1 presents a number of network diagrams representing different evidence structures. Consider panel (a) of the figure, which contains two circles representing two interventions, A and B. Both circles are shaded, which is our convention for indicating that both treatments are being considered as potential alternatives for adoption in the assessment in question (i.e. the relevant technologies and comparators defined by NICE in the scope for the appraisal). The line between them indicates there are one or more studies that estimate this comparison directly (i.e. there are direct head-to-head trials of A vs B). We define the comparative effect of A versus B by dAB, and each of the A versus B trials provides an estimate of this. This is the simplest situation, and one for which standard meta-analysis models[12] are often utilized for synthesis if multiple trials exist.

Fig. 1
figure 1

Network diagrams for evidence structures: (a) standard pair-wise meta-analysis scenario; (b) indirect comparison; (c) mixed treatment comparison (MTC); (d) MTC in which all treatments are being appraised, implying comparisons vary from being either direct or indirect depending on which pair-wise comparison is being considered; (e) disconnected evidence network; and (f) connecting network by extending the network using a common comparator to ‘span’ the discontinuity. The shading indicates those comparators being assessed. The line between the alternatives indicates there are one or more studies that estimate this comparison directly.

Of course, this does not mean that meta-analysis is appropriate just because multiple trials exist. For example, although trials may compare broadly the same interventions, the trials may differ in important ways. For example, populations in the trials may differ, as may details relating to the intervention (e.g. different doses or timings of delivery of drugs). It is perhaps manifestations of this general issue that have caused the majority of debate and controversy[13] over the use of meta-analysis for the past 2 decades or so. Deciding when it is appropriate to, and not to, combine trials is inevitably a subjective decision, and debates rumble on between ‘lumpers’ and ‘splitters’. What is important, but sometimes overlooked, is that the question the meta-analysis is trying to answer should have a bearing in these decisions. This is important because, in technology assessment, questions will often be highly focused, whereas published meta-analyses, which are not geared to informing a specific policy decision, often aim to answer broader questions or just summarize a (possibly disparate) evidence base.

For example, weight loss interventions have been evaluated for reducing the risk of onset of type 2 diabetes mellitus. A meta-analysis has been published[14] addressing the broad question of whether lowering weight, by any means, reduces the onset of diabetes. This meta-analysis included trials evaluating (different) drug and diet interventions. It would probably be too general for technology assessment, where decisions would usually relate to specific weight-lowering interventions, perhaps being as specific as a certain dose of a particular drug, or a particular dietary advice package. Furthermore, the policy question may relate to specific patient populations (e.g. individuals of different ages/starting body mass indexes), which may or may not match-up with those used in the trials.

Clearly, these issues will influence which evidence is considered in the assessment. It may even mean that the analyst may want to include a subset of data from some trials; for example, a trial may contain both ‘young’ and ‘old’ patients, but the results of just the ‘young’ patients are sought. Often it will be necessary to obtain the individual patient data (IPD) and conduct further analysis[15] to make this possible (the article on subgroups in this issue of PharmacoEconomics considers related issues in more detail[16]). While this can be a time-consuming and expensive process, it may eventually turn out to be the only reliable way of conducting meta-analyses to answer specific policy questions.

This example demonstrates the point that there is no one ‘correct’ way to combine data from related trials, but the ‘correct’ analysis depends on the question of interest, which itself will depend directly on the decision question. This potentially conflicts with the notion of being able to produce one definitive summary of trial data, which would appear to be an aim of groups such as the Cochrane Collaboration.

While the above discussion considers the inclusion and exclusion of trial data for a specific synthesis, in reality the decision is rarely a straightforward dichotomy. Rarely will large amounts of evidence relating specifically to the decision context in question be available (and sometimes there will not be any). There is a hope (and evidence)[17] that relative treatment effects are reasonably constant across variability in trial characteristics. Therefore, there would appear to be a general (if somewhat implicit) inclusiveness regarding data selection for meta-analysis in that trials are included, despite known differences between them and the scenario relevant to the policy question of interest, unless a good reason is identified why such trials are not combinable.

Alternative meta-analysis models have been developed to account for differences between trials (making the same broad comparison). The inclusion of random effect terms in meta-analysis models[18] provides a way of allowing for variability between treatment effects estimated by individual studies. Such models assume there is not one but a distribution of underlying treatment effects (compared with a single value estimated by the simpler fixed effect method). While this may appear to provide a ‘one stop’ and elegant solution to the problem of synthesizing heterogeneous data, with a few notable exceptions,[19] little thought has gone into the interpretation of such syntheses. Attention is usually focused exclusively on the estimate of the central location of the distribution of effects, thus ignoring the implications regarding variability indicated by the associated distribution. A recent paper considering the use of random effect models for decision making explores this issue in depth.[20,21] Thus, if meta-analysis can ‘accommodate’ heterogeneous data, does this mean we do not have to worry too much about which trials we include in our synthesis? The answer would seem to depend on whether we can relate the distribution of effects estimated to the decision context of interest.[20] For example, is the mean of the distribution a good estimate for the treatment effect across all populations we wish to make a decision for? Or would we expect variations in effect across geographical regions similar to the differences in effect observed in the different trials? Or is one (or more) particular study in the meta-analysis more representative of the context the decision is being made for (compared with the other trials), and therefore, we should expect the treatment effect in practice to be closer to that estimated from this specific study than the others in the meta-analysis? Clearly, the answers to these questions should influence the estimate(s) of effectiveness used in any economic modelling. This issue is only beginning to be explored, but if this cannot be done reliably, then perhaps this is an argument against the use of random effect models in technology assessment.

When a component of the heterogeneity is predictable, a further alternative is to include study-level co-variates in the synthesis to not only accommodate, but also to explain between-study variability. Such an analysis has been called ‘meta-regression’.[22] While such an analysis is conceptually appealing, its practical implementation is hampered by issues relating to low power and potential aggregation biases, thus strengthening the argument for the use of IPD meta-analyses for exploring patient-level characteristics.[23]

In summary, while meta-analysis has gained acceptance as a valid tool to combine data across RCTs, and the procedure is systematic and transparent, there are still subjective components in their conduct. The first is the way that the protocol for study inclusion and exclusion is framed. This is mainly driven by the target patient population, which is usually clearly defined in a decision-making context. However, subjective judgements may be made regarding the similarity of the patients in certain trials to the target population (although methods are emerging that adjust a meta-analysis when trials have varying external validity).[24] A second area where judgement is required is in decisions regarding whether differences between study populations might impact on either the absolute or the relative treatment effects. A third subjective element arises from seeing evidence synthesis as a complex model-fitting exercise, rather than a way of ‘summarizing data’; judgement is then required to select a model, for example, fixed or random effects.

Understanding how the meta-analysis conducted relates to the decision in question is a difficult but important and under-appreciated step that needs to be considered in technology appraisal. Some of the issues that will need to be addressed to do this successfully will require expert clinical, as well as statistical, input, although it will often be unreasonable to expect expert opinion to be able to inform synthesis decisions without introducing elements of uncertainty that should be acknowledged. Section 3 considers the additional issues/assumptions of the use of indirect and MTC modelling in technology assessment.

3. Indirect and Mixed Treatment Comparisons (MTCs) in Technology Appraisal

3.1 Indirect Comparisons

Figure 1 b contains three treatments: A, B and C. A and B are shaded, indicating that the decision relates to the adoption of these treatments; common reasons why C is not being considered in the decision include (i) it is a placebo or standard care intervention that would currently not be acceptable for current practice; or (ii) it may not be licensed. There is no direct evidence since no direct line connects A and B. However, A and B are linked through C, which is a common comparator (despite C not being of direct interest in the assessment). An ‘indirect comparison’ of A and B (represented subsequently as ‘AB’, etc.) can be calculated using the AC and BC trials. Using the notation introduced in section 2, an indirect estimate of dAB can be obtained using the AC and BC trials, since dAC − dBC = dAB. This is sometimes referred to as an ‘adjusted’ indirect comparison.[9] There are instances in previous literature where an ‘unadjusted’ indirect estimate of dAB is obtained by simply comparing the results in the A arms in the AC trials with the B arms in the BC trials. This is strongly discouraged because it ignores the randomization. It is no more valid than a comparison based on single-arm (i.e. observational) studies.

In this simple context, it is possible to use standard meta-analyses to initially estimate dBC and dAC (with associated uncertainty) and use these estimates to calculate dAB (again, with associated uncertainty).[25] This is powerful because it produces a comparative estimate of effect where no head-to-head evidence exists. Natural questions to ask are: (i) what assumptions does this analysis make? and (ii) how reliable are the estimates it produces? These are considered in turn below.

Since the method uses meta-analysis, all the issues raised in section 2 are relevant and, like standard meta-analysis, randomization is not broken since comparative estimates are derived from each trial prior to synthesis. However, a further assumption is that there is consistency across evidence, such that, if an A arm had been included in the BC trials, the estimate of dAC would be consistent (i.e. the underlying effects are assumed to be identical or exchangeable depending whether fixed or random effects are assumed) with those produced by the AC trials. (This is equivalent to assuming that, if a B arm had been included in the AC trials, the estimate of dBC would be consistent, within statistical sampling error, with those produced by the BC trials, etc.) At first sight, these seem to be large assumptions. But they are in fact exactly the assumptions made in a standard pair-wise meta-analysis, namely that all the trials estimate identical (fixed) or exchangeable (random) treatment effects.

Ascertaining the reliability of the method in practice is difficult to do definitively. An empirical study[26] has been carried out comparing results of direct and indirect estimates from collections of trials on a topic in which both types of evidence exist. Perhaps inevitably, different levels of agreement were observed across topic areas. In order to make such comparisons, it was assumed that the direct evidence was the gold standard and gave the ‘correct’ answer. This is, of course, a strong assumption in itself since each meta-analysis used in the investigation is subject to all the complicating issues relating to individual meta-analyses considered in section 2, including heterogeneity. This can imply that all three meta-analyses (the direct one and the two making the indirect comparison) may, for example, be estimating effects that may be dependent on the populations in which they were conducted. Thus, it may be wrong to think of any of them as being biased; rather, they are all simply inconsistent estimates, and it should not be implied that the direct evidence is the most appropriate estimate for a particular decision context. Indeed, a recent paper considering a number of case studies argues that, in some instances, the indirect estimates may, in some sense, be more reliable than direct ones.[27] We return to the notion of inconsistency in section 3.3.

In a sense, the mechanisms by which an indirect comparison can give the ‘wrong’ answer are very similar to those in which meta-analysis can give the wrong answer, but since an indirect comparison requires two meta-analyses to be ‘correct’, rather than one, there is, arguably, more scope for error.

In summary, estimation of indirect comparisons is a statistically valid idea if the assumptions that the approach makes hold. The assumptions are not inconsistent with those made in pair-wise meta-analysis. It must be acknowledged that knowing when such assumptions are reasonable and when they are not may be difficult and largely untestable. However, in the decision context in which NICE operates, the target population is clearly defined, and the protocol for study inclusion and exclusion should result in an evidence base in which a degree of homogeneity of effect can often be reasonably expected. Further, if a decision is required in the absence of direct evidence, the approach is more transparent and explicit than any less formal alternatives.

3.2 MTCs

Figure 1 c contains a network of treatments, A, B and C, in which, like for the indirect comparison example, A and B relate to the decision being made. But now there is both direct and indirect evidence on A versus B. This is the simplest example of an MTC network.

Models have been developed[4] to simultaneously synthesize all available evidence relating to such a network (i.e. AB, AC, BC and ABC [i.e. three-arm] trials) using an extended meta-analysis model without breaking randomization. More recent work has demonstrated this approach immediately generalizes to situations in which there are more than three technologies and more complex evidence structures.[7,8]

It is important to note that, in panel (c) of figure 1, treatment C is not shaded and therefore is not an option in the decision being addressed. This is an important issue because, for the first time, there are now options available regarding the type of evidence that could be used to estimate the comparative treatment effect (dAB) of interest. We could use any of the three network diagrams represented in figure 1ac; that is, just use the direct (a), or just use the indirect (b) or simultaneously use the direct and the indirect (c).

Currently, common practice would be just to use the direct evidence and conduct a standard meta-analysis. This has been justified by claiming the indirect is less reliable and doing the MTC would produce a ‘weaker’ analysis. There would seem little justification in only using the indirect evidence (unless you believed the direct evidence not to be suitable for the decision context). Two advantages for considering all the evidence in an MTC model are (i) it allows the inclusion of all the evidence, which will reduce the uncertainty in the pooled estimate of interest (dAB); and (ii) it allows us to formally check the consistency of the evidence,[28] i.e. we can formally measure the fit of a model that assumes the direct and indirect evidence fit together.

Therefore, there are two separate motivations for considering indirect and mixed comparisons.

  • Indirect comparisons allow us to estimate treatment comparisons that have not been trialled head-to-head without breaking randomization.

  • Extending networks to include both direct and indirect comparisons can reduce uncertainty in the comparisons of interest and provide an opportunity for formally assessing the consistency of the evidence.

A further factor that has been argued could influence the choice of whether to use the direct evidence only or to use it in combination with the indirect evidence is the amount of direct evidence that exists (i.e. if there is a ‘lot’ of direct evidence, there is no need to consider the indirect). The problem with such a notion is that, unless conditions are made explicit, the selection of evidence across assessments loses a degree of transparency. Logically, it is hard to see why the validity of one piece of evidence should depend on the existence of another.

There is a concern that, where a complex evidence network could be constructed from multiple treatment comparisons, analysts might select a particular set of contrasts that give favourable results. The only effective, appropriate protection against this is an explicit and transparent protocol for study inclusion/exclusion that is open to discussion and debate.

The assumptions required for an MTC analysis are essentially the same as those relating to (an individual) indirect comparisons analysis and concern the consistency across the comparisons. Perhaps the clearest way to conceive this for the general MTC case is to imagine that each trial in the synthesis contained an arm for every treatment regime in the network. Then, to reduce the dataset down to the comparisons that actually exist in reality, arms need to be assumed to have been removed at random from each of the studies. If this is not tenable, the analysis may be invalid. Factors such as comparative treatment effects varying with disease severity may be a cause to consider such an analysis invalid (e.g. if the BC trials are undertaken in patient populations with higher/lower baseline risks than the AB and AC trials, and the treatments interact with baseline risk, the evidence will be inconsistent). This should be seen, not as a limitation of the methodology, but as a limitation of the data since, for a defined population, the underlying model is consistent by definition.

Figure 1 d is the same as 1c except that treatment C is now also included in the decision options. Although the MTC statistical model that could combine data in such a structure would be exactly the same as for scenario 1c, this scenario has interesting implications for what we label to be direct and indirect evidence.

Since all three treatments are now relevant to the decision, dAB, dAC and dBC are all of interest, therefore, all two-arm trials AB, AC and BC, provide direct evidence for each comparison, respectively. However, AB and AC trials also contribute indirect evidence to the dBC estimate, and AB and BC contribute indirect evidence to the dAC estimate, and so on. Also, none of these trials provide direct evidence on all comparisons of interest, as a three-arm ABC trial would be required for that. This illustrates that our notion of direct and indirect evidence becomes difficult to define when more than two treatment options are of interest. An implication of this is that, if you wished to include only direct evidence in your synthesis, you would only include three-arm trials. Given the likely exclusion of the majority of the evidence with this ruling, we suspect many would object to it and lobby for the use of the two-arm trials also. But if this is done, it is important to note that this would be inconsistent with a ruling of excluding the indirect evidence (i.e. AC and BC) in the scenario associated with figure 1 c, an issue that we believe is not widely appreciated.

Indirect comparisons, and the combination of direct and indirect evidence (i.e. MTC) for three treatments is possible using simple manipulations of pair-wise results,[25,29] and such an approach can be extended to more complex evidence structures by constructing an MTC model using the pair-wise meta-analytic summaries for every randomized comparison. However, a study-level MTC extension of the standard (fixed or random) effects meta-analysis model is usually used.

This is perhaps most straightforward using the Bayesian WinBUGS software[30] (for which code is available),[8] although fixed and random effect MTC models have been fitted using classical methods in R[5] and could probably be fit in other packages. Using Bayesian Markov chain Monte Carlo methods it is possible to rank all three treatments and produce a probability any one treatment is the ‘best’.[31] This is a powerful illustration of the ability of Bayesian statistical methods to make direct probability statements about quantities of central interest to a decision, although this can be approximated within a classical framework through bootstrap methods.[32] In this context, a further advantage of MTC methodology exists:

  • MTC methodology allows more than two treatments to be compared simultaneously, using one consistent evidence base to make all treatment comparisons and thus making simultaneous comparisons between all treatments.

Hence, the potential benefits of extending meta-analysis to indirect and MTC approaches for technology assessment are multifaceted, which we believe has contributed to some confusion about the scenarios in which the approach offers benefits over standard meta-analysis.

An MTC cannot synthesize all evidence structures since it requires a connected network to be valid.[8] That is, for each treatment, there is a chain of pair-wise comparisons that connects it to every other treatment. Figure 1 e gives an example where this is not the case. Here, treatment D is relevant to the decision but has not been compared directly to any of the other treatments, hence creating a disconnected network. Because of this discontinuity, no estimate of the effect of D compared with A, B or C can be obtained. It may be that treatment D is new and has not been evaluated in any trials yet. It is possible to envisage that MTC methods could be extended to include disconnected networks or, in effect, to include one-arm observational studies.[33] However, such an extension introduces additional types of uncertainty, in the form of both random and systematic error. No research has been done to establish how this would be done or whether, given the degree of uncertainty, such evidence could ever make a material contribution to the decision.

However, if treatment D has been trialled, but against treatments not currently in the network, it may be possible to expand the network to include further treatments, enabling the discontinuities to be spanned. In figure 1 f, treatment E is introduced because it has been trialled against both treatments D and B and hence spans the network.

Even when networks do not contain discontinuities and all treatments of relevance to the decision are connected, it may be possible to extend the network further, and in some therapeutic fields, highly extended networks may be possible. Advantages of expanding networks are that it can reduce uncertainty in the comparisons of interest and allow further opportunities for checking consistency (every closed loop of connections in a network provides the opportunity for checking consistency).[28] Ascertaining the likely gains in precision from expanding networks is complex, but diminishing returns exist the more distant the network additions are from the comparisons of interest for the decision question. Since the resources required for study identification and data extraction will also increase with the size of the network, there may be a point after which it will not be cost effective to expand a network, although no formal methods exist for indicating when this is. This also means that, as the number of treatments increases, the number of alternative networks also increases and the potential for selectivity of reporting potentially increases. This further reinforces the need for a protocol to define the evidence base in advance of the assessment.

Although not empirically proven, it would seem reasonable that the more ‘distant’ the extensions are to an existing network, the more likely inconsistencies are introduced into the network. This is argued since, in many contexts, ‘distant’ treatments relative to those in the decision question may be more likely to be supported by older evidence and, thus, to have been carried out under different clinical and methodological conditions from the current climate. Given these considerations, it would seem sensible to keep the protocol tightly focused to the relevant patient population (the definition of which may, of course, include an element of subjectivity), rather than trying to create the biggest network possible, although this is an under-researched topic.

3.3 Extensions to MTC Methodology

MTC methodology is still very much in development, and several extensions to the basic models have been considered. A common assumption to make in such modelling is that the degree of heterogeneity between each pair-wise comparison is identical — i.e. one heterogeneity parameter is used to estimate the degree of heterogeneity across all comparisons.[34] However, this assumption can be relaxed if necessary. For example, where inconsistency is identified, it is possible to include further parameters in the model in order to account for such inconsistency (Lu G, Ades AE, unpublished data).[5,28] It should be noted that, like heterogeneity in meta-analysis, inconsistency in MTC is a ‘double-edged sword’. While inconsistency may be difficult to deal with, an MTC analysis does provide a formal framework for assessing collections of trials and exploring why results are inconsistent.

It is possible to include study-level co-variates in an MTC analysis[35] to address inconsistency between comparisons and heterogeneity within comparisons in the same way as co-variates are included to address heterogeneity[22] in a standard meta-analysis. For example, if baseline risk treatment interactions were suspected, baseline risk could be included as a co-variate in the model to adjust for this and remove inconsistency. However, problems with the low power of meta-regression[23] will also apply in this context, and therefore, such a solution will only be partial at best. To date, IPD MTC have rarely been undertaken, but this may overcome some of the limitations of an analysis based on summary data. There may be differences in the outcome data reported across the different studies that need accounting for in the analysis; models are being developed to address these. For example, patient responses may have been reported at different and multiple timepoints,[34] or the outcome definition may have differed between studies requiring the relationship between outcomes to be simultaneously modelled in the synthesis.[36]

4. Relevant Issues Relating to Updating the 2004 Methods Guide with Respect to the Use of Indirect and MTC Approaches

This section highlights (i) what the 2004 Methods Guide recommended with respect to indirect and MTC analyses; (ii) aspects of the guide that could be affected with the changing of the recommendations; (iii) questions relating to the role of indirect comparisons and/or MTC in NICE appraisals (some of which were presented for discussion at a workshop held by NICE on MTC as part of the process of updating their guidance);[10] and (iv) the authors’ own opinions on some of these issues.

The NICE Methods Guide 2004[10] does not mention the possibility of an MTC analysis at all, but does briefly mention indirect comparisons. Therefore, it is key to establish in which contexts (if any) indirect and MTC approaches should be recommended as the synthesis method of choice for estimating comparative treatment effects. This could range from always to instances where standard pair-wise meta-analysis is considered inadequate; for example, where more than two treatment options are being considered, or where direct data do not exist. Of course, placing such restraints on when MTC could be used would mean the evidence used in an appraisal would be conditional on the number of comparators considered and the type of evidence available, which could raise issues of inconsistency across assessment topics.

The reliability, and hence desirability, of indirect and MTC analyses can perhaps be measured by placing such analysis within pre-existing hierarchies of evidence for treatment effectiveness.[1] Would such analyses be placed alongside meta-analyses of RCTs at the top of such hierarchies, or take a lower ranking? If they take a lower ranking, how much lower would this ranking be? For example, would indirect comparisons from RCTs come higher than direct estimates from observational studies? There are concerns that the assumptions of MTC analyses may be valid in some contexts and not others.[11] Can contexts in which these assumptions are not valid be identified ahead of time (e.g. when there are large known differences in the populations recruited to the trials for different comparisons) or can this only be assessed through statistical analyses of the (in)consistency?

The definition and use of comparator technologies has to be carefully considered in any technology assessment. The 2004 guidance stated a “strong preference for evidence from ‘head-to-head’ RCTs that directly compare the technology and the appropriate comparator. Where no head-to-head trials are available, indirect comparisons can be considered, subject to careful and fully described analysis and interpretation [emphasis added].” Related to this is the specific issue of ‘class effects’. The 2004 guide stated “A group of related technologies, whether or not they are formally identified as part of a recognised ‘class’, might have similar but not necessarily identical effects. Where the Institute is appraising a number of related technologies within a single appraisal, both separate and combined analysis of the benefits of the individual technologies should be undertaken.” Clearly, if indirect or MTC analyses are to be used, how class effects are treated will have considerable influence on the structure of the evidence network.

Once the comparators of interest have been defined, the associated evidence base needs defining. Recall that, in an MTC analysis, treatments can be included that are not defined as comparators in the assessment. The 2004 guidance stated that analysis should include, “… data from all relevant studies.” Clearly, the data that are considered relevant will depend on the synthesis model used. There are many issues surrounding how the network is defined. It would seem sensible that any network should be specified prior to analysis to prevent selectivity of reporting following analyses of multiple network variations. But how extensive should the network be? Should interventions be added to strengthen the evidence base or to build connections between isolated comparators? Should we go as far as insisting that networks should be exhaustive in the treatments they include? It would seem general principles need outlining here to offer transparency to the process. These could be based on expert clinical opinion and/or statistical/economic considerations of efficiency/cost effectiveness regarding the pay-offs of expanding the network.

A related and recurring issue in technology assessments considered by NICE is that the evidence base on which they are based is far from ideal. Use of indirect and MTC methods touches on the broader issues relating to study quality/validity. Currently, such issues are not explicitly factored into the synthesis models in common usage. In certain circumstances, no randomized evidence may exist at all, and estimates of effectiveness rely on observational studies. There is a pressing need to develop methods of incorporating information on validity of evidence into any synthesis models used in technology appraisal (whether they are standard meta-analyses or MTCs, etc.).[24] The authors believe that only after this is done can the use of an MTC be fully built into a coherent context within the appraisal process.

There is a need to consider details of the statistical analysis used to synthesize the defined evidence base. Although the statistical models involved for MTC can be seen as a natural extension of standard fixed and random effect meta-analysis, they appear to introduce many new complexities with far-reaching practical implications, which we have highlighted in this article. To a large extent, these complexities are less to do with the technique itself than with the principles under which evidence from many RCTs is selected and assembled. Nevertheless, if NICE does embrace the use of such MTC methods, there will be implications in terms of further expertise and resources required in conducting a technology assessment.

The need to assess heterogeneity between study results before pooling was highlighted in the 2004 Methods Guide.[10] Also supported is the use of random effect models, as well as meta-regression and subgroup analysis in circumstances where heterogeneity is present. If MTC methods are used, these and further issues around methods to assess between comparison consistency need considering. If such assessments indicate that the assumptions of the models do not hold (including the need for a connected network), what should be done? Since MTC methods are presently in their infancy, do we know enough about the performance of such methods to recommend them for routine practice? Slightly different model specifications exist in the literature[5,7] for conducting MTC analysis; is more methodological research required before a specific modelling approach can be recommended? If so, what? It has been concluded by others[11] that more work is needed in this area, including the development of more user-friendly software to fit MTC models.

Until now, we have not considered the reporting and presentation of indirect and MTC analyses. While guidelines for reporting pair-wise meta-analyses exist,[37] would it be helpful/essential to have a similar document for indirect and MTC approaches? There are several ways in which the trial evidence can be presented. Is presenting a network diagram a good idea (or should it be compulsory?), perhaps including the number of comparisons on the connecting lines (as shown in figure 2, an MTC of treatments for atrial fibrillation [AF])?[38] Is a tabular representation of the comparisons that exist helpful, as shown in table I for early thrombolysis for acute myocardial infarction (MI)? Should trial-level data be included, as shown in table II for fictitious data based around panel (f) of figure 1(with a C vs E trial added)? Given the more extensive results resulting from an MTC compared with a pair-wise meta-analysis, presenting clear results is a non-trivial task. Should results of all pair-wise comparisons be presented? If so, is this most clearly done in tabular form, as in table III for treatments for acute MI, or in graphical form, as illustrated in figure 3 for treatments for AF? Should these results be compared, where possible, to the results only using the direct evidence, as done in the upper triangle of table III? Is it useful to present the probability that each technology is best, as in table IV for treatments for MI? There are further technical details that need to be considered, such as (i) how to clarify the statistical model used; (ii) how to discuss the goodness of fit of the model (e.g. describe any heterogeneity and inconsistency); and (iii) how the possibility of heterogeneity and inconsistency is explored.

Fig. 2
figure 2

Network diagram of treatments for non-rheumatic atrial fibrillation. The numerals on the connecting lines represent the number of comparisons (reproduced from Cooper et al.[38] with permission. © 2006, American Medical Association. All rights reserved).

Table I
figure Tab1

Summary of comparisons made in the randomized, controlled trials (RCTs) of early thrombolysis for acute myocardial infarction (reproduced from Caldwell et al.[8])

Table II
figure Tab2

Structure of table for presenting data in a mixed treatment comparison (MTC) analysis. Data are fictitious, based around the data structure in panel (f) of figure 1, with the addition of a C versus E trial. The data in the columns representing treatments A, B and D are italicized, indicating that these treatments are the interventions being evaluated in the technology assessment

Table III
figure Tab3

Results (odds ratios for mortality) from fixed effects mixed treatment comparison (MTC) analysis and standard pair-wise meta-analysis for treatments for acute myocardial infarction (upper triangle = direct comparisons; lower triangle [italic] = MTC). Reproduced from Caldwell et al.[8]

Fig. 3
figure 3

Pair-wise comparisons of all different treatments for prevention of ischaemic stroke in atrial fibrillation patients. Adverse event outcome: major or fatal bleed (reproduced from Cooper et al.[38] with permission. © 2006, American Medical Association. All rights reserved). ASP = aspirin; cf = compared with; FDW = fixed-dose warfarin; FDWA = fixed-dose warfarin and aspirin; IBF = indobufen; LDW = low-dose warfarin; PLC = placebo; WFN = warfarin; XML = ximelagatran.

Table IV
figure Tab4

Probability that each treatment is best for acute myocardial infarction example (reproduced from Caldwell et al.[8])

An implication of using MTCs is that, unlike for regulatory approval, the results of one company’s trial may influence the estimated relative effectiveness of another company’s product, even when the other product is not used as a trial comparator. As well as having implications for the design of future studies, this approach may have implications for which commercially in-confidence data are made available by companies for NICE appraisals.

Finally, it is helpful to consider the question: What alternatives do we have for making decisions regarding interventions that have not been compared directly? We believe that if we do not accept this type of analysis, we render ourselves unable to make decisions, which may itself be seen as a good or bad thing. (We are also unable to explore issues such as consistency of evidence across comparisons, and in this, MTC should be seen as an opportunity, in the same way that exploration of between-study heterogeneity can result in new insights in meta-analysis.) It is important to think about the downstream implications of accepting indirect (and mixed) treatment comparison analyses. If it is considered to be an acceptable methodology by decision makers, decisions regarding future trials should be made based on the uncertainty given the current (MTC) analysis. This may mean there is a decrease in the insistence for direct head-to-head trials in the future, which some may have issue with (although it may mean that future head-to-head trials could be more precisely targeted and designed more efficiently if the totality of existing evidence is considered in an MTC and used at the design stage).

5. Conclusions

Although the methodology required to conduct indirect and MTC analyses has been in place for a little while, it is only in very recent years that its use has started to become widespread. It is a particularly important methodological development in technology appraisal because it potentially offers a powerful solution to synthesis in contexts where individual or pair-wise meta-analyses of trials do not provide coherent estimates of all the effectiveness parameters, as is often required to inform associated economic decision models. This is why NICE needs to consider their position on the use of the methods at this relatively early stage in their development.

It is because such methodology is at this early stage of development, and because it is necessarily more complex than standard meta-analysis, that we believe drawing up guidance that is logical, transparent, explicit and fair across different technology appraisals is particularly challenging. Despite the challenges ahead, MTC methods are perhaps the most important development in evidence synthesis in recent years, and their potential for use in technology assessment is considerable.