A comparison of various aggregation functions in multi-criteria decision analysis for drug benefit–risk assessment

Multi-criteria decision analysis is a quantitative approach to the drug benefit–risk assessment which allows for consistent comparisons by summarising all benefits and risks in a single score. The multi-criteria decision analysis consists of several components, one of which is the utility (or loss) score function that defines how benefits and risks are aggregated into a single quantity. While a linear utility score is one of the most widely used approach in benefit–risk assessment, it is recognised that it can result in counter-intuitive decisions, for example, recommending a treatment with extremely low benefits or high risks. To overcome this problem, alternative approaches to the scores construction, namely, product, multi-linear and Scale Loss Score models, were suggested. However, to date, the majority of arguments concerning the differences implied by these models are heuristic. In this work, we consider four models to calculate the aggregated utility/loss scores and compared their performance in an extensive simulation study over many different scenarios, and in a case study. It is found that the product and Scale Loss Score models provide more intuitive treatment recommendation decisions in the majority of scenarios compared to the linear and multi-linear models, and are more robust to the correlation in the criteria.

(a utility or loss score) for a treatment, which summarises all the benefits and risks induced by the treatment in question. These scores are then used to compare the treatments and to guide the recommendation of therapies over others. Mussen et al. 7 proposed to use a linear aggregation model in the MCDA, which takes into account all main benefits and risks associated with a treatment (as well as their relative importance) to generate a treatment utility score by taking a linear combination of all criteria. This utility score is then compared against the utility score of a competing treatment, and that with the highest score is recommended. This model appealed for numerous reasons, one of which was its simplicity. The proposed method, however, was deterministic, point estimates of the benefit and risk criteria were used, and no uncertainty around these estimates was considered. Yet, uncertainty and variance are expected in treatments' performances, and must therefore be accounted for in the decision making.
To resolve this shortcoming, probabilistic MCDA (pMCDA) 11 that accounts for the variability of the criteria through a Bayesian approach was proposed. Generalisations of pMCDA for the case of uncertainty in the relative importance of the criteria were developed, named stochastic multi-criteria acceptability analysis (SMAA) 12 or Dirichlet SMAA. 13 However, it was acknowledged that by accounting for several sources of uncertainty, these models become more complex and should be used primarily for the sensitivity analysis.
All the works discussed above concern a linear model for aggregation of the criteria, which is thought to be primarily due to its wider application in practice rather than its properties. One argument against the linear model is that a treatment which has either no benefit or extreme risk could be recommended over other alternatives without such extreme characteristics. [14][15][16] In addition, the linearity implies that the relative tolerance in the toxicity increase is constant for all levels of benefit that might not be the case for a number of clinical settings. To address these points, a Scale Loss Score (SLoS) model was developed. This model made it impossible for treatments with no benefit or extremely high risk be recommended. It also incorporates a decreasing level of risk tolerance relative to the benefits: where an increase in risk is more tolerated when benefit improves from 'very low' to 'moderate' compared to an increase from 'moderate' to 'very high'. SLoS model resulted in similar recommendations to the linear MCDA model when the one treatment is strictly preferred to another (i.e. has both lower risk and higher benefit), but resulted in more intuitive recommendations if one of the treatments has either extremely low benefit or extremely high risk.
Whilst other methods are discussed in the literature, the only application of a non-linear BRA model to the medical field is made by Saint-Hilary et al., 14 and this only compares the linear and SLoS models. This paper shall build on this comparison by introducing various different aggregation models (AM) to analyse how each work compared to the other in the medical field (by conduction a case study and a simulation study), and allow an informed decision to be made as to which one should be used using the results of an extensive and comprehensive simulations study over a number of clinical scenarios. We will also use a case study to demonstrate the implication of the choice of AM on the actual decision making using the MCDA.
The rest of the paper proceeds as follows. The general MCDA methodology, the four different aggregation models considered, linear, product, multi-linear and SLoS, and the choice of the weights for them are given in Section 2. In Section 3, we revisit a case study conducted by Nemeroff 17 looking at the effects of Venlafaxine, Fluoxetine and a placebo on depression, applying the various aggregation models to a given dataset. In Section 4, a comprehensive simulation study comparing the four aggregation models in many different scenarios is presented, as well as the effects any correlation between criteria may have. We conclude with a discussion in Section 5.

Methodology
All of the aggregation models (referred to as to 'models' below) considered in this work are all classified within the MCDA familythey aggregate the information about benefits and risks in a single (utility or loss) score. Therefore, we would refer to each of the approaches by their models for the computation of the score. Below, we outline the general MCDA framework for the construction of a score using an arbitrary model. We consider the MCDA taking into account the variability of estimates, pMCDA. 11

Setting
Consider m treatments (indexed by i) which are assessed on n criteria (indexed by j). To ensure continuity, we use the same notations as those of Saint-Hilary et al. 14 : • ξ i,j is the performance of treatment i on criterion j, so that treatment i is characterised by a vector showing how it performed on each criterion: ξ i,j = (ξ i,1 , . . . , ξ in ).
• The monotonically increasing partial value functions 0 ≤ u j (·) ≤ 1 are used to normalise the criterion performances. Let ξ ′ j and ξ ′′ j be the most and the least preferable values, then u j (ξ ′′ j ) = 0 and u j (ξ ′ j ) = 1. The inequality u j (ξ ij ) > u j (ξ hj ) indicates that the performance of the treatment i is preferred to the performance of the treatment h on criterion j. In this work, we focus on linear partial value functions, one of the most common choice in treatment BRA 5,7,12,18,11 that can be written as (1) • The weights indicating the relative importance of the criteria are known constants denoted by w j . The vector of weights used for the analysis is denoted by w = (w 1 , . . . , w n ). • The MCDA utility or loss scores of treatment i are obtained as u(ξ i , w): = u w j , u j (ξ ij ) , j = 1, . . . , n and l(ξ i , w): = l w j , u j (ξ ij ) , j = 1, . . . , n respectively, where u(·) and l(·) are the functions specifying how the criteria should be summarised in a single score, and are referred to as 'aggregation models'. The impact of this model's choice on the performance of treatment recommendation is the focus on this work. The higher the utility score, or lower the loss score, the more preferable the benefit-risk ratio. Then, the comparison of treatments i and h is based on or Within a Bayesian approach, the utility score u(ξ i , w) and the loss score l(ξ i , w) are random variables having a prior distribution. Given observed outcomes x i = (x i1 , . . . , x in ) and x h = (x h1 , . . . , x hn ) (corresponding to treatment performances ξ i and ξ h , respectively) for i and h, one can obtain the posterior distribution of Δu(ξ i , ξ h , w) or Δl(ξ i , ξ h , w), respectively. The inference is based on the complete posterior distribution and the conclusion on the benefit-risk balance is supported by the probability of treatment i to have a greater utility score (or smaller loss score) than treatment h: or The probabilities (4) or (5) are used to guide a decision on taking/dropping a treatment. A possible way to formalise the decision based on this probability is to compare it to a threshold confidence level 0.5 ≤ ψ ≤ 1. Then, P ih u > ψ (or P ih l > ψ) would mean that one has enough evidence to say that treatment i has a better benefit-risk balance than h with a level of confidence ψ. Note that P ih u = 0.5 (and P ih l = 0.5) corresponds to the case where the benefit-risk profiles of i and h are equal according to the corresponding MCDA model.

Aggregation models
Below, we consider four specific forms of aggregation models, namely, linear, product, multi-linear and SLoS, that were argued by various authors to be used in the MCDA to support decision making.

Linear model
A linear aggregation of treatment's effects on benefits and risks remains the most common choice for the treatment development. 7,12,19,18,13 Under the linear model, the utility score is computed as where w L j > 0 ∀j and n j=1 w L j = 1, the superscript L referring to the linear model. The expression (6) is used in equations (2) and (4) to compare the associated linear scores for a pair of treatments.
As an illustration of all considered aggregation models, we will use the following example with two criteria: one benefit indexed by 1, one risk indexed by 2. The linear utility score for treatment i at fixed parameter values θ i1 , θ i2 takes the form As values u 1 (θ i1 ), u 2 (θ i2 ) ∈ (0, 1), one can interpret u 1 (θ i1 ) as a probability of benefit and 1 − u 2 (θ i2 ) as a probability of risk. This utility score can be transformed into a loss score by subtracting it from one We do this as, historically, the concept of a loss function is preferred both in statistical decision theory and Bayesian analysis for parameter estimation. 20 The contours of equal linear loss score for all values of u 1 (θ i1 ) and (1 − u 2 (θ i2 )) are given in panel A of Figure 1 using w L = 0.5 (top row) and w L = 0.25 (bottom row). The contours represent the loss score for each benefit-risk pair. Lower values of l L (θ i1 , θ i2 , w L ) correspond to better treatment benefit-risk profiles. It is minimised (right bottom corner) when the maximum possible benefit is reached (u 1 (θ i1 ) = 1) with no risk (1 − u 2 (θ i2 ) = 0). The contours are linear, with a constant slope w L /(1 − w L ). This implies that if one treatment has an increased probability of risk of x% compared to another, its benefit probability should be increased by (1 − w L )/w L × x% to have the same utility score, and this holds for all values of benefit and risk. This figure allows for an illustration of the penalisation of various benefit-risk criteria and for an illustrative comparison between treatments with different criteria. For example, any pairwise comparison that lies on a contour line shows that the two treatments are seen as equal.
The major advantage of the linear model is its intuitive interpretation: a poor efficacy can be compensated by a good safety, and vice-versa. However, the linear utility score can result in the recommendation of highly unsafe or poorly effective treatment 21,6 and, consequently, in a counter-intuitive conclusion. Moreover, the linearity implies that the relative tolerance in the toxicity increase is constant for all levels of benefit. 14 These pitfalls could be avoided (or at least reduced) by using non-linear models. 6,22 Specifically, Saint-Hilary et al. 14 advocated introducing two principles a desirable benefit-risk analysis aggregation model should have: 1. One is not interested in treatments with extremely low levels of benefit or extremely high levels of risks (regardless of how the treatment performs on other criteria). 2. Decreasing level of risk tolerance relative to benefits: an increase in risk could be more tolerated when benefit improves from 'very low' to 'moderate', compared to from 'moderate' to 'very high'.
Below, we consider three models having one or both of these properties.

Product model
A multiplicative aggregation (known as a product model) is an alternative method of comparing treatment's effects on benefits and risks. 23 Under the product model, the utility score is computed as where the superscript P refers to the product model. The expression (9) is used in equations (2) and (4) to compare the associated product scores for a pair of treatments. The product utility score for treatment i with two criteria at fixed parameter values θ i1 , θ i2 takes the form Similarly as for the linear model, this utility score can be transformed into a loss score by subtracting it from one The contours of equal product loss score for all values of u 1 (θ i1 ) and (1 − u 2 (θ i2 )) are given in panel B of Figure 1 using w P = 0.5 (top row) and w P = 0.25 (bottom row).
One advantage the product model has over the linear model is that it cannot recommend treatments with either zero benefit or extreme risk. This is because either of these two options would result in a score of zero for the utility function, and as such would make it impossible for such a treatment to be recommended. The contour lines in panel B in Figure 1 demonstrate how the product model penalises undesirable values compared to the linear model. These contours are curved, and are bunched together tightest at points where benefit values are low and where risk values are high. This shows how the penalisation differs this model from the linear model, as under the linear model, an increase/decrease in benefit-risk is treated equally regardless of the marginal values of these criteria, whereas the values of these criteria often have an effect on our decision making under the product model.

Multi-linear model
A multi-linear model for the aggregation of treatments' benefits and risks provides a one more alternative for the comparison of two treatments. 22 This model can be seen as attempt to combine the linear and product model. Under the multi-linear model, the utility score is computed as where the superscript ML refers to the multi-linear model, and the weight criteria w ML i,j,... refer to the weight criteria given to the interaction term between criteria i, j, . . . We require all the weights in the ML model to sum up to 1. The expression (12) is used in equations (2) and (4) to compare the associated multi-linear scores for a pair of treatments.
Considering the example with two criteria, the multi-linear utility score for treatment i at fixed parameter values θ i1 , θ i2 takes the form Note that the even under the constraint of the sum of the weights to be equal to one, there is one more weight parameter than for the linear and product models. This immediately can make the weight elicitation procedure more involving for all stakeholders.
To link the weights of the ML model with the rest of the competing approaches (see more details in Section 2.3), we set up one more constraint, so that the number of weight parameters is the same in all considered model (for the purpose of the comparison in this manuscript). Specifically, we fix w ML 1,2 = c where 0 ≤ c ≤ 1, implying that we fix the effect of the interaction term. Similarly as for the linear and product models, this utility score can be transformed into a loss score by subtracting it from one: The contours of equal linear loss score for all values of u 1 (θ i1 ) and (1 − u 2 (θ i2 )), c = 0.20 are given in panel C of Figure 1 using w ML 1 = 0.40 (top row) and w ML 1 = 0.15 (bottom row). The contour lines demonstrate the almost linear trade-off between benefit and risk, but that there is a slight curvature (which becomes more prominent as it moves further away from more desirable values), indicating a moderate penalisation of extreme values. This shows that while this model attempts to penalise the undesirable criteria values, this effect does not seem to be as strong as in the product model, admittedly due to the chosen value of the weight, w ML 1,2 , given to the interaction term. A moderate level of penalisation for the chosen value of the weight corresponding to the interaction term allows for treatments to be recommended when there is no benefit or extreme risk, as is the case in the linear model. The more the weight of the interaction terms, the less likely this would happen.

SLoS model
An alternative to the models proposed above is the Scale Loss Score (SLoS) model, which was proposed by Saint-Hilary et al. 14 to satisfy the two desirable properties for an aggregation method. First of all, in contrast to the three models above, SLoS considers a loss score, rather than a utility score, as the output. Therefore, lower values are more desirable. Under the SLoS model, the loss score is computed as where the superscript S refers to the SLoS model. The expression (15) is used in equations (3) and (5) to compare the associated SLoSs for a pair of treatments. The loss score could theoretically be transformed into a utility score as u S (ξ i , w S ) = −l S (ξ i , w S ). However, this form is usually not used because it provides negative utility values, which is not intuitive for a utility concept.
Coming back to the example with two criteria, the loss score for treatment i at fixed parameter values θ i1 , θ i2 takes the form The contours of equal scale loss score for all values of u 1 (θ i1 ) and (1 − u 2 (θ i2 )) are given in panel D of Figure 1 using w S = 0.5 (top row) and w S = 0.25 (bottom row).
As is the case with the product model, this penalisation makes it impossible for treatments with either no benefit or extreme risk to be recommended over other potential treatments, compared to the linear and multi-linear models (which can recommend such treatments). This is because a treatment that had either of these would return a loss score of infinity (regardless of the values of any other criteria) and would therefore be non-recommendable. On the figure, the white colour at extreme undesirable values (either very low benefit or very high risk) corresponds to very high to infinite loss scores and demonstrate the penalisation effect.
Of note, Figure 1 displays the contours of equal loss score for all the models, so all the plots on this figure could be interpreted in the same way, with lower scores (in blue) corresponding to more desirable benefit-risk profiles. Even when these contour plots concern the same values of weights in the models, the weights themselves are different in each model (represented by different indices). Therefore, when to provide a fair comparison of these models, it is important to ensure that the models carry (approximately) the same relative importance of the criteria defined through the slope of the contour lines. We propose an approach to match the relative importance of the models below.

Weight elicitation and mapping
Methods for quantifying subjective preferences, for example, Discrete Choice Experiment and Swing-Weighting, have been widely studied in the literature. 6,7,24,25 Applied to drug BRA, the majority of the weight elicitation methods concern the linear model. In the linear model framework, the weight assigned to one criterion is interpreted as a scaling factor which relates one increment on this criterion to increments on all other criteria.
Note that each of the aggregation models use the individual weights, w L , w P , w ML and w S . However, in the actual analysis, regardless of the aggregation model used, one can expect only one underlying level of the relative importance of the considered benefit and risk criteria, as the stakeholders' preferences between the criteria should not depend on the methodology used for the decision making. Therefore, it is crucial to make sure when applying different models to the same problem that they reflect the same stakeholders' preferences. We adapt the approach proposed by Saint-Hilary et al. 14 to achieve that. Since comprehensive work has been published and is currently being continued on the weight elicitation for the linear model, we will map the weights w L j (hypothetically) elicited for the linear model to the weights w P , w ML and w S such that they reflect the same trade-off preferences between the criteria.

Mapping for two criteria
As described in Saint-Hilary et al., 14 formally, the trade-off between the criteria could be represented by the slope of the tangent of the contour lines where the contour line passes through the point (0.5, 0.5) (see the red lines in the contour plot of panels B to D in Figure 1). Therefore, the expressions for the mapping of the linear weight to the competitive models are found through the equality of the slopes of the tangents to the corresponding contour lines.
We start from the setting with two criteria. As stated above, even for the two criteria setting, the multi-linear model requires one more weight to be specified. Therefore, we impose a constraint on the weight corresponding to the interaction term to obtain the unique solution for the mapped weight w ML , specifically w ML Note that for c = 0, the multi-linear model reduces to the linear one, and for c = 1 it becomes the product of the two criteria values.
Using the utility/loss scores z P , z ML , z S obtained at point (u 1 (θ i1 ), u 2 (θ i2 )), the expressions of the equality of the tangents with two criteria take the form where the slope for the linear model is given in the left hand size, and the slopes for the product, multi-linear and SLoS models are given in the right hand side, respectively. Note, however, that the slope of the tangent of the contours for the linear model are constant for all values of parameters and defined by the weights w L j only, while the slopes for the competitive models change with the values of the criteria. For the purpose for the weights mapping, we would interpret w L j as an average relative importance of each criterion over the others, and would match the slopes of the tangents to the corresponding contours in the middle point, u 1 (θ i1 ) = u 2 (θ i2 ) = 0.5. 14 Then, the equalities above reduce to Therefore, the product weight coincides with the linear weight in the given middle mapping point. For the SLoS model, the weight mapping does not have an analytical solution, but the approximate value of w S can be obtained by line search.  Figure 2 shows the mapping from the linear model to the multi-linear and SLoS models. It demonstrates how the value for the linear model (x-axis) can be used to find the respective weights for the multi-linear and SLoS models on the y-axes.
One can note that for the multi-linear model, the proposed mapping process may result in the obtained negative mapped values of weight. This is because of how the weight mapping function is elicited in the two criteria case: if the value of a weight under the linear model is less than half the value of c, then this will map to a negative value (which, in theory, gives our criteria a negative importancewhich is impossible) to reflect the same relative importance as induced by the linear model. Intuitively, if the interaction terms already contributes more to the importance of the one of the criterion in the interaction, the model needs to subtract the 'excessive' importance from the weight corresponding to this criterion standing alone. Whilst this effect can be negated by setting an upper limit of the values c can take, this in term limits the effect the interaction terms have, and can make the model more similar to the linear model. This is demonstrated in Figure 2 for c = 0.2, where any weights for the linear model that are given a value of 0.1 or less would be mapped to 0 in the multilinear model, rather than a negative value.
The mapped weights for the multi-linear and SLoS models do not have a direct interpretation, and should be backtransformed to linear weights to be interpreted. For instance, with two criteria, a weight of 0.3 for SLoS corresponds to a weight of 0.25 for the linear model. Therefore, it still means that the risk criterion is twice as important, on average, as the benefit criterion. Since the values are different but the underlying interpretation remains the same, mapping the weights permits to provide the fairest comparison between the models.
Proof for the above workings is given in the Supplemental Material.

Mapping for setting with more than two criteria
The derivation above concerns the setting with two criteria only but could be directly extended for the product and SLoS models. Specifically, one can apply the proposed mapping function to each of the weights in the setting with more than two criteria marginally. This would imply that the weights are mapped with respect to the importance of all other criteria rather than a single benefit (or risk). 14 The extension for the multi-linear model, however, is less straightforward. Generally, it would be a much more involving procedure to elicit weights for all the interactions terms as their number increases noticeably if more than two criteria are considered. Specifically, in the case study considered in Section 3, there are four criteria resulting in 11 interaction terms. Following the two criteria setting, we suggest to fix the total weight attributed to all the interactions to be equal to c = 0.2. Then, the ML model for the setting with four criteria takes the form where the fraction c 2 n −n−1 ensures that the sum of all the interaction terms equals c and this is split equally between all interaction terms. To calculate the individual weights w j , j = 1, . . . , n, again, a mapping to the linear weights can be used. In order for the weights to sum up to 1, the transformation w ML = w ML − c/n could be applied. For n = 2, this translates into the corresponding mapping in equation 18. While this procedure does not guarantee the equality of the slopes of the tangents, it, however, emphasises the potential challenge associated with the use of the multi-linear model that should be taken into account when considering it.

Case study
In this section, the performance of the four aggregation models is illustrated in the setting of an actual case study. This will provide an insight on how the various models perform, and what difference in the decision making they induce when applied to real-life data. The case study in question analyses the effects of two treatments (Venlafaxine and Fluoxetine) compared to a placebo, on the effects of treating depression. This study uses data from Nemeroff, 17 and expands on the studies conducted by Tervonen et al. 12 and Saint-Hilary et al. 13 Fluoxetine and Venlafaxine are both treatments used to treat depression. Here, the benefit criterion is the treatment response (an increase from baseline score of Hamilton Depression Rating Scale of at least 50%), and the three risk criteria are nausea, insomnia and anxiety. Table 1 shows the outcomes of the trial for the two treatments and the placebo. For all criteria, we approximate the distributions of the event probabilities by Beta distributions B(a, b), with a = number of occurrences and b = (number of patients − number of occurrences) of the considered event (response or adverse event), assuming Beta(0,0) priors. We generated 100,000 samples from each distribution. These samples are then used to approximate the distributions of the linear partial value functions (PVFs) as defined in equation (1) for all criteria and all treatment arms, with the following most and least preferred probabilities of occurrence ξ ′ j and ξ ′′ j : • Most and least preferable values of ξ ′ j = 0.8 and ξ ′′ j = 0.2 for the response. • Most and least preferable values ξ ′ j = 0 and ξ ′′ j = 0.5 for the adverse events.
This case study considers three different weighting combinations, which were used under the linear model by Saint-Hilary et al. 13 These sets of weights correspond to three different scenarios of the relative importance of the criteria for the stakeholders. The first scenario reflects the case when all four criteria are equally important. The second scenario corresponds to the benefit criterion having more relative importance than all risk criteria together. The third scenario can be considered as a 'safety first' scenario, in which each risk criterion has a higher weight than the benefit criterion. As discussed in Section 2.3, the weights of the criteria for the product, multi-linear and SLoS models are obtained by mapping. Note, again, that while the multi-linear model might not exactly induce the same average relative importance of the criteria, the proposed procedure suggests to control the contribution of the interaction terms in the decision at the given level of c = 0.20, and therefore is used for the sake of simplicity. The mapped weights for each of the three scenarios are presented in Table 3.
Three pairwise comparisons are made: Venlafaxine against Fluoxetine, Venlafaxine against Placebo and Fluoxetine against Placebo. We consider that one treatment is recommended over another if the probabilities defined in (4) or (5) are greater than ψ = 0.8. The probabilities of recommendations under all three scenarios and for each aggregation model are given in Table 4.
Under the first scenario with the equal weights for all criteria, the treatment with preferable risk criteria values was more likely to be recommended as the three risk criteria altogether have a greater weight than the one benefit criterion. For the comparison between Venlafaxine and Fluoxetine, the probability that Venlafaxine has better benefit-risk characteristics is around 1.7%-1.8% under all four models. For the comparison between Venlafaxine and the placebo, there is only a minor difference in the probability that Venlafaxine has better benefit-risk characteristics (<0.1% in the linear and multi-linear models, 1.6% in the product model and 3.7% in the SLoS model), not enough of a difference to change the recommendation. However, when comparing Fluoxetine to the placebo, a notable difference is observed. Under the linear and multilinear models, the probability of Fluoxetine having the better benefit-risk characteristics is around 7%-10% (suggesting the placebo is much more preferable), whilst this rises to 37% under the product model and 47.3% under the SLoS model (suggesting near-parity of treatments). This occurs due to the penalisation of low benefit criterion values for the placebo, where the 95% credible interval includes values close to zero (in bold in Table 2). These low values are harshly penalised under the product and SLoS models, as they suggest that the placebo induces no treatment benefit with a non-neglectable probability. The linear model does not account for this and strongly favours the placebo, while the multi-linear does not penalise these values strongly. Under the second scenario, the treatment response is considered as the most important factor, and is given a weighting greater than that of the three risk criteria combined. For the comparison between Venlafaxine and Fluoxetine, both the product and SLoS models say that Venlafaxine has inferior benefit-risk characteristics (42.6% and 36.6% probability of being better, respectively). More average results are observed with both the linear model, which gives a probability of 48.0%, and the multi-linear model, which gives a probability of 46.3%. Again, the difference between the probability of the linear model and those of the product and SLoS models is due to the penalising effects of the latter. This occurs because of the nausea risk criterion interval contains zero for Venlafaxine (in bold in Table 2), which causes the product and SLoS models to recommend Fluoxetine more often than Venlafaxine, despite the weighing criteria giving preference to the treatment response (which is greater with Venlafaxine). With the multi-linear model, the penalisation of the undesirable nausea criterion is not as strong as in the product or SLoS models, as the weight mapping induces a drop from 0.11 to 0.06 in the weight given to the corresponding individual term, and the effect of the interaction terms is not enough to overcome this.
For the comparison between Venlafaxine and the placebo, the probability that Venlafaxine has better benefit-risk characteristics is between 63% and 75% across the four models. The product and SLoS models both penalise the low benefit value of the placebo, which is why they are both more likely to recommend Venlafaxine than the other two models. Additionally, the product and SLoS models both also penalise the nausea criterion value of Venlafaxine, and due to the Table 4. Probability of treatment being recommended as the best treatment against another for the three pairwise comparison, using each of the four aggregation models, for each of the three weighting scenarios.

Probability of treatment being
Venlafaxine over Venlafaxine over Fluoxetine over recommended as best treatment Fluoxetine Placebo Placebo increase weighting given to it by the SLoS model mapping, this causes the product model to be more likely to recommend Venlafaxine than the SLoS model. For the comparison between Fluoxetine and the placebo, the probability that Fluoxetine has better benefit-risk characteristics is around 65%-80% under all four models, with the probability of Fluoxetine being preferable increasing as the methods increase the penalisation applied to the placebo's lack of benefit effect. The stronger penalisation occurs under the product and SLoS models, hence why they are both more likely to recommend Fluoxetine.
Across all three comparisons, the multi-linear model is always slightly less likely to recommend the treatment with the greater benefit value than the linear model. As this is the scenario where the benefit criterion is considered to be the most important, this shows that the weight splitting with the multi-linear model induces a loss of the preferences that were given when the weights were originally set out for the linear model, illustrating some of the problems theorised in the methods section.
Under the third scenario, a 'safety first' approach is adopted, giving the risk factors a higher weighting. The probability that Venlafaxine has better benefit-risk characteristics is around 0.5%-0.6% when it is compared to Fluoxetine and around 0%-0.6% when it is compared to placebo, under all four models. For the comparison between Fluoxetine and the placebo, the probability that Fluoxetine has better benefit-risk characteristics is around 2.1%-3.0% for the linear and multi-linear models, whilst this increases to 18.5% under the product model and 30.1% under the SLoS model. This increase occurs for the same reasons outlined for the same comparison in scenario 1: The penalisation of the benefit criterion for the placebo, with its 95% credible interval including low values (in bold in Table 2). The linear model does not account for this and strongly favours the placebo, while the multi-linear does not penalise these values sufficiently and still favours the placebo.
Overall, this case study provides us with a number of important observations shedding a light on the differences in the aggregation performances. Firstly, the effects of extremely undesirable outcomes (those highlighted in bold in Table 2) are more significantly and consistently penalised in the product and SLoS models (the penalisation is stronger in the SLoS model than the product model, although they give the same recommendation for every comparison). These examples also help to show that the models provide similar recommendations when one treatment is clearly preferable than its competitor. Lastly, the weight splitting in the multi-linear model induces a change in the relative importance between criteria that may not always reflect the choices of weights as well as other models, highlighted in scenario 2. This makes it less appealing than other models.
To draw further conclusions regarding the differences between models, we conduct a comprehensive simulation study under various scenarios and under their many different realisations.

Data generation and comparison procedure
The following Bayesian procedure is used for the simulation study: • Step 1: Simulate randomised clinical trials with two treatments T 1 and T 2 , each with two uncorrelated criteria, and the sample size of N = 100 in each treatment arm. • Step 2: Derive the posterior distributions using the simulated data assuming a degenerate prior, Beta(0,0), to reduce the influence of the prior distribution. Draw 2000 samples from each posterior distribution of the criteria and obtain the corresponding empirical distribution for the PVF. • Step 3: Use the posterior distributions of the PVF in each of the aggregation models as given in equations 2 and 3 to compute the probability in equations 4 and 5 that treatment T 1 has better benefit-risk profile, P 1,2 X (for some model X ), and compare to the threshold value ψ = 0.8. If P 1,2 X > 0.8, then treatment T 1 is recommended. If P 1,2 X < 0.2, then treatment T 2 is recommended. If 0.2 ≤ P 1,2 X ≤ 0.8, then neither treatment is recommended. • Step 4: Repeat steps 1-3 for 2500 simulations trials. • Step 5: Estimate the probability that each treatment is recommended (P(P 1,2 X > 0.8)) by its proportion over 2500 simulated trials.
The aggregation models will be compared using P(P 1,2 X > 0.8), which is the probability that the model X recommends T 1 over T 2 , and ϕ X −Y = P(P 1,2 X > 0.8) − P(P 1,2 Y > 0.8), which is the difference between the probability that the model X recommends T 1 and the probability that the model Y recommends T 1 . The value of ϕ represents a difference between two probabilities, and can therefore take the range of values −1 ≤ ϕ ≤ 1. If 0 < ϕ ≤ 1, then the model X recommends T 1 more often than model Y . If −1 ≤ ϕ < 0, then the model Y recommends T 1 more often than model X . If ϕ = 0, then the two models make the recommendations with the same probability. Note that, for the ML model, we adopt c = 0.20 as in the case study above.

Results
The results are presented on Figures 4 and 5. The first seven scenarios referred to above for treatment T 1 are presented in the rows labelled 1-7. Each graph corresponds to fixed expected probabilities of event for treatment T 1 , and each cell corresponds to a combination of expected probabilities of benefit and risk for T 2 . When reference is made to the 'diagonal', this refers to the diagonal line that runs from the bottom left corner of the graph to the top right. In all scenarios, all models agree to recommend T 1 when it is undoubtedly better than T 2 , i.e., when T 1 is more effective and less harmful than T 2 (or to recommend T 2 when T 1 is indisputably worse, i.e., less effective and more toxic). For this reason, the results for scenarios 8 and 9 are not presented here, but are included in the Supplemental Material for completeness.
In Figure 5, a colour of a cell corresponds to the aggregation model of this colour to recommend treatment T 1 with higher probability than another method. For instance, red cells in the first column of Figure 5 showing (ϕ P−L ) indicate that, when T 2 characteristics take the corresponding value, the linear model recommends T 1 more often than the product one.
In scenario 1, the four models are in agreement to recommend T 1 when T 2 corresponds to less benefit and more risk. On the diagonal, the product and SLoS models both favour T 1 over T 2 when T 2 has either extremely high benefit and risk (top right corner), or extremely low benefit and risk (left bottom corner), compared to either the linear or multi-linear models. This occurs due to the penalisation of extremely low benefit and extremely high risk by the product and SLoS models. Comparing product and SLoS models for these values of benefit-risk, SLoS favours T 1 over T 2 more often for low but not boundary values of the criteria. This occurs due to the SLoS model penalising the undesirable qualities more than the product one (this is similar to trends observed in the case study). Compared to the linear model, the multi-linear model recommends T 1 over T 2 with higher probability when T 2 has either higher benefit and higher risk, or lower benefit and lower risk due to the interaction term providing mild penalisation of extremely high risk or extremely low Figure 4. Probability that the model recommends T 1 over T 2 , P(P 1,2 > 0.8), for scenarios 1 to 7 for the linear (red), product (purple), multi-linear (orange) and SLoS (blue) models.
benefit. However, there is (in most cases), a greater magnitude of difference between the SLoS and product models than between the linear and multi-linear models.
For example, when T 2 has criteria values θ 2,1 = 0.2 (benefit), θ 2,2 = 0.1 (risk) (lower benefit, lower risk), T 1 is recommended in 2% of the trials under the linear model, in 70% under the product, in 8% under the multi-linear and in 90% under SLoS. This tells us that the product and SLoS models do not permit that the decrease in risk is worth the decrease in benefit that comes with it (the SLoS model more than the product model), whilst the linear and multi-linear models both consider it acceptable. Considering the case when θ 2,1 = 0.7, θ 2,2 = 0.7 (higher benefit and higher risk compared to T 1 ), T 1 is recommended in 20% of the trials under the linear model compared to 49% for product model, 25% for the multi-linear model and 61% for SLoS model. This tells us that the product and SLoS models do not permit that the increase in benefit is worth the increase in risk that comes with it (again, this effect is stronger in the SLoS model than the product model), whilst the linear and multi-linear both consider it acceptable (the linear model more-so than the multi-linear model). Similar observations can be made in scenarios 2 and 3.
However, a distinguishing difference between the designs under scenario 1 can be found when T 2 has the criteria θ 2,1 = 0.9, θ 2,2 = 0.7. In this comparison, T 1 is recommended in 0% of the trials under the linear model compared to 11% for product model, 0% for the multi-linear model and 30% for SLoS model. Meanwhile, T 2 is recommended in 92% of the trials under the linear model compared to 32% for product model, 84% for the multi-linear model and 13% for SLoS model. This shows that the linear, product and multi-linear models are all more likely to recommend T 2 , whilst only the SLoS model is more likely to recommend T 1 . This occurs due to the different strengths of penalisation between the models, and only the SLoS model does not consider this an acceptable trade-off. This shows that the product model and the SLoS model do not always make the same recommendations, and that these differences can sometimes be quite large. In scenario 4, where T 1 has extremely low benefit and risk, it is very rarely recommended by either the product of SLoS models, whereas it recommended by both the linear and multi-linear models, in cases where T 2 has some increase in benefit, but a higher increase in risk. This occurs because the SLoS and product models penalise extremely low benefit so severely that the level of risk has almost no impact on the recommendation. The multi-linear model also penalises the extreme low benefit, but on a much smaller scale. For example, for T 2 with criteria values θ 2,1 = 0.6, θ 2,2 = 0.7, T 1 is recommended with probability 68% under the linear model, never recommended under the product model, 41% under the multi-linear model and never recommended under the SLoS model. This shows that the product and SLoS models reflect the desirable properties outlined above: that we are not interested in the risk criterion value of a treatment if the benefit criterion value is small/zero, whilst both the linear and multi-linear models do not reflect this (although the multilinear model does somewhat penalise this). Similar results are observed in scenario 5, where T 1 has extreme risk and extreme benefit. The SLoS and product models will recommend T 2 if it has lower risk than T 1 as long as it has some benefit, whereas the linear model and the multi-linear model will recommend T 1 over T 2 if the benefit of T 2 decreases by a greater amount than the risk.
It should be noted that poor recommendations can be made under the product and SLoS models if both T 1 and T 2 have a risk criterion value of 0.9, as the strength of the penalisation of the undesirable criteria overpowers the effect of the benefit. For example, in scenario 5 where T 2 has criteria values θ 2,1 = 0.8, θ 2,2 = 0.9 (same risk criterion value as T 1 but a lower benefit criterion value), T 1 is recommended with probability 75% under the linear model, 27% under the product model, 68% under the multi-linear model and 23% under the SLoS model (this effect is stronger in the SLoS model than in the product model due to its harsher penalisation of the undesirable criteria). They both did recommend T 2 with probabilities 13% and 17%, respectively, showing that they still recommend the better treatment T 1 more often than T 2 , but that these two models hardly discriminate very unsafe drugs (for comparison, both the linear and multi-linear models only recommended T 2 with probability 1% each).
In scenarios 6 and 7, all AM recommend T 1 over T 2 when T 2 is unarguably worse (similarly they all recommend T 2 over T 1 when T 1 is unarguably worse). Along the diagonal, the SLoS model recommends T 1 over T 2 more often than the other AM when T 2 has either extreme low benefit and extreme low risk, or extreme high benefit and extreme high risk, compared to T 1 (although the product model recommends T 1 only a slightly smaller proportion of times than the SLoS model). Again, this is the result of the penalisation of extremely low benefit or extremely high risk criteria. Similarly, the multi-linear model recommends T 1 over T 2 more often than the linear model in the same circumstances. For example, in Scenario 6, when T 2 has criteria values θ 2,1 = 0.2, θ 2,2 = 0.2 (lower benefit and lower risk), T 1 is recommended with probability 21% under the linear model, 59% under the product model, 28% under the multi-linear model and 68% under the SLoS model. This shows how the different levels of penalisation affect the recommendations, where the stronger the penalisation of the undesirable low benefit criterion value, the more likely an AM is to recommend T 1 , and is the reason why there is such a large difference between the linear and SLoS models recommendations.
Overall, the simulation study has shown that, for the two criteria having an equal relative importance, SLoS penalises extremely low benefit and extremely high risk criteria the most, whilst the product model penalises these moderately, acting as a sort of middle ground between the linear and SLoS models. The multi-linear model offers a small amount of penalisation (less than the product model), but due to the added complexity of this model when more criteria are added, it should not be recommended over either the SLoS model or the product model. The linear and multi-linear models both recommend treatments with no benefit/high risk over other viable alternatives, which contradicts conditions set out by Saint-Hilary et al. 14 Therefore we can provisionally conclude that the two models that appeal most at this point are the product and SLoS models.

Sensitivity analysis: Correlated criteria
The results above concerned the case with the two criteria being uncorrelated. However, it might be reasonable to assume that the criteria for one treatment might be correlated. In this section, we study how robust the recommendation by each of the four models are to the correlation between the benefit and risk criteria. We consider two cases of the correlation: a strong positive correlation (ρ = 0.8) and a strong negative correlation (ρ = −0.8) between the criteria. The correlated outcomes were generated using a procedure laid out in Mozgunov et al. 26 We study how likely the correlated outcomes are to change the final recommendation of one of the treatments. Specifically, we study the proportion of cases under each of the scenarios in which the difference in the probability of recommending treatment T 1 , P(P 1,2 X > 0.8), changes by more than 2.5% and by 5%. Table 5 shows the number of cases (out of 81) under each of nine scenarios, in which the differences in the probabilities to recommend T 1 over T 2 changes by at least 2.5% and 5% comparing the positively correlated and uncorrelated criteria. The case investigating the effects of negative correlation shows similar results to those presented here, and is included in the Supplemental Material. For example, the first entry in Table 5 shows that in 37% cases under scenario 1, the probability to recommend T 1 changes by at least 2.5% if the linear model is used. Table 5 shows that all four models are the most affected by correlation under scenario 1 with the characteristics of T 1 being in the middle of the unit interval. This effect is, however, less prominent for the product and SLoS models. At the same time, under scenarios 2 to 7, the correlation has a larger effect on the linear and multi-linear models than on the other two models. Scenarios 8 and 9 are hardly affected by any correlation, and the effect is similar across all four models.
Overall, the SLoS model is the least affected by correlation between the criteria, the product model is the second least affected whereas the multi-linear (for the threshold 2.5%) and the linear model (for the threshold 5%) are the most affected ones.

Discussion
In this article, four potential AM are investigated for use in benefit-risk analyses: The linear model, product model, multilinear model and the SLoS model. The differences of these models were highlighted in a case-study and a simulation study.
In most clear cases (i.e. when one treatment has more benefit and less risk than the competitor), all AM gave similar recommendations. However, in cases where one treatment had either no benefit or extreme risk, the models which penalised undesirable values more (the product and SLoS models) gave more desirable recommendations: non-effective or extremely unsafe treatments are never recommended. Furthermore, with these models, more risk is accepted in order to increase benefit when the amount of benefit is small than when it is high (or less benefit is desirable to reduce risk when the amount of risk is high than when it is small), which is consistent with the well established assumption of nonlinearity of human preferences. 20 It should be noted that these models hardly discriminate two treatments that slightly differ but have both extremely undesirable properties. However, in this case, none of the treatments should be recommended anyway.
The effects of correlations between criteria was also investigated in this study. The overall effect of correlations was small to negligible in the product and SLoS models, showing these AM are not much affected by correlations between the criteria. However, the linear and multi-linear models were more likely to see a 2.5% or 5% change in the probability of recommending one treatment over another, showing that they are more affected by correlations between the criteria. Table 5. Number of times (%) when the difference in recommending T 1 changes by at least 2.5% or 5% between the positively correlated criteria and the non-correlated criteria.

Linear model Product model Multi-linear model SLoS model
A simple mapping was applied to obtain multi-linear and SLoS weights from linear weights, so that the models could be fairly compared while preserving the weight interpretation. However, since the mapping is not far from an identity transformation, omitting it would not have a major impact on the results, as demonstrated in Saint-Hilary et al. 2018 14 for SLoS.
Overall, the two models to recommend from this investigation are the product model and the SLoS model, depending on how severely the decision maker whish to penalise treatments with either no benefit or extreme risk (moderate penalisation: product model, strong penalisation: SLoS model). The multi-linear model, whilst acting as a middle ground between the linear model and the product and SLoS models in the simulation study, involves an increased complexity behind the model. These include the increased complexity involved with adding additional terms and increased difficulty in weight mapping. This model also struggled to truly reflect the weightings given in the case study, especially in scenario 2. Because of this, we do not recommend this AM over the product or SLoS models. Additionally, the linear and multi-linear models should not be recommended as both of these models do not contain the two desirable properties outlined in Saint-Hilary et al. 14 : That treatments with no benefit/extreme risk should not be recommended, and that a larger increase in risk is accepted in order to increase the benefit if the benefit is small compared to if the benefit is highboth of which are present in the product and SLoS models.