Additional background to the matter arising
The sponsor has said “The analysis in the table (presented on page 7.8. ESC ADV 4) is the same post-hoc analysis that was presented in the previous submission. In relative terms, the results are no longer different between the SQ3 sub-group and the non-SQ3 sub-group (7.8. ESC ADV 4). We are again frustrated that a post-hoc analysis completed during the PES evaluation of the previous submission which was judged to be ambiguous during that evaluation is again presented in the ESC advice as unequivocal. This analysis is based on summary data. We have not replicated this analysis in the submission as we have already presented a treatment effect analysis conducted using the patient level data. These data presented in both the presented and previous submission, also presented during the hearing for the previous submission, demonstrated a significant treatment effect.The Cochran-Armitage test for trend demonstrated that baseline severity of fracture was a predictor of subsequent fracture in the placebo group, this effect was not significant for the teriparatide treated group demonstrating that teriparatide breaks the cycle of accelerating fracture risk associated with fracture severity. For teriparatide 20 ug/day, the Armitage Trend Test Statistic = 0.7909, p-value = 0.429, for patients treated with placebo the Armitage Trend Test Statistic = -3.6764, p-value = 0.002.”
In the re-submission for consideration at the March 2006 meeting of the PBAC, the sponsor states “In discussions with the Chair of the PBAC following the rejection of the previous submission, it was suggested that the PBAC appreciate the lack of ideal comparative data for the proposed listing, but would consider a scientific and clinically valid chain of logic for the proposed positioning and using alendronate as the comparator. This logic is critiqued in the PES commentary at 7.8 PES COM.19 and therefore this response focuses primarily on the response to this critique.
The sub-group analysis was not pre-specified in the statistical plan and the statistical evidence for interaction is weak. For instance, an earlier submission (March 2004) presented data on teriparatide treatment effects in the full trial population stratified by baseline vertebral fractures. The evidence for a trend to greater protective effect higher numbers of baseline fractures is unconvincing and if anything the trend is in the opposite direction when non-vertebral fractures are the outcome.”
The following is the sponsor response “Eli Lilly Australia agrees that the analysis of fractures by baseline severity was not pre-specified in the statistical analysis plan written in 1994. The GHAC trial was essentially developed to fulfill regulatory purposes. Semiquantitive (SQ) analysis of fractures was not common clinical practice nor was it required as standard by the regulatory agencies. However, SQ grading was included in all protocols for osteoporosis agents (including alendronate, raloxifene and risedronate) but to date only analysed and published for raloxifene. Due to evidence gained since the study of the new treatments, SQ grading of fractures has become more common in research and clinical practice and while low BMD and number of baseline fractures are recognized as independent risk factors for new fracture, evidence from analyses of the placebo group from the Multiple Outcomes of Raloxifene Evaluation (MORE) study demonstrated that baseline fracture severity is the most important predictor of incident fracture.”
Our analysis of the baseline characteristics of SQ3 patients in both the GHAC and MORE studies, as well as the characteristics of SQ3 fracture patients in a large scale European epidemiological study, demonstrates that these patients can be characterized as having both low BMD (within the osteoporotic range) and multiple vertebral fractures at baseline (see Section 2.5.3 of the submission). In the GHAC trial, these two distinct analyses were pre-specified and demonstrate that as the severity of disease increases (based on the risk factors of BMD and number of vertebral fractures), the reduction in risk of new vertebral fracture associated with teriparatide is improved.
Furthermore, the incidence of new moderate (SQ2) or severe (SQ3) fractures was a pre-specified analysis. No new SQ3 fractures were recorded during the GHAC trial, the reduction in risk of new moderate or severe fractures for the ITT sample of the GHAC trial is 0.096 (0.035, 0.266) (see Table 37 of the submission). That is, a 90% reduction in risk of new SQ3 fractures.
The proposed listing is for patients with severe vertebral fractures and focuses on these outcomes. The GHAC study was conducted over a median 19 months compared with 36 months for the FIT-VFA. Due to the low incidence of non-vertebral fractures over the 19 months of the trial the GHAC is not adequately powered for measuring trend in non-vertebral fractures by SQ sub-groups.”
Top of page
Reviewer’s understanding of the matter involved
A further issue of concern identified in the July 2005 and March 2006 submissions was the validity of the subgroup analysis in the pivotal teriparatide trial, GHAC, that consists of a patient subpopulation who had a prevalent baseline vertebral fracture that is severe (SQ3), meaning at least a 40% vertebral height reduction. The subgroup of patients defined as having a severe prevalent vertebral fracture in the GHAC trial comprised 20.3% (181 of 892 patients) of the total evaluable (i.e. those patients with paired radiographs) trial cohort. Subgroup analyses based on risk factors for future fracture were included as apriori analyses. The sponsor identified patients with SQ3 grade fractures using a combination of baseline risk factors, and then performed treatment by subgroup interaction analyses (post-hoc). The criteria for patient identification are objective and can be documented using radiological evaluation. The results from the post-hoc subgroup analysis of patients from the GHAC study have been published by Gallagher et al (2005).An additional issue in the fourth submission is the disparity between the proposed listing for teriparatide and the direct trial data from the patient subgroup analysis. The proposed listing states that the qualifying fracture needs to be painful and appears to infer that all severe vertebral fractures are symptomatic. However, the GHAC trial used new morphometric (not necessarily clinically apparent) vertebral fractures as its primary outcome. The sponsor states in its latest submission that this patient subgroup with severe osteoporosis is readily identifiable from a clinical perspective, presenting with characteristic compression, crush or wedge fractures of the spine, and presumably pain. However, this assumption is not supported by data that accurately quantifies the proportion of clinically apparent patients in this particular subgroup. In the sensitivity analysis, the sponsor has given a figure of up to 70% of patients with vertebral fractures being asymptomatic. This is accurate for a patient population with any grade of vertebral fracture but is not specific to patients with SQ3 grade vertebral fractures. The methods used to adjust for the proportion of painful fractures (as per the requested listing) in the sensitivity analysis is discussed at a later stage of this review (refer to Matter 3-page 34). In addition, it remains unclear what proportion of patients in the severe subgroup would be accurately identified on radiological grounds. In the GHAC study itself, there was discrepancy between the investigative site and central radiographic reader on the presence of any grade of prevalent vertebral fracture in 8.4% (91/1085) patients. It is reasonable to expect that outside of the controlled environment of a formal trial, discrepancy rates in the correct identification of baseline radiological criteria would at best be equivalent in the non-trial clinical setting.
Top of page
Reviewer’s opinion
Subgroup analyses attempt to identify a subset of patients who derive a greater or lesser benefit from therapies than does the average trial patient. Because the sponsor in this matter is requesting a restricted listing in a patient subgroup thought to derive the most benefit from teriparatide, then it is appropriate for a subgroup analysis to be considered as part of the submissions. Moreover, I believe it is justifiable because of the potential plausibility of an increased treatment effect with severity of disease and stage in the natural history of osteoporosis. This particular issue of potential heterogeneity of treatment effect in relation to pathophysiology will be discussed further in Matter 2.However, the methodology of the subgroup analysis needs to be scrutinized to ensure the validity of the process, appropriate interpretation, and application to everyday clinical settings. Several prominent authors on analysis of clinical trials (Oxman and Guyatt-page 49) have proposed criteria that should be satisfied before accepting results from subgroup analyses. The criteria are outlined below and are relevant in assessing this particular matter:-
- Clinically significant: the magnitude of the differences between treatment groups are clinically important and would lead to different decisions for different subgroups.
- Statistically significant: the differences remain statistically significant after formally testing for treatment-subgroup interactions using appropriate statistical methods.
- A priori hypothesis: the hypothesis of subgroup differences preceded rather than followed the analysis (i.e. a priori hypothesis pre-specified in the trial protocol, not a discovery made from post-hoc analyses).
- Limited number of comparisons: the subgroup analysis in question was one of a small number of hypotheses tested to minimize the number of seemingly significant differences (i.e. interactions) that could simply occur by chance.
- Within study comparisons: subgroup differences were suggested by comparisons within studies (i.e. direct comparisons) rather than between studies (indirect comparisons)
- Reproducibility: the subgroup difference is reproducible in other studies that have adequate power and are of similar design in terms of patient characteristics, co-interventions and outcome measures.
- Supporting evidence: the subgroup difference is biologically compelling and consistent with current understanding of biologic mechanisms of disease.
When the data presented in the submissions derived from the GHAC trial is evaluated against the above criteria, the subgroup analysis arguably satisfies some but clearly not all of the seven elements. Without doubt the criteria of apriori hypothesis and reproducibility have not been met and these criterion should be fundamental to the scientific strength of the data supporting the proposed PBS listing.
Top of page
The sponsor’s submissions of July 2005 and March 2006 use a post-hoc subgroup analysis of the GHAC trial data. Although the PBAC does not currently have submission guidelines that transparently represent their opinion weight on the validity of certain types of clinical data analyses, it is clear from the medical literature that post-hoc subgroup analyses, in general, are relatively poor options in terms of scientific rigor and validity. Post-hoc analyses of certain subgroups are potentially fraught with multiple hazards, especially the play of chance and uncontrollable confounders. They are particularly prone to false findings and indeed, the medical literature has numerous examples from randomized trials in which an apparently important differential response to therapy suggested by subgroup analysis generated a hypothesis that was subsequently refuted in a trial designed to test that hypothesis (references such as Parker, Pfeffer, Rothwell and Scott-page 49). In addition, respected authors on the topic of subgroup analysis such as Rothwell report that simulations of randomized controlled trials powered to determine the overall effect of treatment suggest that false subgroup treatment effects occur by chance in 7%-21% of analyses. The same author states “post-hoc observations should be treated with skepticism irrespective of their significance” and that “no test of significance is reliable in this situation.”
The sponsor did perform an appropriate statistical test (i.e. to test for a subgroup-treatment effect interaction) when presenting the subgroup data. However, the best test for validity of subgroup-treatment effect interactions is not statistical significance but reproducibility in other trials. The Cochran-Armitage trend test is sensitive to linearity between response and covariates. It tests for trends in binomial proportions across levels of either a single factor or a covariate. In the subgroup data analysis, the sponsor applied this test to determine the relationship between prevalent vertebral fracture grade and incident vertebral fractures in both the placebo and teriparatide treated groups. In the submissions, the sponsor refers to this analysis (i.e. a test for trend) to support its claim for treatment effect modification with teriparatide.
The data for the paired radiograph population and the SQ3 subgroup cohort of the GHAC trial is summarized in Table 3 (page 16) and the test for heterogeneity for effect sizes of subgroups is presented in Table 4 (page 17). In particular, it is noteworthy that the PBAC tested for heterogeneity (treatment effect modification) by performing a Cochran Q test. Statistical tests for heterogeneity seek to determine whether or not there are genuine differences underlying the results of studies. This analysis is usually evaluated by Cochran’s Q test.
Treatment effect in terms of new vertebral fractures within the subgroups was assessed by the sponsor using a likelihood ratio test based on a logistic regression model. In the fourth submission, the sponsor reports the analysis of the GHAC data demonstrating the following treatment interaction effects via logistic regression:-
- Treatment (20ug/day vs placebo), SQ grade (1,2,3), and the interaction (saturated model): p=0.2239 (i.e. not statistically significant)
- Treatment (20ug/day vs placebo), SQ grade (1 and 2 combined vs 3), and the interaction (saturated model): p=0.0865 (i.e. “significant at the 0.10 level”)
The statistical analyses are complex and some aspects of both the analysis and results are open to varying interpretation. However, in my opinion, the analysis of the data presented in Table 4 (page 17) is the most scientifically valid means of assessing a treatment-subgroup interaction. Testing for heterogeneity in terms of treatment effect modification is optimally evaluated by the Cochran Q test, but all such tests are statistically underpowered in this subgroup analysis dataset. As such, any results suggesting a treatment effect modification are only weakly supported by the data. For the SQ3 subgroup, the test for heterogeneity on the risk difference is statistically significant in terms of further fractures (i.e. 22.6% versus 5.7% respectively, p=0.0027). Hence, there is a statistically significant difference in absolute risk between the high risk subgroup and the remainder of the trial population. This is an expected finding given the different background prognostic factors for future fracture events between the subgroup populations. There is convincing evidence in the medical literature that prior osteoporotic fractures (both number and severity) independently predict the risk for new vertebral fractures. Moreover, the background (i.e. placebo treatment group) rates of fracture in the SQ3 subgroup populations are quite different for the two studies. For the placebo treated SQ3 subgroup in the GHAC study, the rate of incident vertebral fracture was 28.4% (27/95) and for the placebo treated SQ3 subgroup in the FIT-VFA, the rate of incident vertebral fracture was 15% (145/965) -refer to section 3 of Table 5 (page 19).
However, the key result for data interpretation of treatment effect was the absence of a statistically significant difference (p=0.1433) for the relative risk between the two baseline fracture subgroups of severe (SQ3) and the “non-severe” (non-SQ3) subgroups on the formal test for heterogeneity.
Guidelines on the analysis and reporting of subgroup analyses recommend reporting both absolute and relative risk reductions, with tests for subgroup-treatment effect interaction also to be performed. In this particular matter, given the heterogeneity of treatment effect is likely to be plausibly related to varying absolute risks (with or without treatment) within the different sub-populations at baseline, relative risk reductions with treatment is the most informative means of assessing treatment effect across populations. As such, in my opinion, the relative risk reduction of the overall trial population should be applied to the different baseline risk of the SQ3 subgroup in deriving the clinical outcome data.
In addition, potential sources of bias related to methodology and patient disposition were also observed as per the study report published by Gallagher et al. These have already been outlined in Matter 1A (page 19).
The pivotal GHAC trial was conducted in 1996-1997 and the analysis of fractures by baseline severity was not pre-specified in the statistical analysis plan which was written in 1994. Unfortunately there have been no randomized controlled trials of teriparatide performed since that time to directly assess the validity of the proposed listing. The medical literature indicates that genuine unanticipated subgroup-treatment effect interactions are uncommon. In addition, part of the expected process in designing clinical trials for regulatory purposes is to consult expert clinical opinion in the development phase with the aim of pre-defining potentially important subgroups. In addition, if relative treatment effect is likely to be related to baseline fracture risk, then the analysis plan should include stratification by predicted risk and be statistically powered to determine any such treatment effect.
Reviewer’s summary
The result derived from post-hoc subgroup analysis of the GHAC trial suggesting a possible but highly contestable and unclear benefit with teriparatide treatment in patients with more severe forms of vertebral osteoporosis (SQ3) at baseline should be interpreted as informative and hypothesis-generating. The current dataset and analysis is deficient in several measures of scientific rigor that are open to varying interpretation and may be impacted by confounding factors. As such, the hypothesis that teriparatide has a superior treatment effect in patients with severe pre-existing vertebral osteoporosis is not adequately supported by the current submissions. This matter is a fundamental requirement to the proposed listing. Confirmation in subsequent well designed clinical trials is recommended before the proposed listing is accepted on the basis of a scientifically robust treatment effect.Document download
This publication is available as a downloadable document.