Last week I was teaching in a summer school of the IBS German and Austrian-Swiss Region in Strobl at the beautiful Wolfgangsee. The theme of the summer school was “Time-to-Event Analysis”. Thanks again to the organizers for inviting me!
One of the topics I discussed was non-proportional hazards (NPH) in drug development. Of course much can be said about this topic, so this post can only be a start.
Before discussing NPH it may be useful to think about why we actually assume *PH* when designing and analyzing trials.
- Assuming PH allows summary of the treatment effect over time in *one number*, the hazard ratio (HR).
- Assuming PH implies that the power of the logrank test only depends on alpha, power, and an assumed HR (and the randomization ratio). Nothing else!
- The logrank test has maximal power against PH alternatives.
- The canonical estimand corresponding to the logrank test is the HR, so that we naturally align hypothesis test and effect quantification.
Now, if we anticipate NPH during trial design things get a bit more complicated:
- The HR is time-varying, i.e. it is not obvious how to quantify the effect in one number. We typically have several aspects that need to be looked at, e.g. delayed effect or cure.
- The logrank test remains valid in the sense that it always protects type I error. Under H0 there cannot be NPH!
- However, the logrank test is not power optimal anymore. Other tests, e.g. weighted logrank or maxcombo, may offer better power *against specific alternatives*. They may have other issues though (choice if weight fubction, estimand corresponding to them unclear, T1E violation).
- When designing a trial anticipating NPH the first question to ask is: How do we quantify the effect? HR still? Restricted mean survival time? Difference at milestone? Difference at a quantile? Accelerated failure time? A trial can be designed using each of these, but each has pro’s + con’s.
- How does the hypothesis test relate to the effect measure of interest?
- The power of a hypothesis test now depends on many more quantities than just alpha, power, and effect size, e.g. also on the value of survival functions, censoring distribution, etc. This implies that one needs to make quite precise assumptions when evaluating operating characteristics of a design – and that these are only accurate if these assumptions hold true in the actual data.
- Trials are typically designed through simulations.
What do regulators say?
- In Europe regulators have repeatedly stated that for them, it would be fine to run a logrank test and then potentially quantify the effect as appropriate. Having said that, I am not aware of a label precedence of such a decoupling of the hypothesis test and effect measure.
- In the US, the FDA has clearly voiced a preference for the (unweighted) logrank test, see paper linked in comments. I am not aware of a comment or position with respect to the effect measure.
#biostatistics #survival #estimand #drugdevelopment
Student at University of the West of England
2moThank you so much for having me this year, it’s been great 😊