|
hedgehog
Site Admin
| Joined: 19 Jan 2006 |
| Posts: 820 |
| Location: Bay Area |
|
 |
Posted: Fri Apr 14, 2006 3:06 pm |
|
 |
 |
 |
 |
Chapter 32 addresses the traditional approach to designing clinical trials, particularly as regards sample size. The first step in planning a trial from a Bayesian perspective is to assess the available evidence regarding the hypotheses and parameters of interest. The designer addresses the possibility of using this information in a prior distribution or incorporating it in a hierarchical model along with the results of the trial being planned.
At the planning stage it is important to consider the possible state of affairs when the trial is over. One consideration is the set of implications and consequences of each possible result. Another is the predictive probability of each possible result. The previous section presents an approach in which utilities are assessed for the former and weighed with respect to the latter. The present section deals with designs that are more flexible than the ones in the previous section. Although the designs in this section are not based on an explicit consideration of utilities, the goals are efficient learning and effective treatment of patients. For explicit decision-analytic generalization of some parts of this section, see Bandit Problems: Sequential Allocation of Experiments.16
Consider a trial having a particular design. Calculating the predictive probabilities of the trial's results is always possible, even for the most complicated of designs (although the most complicated designs require simulations). These calculations allow for finding a variety of the design's attributes, including the probability of achieving a statistically significant benefit of one therapy over another, the expected number of patients in the trial, and the expected number of patients in the trial who successfully respond to their assigned treatment. Comparing calculations for different designs facilitates choosing one design over another.
Designs of clinical trials are usually static in the sense that the sample size and any prescription for assigning treatment, including for randomization protocols, are fixed in advance. Results observed during the trial are not used to guide its course. There are exceptions. Some Phase II cancer trials have two stages, with stopping after the first stage possible if the results are not sufficiently promising. And most Phase III protocols specify interim analyses that determine whether the trial should be stopped early for sufficiently strong evidence of a difference between competing treatment arms. However, traditional early stopping criteria are very conservative and so few trials stop early.
The simplicity of trials with static designs makes them solid inferential tools. Their sample sizes tend to be large, at least in comparison with alternatives to be discussed in this section. And they usually consider two therapeutic strategies, or arms, thus enabling straightforward treatment comparisons. I do not mean that static trials always give clear answers as to whether one arm is better than the other, but only that they usually allow for an unambiguous quantification of the uncertainty regarding whether one arm is better.
Despite their virtues, static trials result in slow and unnecessarily costly drug development. Hundreds of millions of dollars and many years can be expended in developing a single cancer drug, one that may not make it to market. For a company developing a moderate number of drugs (say 20 or more), this circumstance is tolerated because costs are balanced by profits from other drugs. Smaller companies are at the mercy of the prevailing attitudes toward drug development and risk going belly up.
The tradition of drug development is one at a time. The number of cancer drugs available for development is increasing exponentially. It is inefficient to focus on a single drug while a gazillion others are sitting on the sidelines waiting to be evaluated. The standard types of errors in drug development are false positives and false negatives. These errors apply to drugs actually being tested. Another kind of error applies to drugs not under investigation: Every such drug is a false neutral. A drug not being developed has no chance of helping anyone. Finite resources limit the ability of the medical establishment to develop therapies. But when resources are limited we should approach their allocation in a more rational way. And what makes sense today may well be different from the ways of the past.
Pharmaceutical companies and medical researchers generally must be able to consider hundreds or thousands of drugs for development at the same time. Static trials inhibit the simultaneous processing of many drugs. And they cannot efficiently address dose-response questions when many drugs are under consideration. Dynamic designs that are integrated with the drug development process are necessary for reasonable progress in medical research.
The focus of this section is a family of designs that are dynamic in the sense that observations made during the trial can affect the subsequent course of the trial. The general class of designs is adaptive or sequential. The focus is clinical trials, but the ideas apply at least as forcefully in the preclinical setting. A main bottleneck of the drug development process occurs at the level of the preclinical animal toxicity/carcinogenicity studies. There are many opportunities for using adaptive designs in the preclinical area that will efficiently identify the best drugs to move forward in trials for humans.
Using an adaptive design means examining the accumulating data periodically—or even continually—with the goal of modifying the trial's design. These modifications depend on what the data show about the unknown hypotheses. Among the modifications possible are stopping early, restricting eligibility criteria, expanding accrual to additional sites, extending accrual beyond the trial's original sample size if its conclusion is still not clear, dropping arms or doses and adding arms or doses. All of these possibilities are considered in the light of the accumulating information. Adaptive designs also include unbalanced randomization where the degree of imbalance depends on the accumulating data. For example, arms that give more information about the hypothesis in question or that are performing better than other arms can be weighted more heavily.16
Adaptation is not limited to the data accumulating in the trial. Information that is reported from other ongoing trials can also be used. This is easier to effect if one takes a Bayesian approach, possibly using hierarchical modeling as described in the previous section.
Adaptive designs are increasingly being used in cancer trials. This is true for trials sponsored by pharmaceutical companies, and more generally. For example, a variety of trials at The University of Texas M. D. Anderson Cancer Center (MDACC) are prospectively adaptive. I will describe some of them here.
Continuous Reassessment Method (CRM) in Phase I
As indicated in Chapter 32, the purpose of Phase I cancer trials is to identify the maximum tolerated dose (MTD). The most commonly used Phase I designs are variants of the so-called “3+3” design. Patients are admitted in groups of 3. If none of the 3 experiences toxicity then the dose is increased one level for the next group of 3. If 2 or 3 of the 3 experience toxicity then the next lower dose is the MTD. If 1 of the 3 experiences toxicity then 3 more patients are added at the same dose level. If 2 or more of the 6 patients at that dose experience toxicity then again the next lower dose is the MTD.17
This design is adaptive, but its adaptation is very crude. Such a design is likely to assign low doses and to select an MTD that is ineffective. Moreover, such a design ignores important information that is available in the trial. In particular, dose assignments are not based on sufficient statistics.18 An alternative approach uses Bayesian updating: the continual reassessment method (CRM) of O'Quigley and colleagues19 Updating takes place assuming a particular model of the relationship between dose and toxicity (such as the logistic function). The CRM too is adaptive. Each patient is assigned to the dose having probability of toxicity closest to some predetermined target value. This is the Bayesian posterior probability calculated from the data available up to that point (and so it is based on sufficient statistics).
The CRM more effectively finds the MTD than does the 3+3 design. The CRM is the standard design used in Phase I trials at MDACC. But it is rather crude and we are improving it in a number of ways. One of these ways is based on the fundamental principle that ignoring information is wrong. (A catch, of course, is that taking information into account is work, and it can require modeling.) There is some information that accrues about efficacy in a Phase I trial. This information is limited, especially regarding the dose-efficacy relationship. But at a minimum, in proceeding to Phase II with a particular dose (usually the MTD), one should use the efficacy information from those patients in Phase I who were assigned to that dose. This notion leads to using a Phase I/II design that addresses safety and efficacy simultaneously, or the focus turns to efficacy after an initial focus on toxicity. Such an approach is efficient from the perspective of both time and patient resources.
A way in which both 3+3 and CRM designs are crude is the need to pause accrual while waiting for toxicity information.20,21 Such pauses are inefficient and they cause logistical problems. Trials should be paused or stopped if there are safety concerns, not because the design cannot get out of its own way. In getting information about toxicity (or efficacy), there is seldom a magical dose that the next patient must get. All doses are potentially informative. Rather than stopping, one should use a design that models dose-response (toxicity and efficacy) and is able to assign a next dose even though patients previously treated are not yet fully evaluable.
Another way in which both 3+3 and CRM designs are crude is the assumption that toxicity is dichotomous. An approach that is better—again because of using all available information—would be to account for severity of toxicity. Again, it would be better to consider both severity of toxicity and efficacy in a Phase I/II design.22 Assigning utilities to the various possible health states would lead to weighing these two conflicting desiderata in a decision analysis.top link
Adaptive Dose-finding in Phase II
In many diseases, the standard Phase II dose-finding design is to allocate a fixed number of patients to each of a number of doses in a grid. Such questions are generally of less interest in cancer because of the MTD mentality: administer as much drug as the patient can tolerate. But with the increasing interest in biological agents, dose finding for efficacy is becoming important in cancer research.
After seeing the results of a dose-finding trial, the investigators usually wish they had assigned patients in some other fashion. Perhaps the dose-response curve was shifted more to the leftn or right than anticipated. If so, then assignment of the bulk of patients on one end or the other was wasted. Or perhaps the slope of the dose-response curve is greater than anticipated and the response of patients assigned to the flat regions of the curve would have been more informative if the doses assigned had been in the region where the slope is apparently greatest. Or perhaps results for the early patients made it clear that the dose-response curve was flat over the entire range and therefore the trial could have stopped earlier. Or perhaps the results of the trial show that the standard deviation of the outcome of interest is greater or less than anticipated and so the trial should have been larger or could have been smaller.
The approach of Berry and colleagues 23 is to proceed sequentially, analyzing the data as it accumulates—see also Malakoff.5 There are two stages of the trial, first dose ranging (Phase II) and then confirmatory (Phase III), if the latter is warranted. The dose-ranging stage continues until a decision is made that the drug is not sufficiently effective to pursue future development or that the optimal dose for the confirmatory Phase III trial is sufficiently well known. (Switches to Phase III can be effected seamlessly and without stopping accrual—see below, and this is so even if the endpoint of interest is delayed, such as time to progression.) The example trial of Berry and colleagues 23 involves a biological neuroprotective agent for stroke. But the same principles of trial design apply in cancer. Each entering patient is assigned the dose (one of 16, including placebo) that maximizes information about the dose-response relationship, given the results observed so far. This dose could be in the region of greatest apparent slope, or it could be placebo or a high dose. But future patients are not assigned to doses in any region where accumulating evidence suggests that the dose-response curve is flat.
In the dose-ranging stage, neither the number of patients assigned to any particular dose nor the total number of patients assigned in this stage are fixed in advance. The dose-ranging sample size will be large when the data suggest that the drug has moderate benefit, when the dose-response curve is gently sloping, or when the standard deviation of the responses is moderately large. It will tend to be small if the drug has substantial benefit, if the drug has no benefit, if the dose-response curve rises over a narrow range of doses, or if the standard deviation of the responses turns out to be small. (In addition, and somewhat non-intuitively, the dose-ranging stage will be small if the standard deviation of responses is very large. The reason is that a sufficiently large standard deviation means that a very large sample size would be required to demonstrate a beneficial drug effect. The required sample size may be so large that it would be impossible to study the drug and so the trial stops in the dose-ranging phase before substantial resources go down the drain.)
In the stroke trial considered by Berry and colleagues,23 the ultimate endpoint is improvement in stroke scale from baseline to 13 weeks. If the accrual rate is large then the benefit of adaptive assignment can be limited by delays in obtaining endpoint information. To minimize the effects of delayed information, each patient's stroke scale is assessed weekly between baseline and week 13. Within-patient measurements are correlated, with correlations greater if they are closer together in time. We incorporate a longitudinal model into the analysis of the trial and carry out Bayesian predictions of ultimate endpoint based on current patient-specific information, and we update probability distributions of treatment effect accordingly.
Adaptive dosing is more effective than is the standard design at identifying the right dose. And it usually identifies the right dose with a smaller sample size than when using fixed dose assignments. Another advantage is that many more doses can be considered in an adaptive design. (Even though some doses will be little used and some might never be used, these cannot be predicted in advance.) An adaptive design therefore has some ability for distinguishing responses at adjacent doses and for estimating nuances of the dose-response curve.
The circumstances of the stroke trial are similar to those in many other types of trials. Finding the right dose is a ubiquitous problem in pharmaceutical development, and it is seldom done well or efficiently. The adaptive nature of the stroke trial would be less advantageous if we had not exploited early endpoints. Cancer too is characterized by the availability of information about a patient's performance (local control of the disease, biomarkers, etc.) before reaching the primary endpoint. Finally, the possibility of moving seamlessly into Phase III depending on the Phase II results exists for many types of drugs. That issue leads naturally to the subject of the next section.top link
Seamless Phases II and III
The convention of categorizing drug development into phases is unfortunate. We proceed from one phase to the next when we think we know something: the MTD from Phase I or that a drug's impact on a Phase II endpoint will translate into a benefit in Phase III, and at the Phase II dose. In the Bayesian approach one never takes a quantity to be perfectly known. Instead, the Bayesian perspective is to carry along uncertainty with whatever knowledge is available. Phases of drug development are arbitrary labels that describe a process that is—or should be—continuous.
One of the consequences of partitioning drug development into phases is that there are delays between phases. For example, there is a pause between phases II and III to set up one or more pivotal studies. As mentioned above in the context of the stroke trial, its design allows for avoiding such a hiatus. At each time point, say weekly, the algorithm that guides the conduct of the trial carries out a decision analysis and recommends either (1) continue the dose-ranging stage of the trial, (2) stop the trial for lack of efficacy (inadequate slope of the dose-response curve or, more accurately, evidence of a positive dose-response that is insufficient to justify continuing the trial), or (3) shift into a confirmatory trial. This shift can be made seamlessly, with no break in accrual. Indeed, it is even theoretically possible to effect such a shift in a double-blind trial without informing the investigators: they simply continue to randomize doses, but unbeknownst to them, the only two being assigned are the Phase III dose and placebo.
We designed a trial at MDACC24 that encompasses both phases II and III. If there is a switch to Phase III, this switch is seamless. The anticipated effect of the drug is on local control. We model survival as it depends on local control and as it depends on treatment. (Though the possibility is remote, we allow for the experimental drug to have a beneficial effect on survival that is not mitigated by local control.) So local control is a surrogate endpoint in a way similar to the way early stroke scale assessments are surrogate endpoints in the stroke trial. But the clear focus is on survival as the main endpoint and the utility of the surrogate endpoint must be demonstrated by the results actually observed in the trial. We exploit any relationships that exist, but do not assume such relationships. We analyze the data in the trial frequently and adapt to the accruing evidence.
The seamless aspect is as follows. Initially, only MDACC patients are accrued to the trial. Think of this as Phase II. If the accumulating data are sufficiently strong in suggesting that the drug has no effect on local control or survival, then the trial stops. If the data suggest that the drug may have an impact on local control and that this impact translates into a survival benefit, then the trial will be expanded to include other centers and the accrual rate will increase accordingly. During such an expansion, patients continue to accrue at MDACC so that there is no down time in local accrual while other centers gear up for joining the trial. This is efficient use of patient resources because the patients accrued early at MDACC contribute to the eventual inferences about survival. These patients are the most informative of all those enrolled because their follow-up times are the longest.
The trial continues until stopping occurs because (1) continuing would be futile, judged by predictive probabilities, (2) the maximum sample size is reached, or (3) the predictive probability of eventually achieving statistical significance becomes sufficiently large. Should the third event occur, accrual ceases and the pharmaceutical company prepares a marketing application.
The sample size of a conventional Phase III trial with the desired operating characteristics is 900. We take this to be the maximum sample size in the seamless design as well. Actual accrual is very likely to be much less than this maximum sample size, and on average it will be about half as large. On the other hand, incorporating the same number of interim analyses in a conventional design using a conventional type of stopping boundary allows for only a slight decrease in average sample size. Under any hypothesis, null or alternative, the Bayesian design occasionally leads to a relatively large trial (close to 900 patients). However, a pleasant aspect of the design is that the sample size is large precisely when a large trial is necessary. Conventional trials may well (and sometimes do!) come to their predetermined end with an ambiguous conclusion. In a Bayesian approach one may choose to continue such a trial to resolve the ambiguity, and this option has substantial utility. (Carrying this argument to the maximum sample size, there may be times for which stopping at 900 is ill advised, but for logistical reasons we specified a maximum size.)
Reductions in sample size result from two characteristics of the seamless design described above. First are the frequent analyses to assess the predictive probability of eventual statistical significance. The second is the explicit modeling of the possible relationship between local control and survival. Of the two, the second is more important.
A conventional drug development strategy involves running a Phase II trial that addresses local control, digesting the results, and if the results are positive, starting to develop Phase III trials with survival as the primary endpoint. As indicated above, in comparison with such a strategy, a seamless approach can greatly reduce sample size. In addition, a seamless design minimizes pauses between phases and so the total drug development time is greatly shortened.top link
Adaptive Allocation
The adaptive designs discussed so far are motivated by the desire to learn efficiently and as rapidly as possible. Another kind of adaptive design aims to treat patients in the trial as effectively as possible. These designs use adaptive allocation in which patients are more likely to be assigned to therapies that are performing better. In addition to making clinical trials more attractive to patients and thereby increasing participation in clinical trials, such strategies have the important side benefit of being efficient and so they result in rapid learning.
More than a dozen trials at MDACC have been designed and are being conducted using adaptive allocation. Our standard approach is to randomize treatment assignment, but we shift the weights toward better performing arms as the trial proceeds and the results accumulate. Many of these trials have more than two arms. The arms are sometimes distinct therapies, and sometimes they are closely related. An example of the latter is an MDACC trial involving five doses (including 0) of a drug (pentostatin). The goal is to inhibit graft-versus-host-disease (GVHD) in leukemia patients who are receiving bone marrow transplants. The problem is that the drug may inhibit successful engraftment of the transplant, which is necessary for survival. Such inhibition may be related to dose. We use a combination endpoint: survival at 100 days free of GVHD. The conflict between engraftment and freedom from GVHD means that the dose-response curve may not be monotone. In particular, it may increase for small doses and then decrease. Initially we assign doses in a graduated fashion, climbing the dose ladder slowly. But as doses become admissible, we assign patients to those that have been performing well.
Consider a patient who qualifies for the trial. To decide which pentostatin dose to assign we calculate the current (Bayesian) probabilities that each admissible dose is better than placebo. This calculation uses all information from patients treated to date. We allocate doses randomly, with weights proportional to these probabilities. We consider other allocation algorithms, including assigning in proportion to powers of these probabilities. The assignments involve some degree of randomization, but all patients are more likely to receive doses that are performing better. Doses that are doing sufficiently poorly become inadmissible in the sense that their assignment weight becomes 0. When and if we learn that the drug is effective, we stop the trial. When and if we learn that the drug is ineffective, then again we stop the trial. Patients in the trial benefit from data collected in the trial. The explicit goal is to treat patients more effectively, but in addition we learn efficiently. We evaluate each design's frequentist operating characteristics using Monte Carlo simulation, possibly modifying the parameters of the assignment algorithm to achieve desired characteristics.top link
Process or Trial? Evaluating Many Drugs Simultaneously Using Adaptive Allocation
The greatest need for innovation and the greatest room for improving drug development is effectively dealing with the enormous numbers of potential drugs that are available for development. The notion of developing drugs one at a time is part of the pharmaceutical culture. It will change. Companies that are able to screen many drugs simultaneously and do so effectively will survive and others will not.
Many different drugs should be evaluated in the same preclinical experiment or collection of experiments. Information should be updated frequently or even continually. The extent to which any particular drug is used and the order of drugs used will depend on the available data. Drugs that are apparently more promising will move faster through the preclinical setting. Drugs that give disappointing data will languish. And the sample sizes of drugs whose promises and toxicities are not clear will tend to be large so as to enable resolving uncertainties.
These ideas and imperatives apply as well to drugs' clinical development. As an example, at MDACC we are building the foundation for a Phase II trial for evaluating drugs that is more a process than a trial. The idea is an extension of the adaptive assignment strategies described in the previous section. We start with a number of treatment arms plus a control—possibly a standard therapy. We randomize to the arms and learn about their relative efficacy as the trial proceeds. Arms that perform better get used more often. An arm that performs sufficiently poorly gets dropped. An arm that does well enough graduates to Phase III, and if it does sufficiently well it might even replace the control. As more arms become available, they are added to the mix.
The result is that better arms move through quickly and poorer arms get dropped. Patients in the trial are provided with better treatment (when the arms are not equally good). Patients outside the trial get access to better drugs more rapidly.top link
Extraim Analyses
A common circumstance is that a clinical trial ends without a clear conclusion. For example, a statistical significance level of 5% in the primary endpoint may be required for drug registration and the p-value turns out to be 6%. The regulatory agency suggests that the trial was “under-powered” and that the company should carry out another trial. It would be much more efficient to simply increase the sample size in the present trial. The problem is that the possibility of such an extension increases the type I error rate. The principle is identical to that for interim analyses.
The solution is to build into the design the possibility of continuing the trial depending on the results, suitably adjusting the significance levels. In contrast to adjustments for interim analysis, the adjustments for “extra-im” (extraim) analyses are reversed, with much of the overall significance level “spent” at the originally planned sample size. For example, taking equal significance levels at each possible termination point is preferable to O'Brien-Fleming stopping boundaries because the latter are too conservative for extraim analyses. Allowing for extending the trial increases the maximal sample size and also the average sample size. But a modest increase in average sample size (such as 20%) comes with a substantial increase in statistical power (such as 80% increasing to 95%). The reason for this beneficial trade-off is that the trial is extended only when such an extension is worthwhile.
The “penalty” in significance level can be either partially or fully offset by including futility analyses as part of the design. Namely, the trial would be stopped for sufficiently negative results at preset interim time points. The reason such analyses offset the penalty for extraim analyses is that the null hypothesis is never rejected when the trial stops for futility. Decreasing the opportunity for a type I error also decreases the power of the trial. However, this decrease is usually quite modest and in any case is more than compensated by the increase in power due to the extraim analyses.
The increment in sample size depends on the available data at the time the decision is made to continue accrual. It also depends on the number of possible extensions. In trials I have designed, I base each extension on predictive power. The usual definition of power assumes a particular value of the parameter of interest, say r. Predictive power considers all possible values of r. The data available at the time of the extraim analysis plays two roles. First, they count in the final results of the trial. Second, they are used to update the (Bayesian) probability distribution of r. Fix the total sample size n and calculate the power for detecting each possible value of r. Average this power with respect to the probability distribution of r to give predictive power for sample size n. Extend accrual by the minimum sample size that gives total sample size having pre-specified predictive power. If there is no such value of n, then continuing accrual may be unwise.
There is an aspect of the above development that may be unrealistic. Namely, it assumes that endpoints for those patients treated in the trial so far are available at the time of the extraim analysis. Even if the endpoint is tumor response, there is a delay in obtaining this information. There is no need to stop the trial just because some of the endpoint information is unavailable. Rather, these data can be predicted along with that from patients not yet accrued. If there is some early information (biomarkers, performance status, etc.) that is correlated with the endpoint of interest then this can be used to inform the prediction. A special and important case is when the endpoint is time to event. The fact that a patient has not yet reached an event is useful information in predicting the time to that event. But if there is no patient-specific early information, then patients treated but not yet assessed for response are treated in the same way as patients not yet treated. (This set of issues is sufficiently important that they deserve being addressed separately—see the next section)
The above process is complicated. But it can be completely and precisely described. That means it can be simulated. The simulations can be carried out under various assumptions about the parameter of interest. In particular, the false-positive rate can be calculated. If there is a target significance level—such as 5%—then the various inputs into the design (number and type of extraim analyses, number of type of futility analyses, etc.) can be varied until achieving that target. An advantage of simulations is that each iteration provides a fully accrued trial. So it is possible to check any characteristics of interest regarding the trial's design by calculating the proportion of the trials that have that characteristic. Characteristics of interest include power, actual sample size and the probability of extending accrual.top link
© 2003 BC Decker Inc
|