Powered by the Evidence-based Practice Centers
Evidence Reports All of EHC
Evidence Reports All of EHC



Statistical Design and Analytic Considerations for N-of-1 Trials (Chapter 4)

Research Report Feb 12, 2014

Page Contents

This is a chapter from Design and Implementation of N-of-1 Trials: A User's Guide. The full report can be downloaded from the Overview page.


In this chapter, we discuss key statistical issues for n-of-1 trials—trials of one patient treated multiple times with two or more treatments, usually in a randomized order, with the design under the control of the patient and his or her clinician. The issues discussed include special features of experimental design, data collection strategies, and statistical analysis. For simplicity, we will focus on the two-treatment, block pair design in which patients receive each of two treatments in every consecutive pair of periods with separate treatment assignments within each block of two periods, either randomized or in a systematic, balanced design. Extensions are straightforward to other designs such as K treatments (K > 2) assigned in blocks of size K, randomization schemes with differently sized blocks (e.g., block sizes equal to a multiple of the number of treatments), or unblocked assignment schemes, requiring no changes in the fundamental principles we outline. The basic design principles include randomization and counterbalancing, replication and blocking, the number of crossovers needed to optimize statistical power, and the choice of outcomes of interest to the patient and clinician. Analyses must contend with the scale of the outcomes (continuous, categorical, or count data), changes over time independent of treatment, carryover of treatment effects from one period into the next, (auto)correlation of measurements, premature end-of-treatment periods, and modes of inference (Bayesian or frequentist). All of these complexities exist within an experimental environment that is not nearly as carefully regulated as the usual randomized clinical trials and so require an appreciation of the special difficulties of gathering data in an n-of-1 trial.

Experimental Design

One of the appealing features for the n-of-1 trial lies in its allowing the patient and clinician to devise an individualized trial with idiosyncratic treatments and outcomes run in real-world settings. As a result, n-of-1 designs may vary substantially and reflect great creativity. On the other hand, they often involve clinicians who are unfamiliar with the principles and practice of clinical trials and who may not have access to the resources common in research settings. Because many n-of-1 trials will be carried out in nonresearch medical office or outpatient clinic environments, it is important to ensure that proper experimental standards are maintained while allowing designs to remain flexible and easy to implement. One way to ensure such standards is to establish a centralized service responsible for crucial study tasks such as providing properly randomized or balanced/counterbalanced treatment sequences to the patient-clinician pair when they are designing the trial. We next discuss common clinical crossover trial standards that continue to apply in n-of-1 studies.


After choosing the identity and duration of the treatments to be given, the patient and his/her clinician must be given a sequence of treatments in such a way that the validity of the experimental process is maintained. The sequence can be either randomized or generated in a systematic counterbalanced design, such as ABBA.1,2 In the standard two-treatment n-of-1 trial, the assignments are made within blocks of two time periods. With randomization, the first time period in each block is assigned randomly to one of the two treatments, say, A; the second time period is then assigned to the other treatment, say, B. With a counterbalanced design, the assignments alternate between AB and BA in a systematic manner that is intended to minimize possible confounding with time trend. For example, each two blocks can be assigned as AB (first block) BA (second block) to eliminate possible confounding with a linear time trend.

An important requirement for a good experimental design is to balance treatment assignments, especially for potential confounding factors, so that the treatments are compared fairly. Making assignments in blocks of size two ensures that each patient receives each treatment with the same frequency at a comparable set of times, to avoid poorly balanced designs such as AABA and AABB.

Randomization and counterbalancing attempt to balance treatments both within and across blocks. Randomization achieves balance, on expectation, when averaged across a large number of blocks and/or a large number of n-of-1 trials. However, for each individual n-of-1 trial, exact balance might not be achieved. For example, if patient outcomes are deteriorating gradually over time, inducing a time trend, the ABAB design would not be well balanced, as B is always delivered after A. The design itself may induce inferior outcomes for B due to the time trend when the two treatments are actually equivalent. For a four-period trial randomized in blocks of size two, there is a 50-percent chance that randomization will yield such an unbalanced design, either ABAB or BABA (and a 50% chance of a design that is well balanced against the linear time trend, either ABBA or BAAB). Counterbalancing, on the other hand, can be more effective at achieving exact or nearly exact balance for the potential confounding factor(s) designed explicitly to be balanced, for example, the ABBA design achieves exact balance for linear time trend.

While randomization can be less effective than counterbalancing in distributing known confounding factor(s) in a balanced way across treatment periods, randomization has an important advantage in its ability to balance (on average) all potential confounding factors, both known and unknown. Counterbalancing, on the other hand, can perform poorly if the explicit scheme chosen leads to imbalance with respect to an unknown confounding factor.

In addition to reducing but perhaps not completely eliminating the risk of bias induced by time trends, blocked assignment also provides two other important benefits. It minimizes the consequences of early termination from the trial that might otherwise lead to an unbalanced number of observations in the two treatment arms. Within-block assignment also reduces the chances that unknown confounders may bias the estimate of within-patient variation, which would invalidate appropriate statistical inference.

To summarize, we recommend that a blocked scheme for treatment assignment be used for n-of-1 trials. We also recommend that users make a careful choice between randomization and counterbalancing. If there is good information on the most important potential confounding factor (such as the linear time trend) and if the total number of blocks is small, say, less than four, counterbalancing can be more effective. Otherwise, randomization would be a more robust choice. The end of the next section, Blinding, has some further discussion.


To the extent possible, patients and clinicians should remain blinded to the treatment assigned, particularly when patient-reported or other subjectively ascertained outcomes are used. While blinding is desirable in all clinical trials, it may be particularly important with n-of-1 trials because of the individualized crossover nature of the study. Patients may (and probably will) try to guess which treatment they received in each period. Because they are so invested in the research and so desirous of a positive outcome, it is natural that their reported outcome measures are affected by knowledge of the treatment received--for example, in favor of the direction that confirms any preexisting expectations they might have (the expectancy effect).3,4 Potential bias might also ensue from the motivation for the trial if, for example, patients were compelled to enter an n-of-1 trial to prove that a more expensive treatment was really indicated and should be reimbursed. On the other hand, patients' self-interest might also drive them to report as objectively as possible, particularly if they enter the trial without any preconceived preferences, because they themselves will bear the consequences of a bad treatment decision based on biased outcome reports.

In the absence of blinding, other features related to treatment administration might influence outcomes, but in such a way that they should actually be incorporated into the treatment decision, if it is reasonable to expect the same effect will persist beyond the end of the trial. It was noted in the section "Blinding" in Chapter 1: "Patients and clinicians participating in n-of-1 trials are likely interested in the net benefits of treatment overall, including both specific and nonspecific effects." For example, if the patient prefers one pill to the other because of its color or texture during the trial (a nonspecific effect), and this effect is sustained, it is a real effect for this patient and should be part of the treatment decision. In a parallel group trial where the intent is to generalize beyond the patients in the trial, such a preference should be considered a bias, because future patients to be treated according to the findings from the trial might not have the same preference.

In addition to the potential effect on reported outcomes, knowledge of treatment identity may lead some to end a treatment period early if the measured outcomes support the treatment expectation. Even if the treatment assignment is blinded, superior results in one or more periods may induce patients to ask to unblind the trial to confirm whether their hunches are correct. Such unblinding will stop the trial and may result in an inconclusive result.

For blinded n-of-1 trials with treatments assigned in small blocks such as blocks of size two, there is sometimes a concern that some users (patients and/or clinicians) might learn during the course of the trial that the second treatment in the block is predetermined by the first; therefore, the outcome for the second treatment might be affected by expectancy. When this is an important concern, one could use a block size that is a multiple of the number of treatments or randomize the block sizes in different multiples of the number of treatments. This strategy minimizes the chance for the user to figure out the treatment in any given period. On the other hand, this strategy may also increase the risk of bias if time trends are present or dropout occurs.


Because only one patient is involved in an n-of-1 trial, the number of measurements taken on each individual determines the sample size of the study. The total number of measurements is determined, in turn, by two components: the number of periods and the number of measurements per period. For instance, a pain outcome measured daily over six 14-day treatment periods will have 84 observed data points. These repeated measurements enable estimation of between- and within-period variances, both crucial for proper statistical modeling. Larger sample sizes can be achieved by increasing the number of treatment periods, increasing the length of each period, or increasing the frequency of measurements within each period. These alternative strategies have different analytic implications because they affect different components of the study variance. It is important to carefully choose both the number of crossover periods and the number of measurements taken per period to enhance the efficiency of the study design. More data will improve the precision of the treatment effect estimate, but the optimal allocation to more treatment periods or more measurements per period depends upon statistical considerations such as the expected size of each variance component and its influence on the precision of the effect of interest and the minimum effect size of interest, as well as on practical considerations related to feasibility and type of measurement. Such considerations include patients' ability to record data more than once a day, the validity of measures on different time scales, increased likelihood of dropout with longer trials, and the tendency for patients to become less careful in following treatment protocols over time. Outcomes with substantial measurement variation such as quality-of-life measures will need to be collected more frequently in order to precisely estimate the variance.


Carryover, the tendency for treatment effects to linger beyond the crossover (when one treatment is stopped and the next one started), threatens the validity of the comparison between treatments in crossover studies, including n-of-1 trials. While statistical models may attempt to accommodate carryover, they rely on assumptions about the nature of the carryover that may be difficult to test or even control. In the extreme, carryover may extend throughout all or most of the next treatment period, contaminating many of the outcome measurements.

Inserting a washout period in which no treatment is given between consecutive treatment periods is the most common method to reduce or even eliminate the effect of carryover by design. The goal of a washout period is to provide time for each patient to return to the baseline disease state, unaffected by preceding treatment. Deciding whether to include a washout period depends on both clinical judgment about the durability of the treatment effect (e.g., from the pharmacokinetics of a drug treatment) as well as practical and ethical considerations related to the study design's implications on the satisfaction and welfare among end-users (patient and clinician).

An important clinical consideration for the washout is to avoid adverse interaction between the treatment conditions. This is mainly an issue for active-control studies, with an active treatment (the standard treatment) used as the control condition to evaluate the comparative effectiveness of an alternative treatment. If the two active treatments being compared are not compatible with each other, it would be necessary to impose a washout period to eliminate the first agent before starting the second agent.

When adverse interaction can be ruled out, the inclusion of a washout period can be problematic for active-control studies, both in terms of satisfaction for the end-users (patient and clinician) and in terms of clinical ethics. The washout period introduces a third treatment condition: the absence of either active treatment. Even a patient managing the disease condition adequately with current treatment might undertake the n-of-1 trial to test the possibility that the alternative treatment might be better. It is undesirable, and perhaps even unethical, for the patient to be forced into a period of no treatment that is likely to be inferior to the current treatment. The use of washout in such studies might reduce a patient's willingness to undertake the n-of-1 trial and increase the chances of early termination from the trial. The ethical dilemma here is that, when adverse interaction can be ruled out, there is no obvious clinical rationale to withhold both active treatments from the patient during the washout period, other than to make a short-term sacrifice in exchange for a better chance to improve the therapeutic precision at the end of the trial.

Conversely, not using a washout might compromise the validity of the estimated treatment effect and lead to biased estimates for treatment effects. Therefore users need to determine whether the likelihood of a substantial bias warrants the drawbacks of the washout.

In some cases, the effect of the washout can be accomplished analytically without including any period during which treatments are withheld. More specifically, any effect of carryover can be dealt with analytically by eliminating, discarding, or downweighting observations taken at the beginning of a new treatment period. It is also possible to incorporate all observations by introducing a smooth transient function that drifts toward zero gradually over time and reflects the time to respond to the carryover effect. Such a function would reduce the influence of potentially contaminated observations early in the period. It contrasts with discrete functions that either accept or discard early observations. This approach can also help to maintain the integrity of the trial by reducing the chance that the patient will drop out and that observations will be contaminated by carryover.

While carryover affects how the effects of the previous treatment might linger after the completion of the previous treatment period, another important transition issue is the onset of the new treatment. Some treatments, such as selective serotonin reuptake inhibitors (SSRIs), may take some time to reach full effectiveness. Slow onset provides another reason to reduce the influence for potentially contaminated data at the beginning of a period; it introduces a natural washout, particularly if the time for one drug to wear off is no greater than the time for the next drug to take effect.

It should be noted that a washout period does not directly mitigate the problem of slow onset. On the contrary, a washout period further extends the transition between the two treatments, because the onset for the new treatment does not begin until the end of the washout period. As an example, assume that treatment A takes 3 days to wash out, and treatment B takes 2 days to reach its full effectiveness. If a washout period of 3 days is used after a period of treatment A, then treatment B begins on day 4 and reaches its full effectiveness on day 6. Therefore, a total of 5 days are lost to the transition between the two treatments. On the other hand, if a washout period is not used (under the assumption that there is no adverse interaction between the two treatments), the transition is 3 days only: by day 3, treatment B has reached its full effectiveness; by day 4, the carryover effect for treatment A has disappeared. Therefore only 3, instead of 5, days of treatment do not reflect full treatment effects.

If a washout period is included in the study design, its length needs to be chosen carefully, taking into consideration treatment interactions, medical ethics, drug half-lives, and onset efficacy. Longer washout periods decrease the likelihood of carryover but increase the length of the study and time spent off treatment, and also delay the onset of the full effectiveness of the next treatment. Making washout periods too short contaminates treatment effects and carryover effects, and might result in biased estimates for treatment effects. In summary, one needs to define treatment periods sufficiently long to manifest the intended treatment effect and overcome transient effects such as carryover and onset, but short enough to allow enough crossovers within a reasonable total duration for the study.


While a fixed trial design is the norm, adaptive trial designs offer the chance to modify the design of an ongoing trial in order to make it more efficient or to fix problems that may have arisen.5 Some adaptations occur naturally, as when a patient and clinician decide to stop a trial because one treatment appears to be more effective or end a treatment period early because of an adverse event. It is important in such circumstances that blinding be maintained if it is already part of the study design. For instance, it would not be proper to unblind a treatment period in order to stop one treatment, but not the other. Other adaptations could include extending the length of the trial to more treatment periods if treatment differences appear to be small or instigating play-the-winner designs,6,7 in which the treatment that appears to be more effective is given more frequently. Such designs are generally easier to implement when the data are analyzed using Bayesian methods without tests of hypothesis whose properties depend on prespecified design plans. If frequentist inference (i.e., p-values) is used, sequential design with explicit stopping rules is necessary to protect the overall type I error rate. In some cases, decisions to adapt a design may arise from experience with similar patients. For the implementation of adaptive and sequential designs, it is important that these procedures be built into the informatics system to allow for automation of these design features. In order to ensure high-quality performance of the automated procedure, we recommend that these procedures should be reviewed periodically and calibrated as needed.

Multiple Outcomes

The personalized nature of n-of-1 trials and their focus on making a treatment decision for an individual patient require outcomes to be carefully chosen so as to reflect the measures of most importance to the patient's well-being. Often, more than one outcome is of interest to the patient--perhaps obtaining relief from pain and sleeping better--and so the effect of treatment on both needs to be considered in the choice of treatment at the end of the trial. This contrasts with most clinical trials, which often focus on one particular average treatment effect in the population. Thus, although almost all clinical trials collect data on at least several, if not many, outcomes of interest, they typically focus on a primary outcome and so use statistical methods for a single outcome variable.

A common technique when multiple outcomes are of interest is to form a composite variable such as MACE in cardiovascular trials, which counts the number of major adverse cardiac events (e.g., acute myocardial infarction, ischemic stroke, coronary arterial occlusion, and death), and then analyze it by univariate methods. Composite outcomes are not as popular in n-of-1 studies because they do not allow the patient or clinician to see the effect on each distinct outcome separately. Often the outcomes differ so fundamentally that forming a composite becomes difficult. Returning to a previous example, how might one combine a pain scale and the number of nights of good sleep over a fortnight? One could express both as a percentage of relief compared to a baseline level and then average the two percentages, but this would assume that both outcomes were of equal importance and that both outcome scales were linear. Alternatively, one could choose one outcome as primary and the other as secondary, but if the patient were concerned with both, this would be unlikely to work well. Another approach would be to form a weighted composite scale, with weights accommodating patient and clinician preferences or utilities.

To reflect the patient's true decisionmaking state, one might instead analyze each outcome separately and report a measure of the treatment's effectiveness for each, letting the patient and clinician weight them on their own. One could argue, however, that explicitly specifying the weights up front is more scientific and transparent than having the patient and clinician implicitly weighting separate outcomes post hoc in trying to make a treatment decision. In the end, this is a decision problem, and it is worth exploring methods of decision analysis to improve decisionmaking for n-of-1 trials. Both approaches may be useful.

Because the focus is on the immediate decision of which treatment to take, it is not important to protect against a false-positive decision, as in the standard test of hypotheses commonly employed in clinical trials. One is not choosing to report a statistically significant finding for one outcome among many, so multiple testing is not an issue. Instead, one provides the decisionmaker with all the information required in a format that facilitates decisionmaking.

Multiple Subjects Designs

Several publications have described an n-of-1 service in which many patients are offered the opportunity of carrying out studies. Such services offer several advantages: economies of scale in research infrastructure, clinicians experienced in n-of-1 trials, and the chance to use information gained from other patients. Multiple n-of-1 trials may be combined in a common statistical model to both estimate the average treatment effect as well as improve individual treatment-effect estimates by borrowing strength from the information provided by other similar patients. As more patients accrue, not only does the precision with which the next patient can be evaluated improve, but the estimates for previous patients who might have even finished their studies may also change as a result of information gathered from later patients. Multiple-subject designs increase the complexity of sample size choices, because they permit manipulation of the number of subjects as well as the number of measurements on each. Balancing these two numbers requires knowledge of the relevant within- and between-patient variances.8 Ethical considerations may also arise from multiple n-of-1 trials if one treatment appears to be working better and clinicians become reluctant to continue randomizing patients due to lack of equipoise.

Data Collection

The lack of research infrastructure for the single clinician running an n-of-1 trial may have a serious detrimental effect on data collection. Typically, research studies initiate elaborate procedures to ensure that data are collected in a timely, efficient, accurate fashion. Forms are tested and standardized; research assistants are hired and trained to help collect data from patients either at patient visits or remotely via mail, telephone, or Internet connections; data are checked and rechecked by trial personnel and external monitors; and missing items are followed up. Many of these options are not available to the typical clinician running a trial outside of an established n-of-1 service. Conversely, patients in n-of-1 trials are usually extremely motivated, because the trial is being done for them and by them, so they may be more committed to data collection and therefore less likely to miss visits and fail to complete forms accurately. Missing items can be particularly costly in an n-of-1 study because of the small number of observations.

Clinicians undertaking n-of-1 trials must be aware that each trial is unique, with its own protocol and its own set of outcomes. This multiplicity of designs can complicate data collection, even if a centralized support service is available. Multiple data collection forms may be needed, and personalized user interfaces may be valuable ways to collect data. Reminders are important to provide, and interim feedback can maintain the patient’s enthusiasm.

Statistical Models and Analytics

The unique design features of n-of-1 trials, including a multiple-period crossover design, multiple patient-selected outcomes, and focus on individual treatment effects, motivate statistical models for these trials. Data resemble a time series in that they are autocorrelated measurements on a single experimental unit. Unlike classic time series, however, the measurements are structured by the randomized design, and so statistical models also have features like those for longitudinal data with a time-varying covariate (the treatment condition). The main goal is to compare the observations made under the two treatment conditions, adjusting for any carryover effects, while accommodating the randomized block structure.

Constructing such models is difficult, especially when few measurements are taken. One review of the n-of-1 literature in medicine, in fact, found that many studies used no formal statistical model at all to compare treatments, opting instead for eyeball tests based on a graph of the data or simple nonparametric tests such as the proportion of paired treatment periods in which A outperformed B.9 When the data are simple and treatment differences are clear, such simple methods work well; graphs are always informative, and plots of the measurements provide good ways to understand the data. But when the number of measurements gets large or when differences are small, graphs will not be sufficient to properly distinguish the treatment effects.

The basic data from an n-of-1 design consist of measurements taken over time while on different treatments. The fundamentals of the statistical analysis can be most easily understood by focusing on the two-treatment design, in which treatments are randomized in blocks of size two, each treatment appearing once in each block. Each treatment period consists of one or more measurement times.

Nonparametric Tests

The earliest n-of-1 trials in medicine used a simple nonparametric test called the sign test. First, one calculates the difference between treatment A and treatment B. If the difference is positive (A is better than B), one counts this as a success. A negative difference counts as a failure. (The choice of which difference is defined to be a success is, of course, arbitrary.) The number of successes, that is, the number of blocks in which A outperforms B, is now compared to the number expected if the treatments were the same: N/2 where N is the number of blocks. Since the number of successes is assumed to follow a binomial distribution, one calculates the probability of the observed result under the null hypothesis that the true success probability is 50 percent. For example, if there were three blocked comparisons and in each A was better than B, the probability would be ½*½*½ = 1/8. This is then a (one-sided) p-value for testing whether A was better than B. This procedure ignores the actual size of the differences and thus ignores potentially important information. Instead, one might use the Wilcoxon signed-rank test on the ranked differences.

While these simple nonparametric tests are easy to use, they ignore important features of the time-series data, particularly their autocorrelation, time trends, and repeated measurements within periods. As a consequence, it is usually worth constructing a proper statistical model that incorporates these features along with an estimation of treatment effect.

Models for Continuous Outcomes

A variety of different models can be constructed when the outcomes are continuous variables, depending on whether they are considered random measurements within each treatment period or vary systematically with time.

First, consider a model in which time may be indexed within treatment periods inside blocks. Notationally, let yijkl represent the outcome measured at time i within treatment period j within block k while on treatment l:

Model 1: yijkl = α + βl + γk + δj(k) + εi(j(k)).

Model 1 assumes a fixed treatment effect βl, random block effects γk ~ Ν(0,σ2γ ), within-block random period effects δj(k) ~ Ν(0,σ2δ ), and within-period random errors εi(j(k)) ~ Ν(0,σ2), where the notation Ν(μ,σ2) indicates a normal distribution with mean μ and variance σ2. The constant term is used to avoid oversaturation of model terms. Usually, one block is chosen as the reference (e.g., set γ1 = 0), and period within-block effects may be expressed so that the difference between the first and second period is assumed the same in each block. This model assumes no time trend and no carryover. The model may be simplified if observations within one treatment period or block are uncorrelated with those in another. In that case, the model becomes a simple two-mean model with random errors:

Model 2: yijkl = βl + εi(j(k))

A common scenario for this model is when each treatment period has only one observation (perhaps at the end) to minimize the possibility of carryover.

Modeling Effects Depending on Time

Another class of models pertains to occasions when outcomes vary systematically with time. Causes for such variation include time trends that might describe a disease course or calendar effects that arise from seasonal variation in severity, for instance, in asthma patients whose health is affected by hay fever. Measuring such time effects requires that the study duration and measurement frequency be sufficient to differentiate the trends from noise. It is then easiest to express the model in terms of the measurement yt taken at time t. If the trend is linear, we have

Model 3: yt = α + βt + γXt + εt,

in which β is the slope of the time trend, Xt is an indicator for the treatment received at time t, γ is the treatment effect, and εt are the residual errors possibly correlated over time. Other time effects can be introduced by modifying the specification for the model, for example, adding nonlinear terms such as quadratic terms to capture possible nonlinear trends. As another example, a seasonal effect could be introduced by adding a dummy variable Zt taking the value of 1 during the season and 0 outside it. It is important to recognize that the true functional form for time trend is usually unknown; therefore the specification of time effects is usually exploratory.

When each period has a single measurement, the time variable can be replaced by an indicator variable for period. If the effect of treatment is expected to vary with time (e.g., because of higher efficacy during periods of greater disease severity), one can include a time-by-treatment interaction effect into the model.


Measurements in a time series typically are not independent, exhibiting some form of autocorrelation that represents the relationship between one measurement and the next in the series. Such autocorrelation arises from time trends or treatment carryover that causes individuals to tend to respond more similarly at times that are closer to each other. Model 3 presents one method of detrending the time series by fitting a model that is linear in time. Such detrending often removes substantial amounts of observed autocorrelation, but some may remain as a consequence of features such as carryover or delayed uptake. Carryover may cause the response to be greater than it should be, if both treatments being compared are active and beneficial. Delayed uptake applies if the full effect of a treatment is not felt at the start of the measurement of the outcome. It will work in the opposite direction, initially depressing the response. The effect of each, however, is to induce correlation between consecutive outcome measurements.

Models that adjust for autocorrelation take two main forms. The first, often called an autoregressive or serial correlation model, expresses the residual error at a given time as a function of the error at one or more previous times, that is, εt = δεt-1 + ut. In this model, δ is the correlation between consecutive errors εt and εt-1. Additional lagged errors of the form εt-k can be added to the model to represent more complex autocorrelation. The second form, called by some a dynamic model,10 places the autocorrelation on the outcomes themselves so that the response at time t is a function of the response at time t-1 (and perhaps earlier times). A dynamic form for a model with one fixed treatment effect, for instance, would be yt = δyt-1 + γXt + εt. The dynamic model induces a dependence of the current outcome on previous values of the predictors in the model. One can also explicitly introduce this dependence in the form of lagged predictors. It is important to recognize the different interpretation of predictors in a dynamic model resulting from the need to condition on the previous outcome, that is, γ is the treatment effect conditioning on yt-1.


Carryover is a special type of autocorrelation common to crossover trials. As stated earlier, it occurs when the time between treatment periods is insufficient for the effect of the previous treatment to end before the next treatment is started. This is common with pharmacological treatments when the drug continues to exert effects in the body after the patient stops taking it. If not controlled for, carryover may lead to bias in the estimated treatment effects, with a tendency to magnify observed treatment effects during transitions from a less effective (but still effective) treatment to a more effective treatment, and conversely to shrink effects during transitions from a more effective to a less effective treatment.

Both design and analytic approaches can address carryover. Designing washout periods long enough for the prior treatment's effect to disappear by the beginning of the next treatment period eliminates any potential correlation across periods. An analytic approach downweights, disregards, or simply does not collect outcomes at the beginning of a treatment period, thus creating an analytic washout period.11 This analytic approach is also helpful when treatments take time to reach their full effect and one desires to account for the reduced effect at the beginning of the period.

Zucker12 used an extreme version of this approach in a series of n-of-1 trials for patients with fibromyalgia tested on amitryptoline or amitryptoline plus fluoxetene. Treatment periods were 6 weeks long, and the primary outcome was the score on the patient-reported Fibromyalgia Impact Questionnaire. Only the report from the end of each treatment period was analyzed. While this almost certainly eliminated carryover, and in fact autocorrelation, it did have the drawback of giving only one measurement per treatment period. In some studies, however, these choices may be unavailable if each treatment period is short or treatment half-life is very long.

Various approaches to estimating carryover have been proposed. As Senn13 points out, all rely on restrictive modeling assumptions and are inferior to designing a proper washout (which also may rely on assumptions about pharmacologic or similar properties of the treatments). The discussion above points to autocorrelation models as one method to handle carryover, although they assume correlations over time unrelated to when treatment is changed or introduced. In principle, one could design an autocorrelation structure that varied with time since introduction of treatment. But this would need to assume characteristics of the nature of the carryover that might not be well supported.

A simple check for carryover when the analyst has a sufficient number of observations taken over time within each treatment period is to compare results using all measurements to results after discarding those at the beginning of the period that might be affected by carryover. The model with more measurements should return more precise estimates but at the risk of some bias from the carryover. If the estimates are similar, carryover is not likely to be an issue.

Another form of carryover that one might be able to examine is the effect of treatment sequence when the response is different depending on the order of the treatments given. Treatment A may have a bigger effect if given after treatment B. This might manifest itself through responses that are higher for treatment A when it follows B than when it follows another period of A. One can examine a sequence effect by adding a variable that codes for sequence, for example, a dummy variable that equals 1 in periods where A follows B and 0 otherwise. Of course, if treatment effects are wearing off, it would not be appropriate to code every measurement in the A period with the sequence effect.

Discrete Outcomes

In each of the models presented, we have assumed a continuous outcome with normally distributed measurement error. Many outcomes that might be used in n-of-1 trials, however, may use categorical scales, event counts, or binary indicators of health status. For example, Guyatt14 and Larson15 both used Likert scales with ratings from 1–7 to measure patient outcomes. Models for such outcomes require different formulations that do not rely on the assumption of normality.

Generally, one needs to formulate such models as generalized linear models.16 Binary outcomes use logistic regression; count outcomes use Poisson regression; and categorical outcomes use categorical logistic regression. The generalized linear model has the same form as the linear model on the right-hand sides of the models above, but expresses the left-hand side in terms of a (link) function of the mean of the probability distribution for the outcomes. For example, with a binary outcome, events occur according to Bernoulli distribution, and the mean of that distribution is the probability of an event. The link function used in logistic regression is the logit function (logit (p) = loge(p/(1-p)). In Poisson regression, the link function is log. For categorical regression, various link functions can be used depending on how one wants to model the data. A common link function for an ordered outcome such as a preference scale is the cumulative logit.17

Although the generalized linear models use different estimation algorithms and take different functional forms, model construction does not differ conceptually in any fundamental way from the normal linear models, so we will say no more about them here, but refer the interested reader to the many textbooks that treat them.16,17


The simplest approach to estimating the treatment effect is based on the model that ignores any potential effects of time, autocorrelation, or carryover and simply compares the average response when the patient is on each treatment. If the design is blocked, one can take the difference between outcomes within each block and then simply average the differences, computing the appropriate standard error. This corresponds to a paired t-test. If no blocking is used, the analysis is an unpaired t-test.13

In general, one can use likelihood methods that incorporate the necessary correlation structures and interaction terms to fit the models. Likelihood-based methods typically rely on large samples to validate their assumptions of normal distributions of the resulting model estimates. Because the amount of data collected on any single outcome in an n-of-1 study is small, such assumptions may not be appropriate.

Bayesian inference combines this likelihood with prior information to form a posterior distribution of the likelihood that a model parameter takes a given value. The prior information is expressed through a probability distribution describing our degree of belief about model parameters before observing the data. Bayesian inference is natural for clinicians making decisions such as a differential diagnosis, because it expresses the way that they combine new information (such as a diagnostic test result) to update their previous beliefs.18 In an n-of-1 trial, the prior may be based on a population average effect or may be individualized to reflect patient-specific characteristics. The use of prior information also permits the analysis to incorporate patient preferences and beliefs.

Specification of a complete prior distribution for all model parameters can be difficult, particularly for those, such as correlations or variance components, about which not much may be known. One common simplification assumes that very little is known about some or all of the parameters and uses prior distributions that do not favor any values over others. Probabilistically, this corresponds to a uniform (flat) distribution. Such priors are referred to as noninformative. Conversely, knowledge of certain parameters such as the expected treatment effect may be available, and so informative priors may be chosen. For example, for a pain scale outcome the average pain reduction that one can expect over a 2-week course of therapy may be approximately known in the population, or one may be able to bound the maximum amount. It is also possible to construct an approximate prior distribution by eliciting its key parameters, such as its mean and standard deviation, or its percentiles.19

The posterior distribution, formed by calculating the conditional probability distribution of each parameter given the observed data and the specified prior distribution, is essentially a weighted average of the observed treatment effect mean and the hypothesized prior mean. The weights are supplied by the relative information about the two expressed through the precision with which each is known. One can use the posterior distribution to make statements about the probability that the parameters take on different values. For instance, one might conclude that the chance that treatment A reduces pain more than treatment B as measured on a specific pain scale is 75 percent, or one might say that there is 50-percent chance that the reduction is at least 10 points on the scale. Statements like this can be made for each outcome, allowing the patient and clinician to weigh them and determine which treatment is working better. Bayesian inference leads to statements about the probability of different hypotheses given the data observed; non-Bayesian, or frequentist, inference leads to statements about the probability of the data given the null hypothesis.

Local Knowledge and Statistical Methods

The personalized nature of n-of-1 trials indicates that the primary use for the knowledge produced in each individual trial is to inform clinical decisionmaking for the specific patient, that is, the knowledge produced is used locally or internally within the patient-clinician team that produced it. This paradigm is crucially different from the situation in standard parallel group randomized controlled trials (RCTs), in which the primary use of the knowledge produced in an RCT is to inform clinical decisionmaking for future patients, rather than for the patients participating in the RCT. In fact, for double-blinded RCTs, the patients and their clinicians do not know what treatment the patient actually received until the RCT is unblinded. Given this fundamental difference between the two paradigms, the appropriate statistical method also differs. While significance testing is the usual statistical method for the standard parallel group RCTs, the same method might be less pertinent for n-of-1 trials. Instead, one provides the decisionmaker with all the information required in a format that facilitates decisionmaking.

Presentation of Results

In order to make a correct decision, it is important that the patient and clinician not only have the right information, but that it be presented to them in a format that is easy to understand. The results of a trial are complex, and data are collected on multiple outcomes at many times under different treatment conditions. Many of the models we have discussed describe complicated phenomena such as autocorrelation that may confound facile interpretation of the data. Further complications might be present in skewed and/or heteroskedastic data (such as lognormal data and Poisson-distributed count data) that might indicate transformation to a different scale for graphic presentation and statistical modeling. Good graphics can help explain the data and the results to all parties involved.

Results should always be accompanied by the simplest possible graph, plotting each outcome over time separately in the treatment and control groups. A variety of different approaches are possible. One could overlay or stack two line plots, matching by block pairs. This reveals within-block differences as well as time trends and potential autocorrelation. One could add the sequence order by separately coloring within each block the first sequence in one color and the second in another (as in Figure 4–1). Such displays of raw data provide important information on the relationship of outcomes to treatment. They may also be shown in a blinded fashion (without identification of treatment group) to the patient during the trial as a form of feedback to motivate adherence.

Kratchowill et al.20 describe a process for using figures to evaluate the success of the intervention. After establishing a predictable baseline pattern of data, one examines the data within each phase to assess the performance and potentially to extrapolate to the next phase. Assessment involves: (1) level, the mean for the data in a phase; (2) trend, the slope of the line of best fit in a phase; (3) variability of the data around this line; (4) immediacy of the effect, the change in level between the end of one phase and the next; (5) overlap, the degree to which the data at the end of one period resemble those at the beginning of the next; and (6) consistency of data patterns in similar phases. More consistency, separation between phases, and strong patterns suggest a real effect. Once each phase is assessed, results from successive phases are compared to determine if the intervention had an effect by changing the outcome from phase to phase. Finally, one integrates the information across phases to see if the effects are consistent. A similar scheme is given in Janovsky.21

Figure 4–1. Data from simulated n-of-1 trial

Figure 4-1 plots data from a simulated N-of-1 trial. Two line plots (one indicated by a solid line and the other by a dotted line) describe outcomes for two treatments measured at each of 6 blocks. The vertical axis shows the outcome score and the horizontal axis shows the block (from 1-6). On each line, points are labeled as either red or blue. Patients receive each treatment in each block with the point labeled in red taken first. The outcome for the treatment indicated by the dotted line is generally greater than that associated with the solid line. The outcomes for the dotted line start at about 39 and stay between 37 and 39. The outcomes for the solid line start at 32 and rise to 37 at block 5 before dropping to 36 at block 6. The lines cross briefly at block 5.

Note: Two line plots (solid and dotted) show outcomes for 2 treatments measured within each of 6 blocks. Patients receive each treatment in each block, with the point labeled in orange taken first.

Determining treatment differences directly from such figures may, however, be camouflaged by other features of the data such as autocorrelation and time trends. Figure 4–1 shows simulated data showing that treatment B (dotted line) typically produces better outcomes than treatment A (solid line). Responses appear to be increasing with time on treatment A, but not B, suggesting a potential treatment-by-block interaction. Because only one measurement is recorded on each treatment period, we cannot distinguish time effects from effects by block. The overall effect of the picture is that B may be better than A, but that this efficacy wears off over time. In fact, the data are simulated with a fixed treatment difference and with a trend over time, but no treatment-by-block interaction, which occurs by chance. The right answer (discernible through use of appropriate statistical analysis) is that B is better than A and that all patient responses are increasing with time. Therefore, the plot is somewhat misleading and may prompt the wrong decision. As a general rule, unless treatment effects are large or specific, plots will provide necessary but not sufficient information to make appropriate decisions. It is therefore important to supplement graphs with appropriate statistical analysis and present the information in the clearest way possible.

One should use the statistic provided by the modeling process that relates directly to the measured treatment difference. In the Bayesian framework, this is the posterior probability; in the non-Bayesian, or frequentist, framework, this is typically a p-value. We recommend the Bayesian approach because it provides more value to the patient. The p-value describes the likelihood of obtaining the actual data (or more extreme data) under a specific null hypothesis. For example, a p-value of 0.05 for a test of the null hypothesis of no difference in treatments means that if the two treatments had the same effect, one would have observed the difference found (or a more extreme difference) 1 time in 20 times under repeated sampling. Putting aside the irrelevancy of the repeated sampling assumption (since the experiment will not be repeated), one is left with the observation that it is unlikely that the treatments have the same effect, but one does not know the likelihood of any other effect.

Contrast this with the Bayesian interpretation, which gives the full posterior probability distribution of the treatment effect under the model chosen. From this posterior distribution, one can make probabilistic statements about the likelihood of any size of treatment effect, for example, the likelihood that the treatment effect is at least 10, or between 5 and 15. In essence, this approach focuses on estimation of the magnitude of the effect, rather than on hypothesis testing. As a result of the focus on estimation instead of hypothesis testing, power analysis is of less concern. Zucker et al.,8 quoted in Duan et al.,22 show that for a study with M patients and N paired-time periods, study precision is M/(τ2 + 2σ2/N), which provides a way to calculate the tradeoff in sample size between the patients and time periods.

This focus can be particularly informative when multiple outcomes are of interest to the patient, and one wants to balance different objectives. As an alternative to the use of the composite scale discussed previously, one could formulate a joint posterior distribution to make probabilistic statements about the joint probabilities attached to combinations of the outcomes, if one were prepared to make some assumptions about their relationships. As an example, assume that the users (the patient and clinician) specified a performance target for the new treatment, A, to improve pain by at least 10 percent and increase sleep by at least 1 hour per night compared to the current treatment, B. In the simple (and perhaps unrealistic) case that the outcomes are independent, the probability for the joint outcome is the product of the probabilities of each separate outcome. So, if the probability that A improved pain by 10 percent was 0.3 and the probability that A increased sleep by 1 hour was 0.2, then the probability that both would happen would be 0.06.

Such probabilities can be expressed by a distribution function of the likelihood of each gain or by a cumulative distribution. As an example, assume that the posterior distribution of treatment benefit on A compared to B for outcome 1 expressed as a difference in percent change from baseline was normally distributed with mean 10 percent and standard deviation 5 percent. Therefore, there is roughly a 97.5-percent probability that A has bigger benefit than B, since 0 change is about two standard deviations below the mean. Likewise, assume the benefit for the second outcome is smaller but more uncertain, normally distributed with mean 5 and standard deviation 10. Figure 4–2 (top row) plots the resulting posterior probability distributions of treatment effect for each outcome together. One might also be interested in their cumulative distributions, or more likely, the probability of observing an improvement at least as big as a certain size. These graphs appear in the middle row of the figure. Using the dotted lines on the graph, we can see that the probability of at least a 10-percent improvement is slightly higher with outcome 1 than with outcome 2 since its mean is higher, but that the situation reverses for the probability of at least a 20-percent improvement because of the greater uncertainty associated with outcome 2. The bottom row of the figure gives the probability that both outcomes are improved by a given amount. This probability is smaller than for either outcome alone and (for this example) is roughly the product of the two individual probabilities, because the two outcomes were simulated independently. In practice, these joint probabilities may be quite similar to or quite different from their components, depending upon the correlation between the outcomes.

While plots like those in Figure 4–2 display the entire distribution of effect sizes along with our uncertainty in estimating them, some may prefer a simpler display with less total information, but perhaps in a format that is easier to understand. The distributions in the top row of the figure may be collapsed into a median and a central interval displaying the values most likely to occur with a given amount of probability, often 95 percent. One may also choose one or more increments of improvement for which to display probabilities. Figure 4–3 displays the median and 95-percent central interval (from the 2.5 to the 97.5 percentile) for the treatment effect for each outcome. The probabilities associated with improvement of at least 0, 5, 10, 15, and 20 percent for each outcome and both outcomes together can be displayed as in Table 4–1. The users should be able to specify the exact outcome levels for which they want probabilities computed. These may correspond, for instance, to clinically relevant values as determined by the patient and clinician in collaboration.

Figure 4–2. Percent improvement and probability of improvement for two outcomes

Figure 4-2 shows five plots arranged in three rows with two plots on each of the top rows and one on the bottom. The plots in the top row (titled Outcome 1 and Outcome 2) describe posterior distributions in percent improvement (treatment effect) for two outcomes. For outcome 1, the curve is centered at 10 percent improvement and most of the distributon is between about 0 and 20 percent improvement. Ir corresponds to a normal distribuiton with mean 10 and standard deviation 5. For outcome 2, the curve is centered at 5 with most of the distrbiution between -20 and 30. Ir corresponds to a normal distribuiton with mean 5 and standard deviation 10. The plots in the middle row show the cumulative probability (y-axis) that each outcome improves by at least the percentage amount on the horizontal axis. The curve for outcome 1 starts at a probabiliity of 1 and begins to decline when the percent improvement on the x-axis is -10. It declines in a reverse S-shape until it reaches 0 probability at about 20. Two dotted lines are drawn: one at y = 0.5 continuing to to where the curve is at x=10 and the other near y = 0 continuing over to where the curve is at x=20. The second plot is of a similar form. The curve begins to decline from a probability of 1 at x = -30 and reaches y = 0 when x = 30. The dotted lines are drawn at about y = 0.3 and y = 0.05. The plot in the bottom row shows a line depicting the probability that both outcomes improve by at least the amount on the horizontal axis. The line begins at y = 0.7 when x = 0 and declines to y = 0 at about x = 15. Two horizontal dotted lines are drawn at y = 0.18 and y = 0 over to where the curves are for x =10 and x =20 as in the middle row plots.

Note: Top row: Posterior distributions in percent improvement (treatment effect) for 2 outcomes; Middle row: Probability that outcome improves by at least amount on horizontal axis for each outcome; Bottom row: Probability that both outcomes improve by at least amount on horizontal axis.

Figure 4–3. Posterior median and 95-percent central posterior density interval for two outcomes

Figure 4-3 gives Posterior median and 95% central posterior density interval for each outcome. The vertical axis labels the treatment effect and runs from -10 to 20. For outcome 1 there is a vertical line with a midpoint at y=10 and ends at y = 0 and y =20. This line is at the left of the x-axis. For outcome 2 there is a vertical line with a midpoint at y=5 and ends at y = -15 and y =25. This line is at the right of the x-axis.

Table 4–1. Probability that given outcome or two outcomes together have a treatment effect greater than a given amount
Exceedance Probability Outcome 1 Outcome 2 Outcomes 1 and 2
Probability > 0 0.97 0.69 0.67
Probability > 5 0.86 0.50 0.43
Probability > 10 0.51 0.31 0.17
Probability > 15 0.17 0.16 0.02
Probability > 20 0.02 0.07 0.00

Some users may prefer to consider results as odds, rather than probabilities. Others may prefer metrics other than treatment effects. A flexible environment in which the user can request results in the way that is most comfortable and personally informative is a desired feature of any n-of-1 analytic module.

Combining N-of-1 Studies

Although n-of-1 studies are designed for single patients working with a single clinician to make a single treatment decision, many n-of-1 studies may be similar enough to inform others. Furthermore, the small number of crossovers used in many n-of-1 studies may increase the need to combine the index patient's data with data from other patients who participated in similar n-of-1 trials to increase the statistical precision available for making individual treatment decisions.

Such similarity may arise from the same clinician testing the same treatments with patients having the same condition; similar patients testing the same treatments with different clinicians; clinicians within the same clinic practicing in similar ways; or examining a common set of treatments in different combinations. In each case, we may think of the set of n-of-1 studies as forming a meta-analysis and attempt to combine them using techniques from meta-analysis such as multilevel random-effects models, regression, and networks. As an added bonus, combining the results can help estimate the average treatment effect in the population as well as the individual treatment effects for single patients. We give a brief introduction here, but refer the interested reader to related treatments in Zucker23 and Duan, Kravitz, and Schmid.22

To extend the previous models to multiple patients with n-of-1 studies, we assume that the same outcome measure, y, is used across patients to be combined, and consider

Model 1a: ymijkl = αm + βl + γk + δj(k) + εi(j(k(m)))

where m indexes the patient, αm ~ N(0,σ2α) is the random effect for the patient, and the error term indicates the variability within observations taken within a treatment period within a block within a patient. The time-trend model,

Model 3a: yit = αi + βt + γXt + εit,

changes only by having a random intercept αi ~ N(0,σ2α) for the i-th patient. These models may be easily extended to encompass interactions between patients and other factors that would indicate variation across patients. In particular, patient characteristics may explain some of the between-patient variance σ2α.

If we assume all within-block measurements are exchangeable, that is, all block-specific treatment effect estimates are similar and can be considered replicates of each other, we can combine results across patients quite simply. First, estimate the treatment effect for patient m within block b as the difference in the outcomes between treatment 1 and treatment 0, Estimate the treatment effect for patient m within block b as the difference in the outcomes between treatment 1 and treatment 0,. The block-specific treatment effect estimates can then be aggregated across blocks to form the individual treatment effect (ITE) estimate The block-specific treatment effect estimates can then be aggregated across blocks to form the individual treatment effect (ITE) estimate. It is possible to extend this approach to a regression estimate under the broader assumption that allows observed differences across blocks, such as a period effect. The observed ITEs Observed individual treatment effect. are unbiased for the true ITE φi, so that The observed individual treatment effects is approximately equal to the true individual treatment effect, for the within-patient variance. . The within-patient variance is assumed to be known and allowed to be specific to each patient (as in a meta-analysis treating each patient as a study). This permits capture of variation in design or implementation of the studies, such as the variation in the number of blocks across patients. For instance, one could assume that within patient variance= σ2/Bi equals the common within-block variance σ2 scaled by the number of blocks. If the full model 1 is used, then within patient variance is estimated from the within-block measurements.

The true ITEs are assumed to be drawn from a random-effects distribution, φ1 ~ N(φ0, τ2), where φ0 denotes the overall mean treatment effect for the population, and τ2 denotes between-patient variance in the individual mean treatment effects. Prior distributions are placed on the parameters φ0, τ2, and σ2 to represent what is known about these parameters prior to the study. The overall mean treatment effect φ0 and the individual mean treatment effects φi's is are estimated using the posterior distribution for each parameter.

The posterior distribution of the patient's ITE, φi, provides an opportunity to obtain a more informative estimate of the ITE than is available in a single n-of-1 trial because of the opportunity to borrow strength from the population mean φ0. Recall that the posterior mean is an average of the sample mean and the prior mean. In this situation, the prior mean φ0 is the external information coming from other patients and Individual treatment effect estimate is the information coming from the patient. If the patient is like the others, the posterior mean will be close to the average.

The relationship between individual treatment effect, φi, and overall treatment effect, φ0, depends on the balance between the between-patient variance, τ2, and the within-patient variance, within patient variance.24 When between-patient variance is small compared to within-patient variance (i.e., little or no heterogeneity of treatment effects), the patient-specific mean treatment effects, φi, are very similar and close to the posterior mean effect, φ0. Alternatively, if between-patient variance is large compared to within-patient variance (i.e., strong heterogeneity of treatment effects), the φi would be estimated to be close to the patient-specific treatment effect estimate, Individual treatment effect estimate, with little or no "borrowing from strength." In a sense, because the "strength" (population information) to be borrowed does not provide strong statistical information, within-patient information dominates betweenpatient information.

The model for multiple patients may be extended by considering the model as comprising two parts, within-patient and between-patient. The models for the single n-of-1 trial describe the within-patient parts. The between-patient parts describe factors that vary among patients, as in any statistical model with patient units. These include patient characteristics such as comorbidity, demographics, and socioeconomic status. They may also include study and health care structure such as the nesting of patients within providers and providers within organizations. Each level in the nested structure is represented by a random effect, in addition to the patient-level random effect φi. For example, the model that accommodates a nested structure with patients nested within practices will have a random effect for practices in addition to a random effect for patients: the individual mean treatment effect for the ith patient in the pth practice is approximate to all the mean treatment effect among patients in the pth practice, the within-practice variance among patients in the pth practice with θp ~ N(θ0,ω2) where φpi denotes the individual mean treatment effect for the ith patient in the pth practice, θp denotes the mean treatment effect among patients in the pth practice, the within-practice variance among patients in the pth practice denotes the within-practice variance among patients in the pth practice, θ0 denotes the overall mean treatment effect across practices, and ω2 denotes the variance across practices. Again, covariates at the practice level can also be incorporated into the model to evaluate the heterogeneity of treatment effects (HTE) associated with these covariates.

In addition to improving estimates of a patient's ITE through borrowing strength from other studies, one also obtains an estimate of the overall treatment effect across patients either as a single mean or as a regression. These population effects can be used to inform treatment decisions for similar patients who did not participate in n-of-1 trials.

Finally, when n-of-1 trials with different treatment comparisons are combined across patients, it is possible to consider a network meta-analysis of the n-of-1 trials. Models for network meta-analysis25,26 incorporate all the pairwise comparisons into a single model for simultaneous estimation. Under assumptions of consistency27 and similarity,25,28 direct comparisons of treatments A and B, A and C, and B and C may be combined so as to incorporate both their direct estimates and indirect estimates. (AC is estimated indirectly through the sum of AB and BC.) Such models make optimal use of all the treatment data, leading to more precision in effect estimates as well as the ability to rank treatments. These models hold even when studies do not compare all treatments, but only a subset. For example, a study comparing A and B may be combined with one comparing B and C to get an indirect estimate of A and C. Studies with more than two arms not only fit into the model structure, but actually provide additional information, because their direct and indirect estimates obtained from the same study must be consistent.

Automation of Statistical Modeling and Analysis Procedures

The implementation of the statistical modeling and analysis procedures, including procedures for a single n-of-1 trial and procedures for combining n-of-1 trials across patients, needs to be facilitated by building these procedures into the informatics system to allow for automation of these procedures, in conjunction with periodic review and calibration of the procedures for continuous quality improvement. Such automation is particularly important because most clinician/patient pairs will have neither the time nor the expertise to evaluate the statistical models, and rarely will a statistician be available to do the analysis in real time. Instead, it will be necessary to have the statistical modeling, including model selection, model checking, and model interpretation, built into the informatics system for presentation when a treatment decision is to be made.


N-of-1 data offer rich possibilities for statistical analysis of individual treatment effects. The more data that are available both within and across patients, the more flexibility models have. This richness does come at the price of the need for careful model exploration and checking. Many errors can be avoided with good study design that respects standard experimental principles and minimizes the risk of complexity caused by autocorrelation, as by including washout periods to minimize carryover. Such design and modeling expertise is probably not within the realm of the average clinician and patient undertaking an n-of-1 study. Thus, it is crucial that standard protocols and analyses be available, especially in an automated and computerized format that promotes ease of use and robust designs and models.


Guidance Key Considerations Check
Treatment assignment needs to be balanced across treatment conditions, using either randomization or counterbalancing, along with blocking
  • Design needs to eliminate or mitigate potential confounding effects such as a time trend.
  • Pros and cons of randomization versus counterbalancing need to be considered carefully and selected appropriately. Counterbalancing is more effective if there is good information on critical confounding effect, for example, linear time trend. Randomization is more robust against unknown sources of confounding.
  • Blocking helps mitigate potential confounding with time trend, especially when early termination occurs.
Blind treatment assignment when feasible
  • Blinding of patients and clinicians, to the extent feasible, is particularly important for n-of-1 trials, especially with self-reported outcomes, when it is deemed necessary to eliminate or mitigate nonspecific effects ancillary to treatment.
  • Some nonspecific effects might continue beyond the end of trial within the individual patient, and therefore should be considered part of the treatment effect instead of a source of confounding.
Invoke appropriate measures to deal with potential bias due to carryover and slow onset effects
  • A washout period is commonly used to mitigate carryover effect.
  • Adverse interaction among treatments being compared indicates the need for a washout period.
  • Absence of active treatment during a washout period might pose an ethical dilemma and diminish user acceptance for active control trials.
  • Washout does not deal with slow onset of new treatment and might actually extend the duration of transition between treatment conditions.
  • Analytic methods can be useful for dealing with carryover and slow onset effects when repeated assessments are available within treatment periods.
Perform multiple assessments within treatment periods
  • Repeated assessments within treatment periods can enhance statistical information (precision of estimated treatment effect) and facilitate statistical approaches to address carryover and slow onset effects.
  • The costs and respondent burden need to be taken into consideration in decisions regarding frequency of assessments.
Consider adaptive trial designs and sequential stopping rules Adaptive trial designs and sequential stopping rules can help improve trial efficiency and reduce patients’ exposure to the inferior treatment condition.  
Use appropriate statistical method to analyze outcome data, taking into consideration important features of time-series data, including autocorrelation, time trend, and repeated measures within treatment periods
  • Mixed-effect models, autoregressive models, and dynamic models can be used to analyze time-series data from n-of-1 trials.
  • Nonparametric tests are easy to use but might not fully capture time-series features.
  • Significance testing is less pertinent for n-of-1 trials than the provision of the information needed for the users to make decisions for future treatments.
Use appropriate methods to handle multiple outcomes
  • Separate analyses and reporting of trial findings for multiple outcomes can accommodate the patient-centered nature of n-of-1 trials.
  • Explicit prespecification of weights across outcomes is preferable to post hoc weighting.
  • A composite index or scale can effectively synthesize information across related outcomes and reduce the burden on users to digest trial results across multiple outcomes.
Present results of statistical analysis in an informative and user-friendly manner
  • Customize format of presentation to accommodate needs and preferences for individual users.
  • Graphical presentation of trial results is easy to comprehend but might be complicated by autocorrelation, time trend, etc.
  • Posterior probabilities or odds based on a Bayesian framework are more interpretable for users than p-values based on a frequentist framework.
Borrow from strength
  • Bayesian methods can be used to combine data across individuals participating in similar n-of-1 trials, to provide more precise estimates for individual treatment effects, and also to provide estimates for average treatment effects in the population to inform treatment decisions for patients not in the trials.
  • Network meta-analysis can be used to incorporate information from patients whose trials are related to but not identical in design to the treatment conditions compared.


  1. Campbell DT, Stanley JC. Experimental and quasi-experimental designs for research. Chicago: RandMcNally; 1963.
  2. Shadish WR, Cook TD, Campbell DT. Experimental and quasi-experimental designs for generalized causal inference. Belmont, CA: Wadsworth; 2002.
  3. Ross M, Olson JM. An expectancy-attribution model of the effects of placebos. Psychol Rev. 1981;88(5):408-437.
  4. Rutherford BR, Marcus SM, Wang P, et al. A randomized, prospective pilot study of patient expectancy and antidepressent outcome. Psychol Med. 2013;43(5):975-982.
  5. Berry SM, Carlin BP, Lee JJ, et al. Bayesian Adaptive Methods for Clinical Trials. New York: CRC Press; 2010.
  6. Zelen M. Play the winner rule and the controlled clinical trial. J Am Stat Assoc. 1969;64(325):131-146.
  7. Wei LJ, Durham S. The randomized play-the-winner rule in medical trials. J Am Stat Assoc. 1978;73(364):840-843.
  8. Zucker DR, Ruthazer R, Schmid CH. Individual (N-of-1) trials can be combined to give population comparative treatment effect estimates: methodologic considerations. J Clin Epidemiol. Dec 2010;63(12):1312-1323.
  9. Gabler NB, Duan N, Vohra S, et al. N-of-1 trials in the medical literature: a systematic review. Med Care. Aug 2011;49(8):761-768.
  10. Schmid CH. Marginal and dynamic regression models for longitudinal data. Stat Med. 2001;20(21):3295-3311.
  11. Hogben L, Sim M. The self-controlled and self-recorded clinical trial for low-grade morbidity. Br J Prev Soc Med. 1953 7(4):163-179. Reprinted in International Journal of Epidemiology 2011; 40(6):1438-1454.
  12. Zucker DR, Ruthazer R, Schmid CH, et al. Lessons learned combining N-of-1 trials to assess fibromyalgia therapies. J Rheumatol. 2006;33(10):2069-2077.
  13. Senn S. Cross-Over Trials in Clinical Research, 2nd ed. Hoboken, NJ: Wiley; 2002.
  14. Guyatt GH, Keller JL, Jaeschke R, et al. The n-of-1 randomized controlled trial: clinical usefulness. Our 3-year experience. Ann Intern Med. 1990;112(4):293-299.
  15. Larson EB, Ellsworth AJ, Oas J. Randomized clinical-trials in single patients during a 2-year period. JAMA. 1993;270(22):2708-2712.
  16. McCullough P, Nelder J. Generalized Linear Models, 2nd ed. New York: Chapman and Hall; 1989.
  17. Agresti A. Categorical Data Analysis, 2nd ed. Hoboken, NJ: Wiley; 2002.
  18. Gill CJ, Savin L, Schmid CH. Why clinicians are natural Bayesians. BMJ. 2005 Jun 11;330(7499):1080-1083.
  19. Chaloner K, Church T, Louis TA, et al. Graphical elicitation of a prior distribution for a clinical-trial. J R Stat Soc Series D Statis. 1993;42(4):341-353.
  20. Kratochwill TR, Hitchcock J, Horner RH, et al. Single-case design technical documentation. What Works Clearinghouse; 2010.
  21. Janosky JE, Leininger SL, Hoerger MP, et al. Single Subject Designs in Biomedicine. Springer; 2009.
  22. Duan N, Kravitz R, Schmid C. Single-patient (n-of-1) trials: a pragmatic clinical decision methodology for patient-centered comparative effectiveness research. J Clin Epidemiol. 2013;66(Suppl 1): S21-S28.
  23. Zucker DR, Schmid CH, McIntosh MW, et al. Combining single patient (N-of-1) trials to estimate population treatment effects and to evaluate individual patient responses to treatment. J Clin Epidemiol. 1997;50(4):401-410.
  24. Schmid CH, Brown EN. Bayesian hierarchical models. Methods Enzymol. 2000;321:305-330.
  25. Salanti G. Indirect and mixed-treatment comparison, network, or multiple-treatments meta-analysis: many names, many benefits, many concerns for the next generation evidence synthesis tool. Res Synth Method. 2012;3(2):80-97.
  26. Higgins J, Jackson D, Barrett J, et al. Consistency and inconsistency in network meta-analysis: concepts and models for multi-arm studies. Res Synth Method. 2012;3(2):98-110.
  27. Lu GB, Ades AE. Assessing evidence inconsistency in mixed treatment comparisons. J Am Stat Assoc. 2006;101(474):447-459.
  28. Jansen JP, Schmid CH, Salanti G. Directed acyclic graphs can help understand bias in indirect and mixed treatment comparisons. J Clin Epidemiol. 2012;65(7):798-807.


Schmid CH, Duan N, the DEcIDE Methods Center N-of-1 Guidance Panel. Statistical Design and Analytic Considerations for N-of-1 Trials. In: Kravitz RL, Duan N, eds, and the DEcIDE Methods Center N-of-1 Guidance Panel (Duan N, Eslick I, Gabler NB, Kaplan HC, Kravitz RL, Larson EB, Pace WD, Schmid CH, Sim I, Vohra S). Design and Implementation of N-of-1 Trials: A User’s Guide. AHRQ Publication No. 13(14)-EHC122-EF. Rockville, MD: Agency for Healthcare Research and Quality; January 2014: Chapter 4, pp. 33-53.

Project Timeline

Design and Implementation of N-of-1 Trials: A User's Guide

May 20, 2013
Topic Initiated
Feb 12, 2014
Research Report
Feb 12, 2014
Feb 12, 2014
Feb 12, 2014
Feb 12, 2014
Feb 12, 2014
Feb 12, 2014
Page last reviewed August 2019
Page originally created November 2017

Internet Citation: Research Report: Statistical Design and Analytic Considerations for N-of-1 Trials (Chapter 4). Content last reviewed August 2019. Effective Health Care Program, Agency for Healthcare Research and Quality, Rockville, MD.

Select to copy citation