Critically Reading the Literature
A and Understanding Clinical Research
A primer on how to determine
what medical literature
is valuable and what can be filed in the garbage can.
By J. Robert Mannino, DO, PhD, FACOFP
Literature
is the bulletin board of scientific endeavor. It is through the
literature that the rest of the world is informed of discoveries
and techniques that have the ability to alter or save lives. Yet,
not all of the literature is equal quality. In the course of one
month’s time, at least 100 articles of varying worth will
cross the average practitioner’s desk. In this day and age
of publish or perish, less than stellar work appears in print.
It has been facetiously said that 75 percent of the literature
is not worth the paper on which it is written.
How does one separate the wheat from the chaff? Is the article worth reading?
Can I reasonably place any confidence in the conclusions drawn? Is the conclusion
based on fact or is it merely the opinion or supposition of the author? How
does one critically read the literature?
To look critically at the literature, it is incumbent upon the physician to
understand the basic differences in the literature published. Not all literature
has the altruistic purpose of disseminating new information. At the zenith
of excellence is the category of journal that is peer reviewed and refereed;
the nadir is the journal that is a captive publication, neither peer reviewed
nor refereed, with the authors paid to produce articles on a theme. There are
all manner of gradations from top to bottom; obviously, the best journals are
those that are both peer reviewed and refereed. The easiest way to ascertain
this is to look in the Index Medicus, if the journal is abstracted in that
publication then it is both peer reviewed and refereed.
What about articles in an unrefereed journal? Should you automatically discount
them? No, but one must be wary of anything that is in an unrefereed journal.
Usually there is a basic flaw in the article that prevents publication in a
refereed journal. This flaw may be something simple like inadequate numbers
in a study (a study of one) or major like drawing improper or premature conclusions
from limited data.
Usually, articles in unrefereed journals are anecdotal opinion, unsubstantiated ‘research,’ or
written by staff on the basis of interviews. The unrefereed journal can be
a good source of practice and/or patient tips (how to build a practice or how
to remove a fishhook, respectively). In general, anything that remotely smacks
of research that appears in an unrefereed journal should be approached with
extreme caution.
Even within the category of peer
reviewed and refereed journal articles, one must be critical of what
was done and how it was done. In general, the obvious foibles will
be attended to by the review process, these include, but are not
limited to, quality and quantity of literature cited, use of textbooks
as references, self reference, faulty thought processes, faulty experimental
design and unsubstantiated claims. What will not be addressed is
conclusions based on data presented, incomplete studies (a portion
of an ongoing study is presented, but the total study result is years
away), or extrapolation based on presented data sets.
For this evaluation, a rudimentary knowledge of statistics and its application
to medicine is necessary. There is a vocabulary that must be understood. These
specialized terms include: bias, clinically significant result, contributory
cause, control(s), crossover study, dependent variable(s), double blind, experimental
design, extraneous variable(s), free extrapolation, independent variable(s),
p value, paired study, prospective study, reliability, retrospective study,
sample, sensitivity, sham, significance, single blind, specificity, standard
deviation, statistically significant result, trend(s), variable(s), variance,
and validity. In addition to this vocabulary, the statistical techniques commonly
encountered in the literature include: chi square, students t-test, analysis
of variance, specificity and sensitivity, linear coefficients, linear regression,
multiple regression, matrix analysis, and sequential analysis.
Vocabulary
While it is not the intention of this discourse either to turn the reader off
or to make statisticians out of the reader, it is the intent to enable the
reader to intelligently evaluate what is read. The following is not designed
to be all-inclusive or to take into account ‘what if scenarios’;
rather, it is intended to address the commonly used terms and methods of
clinically relevant statistics and their use in the literature. First, let
us define the vocabulary in terms of working definitions:
Bias: Judgment
or opinion formed before fact. The exclusion of women in a hypertension
study has the potential of biasing the results. Clinically significant result: The results obtained
have an important impact on patient care. Contributory cause: The cause precedes the effect
and altering the cause alters the effect. Control(s): Part of an experimental design that serves
as the frame of reference and does not participate in the active or
experimental treatment. Crossover study: A type of experimental design in
which the active or experimental treatment is switched for the control
and vice versa. Dependent variable(s): The variable of interest. Double blind: A type of experimental design in which
both the investigator and the subject do not know who is receiving
the active or experimental treatment. Experimental design: The study is structured in such
a way that it does not contain bias, and the results obtained are statistically
significant and have validity and reliability. Extraneous variable(s): The item or items that can
affect the outcome of a study, but are not being studied. Free extrapolation: Drawing conclusions for one population
based on the results of another non-linked population. Independent variable(s): The item or items that affect
the dependent variable and are varied by the experimenter. Meta analysis: A systematic method that uses statistical
analysis to integrate the data from a number of independent studies. P value: The degree to which the results support the
hypothesis. Paired study: An experimental design in which all
study groups are paired, as closely as possible, for known variables. Prospective study: When treatment and outcome begin
after the start of, and due to the study. Reliability: The results obtained are consistent and
repeatable. Retrospective study: When treatment and outcome have
occurred or begin prior to the onset of the study. Sample: The tested population. The statistically valid
small sample is 30. Sensitivity: In diagnostic tests, the proportion of
diseased subjects who have a positive test. Sham: The process of mimicking a procedure without
performing the actual definitive procedure. Significance: How the results support the hypothesis. Single blind: A type of experimental design in which
only the investigator knows who is receiving the active or experimental
treatment. Specificity: In diagnostic tests, the proportion of
disease-free subjects who have a negative test. Standard deviation: The square root of variance. Statistically significant result: The results obtained
support the hypothesis and occur in a fashion that is more probable
than random chance. Trend(s): The general direction of data, although
not meeting all the rigid criteria of statistical significance. Variable(s): The item or items in a study that are
measured. Variance: The distribution of an individual value
versus the value of the center. Validity: The experimental design measures what it
purports to measure.
Now let us define statistical techniques and when they are to be used. Again,
this is not all-inclusive and does not address the exceptions; rather, it is
aimed at the majority.
Used with one variable: Chi-square: This test examines the association
between a single independent variable and a dependent variable. Student’s t-test: This test is used for measured
variables, in comparing two means.
Used with two or more variables Analysis of variance: This test allows comparison between
more than two sample means. Linear coefficients: Is the slope of the straight
line produced by a linear regression. Linear regression: Is a statistical treatment of data
by which two continuous variables are fitted to
a straight line. Matrix analysis: Is a mathematical treatment of large quantities
of variables in an attempt to ascertain which are independent and which are
dependent variables. Multiple regression: Is a statistical treatment of data
by which several independent variables can be used to predict a dependent
variable. Sequential analysis: A form of integrated experimental design
and statistical analysis that allows for adjustment to the effect of repeated
significance testing. Specificity and sensitivity: This type of analysis, used
primarily with diagnostic testing, allows adjustment for false negatives
versus false positives.
Clinical
Research
Having given working definitions for the terms and techniques commonly used,
what are the pitfalls to look for in clinical research? They basically fall
into three categories:
The number of subjects in a study
The experimental design
The statistical method used to ascertain
the worth of the results
Perhaps the most flagrant, glaring,
abuse committed under the rubric of research is the lack of sufficient
numbers in a study. In general, the greater the number in a study,
the greater the likelihood of usable results. Obviously, it would
be great if all clinical research were conducted with a minimum of
10,000 subjects. This is neither feasible nor desired.
The Helsinki Accords require that a research protocol be terminated if a part
of the protocol proves to be detrimental to the subjects. Moreover, if a part
of a research treatment is obviously superior in terms of patient safety, comfort,
or morbidity, the same Accords require that the project be terminated. Due
to the difficulty in amassing large numbers for clinical trials, most studies
are done with the statistically significant small population, which is 30.
This means in a three-compartment study, there must be a minimum of 90 participants.
Clearly, this is not always the case and there are mathematical adjustments
that can be made.
In general, the more variation in the results, the greater the number necessary
for significance, reliability, and validity. Bottom line: If there are fewer
than 30 subjects in a study be wary of the results and how they are interpreted.
Experimental design revolves around how the study was conducted. Ideally, this
is an a priori function. The researcher makes every attempt possible to design
a study that has only one variable. In basic research, this is relatively simple,
because there are genetically homogenous subjects that can be completely controlled,
available. In general, clinical research cannot be reduced to one variable.
There are many reasons for this. Suffice it to say that since the genetic and
environmental make up of individual patients is different, the impact of treatment
is different and cannot be necessarily isolated to the treatment given. This
mandates the need for multivariable experimental design in most clinical research.
One of the most difficult types of clinical research to evaluate is the retrospective
study. The reason for this difficulty is that the data is collected after the
fact and the usual safeguards to insure random sampling are absent. Not only
is the data in this type of study potentially biased, the conclusions drawn
may well suffer from free extrapolation.
Even under the best of circumstances and design, the results of a retrospective
study should be used only to identify trends and areas for definitive research.
Prospective studies, on the other hand, are conducted with all proper research
safeguards in place.
EXEMPLI GRATIA: [Royal College of General Practitioners’ Oral Contraception
Study, J. Coll. Gen. Pract. 13(5): 267. 1967]. The thrust of this study was
a finding that oral contraception use was associated with a sixfold to tenfold
increase in the incidence of thrombophlebitis in women. And furthermore, there
was also an increased mortality associated with oral contraceptive usage.
FLAWS: This was a retrospective study, i.e., records were tabulated post facto
and conclusions drawn on the basis of those results.
Basically, the study looked at all cases of thromboembolic phenomena and death
in women in the United Kingdom over a ten-year period and whether or not the
individual had ever taken oral contraceptives. If one looks at the data carefully,
it becomes readily apparent that oral contraceptive use was not the study focus;
rather, the incidence of oral contraceptive use among those with a malady versus
those without that malady. Moreover, free extrapolation was used in the conclusions.
This immediately brings into play the concept of cause and effect relationships
and the introduction of bias.
To check on the possibility of an actual association of oral contraception
use with an increased mortality and morbidity, a prospective, multifaceted
study was begun. This study was designed to accurately test all of the questions
raised by the above-mentioned British study. The study is known as the Walnut
Creek Study and the results have been published. [Walnut Creek Contraceptive
Drug Study, J. Reprod. Med. 25(6 Suppl): 345-72 1980 and Long-term Follow-up
of Women in the Walnut Creek Study, Obstet Gynecol 70(3 Pt 1): 289-93 1987].
The results of the Walnut Creek Study refute all of the blanket condemnations
of oral contraceptive usage. In fact, in the non-smoker, the use of oral contraceptives
has a salutary effect. The only verification of the British findings is in
a subset of users who are over the age of 35 and also smoke.
The blinding of a protocol is a little understood and often overlooked element
of clinical research, but it is integral to a good experimental design. The
goal of blinding is to eliminate bias and the so-called placebo effect. Blinding
may take many forms. The simplest form is a single blind study in which the
subject does not know whether or not he is receiving the active or the control
treatment.
Next comes the double blind study in which neither the subject nor the operator
knows who is receiving the active or the control treatment. To these may be
added a sham study and/or a crossover study. Blinding is one of the more difficult
elements of experimental design to implement and is the principle reason that
there have been no long-term studies on the efficacy of osteopathic manipulation.
In its simplest terms, experimental design is aimed at using a particular statistical
treatment. The choice of analysis is dependent, among other things. On the
number of subjects, the number of variables, the length of the study, and the
distribution of results. Ordinarily, only one method of analysis will be used;
however, as clinical research becomes more sophisticated or involved, especially
multicenter studies, multiple methods may have to be used. Next are the guidelines
for the statistical methods used.
If chi-square or students t-test is the statistical method of a clinical research
article, reject the work. These statistical methods are used primarily in basic,
not clinical research where the experimental design can be limited to one variable.
This is type of analysis can ascertain probable trend, but not significance
in multivariable trials.
A p of 0.025 or less is considered significant for biological systems; however,
this level of significance does not address the inherent variability of living
systems. The ‘gold standard’ for statistical evaluation of good
clinical data is either analysis of variance or sequential analysis. A p of
0.01 or less assures that the results obtained are the product of the treatment
tested and not due to sampling variation (chance). Caveat: The clinical implication
of a treatment is a clinical and not a statistical decision; therefore, if
the difference in outcome of a treatment is imperceptible, no level of p can
alter that fact.
If the data does not meet the criteria for these gold standard tests, then
lesser statistical methods may be used which may result in significant results.
This significance, however, does not assure that the association observed is
the result of the treatment and not sampling variation.
Among these statistical modalities
are linear coefficients, linear regressions, multiple regressions,
etc. Be wary of coefficients of correlation less than 0.75 (ideal
is 1.0; if 0.85 to 1.0 then a gold standard test would have been
used). As the correlation becomes lower than 0.85, p must be less
than 0.001 to insure that based on the sample data the observations
are the result of the treatment and not sample variation.
When multiple factors are involved, matrix analysis is the statistical treatment
of choice. This method demonstrates trends only.
It does not take any great intellect to ascertain that a article depicting
a clinical study of 10 male patients with disease ‘A’, subjected
to treatment ‘B’, analyzed by the students t-test, and touted as
proper for the general population is not good research and should not cause
the reader to alter from accepted norms of treatment. Remember; just because
it is in print does not necessarily make it true!
EXEMPLI GRATIA: [A Difference in Hypothalamic Structure Between Heterosexual
and Homosexual Men. Simon LeVay Science 253(35): 1034-1037. 1991]. This study
purports to have found an area in the hypothalamus that is different in the
homosexual versus that of the heterosexual. On the basis of this finding, it
is proposed that there is a genetic/anatomic reason for homosexuality. FLAWS:
The evidence was garnered on the basis of autopsy. There are very few numbers
in the study, well below the statistically significant small number of 30.
All of the homosexual men (19) in the study died of acquired immunodeficiency
syndrome (AIDS), which is caused by a retrovirus that has the capability of
altering neuronal structure. Six of the sixteen presumed heterosexual men died
of AIDS.
The findings in healthy individuals may be significantly different. The heterosexuals
in the study were deemed heterosexual by lack of sexual preference data in
the chart. There is significant variation in both groups, leading one to wonder
whether or not the results obtained were due to inherent variation or actual
difference. Women, due to the way the clinical material was obtained, were
excluded from the study. Although there are problems with this study, it may
have identified trends that deserve further, definitive investigation.
Armed with this knowledge, one may now critically look at the literature and
make informed decisions as to whether or not what is read is worth incorporating
into patient care. Moreover, when encountering ‘This was in the literature’ one
can now ask the pertinent questions about the worth of the information.
J. Robert Mannino, DO, PhD, FACOFP,
practices in Coral Springs, Florida and can be contacted via e-mail
at jrmannino@worldnet.att.net.
Selected
Readings
Intuitive Biostatistics, H. Motulsky,
Oxford University Press, UK, 1995.
Statistics Applied to Clinical
Trials, T. F. Cleophas, et al., Kluwer Academic Publishers,
New York, NY, 2002.Practical Statistics for Medical Research,
D.G. Altman, CRC Press, Boca Raton, FL, 1990.Clinical
Trials and Human Research, F.A. Rozovsky and R. K. Adams,
Jossey-Bass Publisher, San Francisco, CA, 2003.Design
and Analysis of Experiments, 5th Edition, D. C. Montgomery,
John Wiley & Son, Hoboken, NJ, 2000Using Multivariate
Statistics, 4th Edition, B. G. Tabachinick, et al., Pearson
Allyn & Bacon, Upper Saddle River, NJ, 2000.Handbook
of Parametric and Nonparametric Statistical Procedures,
2nd Edition, D.J. Sheskin, CRC Press, Boca Raton, FL,
2000.
Handbook of Statistical Analyses
Using SAS, 2nd Edition, G, Der and B.S. Everitt, CRC
Press, Boca Raton, FL, 2001.