INTRODUCTION

Despite the evidence that supports the role of the physical examination (PE) in the assessment of a patient's disease1,2, there is considerable controversy as to whether the PE has outlived its usefulness. Some suggest that imaging may provide a more direct view into the body and prevent errors3. Others argue that the value of ancillary testing is overrated4; that it cannot replace a physician's ability to recognize familiar patterns of disease5,6; and that the failure to perceive the importance of the PE is due to poor teaching and learning of basic clinical skills58.

This later criticism has led to major changes in the way the skills of patient interviewing9,10 and diagnostic reasoning11,12 are taught. However, teaching of the PE has remained unchanged. From the 1950s13 to the present time12, textbooks of the PE continue to offer a comprehensive and unselective compilation of signs, which include those no longer considered to be useful14, while often excluding important PE signs15. Some medical schools do encourage students to form diagnostic hypotheses early on while listening to the patient’s narrative and conduct a targeted PE aimed at confirming or refuting these hypotheses8. However, we suspect that even in these schools, this educational approach is not reinforced during the subsequent clinical undergraduate training. During their clerkship and internship, students are expected to perform a complete PE along a predetermined sequence, which begins with the patient’s appearance and vital signs, and moves on to an examination from head to toes. Consequently, students often perform hasty PEs with frequent shortcuts that are unlikely to detect physical findings. This may explain the observed decline of students' breast examination skills over the course of their training16 and students’ reliance on “hard” laboratory and imaging data rather than on the clinical assessment of their patients4.

We believe that, in order to enhance students' appreciation of the PE, its teaching should focus on selected and important PE skills8 rather than overwhelm the learners with an all-inclusive list of signs. One possible way to discriminate between important and less important components of the PE would be to select for signs with proven diagnostic accuracy. Indeed, the call by Sackett and Rennie17 for a rational approach to the clinical examination triggered a series of publications (e.g.,18), reviews (e.g.,14) and a textbook19 dealing with the evidence base of the PE. Most of them emphasized the need for additional studies of the reliability and validity of PE signs. However, we know of no suggestions for reconciling the teaching of the PE with the paucity of evidence on the diagnostic accuracy of most PE signs.

The objective of this paper is to update the presently available evidence, or lack thereof, for the reliability and validity of PE signs, and suggest a modified approach to their teaching to undergraduate medical students. We chose the PE of the respiratory system because its importance has been debated ever since the advent of chest radiography20.

METHODS

We used Paper Chase21 to search Medline and Old Medline between 1966 and June 2009. An effort was made to identify all original published studies of the diagnostic accuracy of specific respiratory PE signs. We excluded reviews and studies of the diagnostic value of combinations of PE signs. First, we used the terms ['physical examination'] and ['respiratory' or 'pulmonary diseases' or 'lung diseases']. Of the 392 hits, only 9 were original studies of the reliability and validity of respiratory PE signs. Second, we used the terms ['physical examination' or 'auscultation' or 'percussion'] and ['pneumonia' or 'pleural effusion' or 'airway obstruction' or 'pneumothorax' or 'asthma' or 'pulmonary embolism'] and obtained 701 hits, which included 13 additional studies. Finally, we searched the reference sections of all relevant studies and identified another 38, some of which were published before 1966. After excluding 17 studies of the validity and 3 studies of the reliability of respiratory signs in children and infants, we were left with a total of 40 studies: 13 of the reliability2234, 20 of the validity3554 and 7 of both5561.

Other search strategies, such as those using the terms ['physical examination'] and ['diagnostic accuracy'], or ['physical examination'] and ['sensitivity' or 'specificity' or 'reliability'] and [respiratory], did not identify additional studies. Our failure to capture most studies of the diagnostic accuracy of PE signs using conventional literature searches may have been due to less than optimal indexing.

Studies of diagnostic accuracy are subject to various sources of bias that may result from their design, selection of patients, performance of the test and analysis of data. The recognition that the quality of reporting of these studies is often deficient led to the development of a 25-item list of Standards for Reporting of Diagnostic Accuracy (STARD)62. Similar to the experts who developed this list, we realize that the methodology for designing and conducting studies of diagnostic accuracy requires further development. However, at present, these criteria are the best available measure of the value of such studies, including those of the diagnostic accuracy of the PE63. Therefore, we chose to evaluate the studies included in the present review by the degree of their adherence to the STARD check list.

We reviewed all 40 publications and tabulated (1) the degree of their adherence to the STARD criteria62, (2) the reliability of the various respiratory PE signs and (3) their validity for detecting defined disorders. We use the term "reliability" interchangeably with the reproducibility of the findings that are obtained when the same PE is repeated on the same patient by the same or different examiners. Most commonly, reliability was reported as kappa statistics on a -1 (complete disagreement) to 0 (chance agreement) to +1 (perfect agreement) scale. Values between 0 and 0.4 are commonly accepted as indicating low agreement, those between 0.4 and 0.6, fair agreement; between 0.6 and 0.8, good agreement; between 0.8 and 1, perfect agreement64. Less frequently, reliability was presented as agreement rates between examiners (e.g.,22), coefficients of correlation (e.g.,29) or the "standard deviation agreement index" (e.g.,25), which is, similar to the kappa statistics, a measure of inter-examiner agreement beyond the one expected by chance. "Validity" refers to the ability of a PE sign to discriminate between patients with and without the disease under consideration, and it is expressed as sensitivity and specificity relative to an agreed upon gold standard. The phrase "diagnostic accuracy," as used here, refers to the contribution of a given PE sign to establishing a diagnosis and to its usefulness in clinical practice.

RESULTS

  1. (1)

    Adherence to the STARD criteria for diagnostic accuracy

Most of the reviewed studies were published before the development of the STARD criteria in 2003, and none of them adhered to all of these criteria. Most studies complied with the following criteria: they were prospective (33/40), provided an explicit or implicit definition of the study objectives (37/40) and data on the study populations (39/40), methods of patient recruitment (27/40) and methods of presentation of results (38/40). All validity studies presented their findings relative to a gold standard of diagnosis. In most studies, the test results were dichotomous (PE signs present or absent); in some studies, such as those of respiratory rates, the results were presented as above or below cutoff values. The examiners in 16 of the 20 reliability studies were blinded to the findings of other examiners, and those in 20 of the 27 validity studies were blinded to the results of the gold standard (data not shown in table format).

However, only few of the reviewed studies adhered to the following criteria: only 2 of the 20 reliability studies, and 4 of the 27 validity studies reported data on the severity of disease of the participating patients; only 10 reliability studies and 14 validity studies reported on attempts to use standardized PE procedures in order to enhance the consistency of the examiners' PE techniques; only 1 reliability study and 4 validity studies presented data on the number of qualifying patients who did not participate in the study; only 7 validity studies reported the reliability of the various PE signs (data not shown in table format).

  1. (2)

    Reliability (reproducibility) of respiratory PE signs (Tables 1, 2)

    Table 1 Reliability of Respiratory Physical Examination Signs Elicited by Inspection, Palpation and Percussion
    Table 2 Reliability of Respiratory Physical Examination Signs Elicited by Auscultation

The only study of the intra-examiner reproducibility that we know of found that the examiners disagreed with themselves in 11–26% of the cases and that pulmonary specialists were significantly less self-consistent than medical students32. Of the 20 reliability studies, only 4 studies32,34,60,61 reported inter-examiner agreement rates above k = 0.6 for one or more PE signs. Another five studies24,25,29,56,57 reported more than 90% or "almost total" inter-examiner agreement for at least one PE sign.

The following PE signs were reported by some, but not most studies to have reliabilities of kappa = 0.6–1.0 or disagreement rates of 10% or less: chest movements, clubbing, vocal fremitus, dullness on percussion and reduced auscultatory percussion (Table 1), breath sound intensity, crepitations, vocal resonance (diminished) and wheezes (Table 2). Low to fair inter-examiner agreement rates, i.e., kappa = 0.0–0.6, were consistently reported in one or more studies for the following PE signs: cyanosis, respiratory rate, crico-sternal distance, deformities of the thorax, respiratory distress, position of trachea, hyperresonance on percussion, diaphragmatic expansion and cardiac dullness (Table 1); bronchial breathing, pectoriloquy, bronchophony, egophony, rhonchi, vocal resonance (increased), prolonged expiratory phase and pleural friction rub (Table 2).

  1. (3)

    Sensitivity, specificity and likelihood ratios of respiratory PE signs for defined diseases (Tables 3, 4, 5)

    Table 3 Sensitivity, Specificity and Likelihood Ratios of Respiratory PE Signs Elicited by Inspection, Palpation and Percussion
    Table 4 Sensitivity, Specificity and Likelihood Ratios of Respiratory PE Signs Elicited by Auscultation
    Table 5 Clinical Contexts and Respiratory Physical Examination Signs with Probable Diagnostic Value

The vast majority of studies found sensitivity values of 0.5 or less (Tables 3, 4). Sensitivity values with likelihood ratios- negative (LR-) of 0.2 or less were reported only for dullness on percussion in diagnosing pleural effusion among inpatients with respiratory symptoms (Tables 3, 5), forced expiratory time of more than 6 s duration for obstructive airway disease (OAD) in asymptomatic plumbers screened for lung diseases and for diminished breath sounds in diagnosing hemo-pneumothorax in the context of trauma (Tables 4, 5).

On the other hand, high specificity values with LR+ of 4.0 or more have been reported for PE signs, such as those of pulmonary consolidation, in as many as 17 of the 27 validity studies (Tables 3, 4). Table 5 lists the settings and clinical contexts in which respiratory PE signs may be useful for increasing or reducing the post-test odds of a specific diagnosis.

DISCUSSION

Two main findings emerge from the present review. First, most studies have found low to fair reliability values for respiratory PE signs. This finding detracts from the credibility of the validity studies and is consistent with the view that "clinical skills textbooks fail evidence-based examination"65. Second, since none of the reviewed studies complied with all of the STARD criteria62, their design may have been flawed. Consequently, the reported reliability (Tables 1, 2) and sensitivity (Tables 3, 4) values should be interpreted with caution.

A finding of a poor reliability (i.e., high examiner variability) of a PE sign may indicate either that it has a poor diagnostic accuracy or that some of the examiners had deficient PE skills. This latter possibility is consistent with the reported lack of improvement, or even deterioration, of respiratory6,32, breast16 and cardiac66 PE skills with seniority and experience. The possibility that examiners differed in their PE skills is also suggested by the reported variability in the approach to the respiratory PE of 403 members of the British Thoracic Society7. A lack of adherence to appropriate PE technique also explains the low reliability found in studies of tachypnea25,33: examiners appear to rely on their subjective impression rather than on counting67. On the other hand, an adherence to technique may explain the high reliabilities of respiratory PE signs that have been reported by some authors (e.g.,61).

The reported low to moderate sensitivity of respiratory PE signs may have been similarly due to deficient examination skills: an examiner with poor skills is likely to miss a PE finding. Alternatively, the reported sensitivities may have been confounded by differences in the severity of the diseases of the examined patients. Indeed, it has been reported that the sensitivity of reduced breath sounds55, pulsus paradoxus36 and Hoover's sign (an inward motion of the lower lateral rib cage with inspiration)51 increases with the severity of airway obstruction.

These two main flaws in the design of most studies, namely their failure to control for disease severity and examiners' skills, have probably led to underestimates of the reported reliability, sensitivity and specificity values. Future studies may avoid these biases by adhering to the STARD requirements. However, pending the publication of properly controlled studies, the reliability and validity of most respiratory PE signs remain uncertain. In 1986, Mulrow et al.32 concluded that "despite their routine use, most physical examination techniques, including pulmonary auscultation and percussion, are poorly standardized and of uncertain [diagnostic] value." This conclusion is also pertinent today.

The uncertain sensitivity of the respiratory PE argues against its utility for screening of asymptomatic persons. Screening for disease requires that the test used be highly sensitive, well above 0.7, which is not the case for most respiratory PE signs (Tables 3, 4). However, their low or uncertain reliability and validity do not preclude their usefulness in patients with suspected respiratory diseases for two reasons. First, the observed reliabilities and sensitivities may have been confounded by flaws in the study design. Second, the possible bias produced by these flaws would be toward underestimating the diagnostic accuracy of the various respiratory PE signs. Therefore, the high specificities and high LRs+ reported for some of these signs appear to be credible and to indicate that they may be useful in specific clinical contexts. For example, assuming that the pretest probability of pneumonia in outpatients with acute cough is 10% (odds 1:9)38, a finding of asymetric expansion of the chest would increase the odds to 8:9, i.e., increase the post-test probability to 47%. Assuming a pretest probability of pneumonia of 12–30% (odds 1:9–3:7) among emergency room patients with fever and acute respiratory symptoms41, a finding of pleural friction rub would increase the odds of pneumonia to 5:9–15:7 or to a post-test probability of 36–68%. Future studies should explore whether the various respiratory PE signs provide independent information. In other words, it is at present uncertain whether a combination of PE signs (e.g., of pulmonary consolidation such as dullness on percussion, bronchial breathing and egophony) allows the multiplication of each of their LRs in order to assess the post-test probability of pneumonia.

Therefore, we believe that a meticulously performed respiratory PE, which aims to explore a diagnostic hypothesis, as opposed to a PE that aims to detect a disease in an asymptomatic person, remains a cornerstone of clinical practice. We propose that teaching of the PE should not discriminate among respiratory signs according to their presently uncertain reliabilities and sensitivities, but rather according to the importance of the disease under consideration and to their specificity.

The most important PE signs are, first, those of life-threatening conditions in any clinical context. For example, a patient, who presents with any degree of respiratory abnormality (tachypnea, bradypnea, apnea, labored breathing, stridor, accessory muscle recruitment or paradoxical breathing) is in respiratory distress. Its detection mandates immediate treatment with oxygen and a sustained effort to establish the cause by looking for stridor (croup, epiglottitis), wheezes (bronchial asthma, bronchitis), reduced breath sounds and changes in percussion note (pneumothorax or pleural effusion) and for signs suggesting pulmonary emboli. Second, we believe that teaching should emphasize the respiratory signs that have been reported to have high positive or low negative LRs (Table 5).

At the other end of the spectrum, the least important respiratory PE signs are those that are no longer employed in clinical practice because of the availability of more easily performed ancillary tests. For example, hand-held spirometry provides an easier and more precise assessment of obstructive airway disease than Hoover's sign and pulsus paradoxus. Spirometry may also alert physicians to the possibility of mild pulmonary disorders, and it may be used for monitoring patients with conditions such as asthma and cystic fibrosis. Similarly, pulse oxymetry may detect reduced blood oxygenation at earlier stages than central cyanosis68. Therefore, we join the calls to incorporate pulse oximetry and spirometry into the PE, and add hand-held oximeters and spirometers to the stethoscope, sphygmomanometer and reflex hammer that a doctor already uses during patient examination69.