Theory and Observation in Science
First published Tue Jan 6, 2009; substantive revision Mon Jun 14, 2021
Scientists obtain a great deal of the evidence they use by collecting and producing empirical results. Much of the standard philosophical literature on this subject comes from 20th century logical empiricists, their followers, and critics who embraced their issues while objecting to some of their aims and assumptions. Discussions about empirical evidence have tended to focus on epistemological questions regarding its role in theory testing. This entry follows that precedent, even though empirical evidence also plays important and philosophically interesting roles in other areas including scientific discovery, the development of experimental tools and techniques, and the application of scientific theories to practical problems.
The logical empiricists and their followers devoted much of their attention to the distinction between observables and unobservables, the form and content of observation reports, and the epistemic bearing of observational evidence on theories it is used to evaluate. Philosophical work in this tradition was characterized by the aim of conceptually separating theory and observation, so that observation could serve as the pure basis of theory appraisal. More recently, the focus of the philosophical literature has shifted away from these issues, and their close association to the languages and logics of science, to investigations of how empirical data are generated, analyzed, and used in practice. With this shift, we also see philosophers largely setting aside the aspiration of a pure observational basis for scientific knowledge and instead embracing a view of science in which the theoretical and empirical are usefully intertwined. This entry discusses these topics under the following headings:
2. Observation and data
2.1 Traditional empiricism
2.2 The irrelevance of observation per se
2.3 Data and phenomena
3. Theory and value ladenness
3.2 Assuming the theory to be tested
4. The epistemic value of empirical evidence
4.2 Saving the phenomena
4.3 Empirical adequacy
Other Internet Resources
Philosophers of science have traditionally recognized a special role for observations in the epistemology of science. Observations are the conduit through which the ‘tribunal of experience’ delivers its verdicts on scientific hypotheses and theories. The evidential value of an observation has been assumed to depend on how sensitive it is to whatever it is used to study. But this in turn depends on the adequacy of any theoretical claims its sensitivity may depend on. For example, we can challenge the use of a particular thermometer reading to support a prediction of a patient’s temperature by challenging theoretical claims having to do with whether a reading from a thermometer like this one, applied in the same way under similar conditions, should indicate the patient’s temperature well enough to count in favor of or against the prediction. At least some of those theoretical claims will be such that regardless of whether an investigator explicitly endorses, or is even aware of them, her use of the thermometer reading would be undermined by their falsity. All observations and uses of observational evidence are theory laden in this sense (cf. Chang 2005, Azzouni 2004). As the example of the thermometer illustrates, analogues of Norwood Hanson’s claim that seeing is a theory laden undertaking apply just as well to equipment generated observations (Hanson 1958, 19). But if all observations and empirical data are theory laden, how can they provide reality-based, objective epistemic constraints on scientific reasoning?
Recent scholarship has turned this question on its head. Why think that theory ladenness of empirical results would be problematic in the first place? If the theoretical assumptions with which the results are imbued are correct, what is the harm of it? After all, it is in virtue of those assumptions that the fruits of empirical investigation can be ‘put in touch’ with theorizing at all. A number scribbled in a lab notebook can do a scientist little epistemic good unless she can recruit the relevant background assumptions to even recognize it as a reading of the patient’s temperature. But philosophers have embraced an entangled picture of the theoretical and empirical that goes much deeper than this. Lloyd (2012) advocates for what she calls “complex empiricism” in which there is “no pristine separation of model and data” (397). Bogen (2016) points out that “impure empirical evidence” (i.e. evidence that incorporates the judgements of scientists) “often tells us more about the world that it could have if it were pure” (784). Indeed, Longino (2020) has urged that “[t]he naïve fantasy that data have an immediate relation to phenomena of the world, that they are ‘objective’ in some strong, ontological sense of that term, that they are the facts of the world directly speaking to us, should be finally laid to rest” and that “even the primary, original, state of data is not free from researchers’ value- and theory-laden selection and organization” (391).
There is not widespread agreement among philosophers of science about how to characterize the nature of scientific theories. What is a theory? According to the traditional syntactic view, theories are considered to be collections of sentences couched in logical language, which must then be supplemented with correspondence rules in order to be interpreted. Construed in this way, theories include maximally general explanatory and predictive laws (Coulomb’s law of electrical attraction and repulsion, and Maxwellian electromagnetism equations for example), along with lesser generalizations that describe more limited natural and experimental phenomena (e.g., the ideal gas equations describing relations between temperatures and pressures of enclosed gasses, and general descriptions of positional astronomical regularities). In contrast, the semantic view casts theories as the space of states possible according to the theory, or the set of mathematical models permissible according to the theory (see Suppe 1977). However, there are also significantly more ecumenical interpretations of what it means to be a scientific theory, which include elements of diverse kinds. To take just one illustrative example, Borrelli (2012) characterizes the Standard Model of particle physics as a theoretical framework involving what she calls “theoretical cores” that are composed of mathematical structures, verbal stories, and analogies with empirical references mixed together (196). This entry aims to accommodate all of these views about the nature of scientific theories.
In this entry, we trace the contours of traditional philosophical engagement with questions surrounding theory and observation in science that attempted to segregate the theoretical from the observational, and to cleanly delineate between the observable and the unobservable. We also discuss the more recent scholarship that supplants the primacy of observation by human sensory perception with an instrument-inclusive conception of data production and that embraces the intertwining of theoretical and empirical in the production of useful scientific results. Although theory testing dominates much of the standard philosophical literature on observation, much of what this entry says about the role of observation in theory testing applies also to its role in inventing, and modifying theories, and applying them to tasks in engineering, medicine, and other practical enterprises.
2. Observation and data
2.1 Traditional empiricism
Reasoning from observations has been important to scientific practice at least since the time of Aristotle, who mentions a number of sources of observational evidence including animal dissection (Aristotle(a), 763a/30–b/15; Aristotle(b), 511b/20–25). Francis Bacon argued long ago that the best way to discover things about nature is to use experiences (his term for observations as well as experimental results) to develop and improve scientific theories (Bacon 1620, 49ff). The role of observational evidence in scientific discovery was an important topic for Whewell (1858) and Mill (1872) among others in the 19th century. But philosophers didn’t talk about observation as extensively, in as much detail, or in the way we have become accustomed to, until the 20th century when logical empiricists transformed philosophical thinking about it.
One important transformation, characteristic of the linguistic turn in philosophy, was to concentrate on the logic of observation reports rather than on objects or phenomena observed. This focus made sense on the assumption that a scientific theory is a system of sentences or sentence-like structures (propositions, statements, claims, and so on) to be tested by comparison to observational evidence. It was assumed that the comparisons must be understood in terms of inferential relations. If inferential relations hold only between sentence-like structures, it follows that theories must be tested, not against observations or things observed, but against sentences, propositions, etc. used to report observations (Hempel 1935, 50–51; Schlick 1935). Theory testing was treated as a matter of comparing observation sentences describing observations made in natural or laboratory settings to observation sentences that should be true according to the theory to be tested. This was to be accomplished by using laws or lawlike generalizations along with descriptions of initial conditions, correspondence rules, and auxiliary hypotheses to derive observation sentences describing the sensory deliverances of interest. This makes it imperative to ask what observation sentences report.
According to what Hempel called the phenomenalist account, observation reports describe the observer’s subjective perceptual experiences.
… Such experiential data might be conceived of as being sensations, perceptions, and similar phenomena of immediate experience. (Hempel 1952, 674)
This view is motivated by the assumption that the epistemic value of an observation report depends upon its truth or accuracy, and that with regard to perception, the only thing observers can know with certainty to be true or accurate is how things appear to them. This means that we cannot be confident that observation reports are true or accurate if they describe anything beyond the observer’s own perceptual experience. Presumably one’s confidence in a conclusion should not exceed one’s confidence in one’s best reasons to believe it. For the phenomenalist, it follows that reports of subjective experience can provide better reasons to believe claims they support than reports of other kinds of evidence.
However, given the expressive limitations of the language available for reporting subjective experiences, we cannot expect phenomenalistic reports to be precise and unambiguous enough to test theoretical claims whose evaluation requires accurate, fine-grained perceptual discriminations. Worse yet, if experiences are directly available only to those who have them, there is room to doubt whether different people can understand the same observation sentence in the same way. Suppose you had to evaluate a claim on the basis of someone else’s subjective report of how a litmus solution looked to her when she dripped a liquid of unknown acidity into it. How could you decide whether her visual experience was the same as the one you would use her words to report?
Such considerations led Hempel to propose, contrary to the phenomenalists, that observation sentences report ‘directly observable’, ‘intersubjectively ascertainable’ facts about physical objects
… such as the coincidence of the pointer of an instrument with a numbered mark on a dial; a change of color in a test substance or in the skin of a patient; the clicking of an amplifier connected with a Geiger counter; etc. (ibid.)
That the facts expressed in observation reports be intersubjectively ascertainable was critical for the aims of the logical empiricists. They hoped to articulate and explain the authoritativeness widely conceded to the best natural, social, and behavioral scientific theories in contrast to propaganda and pseudoscience. Some pronouncements from astrologers and medical quacks gain wide acceptance, as do those of religious leaders who rest their cases on faith or personal revelation, and leaders who use their political power to secure assent. But such claims do not enjoy the kind of credibility that scientific theories can attain. The logical empiricists tried to account for the genuine credibility of scientific theories by appeal to the objectivity and accessibility of observation reports, and the logic of theory testing. Part of what they meant by calling observational evidence objective was that cultural and ethnic factors have no bearing on what can validly be inferred about the merits of a theory from observation reports. So conceived, objectivity was important to the logical empiricists’ criticism of the Nazi idea that Jews and Aryans have fundamentally different thought processes such that physical theories suitable for Einstein and his kind should not be inflicted on German students. In response to this rationale for ethnic and cultural purging of the German educational system, the logical empiricists argued that because of its objectivity, observational evidence (rather than ethnic and cultural factors) should be used to evaluate scientific theories (Galison 1990). In this way of thinking, observational evidence and its subsequent bearing on scientific theories are objective also in virtue of being free of non-epistemic values.
Ensuing generations of philosophers of science have found the logical empiricist focus on expressing the content of observations in a rarefied and basic observation language too narrow. Search for a suitably universal language as required by the logical empiricist program has come up empty-handed and most philosophers of science have given up its pursuit. Moreover, as we will discuss in the following section, the centrality of observation itself (and pointer readings) to the aims of empiricism in philosophy of science has also come under scrutiny. However, leaving the search for a universal pure observation language behind does not automatically undercut the norm of objectivity as it relates to the social, political, and cultural contexts of scientific research. Pristine logical foundations aside, the objectivity of ‘neutral’ observations in the face of noxious political propaganda was appealing because it could serve as shared ground available for intersubjective appraisal. This appeal remains alive and well today, particularly as pernicious misinformation campaigns are again formidable in public discourse (see O’Connor and Weatherall 2019). If individuals can genuinely appraise the significance of empirical evidence and come to well-justified agreement about how the evidence bears on theorizing, then they can protect their epistemic deliberations from the undue influence of fascists and other nefarious manipulators. However, this aspiration must face subtleties arising from the social epistemology of science and from the nature of empirical results themselves. In practice, the appraisal of scientific results can often require expertise that is not readily accessible to members of the public without the relevant specialized training. Additionally, precisely because empirical results are not pure observation reports, their appraisal across communities of inquirers operating with different background assumptions can require significant epistemic work.
The logical empiricists paid little attention to the distinction between observing and experimenting and its epistemic implications. For some philosophers, to experiment is to isolate, prepare, and manipulate things in hopes of producing epistemically useful evidence. It had been customary to think of observing as noticing and attending to interesting details of things perceived under more or less natural conditions, or by extension, things perceived during the course of an experiment. To look at a berry on a vine and attend to its color and shape would be to observe it. To extract its juice and apply reagents to test for the presence of copper compounds would be to perform an experiment. By now, many philosophers have argued that contrivance and manipulation influence epistemically significant features of observable experimental results to such an extent that epistemologists ignore them at their peril. Robert Boyle (1661), John Herschell (1830), Bruno Latour and Steve Woolgar (1979), Ian Hacking (1983), Harry Collins (1985) Allan Franklin (1986), Peter Galison (1987), Jim Bogen and Jim Woodward (1988), and Hans-Jörg Rheinberger (1997), are some of the philosophers and philosophically-minded scientists, historians, and sociologists of science who gave serious consideration to the distinction between observing and experimenting. The logical empiricists tended to ignore it. Interestingly, the contemporary vantage point that attends to modeling, data processing, and empirical results may suggest a re-unification of observation and intervention under the same epistemological framework. When one no longer thinks of scientific observation as pure or direct, and recognizes the power of good modeling to account for confounds without physically intervening on the target system, the purported epistemic distinction between observation and intervention loses its bite.
2.2 The irrelevance of observation per se
Observers use magnifying glasses, microscopes, or telescopes to see things that are too small or far away to be seen, or seen clearly enough, without them. Similarly, amplification devices are used to hear faint sounds. But if to observe something is to perceive it, not every use of instruments to augment the senses qualifies as observational.
Philosophers generally agree that you can observe the moons of Jupiter with a telescope, or a heartbeat with a stethoscope. The van Fraassen of The Scientific Image is a notable exception, for whom to be ‘observable’ meant to be something that, were it present to a creature like us, would be observed. Thus, for van Fraassen, the moons of Jupiter are observable “since astronauts will no doubt be able to see them as well from close up” (1980, 16). In contrast, microscopic entities are not observable on van Fraassen’s account because creatures like us cannot strategically maneuver ourselves to see them, present before us, with our unaided senses.
Many philosophers have criticized van Fraassen’s view as overly restrictive. Nevertheless, philosophers differ in their willingness to draw the line between what counts as observable and what does not along the spectrum of increasingly complicated instrumentation. Many philosophers who don’t mind telescopes and microscopes still find it unnatural to say that high energy physicists ‘observe’ particles or particle interactions when they look at bubble chamber photographs—let alone digital visualizations of energy depositions left in calorimeters that are not themselves inspected. Their intuitions come from the plausible assumption that one can observe only what one can see by looking, hear by listening, feel by touching, and so on. Investigators can neither look at (direct their gazes toward and attend to) nor visually experience charged particles moving through a detector. Instead they can look at and see tracks in the chamber, in bubble chamber photographs, calorimeter data visualizations, etc.
In more contentious examples, some philosophers have moved to speaking of instrument-augmented empirical research as more like tool use than sensing. Hacking (1981) argues that we do not see through a microscope, but rather with it. Daston and Galison (2007) highlight the inherent interactivity of a scanning tunneling microscope, in which scientists image and manipulate atoms by exchanging electrons between the sharp tip of the microscope and the surface to be imaged (397). Others have opted to stretch the meaning of observation to accommodate what we might otherwise be tempted to call instrument-aided detections. For instance, Shapere (1982) argues that while it may initially strike philosophers as counter-intuitive, it makes perfect sense to call the detection of neutrinos from the interior of the sun “direct observation.”
The variety of views on the observable/unobservable distinction hint that empiricists may have been barking up the wrong philosophical tree. Many of the things scientists investigate do not interact with human perceptual systems as required to produce perceptual experiences of them. The methods investigators use to study such things argue against the idea—however plausible it may once have seemed—that scientists do or should rely exclusively on their perceptual systems to obtain the evidence they need. Thus Feyerabend proposed as a thought experiment that if measuring equipment was rigged up to register the magnitude of a quantity of interest, a theory could be tested just as well against its outputs as against records of human perceptions (Feyerabend 1969, 132–137). Feyerabend could have made his point with historical examples instead of thought experiments. A century earlier Helmholtz estimated the speed of excitatory impulses traveling through a motor nerve. To initiate impulses whose speed could be estimated, he implanted an electrode into one end of a nerve fiber and ran a current into it from a coil. The other end was attached to a bit of muscle whose contraction signaled the arrival of the impulse. To find out how long it took the impulse to reach the muscle he had to know when the stimulating current reached the nerve. But[o]ur senses are not capable of directly perceiving an individual moment of time with such small duration …
and so Helmholtz had to resort to what he called ‘artificial methods of observation’ (Olesko and Holmes 1994, 84). This meant arranging things so that current from the coil could deflect a galvanometer needle. Assuming that the magnitude of the deflection is proportional to the duration of current passing from the coil, Helmholtz could use the deflection to estimate the duration he could not see (ibid). This sense of ‘artificial observation’ is not to be confused e.g., with using magnifying glasses or telescopes to see tiny or distant objects. Such devices enable the observer to scrutinize visible objects. The minuscule duration of the current flow is not a visible object. Helmholtz studied it by cleverly concocting circumstances so that the deflection of the needle would meaningfully convey the information he needed. Hooke (1705, 16–17) argued for and designed instruments to execute the same kind of strategy in the 17th century.
It is of interest that records of perceptual observation are not always epistemically superior to data collected via experimental equipment. Indeed, it is not unusual for investigators to use non-perceptual evidence to evaluate perceptual data and correct for its errors. For example, Rutherford and Pettersson conducted similar experiments to find out if certain elements disintegrated to emit charged particles under radioactive bombardment. To detect emissions, observers watched a scintillation screen for faint flashes produced by particle strikes. Pettersson’s assistants reported seeing flashes from silicon and certain other elements. Rutherford’s did not. Rutherford’s colleague, James Chadwick, visited Pettersson’s laboratory to evaluate his data. Instead of watching the screen and checking Pettersson’s data against what he saw, Chadwick arranged to have Pettersson’s assistants watch the screen while unbeknownst to them he manipulated the equipment, alternating normal operating conditions with a condition in which particles, if any, could not hit the screen. Pettersson’s data were discredited by the fact that his assistants reported flashes at close to the same rate in both conditions (Stuewer 1985, 284–288).
When the process of producing data is relatively convoluted, it is even easier to see that human sense perception is not the ultimate epistemic engine. Consider functional magnetic resonance images (fMRI) of the brain decorated with colors to indicate magnitudes of electrical activity in different regions during the performance of a cognitive task. To produce these images, brief magnetic pulses are applied to the subject’s brain. The magnetic force coordinates the precessions of protons in hemoglobin and other bodily stuffs to make them emit radio signals strong enough for the equipment to respond to. When the magnetic force is relaxed, the signals from protons in highly oxygenated hemoglobin deteriorate at a detectably different rate than signals from blood that carries less oxygen. Elaborate algorithms are applied to radio signal records to estimate blood oxygen levels at the places from which the signals are calculated to have originated. There is good reason to believe that blood flowing just downstream from spiking neurons carries appreciably more oxygen than blood in the vicinity of resting neurons. Assumptions about the relevant spatial and temporal relations are used to estimate levels of electrical activity in small regions of the brain corresponding to pixels in the finished image. The results of all of these computations are used to assign the appropriate colors to pixels in a computer generated image of the brain. In view of all of this, functional brain imaging differs, e.g., from looking and seeing, photographing, and measuring with a thermometer or a galvanometer in ways that make it uninformative to call it observation. And similarly for many other methods scientists use to produce non-perceptual evidence.
The role of the senses in fMRI data production is limited to such things as monitoring the equipment and keeping an eye on the subject. Their epistemic role is limited to discriminating the colors in the finished image, reading tables of numbers the computer used to assign them, and so on. While it is true that researchers typically use their sense of sight to take in visualizations of processed fMRI data—or numbers on a page or screen for that matter—this is not the primary locus of epistemic action. Researchers learn about brain processes through fMRI data, to the extent that they do, primarily in virtue of the suitability of the causal connection between the target processes and the data records, and of the transformations those data undergo when they are processed into the maps or other results that scientists want to use. The interesting questions are not about observability, i.e. whether neuronal activity, blood oxygen levels, proton precessions, radio signals, and so on, are properly understood as observable by creatures like us. The epistemic significance of the fMRI data depends on their delivering us the right sort of access to the target, but observation is neither necessary nor sufficient for that access.
Following Shapere (1982), one could respond by adopting an extremely permissive view of what counts as an ‘observation’ so as to allow even highly processed data to count as observations. However, it is hard to reconcile the idea that highly processed data like fMRI images record observations with the traditional empiricist notion that calculations involving theoretical assumptions and background beliefs must not be allowed (on pain of loss of objectivity) to intrude into the process of data production. Observation garnered its special epistemic status in the first place because it seemed more direct, more immediate, and therefore less distorted and muddled than (say) detection or inference. The production of fMRI images requires extensive statistical manipulation based on theories about the radio signals, and a variety of factors having to do with their detection along with beliefs about relations between blood oxygen levels and neuronal activity, sources of systematic error, and more. Insofar as the use of the term ‘observation’ connotes this extra baggage of traditional empiricism, it may be better to replace observation-talk with terminology that is more obviously permissive, such as that of ‘empirical data’ and ‘empirical results.’
2.3 Data and phenomena
Deposing observation from its traditional perch in empiricist epistemologies of science need not estrange philosophers from scientific practice. Terms like ‘observation’ and ‘observation reports’ do not occur nearly as much in scientific as in philosophical writings. In their place, working scientists tend to talk about data. Philosophers who adopt this usage are free to think about standard examples of observation as members of a large, diverse, and growing family of data production methods. Instead of trying to decide which methods to classify as observational and which things qualify as observables, philosophers can then concentrate on the epistemic influence of the factors that differentiate members of the family. In particular, they can focus their attention on what questions data produced by a given method can be used to answer, what must be done to use that data fruitfully, and the credibility of the answers they afford (Bogen 2016).
Satisfactorily answering such questions warrants further philosophical work. As Bogen and Woodward (1988) have argued, there is often a long road between obtaining a particular dataset replete with idiosyncrasies born of unspecified causal nuances to any claim about the phenomenon ultimately of interest to the researchers. Empirical data are typically produced in ways that make it impossible to predict them from the generalizations they are used to test, or to derive instances of those generalizations from data and non ad hoc auxiliary hypotheses. Indeed, it is unusual for many members of a set of reasonably precise quantitative data to agree with one another, let alone with a quantitative prediction. That is because precise, publicly accessible data typically cannot be produced except through processes whose results reflect the influence of causal factors that are too numerous, too different in kind, and too irregular in behavior for any single theory to account for them. When Bernard Katz recorded electrical activity in nerve fiber preparations, the numerical values of his data were influenced by factors peculiar to the operation of his galvanometers and other pieces of equipment, variations among the positions of the stimulating and recording electrodes that had to be inserted into the nerve, the physiological effects of their insertion, and changes in the condition of the nerve as it deteriorated during the course of the experiment. There were variations in the investigators’ handling of the equipment. Vibrations shook the equipment in response to a variety of irregularly occurring causes ranging from random error sources to the heavy tread of Katz’s teacher, A.V. Hill, walking up and down the stairs outside of the laboratory. That’s a short list. To make matters worse, many of these factors influenced the data as parts of irregularly occurring, transient, and shifting assemblies of causal influences.
The effects of systematic and random sources of error are typically such that considerable analysis and interpretation are required to take investigators from data sets to conclusions that can be used to evaluate theoretical claims. Interestingly, this applies as much to clear cases of perceptual data as to machine produced records. When 19th and early 20th century astronomers looked through telescopes and pushed buttons to record the time at which they saw a star pass a crosshair, the values of their data points depended, not only upon light from that star, but also upon features of perceptual processes, reaction times, and other psychological factors that varied from observer to observer. No astronomical theory has the resources to take such things into account.
Instead of testing theoretical claims by direct comparison to the data initially collected, investigators use data to infer facts about phenomena, i.e., events, regularities, processes, etc. whose instances are uniform and uncomplicated enough to make them susceptible to systematic prediction and explanation (Bogen and Woodward 1988, 317). The fact that lead melts at temperatures at or close to 327.5 C is an example of a phenomenon, as are widespread regularities among electrical quantities involved in the action potential, the motions of astronomical bodies, etc. Theories that cannot be expected to predict or explain such things as individual temperature readings can nevertheless be evaluated on the basis of how useful they are in predicting or explaining phenomena. The same holds for the action potential as opposed to the electrical data from which its features are calculated, and the motions of astronomical bodies in contrast to the data of observational astronomy. It is reasonable to ask a genetic theory how probable it is (given similar upbringings in similar environments) that the offspring of a parent or parents diagnosed with alcohol use disorder will develop one or more symptoms the DSM classifies as indicative of alcohol use disorder. But it would be quite unreasonable to ask the genetic theory to predict or explain one patient’s numerical score on one trial of a particular diagnostic test, or why a diagnostician wrote a particular entry in her report of an interview with an offspring of one of such parents (see Bogen and Woodward, 1988, 319–326).
Leonelli has challenged Bogen and Woodward’s (1988) claim that data are, as she puts it, “unavoidably embedded in one experimental context” (2009, 738). She argues that when data are suitably packaged, they can travel to new epistemic contexts and retain epistemic utility—it is not just claims about the phenomena that can travel, data travel too. Preparing data for safe travel involves work, and by tracing data ‘journeys,’ philosophers can learn about how the careful labor of researchers, data archivists, and database curators can facilitate useful data mobility. While Leonelli’s own work has often focused on data in biology, Leonelli and Tempini (2020) contains many diverse case studies of data journeys from a variety of scientific disciplines that will be of value to philosophers interested in the methodology and epistemology of science in practice.
The fact that theories typically predict and explain features of phenomena rather than idiosyncratic data should not be interpreted as a failing. For many purposes, this is the more useful and illuminating capacity. Suppose you could choose between a theory that predicted or explained the way in which neurotransmitter release relates to neuronal spiking (e.g., the fact that on average, transmitters are released roughly once for every 10 spikes) and a theory which explained or predicted the numbers displayed on the relevant experimental equipment in one, or a few single cases. For most purposes, the former theory would be preferable to the latter at the very least because it applies to so many more cases. And similarly for theories that predict or explain something about the probability of alcohol use disorder conditional on some genetic factor or a theory that predicted or explained the probability of faulty diagnoses of alcohol use disorder conditional on facts about the training that psychiatrists receive. For most purposes, these would be preferable to a theory that predicted specific descriptions in a single particular case history.
However, there are circumstances in which scientists do want to explain data. In empirical research it is often crucial to getting a useful signal that scientists deal with sources of background noise and confounding signals. This is part of the long road from newly collected data to useful empirical results. An important step on the way to eliminating unwanted noise or confounds is to determine their sources. Different sources of noise can have different characteristics that can be derived from and explained by theory. Consider the difference between ‘shot noise’ and ‘thermal noise,’ two ubiquitous sources of noise in precision electronics (Schottky 1918; Nyquist 1928; Horowitz and Hill 2015). ‘Shot noise’ arises in virtue of the discrete nature of a signal. For instance, light collected by a detector does not arrive all at once or in perfectly continuous fashion. Photons rain onto a detector shot by shot on account of being quanta. Imagine building up an image one photon at a time—at first the structure of the image is barely recognizable, but after the arrival of many photons, the image eventually fills in. In fact, the contribution of noise of this type goes as the square root of the signal. By contrast, thermal noise is due to non-zero temperature—thermal fluctuations cause a small current to flow in any circuit. If you cool your instrument (which very many precision experiments in physics do) then you can decrease thermal noise. Cooling the detector is not going to change the quantum nature of photons though. Simply collecting more photons will improve the signal to noise ratio with respect to shot noise. Thus, determining what kind of noise is affecting one’s data, i.e. explaining features of the data themselves that are idiosyncratic to the particular instruments and conditions prevailing during a specific instance of data collection, can be critical to eventually generating a dataset that can be used to answer questions about phenomena of interest. In using data that require statistical analysis, it is particularly clear that “empirical assumptions about the factors influencing the measurement results may be used to motivate the assumption of a particular error distribution”, which can be crucial for justifying the application of methods of analysis (Woodward 2011, 173).
There are also circumstances in which scientists want to provide a substantive, detailed explanation for a particular idiosyncratic datum, and even circumstances in which procuring such explanations is epistemically imperative. Ignoring outliers without good epistemic reasons is just cherry-picking data, one of the canonical ‘questionable research practices.’ Allan Franklin has described Robert Millikan’s convenient exclusion of data he collected from observing the second oil drop in his experiments of April 16, 1912 (1986, 231). When Millikan initially recorded the data for this drop, his notebooks indicate that he was satisfied his apparatus was working properly and that the experiment was running well—he wrote “Publish” next to the data in his lab notebook. However, after he had later calculated the value for the fundamental electric charge that these data yielded, and found it aberrant with respect to the values he calculated using data collected from other good observing sessions, he changed his mind, writing “Won’t work” next to the calculation (ibid., see also Woodward 2010, 794). Millikan not only never published this result, he never published why he failed to publish it. When data are excluded from analysis, there ought to be some explanation justifying their omission over and above lack of agreement with the experimenters’ expectations. Precisely because they are outliers, some data require specific, detailed, idiosyncratic causal explanations. Indeed, it is often in virtue of those very explanations that outliers can be responsibly rejected. Some explanation of data rejected as ‘spurious’ is required. Otherwise, scientists risk biasing their own work.
Thus, while in transforming data as collected into something useful for learning about phenomena, scientists often account for features of the data such as different types of noise contributions, and sometimes even explain the odd outlying data point or artifact, they simply do not explain every individual teensy tiny causal contribution to the exact character of a data set or datum in full detail. This is because scientists can neither discover such causal minutia nor would their invocation be necessary for typical research questions. The fact that it may sometimes be important for scientists to provide detailed explanations of data, and not just claims about phenomena inferred from data, should not be confused with the dubious claim that scientists could ‘in principle’ detail every causal quirk that contributed to some data (Woodward 2010; 2011).
In view of all of this, together with the fact that a great many theoretical claims can only be tested directly against facts about phenomena, it behooves epistemologists to think about how data are used to answer questions about phenomena. Lacking space for a detailed discussion, the most this entry can do is to mention two main kinds of things investigators do in order to draw conclusions from data. The first is causal analysis carried out with or without the use of statistical techniques. The second is non-causal statistical analysis.
First, investigators must distinguish features of the data that are indicative of facts about the phenomenon of interest from those which can safely be ignored, and those which must be corrected for. Sometimes background knowledge makes this easy. Under normal circumstances investigators know that their thermometers are sensitive to temperature, and their pressure gauges, to pressure. An astronomer or a chemist who knows what spectrographic equipment does, and what she has applied it to will know what her data indicate. Sometimes it is less obvious. When Santiago Ramón y Cajal looked through his microscope at a thin slice of stained nerve tissue, he had to figure out which, if any, of the fibers he could see at one focal length connected to or extended from things he could see only at another focal length, or in another slice. Analogous considerations apply to quantitative data. It was easy for Katz to tell when his equipment was responding more to Hill’s footfalls on the stairs than to the electrical quantities it was set up to measure. It can be harder to tell whether an abrupt jump in the amplitude of a high frequency EEG oscillation was due to a feature of the subjects brain activity or an artifact of extraneous electrical activity in the laboratory or operating room where the measurements were made. The answers to questions about which features of numerical and non-numerical data are indicative of a phenomenon of interest typically depend at least in part on what is known about the causes that conspire to produce the data.
Statistical arguments are often used to deal with questions about the influence of epistemically relevant causal factors. For example, when it is known that similar data can be produced by factors that have nothing to do with the phenomenon of interest, Monte Carlo simulations, regression analyses of sample data, and a variety of other statistical techniques sometimes provide investigators with their best chance of deciding how seriously to take a putatively illuminating feature of their data.
But statistical techniques are also required for purposes other than causal analysis. To calculate the magnitude of a quantity like the melting point of lead from a scatter of numerical data, investigators throw out outliers, calculate the mean and the standard deviation, etc., and establish confidence and significance levels. Regression and other techniques are applied to the results to estimate how far from the mean the magnitude of interest can be expected to fall in the population of interest (e.g., the range of temperatures at which pure samples of lead can be expected to melt).
The fact that little can be learned from data without causal, statistical, and related argumentation has interesting consequences for received ideas about how the use of observational evidence distinguishes science from pseudoscience, religion, and other non-scientific cognitive endeavors. First, scientists are not the only ones who use observational evidence to support their claims; astrologers and medical quacks use them too. To find epistemically significant differences, one must carefully consider what sorts of data they use, where it comes from, and how it is employed. The virtues of scientific as opposed to non-scientific theory evaluations depend not only on its reliance on empirical data, but also on how the data are produced, analyzed and interpreted to draw conclusions against which theories can be evaluated. Secondly, it does not take many examples to refute the notion that adherence to a single, universally applicable ‘scientific method’ differentiates the sciences from the non-sciences. Data are produced, and used in far too many different ways to treat informatively as instances of any single method. Thirdly, it is usually, if not always, impossible for investigators to draw conclusions to test theories against observational data without explicit or implicit reliance on theoretical resources.
Bokulich (2020) has helpfully outlined a taxonomy of various ways in which data can be model-laden to increase their epistemic utility. She focuses on seven categories: data conversion, data correction, data interpolation, data scaling, data fusion, data assimilation, and synthetic data. Of these categories, conversion and correction are perhaps the most familiar. Bokulich reminds us that even in the case of reading a temperature from an ordinary mercury thermometer, we are ‘converting’ the data as measured, which in this case is the height of the column of mercury, to a temperature (ibid., 795). In more complicated cases, such as processing the arrival times of acoustic signals in seismic reflection measurements to yield values for subsurface depth, data conversion may involve models (ibid.). In this example, models of the composition and geometry of the subsurface are needed in order to account for differences in the speed of sound in different materials. Data ‘correction’ involves common practices we have already discussed like modeling and mathematically subtracting background noise contributions from one’s dataset (ibid., 796). Bokulich rightly points out that involving models in these ways routinely improves the epistemic uses to which data can be put. Data interpolation, scaling, and ‘fusion’ are also relatively widespread practices that deserve further philosophical analysis. Interpolation involves filling in missing data in a patchy data set, under the guidance of models. Data are scaled when they have been generated in a particular scale (temporal, spatial, energy) and modeling assumptions are recruited to transform them to apply at another scale. Data are ‘fused,’ in Bokulich’s terminology, when data collected in diverse contexts, using diverse methods are combined, or integrated together. For instance, when data from ice cores, tree rings, and the historical logbooks of sea captains are merged into a joint climate dataset. Scientists must take care in combining data of diverse provenance, and model new uncertainties arising from the very amalgamation of datasets (ibid., 800).
Bokulich contrasts ‘synthetic data’ with what she calls ‘real data’ (ibid., 801–802). Synthetic data are virtual, or simulated data, and are not produced by physical interaction with worldly research targets. Bokulich emphasizes the role that simulated data can usefully play in testing and troubleshooting aspects of data processing that are to eventually be deployed on empirical data (ibid., 802). It can be incredibly useful for developing and stress-testing a data processing pipeline to have fake datasets whose characteristics are already known in virtue of having been produced by the researchers, and being available for their inspection at will. When the characteristics of a dataset are known, or indeed can be tailored according to need, the effects of new processing methods can be more readily traced than without. In this way, researchers can familiarize themselves with the effects of a data processing pipeline, and make adjustments to that pipeline in light of what they learn by feeding fake data through it, before attempting to use that pipeline on actual science data. Such investigations can be critical to eventually arguing for the credibility of the final empirical results and their appropriate interpretation and use.
Data assimilation is perhaps a less widely appreciated aspect of model-based data processing among philosophers of science, excepting Parker (2016; 2017). Bokulich characterizes this method as “the optimal integration of data with dynamical model estimates to provide a more accurate ‘assimilation estimate’ of the quantity” (2020, 800). Thus, data assimilation involves balancing the contributions of empirical data and the output of models in an integrated estimate, according to the uncertainties associated with these contributions.
Bokulich argues that the involvement of models in these various aspects of data processing does not necessarily lead to better epistemic outcomes. Done wrong, integrating models and data can introduce artifacts and make the processed data unreliable for the purpose at hand (ibid., 804). Indeed, she notes that “[t]here is much work for methodologically reflective scientists and philosophers of science to do in string out cases in which model-data symbiosis may be problematic or circular” (ibid.)
3. Theory and value ladenness
Empirical results are laden with values and theoretical commitments. Philosophers have raised and appraised several possible kinds of epistemic problems that could be associated with theory and/or value-laden empirical results. They have worried about the extent to which human perception itself is distorted by our commitments. They have worried that drawing upon theoretical resources from the very theory to be appraised (or its competitors) in the generation of empirical results yields vicious circularity (or inconsistency). They have also worried that contingent conceptual and/or linguistic frameworks trap bits of evidence like bees in amber so that they cannot carry on their epistemic lives outside of the contexts of their origination, and that normative values necessarily corrupt the integrity of science. Do the theory and value-ladenness of empirical results render them hopelessly parochial? That is, when scientists leave theoretical commitments behind and adopt new ones, must they also relinquish the fruits of the empirical research imbued with their prior commitments too? In this section, we discuss these worries and responses that philosophers have offered to assuage them.
If you believe that observation by human sense perception is the objective basis of all scientific knowledge, then you ought to be particularly worried about the potential for human perception to be corrupted by theoretical assumptions, wishful thinking, framing effects, and so on. Daston and Galison recount the striking example of Arthur Worthington’s symmetrical milk drops (2007, 11–16). Working in 1875, Worthington investigated the hydrodynamics of falling fluid droplets and their evolution upon impacting a hard surface. At first, he had tried to carefully track the drop dynamics with a strobe light to burn a sequence of images into his own retinas. The images he drew to record what he saw were radially symmetric, with rays of the drop splashes emanating evenly from the center of the impact. However, when Worthington transitioned from using his eyes and capacity to draw from memory to using photography in 1894, he was shocked to find that the kind of splashes he had been observing were irregular splats (ibid., 13). Even curiouser, when Worthington returned to his drawings, he found that he had indeed recorded some unsymmetrical splashes. He had evidently dismissed them as uninformative accidents instead of regarding them as revelatory of the phenomenon he was intent on studying (ibid.) In attempting to document the ideal form of the splashes, a general and regular form, he had subconsciously down-played the irregularity of individual splashes. If theoretical commitments, like Worthington’s initial commitment to the perfect symmetry of the physics he was studying, pervasively and incorrigibly dictated the results of empirical inquiry, then the epistemic aims of science would be seriously undermined.
Perceptual psychologists, Bruner and Postman, found that subjects who were briefly shown anomalous playing cards, e.g., a black four of hearts, reported having seen their normal counterparts e.g., a red four of hearts. It took repeated exposures to get subjects to say the anomalous cards didn’t look right, and eventually, to describe them correctly (Kuhn 1962, 63). Kuhn took such studies to indicate that things don’t look the same to observers with different conceptual resources. (For a more up-to-date discussion of theory and conceptual perceptual loading see Lupyan 2015.) If so, black hearts didn’t look like black hearts until repeated exposures somehow allowed subjects to acquire the concept of a black heart. By analogy, Kuhn supposed, when observers working in conflicting paradigms look at the same thing, their conceptual limitations should keep them from having the same visual experiences (Kuhn 1962, 111, 113–114, 115, 120–1). This would mean, for example, that when Priestley and Lavoisier watched the same experiment, Lavoisier should have seen what accorded with his theory that combustion and respiration are oxidation processes, while Priestley’s visual experiences should have agreed with his theory that burning and respiration are processes of phlogiston release.
The example of Pettersson’s and Rutherford’s scintillation screen evidence (above) attests to the fact that observers working in different laboratories sometimes report seeing different things under similar conditions. It is plausible that their expectations influence their reports. It is plausible that their expectations are shaped by their training and by their supervisors’ and associates’ theory driven behavior. But as happens in other cases as well, all parties to the dispute agreed to reject Pettersson’s data by appealing to results that both laboratories could obtain and interpret in the same way without compromising their theoretical commitments. Indeed, it is possible for scientists to share empirical results, not just across diverse laboratory cultures, but even across serious differences in worldview. Much as they disagreed about the nature of respiration and combustion, Priestley and Lavoisier gave quantitatively similar reports of how long their mice stayed alive and their candles kept burning in closed bell jars. Priestley taught Lavoisier how to obtain what he took to be measurements of the phlogiston content of an unknown gas. A sample of the gas to be tested is run into a graduated tube filled with water and inverted over a water bath. After noting the height of the water remaining in the tube, the observer adds “nitrous air” (we call it nitric oxide) and checks the water level again. Priestley, who thought there was no such thing as oxygen, believed the change in water level indicated how much phlogiston the gas contained. Lavoisier reported observing the same water levels as Priestley even after he abandoned phlogiston theory and became convinced that changes in water level indicated free oxygen content (Conant 1957, 74–109).
A related issue is that of salience. Kuhn claimed that if Galileo and an Aristotelian physicist had watched the same pendulum experiment, they would not have looked at or attended to the same things. The Aristotelian’s paradigm would have required the experimenter to measure
… the weight of the stone, the vertical height to which it had been raised, and the time required for it to achieve rest (Kuhn 1962, 123)
and ignore radius, angular displacement, and time per swing (ibid., 124). These last were salient to Galileo because he treated pendulum swings as constrained circular motions. The Galilean quantities would be of no interest to an Aristotelian who treats the stone as falling under constraint toward the center of the earth (ibid., 123). Thus Galileo and the Aristotelian would not have collected the same data. (Absent records of Aristotelian pendulum experiments we can think of this as a thought experiment.)
Interests change, however. Scientists may eventually come to appreciate the significance of data that had not originally been salient to them in light of new presuppositions. The moral of these examples is that although paradigms or theoretical commitments sometimes have an epistemically significant influence on what observers perceive or what they attend to, it can be relatively easy to nullify or correct for their effects. When presuppositions cause epistemic damage, investigators are often able to eventually make corrections. Thus, paradigms and theoretical commitments actually do influence saliency, but their influence is neither inevitable nor irremediable.
3.2 Assuming the theory to be tested
Thomas Kuhn (1962), Norwood Hanson (1958), Paul Feyerabend (1959) and others cast suspicion on the objectivity of observational evidence in another way by arguing that one cannot use empirical evidence to test a theory without committing oneself to that very theory. This would be a problem if it leads to dogmatism but assuming the theory to be tested is often benign and even necessary.
For instance, Laymon (1988) demonstrates the manner in which the very theory that the Michelson-Morley experiments are considered to test is assumed in the experimental design, but that this does not engender deleterious epistemic effects (250). The Michelson-Morley apparatus consists of two interferometer arms at right angles to one another, which are rotated in the course of the experiment so that, on the original construal, the path length traversed by light in the apparatus would vary according to alignment with or against the Earth’s velocity (carrying the apparatus) with respect to the stationary aether. This difference in path length would show up as displacement in the interference fringes of light in the interferometer. Although Michelson’s intention had been to measure the velocity of the Earth with respect to the all-pervading aether, the experiments eventually came to be regarded as furnishing tests of the Fresnel aether theory itself. In particular, the null results of these experiments were taken as evidence against the existence of the aether. Naively, one might suppose that whatever assumptions were made in the calculation of the results of these experiments, it should not be the case that the theory under the gun was assumed nor that its negation was.
Before Michelson’s experiments, the Fresnel aether theory did not predict any sort of length contraction. Although Michelson assumed no contraction in the arms of the interferometer, Laymon argues that he could have assumed contraction, with no practical impact on the results of the experiments. The predicted fringe shift is calculated from the anticipated difference in the distance traveled by light in the two arms is the same, when higher order terms are neglected. Thus, in practice, the experimenters could assume either that the contraction thesis was true or that it was false when determining the length of the arms. Either way, the results of the experiment would be the same. After Michelson’s experiments returned no evidence of the anticipated aether effects, Lorentz-Fitzgerald contraction was postulated precisely to cancel out the expected (but not found) effects and save the aether theory. Morley and Miller then set out specifically to test the contraction thesis, and still assumed no contraction in determining the length of the arms of their interferometer (ibid., 253). Thus Laymon argues that the Michelson-Morley experiments speak against the tempting assumption that “appraisal of a theory is based on phenomena which can be detected and measured without using assumptions drawn from the theory under examination or from competitors to that theory” (ibid., 246).
Epistemological hand-wringing about the use of the very theory to be tested in the generation of the evidence to be used for testing, seems to spring primarily from a concern about vicious circularity. How can we have a genuine trial, if the theory in question has been presumed innocent from the outset? While it is true that there would be a serious epistemic problem in a case where the use of the theory to be tested conspired to guarantee that the evidence would turn out to be confirmatory, this is not always the case when theories are invoked in their own testing. Woodward (2011) summarizes a tidy case:
For example, in Millikan’s oil drop experiment, the mere fact that theoretical assumptions (e.g., that the charge of the electron is quantized and that all electrons have the same charge) play a role in motivating his measurements or a vocabulary for describing his results does not by itself show that his design and data analysis were of such a character as to guarantee that he would obtain results supporting his theoretical assumptions. His experiment was such that he might well have obtained results showing that the charge of the electron was not quantized or that there was no single stable value for this quantity. (178)
For any given case, determining whether the theoretical assumptions being made are benign or straight-jacketing the results that it will be possible to obtain will require investigating the particular relationships between the assumptions and results in that case. When data production and analysis processes are complicated, this task can get difficult. But the point is that merely noting the involvement of the theory to be tested in the generation of empirical results does not by itself imply that those results cannot be objectively useful for deciding whether the theory to be tested should be accepted or rejected.
Kuhn argued that theoretical commitments exert a strong influence on observation descriptions, and what they are understood to mean (Kuhn 1962, 127ff; Longino 1979, 38–42). If so, proponents of a caloric account of heat won’t describe or understand descriptions of observed results of heat experiments in the same way as investigators who think of heat in terms of mean kinetic energy or radiation. They might all use the same words (e.g., ‘temperature’) to report an observation without understanding them in the same way. This poses a potential problem for communicating effectively across paradigms, and similarly, for attributing the appropriate significance to empirical results generated outside of one’s own linguistic framework.
It is important to bear in mind that observers do not always use declarative sentences to report observational and experimental results. Instead, they often draw, photograph, make audio recordings, etc. or set up their experimental devices to generate graphs, pictorial images, tables of numbers, and other non-sentential records. Obviously investigators’ conceptual resources and theoretical biases can exert epistemically significant influences on what they record (or set their equipment to record), which details they include or emphasize, and which forms of representation they choose (Daston and Galison 2007, 115–190, 309–361). But disagreements about the epistemic import of a graph, picture or other non-sentential bit of data often turn on causal rather than semantical considerations. Anatomists may have to decide whether a dark spot in a micrograph was caused by a staining artifact or by light reflected from an anatomically significant structure. Physicists may wonder whether a blip in a Geiger counter record reflects the causal influence of the radiation they wanted to monitor, or a surge in ambient radiation. Chemists may worry about the purity of samples used to obtain data. Such questions are not, and are not well represented as, semantic questions to which semantic theory loading is relevant. Late 20th century philosophers may have ignored such cases and exaggerated the influence of semantic theory loading because they thought of theory testing in terms of inferential relations between observation and theoretical sentences.
Nevertheless, some empirical results are reported as declarative sentences. Looking at a patient with red spots and a fever, an investigator might report having seen the spots, or measles symptoms, or a patient with measles. Watching an unknown liquid dripping into a litmus solution an observer might report seeing a change in color, a liquid with a PH of less than 7, or an acid. The appropriateness of a description of a test outcome depends on how the relevant concepts are operationalized. What justifies an observer to report having observed a case of measles according to one operationalization might require her to say no more than that she had observed measles symptoms, or just red spots according to another.
In keeping with Percy Bridgman’s view that
… in general, we mean by a concept nothing more than a set of operations; the concept is synonymous with the corresponding sets of operations (Bridgman 1927, 5)
one might suppose that operationalizations are definitions or meaning rules such that it is analytically true, e.g., that every liquid that turns litmus red in a properly conducted test is acidic. But it is more faithful to actual scientific practice to think of operationalizations as defeasible rules for the application of a concept such that both the rules and their applications are subject to revision on the basis of new empirical or theoretical developments. So understood, to operationalize is to adopt verbal and related practices for the purpose of enabling scientists to do their work. Operationalizations are thus sensitive and subject to change on the basis of findings that influence their usefulness (Feest 2005).
Definitional or not, investigators in different research traditions may be trained to report their observations in conformity with conflicting operationalizations. Thus instead of training observers to describe what they see in a bubble chamber as a whitish streak or a trail, one might train them to say they see a particle track or even a particle. This may reflect what Kuhn meant by suggesting that some observers might be justified or even required to describe themselves as having seen oxygen, transparent and colorless though it is, or atoms, invisible though they are (Kuhn 1962, 127ff). To the contrary, one might object that what one sees should not be confused with what one is trained to say when one sees it, and therefore that talking about seeing a colorless gas or an invisible particle may be nothing more than a picturesque way of talking about what certain operationalizations entitle observers to say. Strictly speaking, the objection concludes, the term ‘observation report’ should be reserved for descriptions that are neutral with respect to conflicting operationalizations.
If observational data are just those utterances that meet Feyerabend’s decidability and agreeability conditions, the import of semantic theory loading depends upon how quickly, and for which sentences reasonably sophisticated language users who stand in different paradigms can non-inferentially reach the same decisions about what to assert or deny. Some would expect enough agreement to secure the objectivity of observational data. Others would not. Still others would try to supply different standards for objectivity.
With regard to sentential observation reports, the significance of semantic theory loading is less ubiquitous than one might expect. The interpretation of verbal reports often depends on ideas about causal structure rather than the meanings of signs. Rather than worrying about the meaning of words used to describe their observations, scientists are more likely to wonder whether the observers made up or withheld information, whether one or more details were artifacts of observation conditions, whether the specimens were atypical, and so on.
Note that the worry about semantic theory loading extends beyond observation reports of the sort that occupied the logical empiricists and their close intellectual descendents. Combining results of diverse methods for making proxy measurements of paleoclimate temperatures in an epistemically responsible way requires careful attention to the variety of operationalizations at play. Even if no ‘observation reports’ are involved, the sticky question about how to usefully merge results obtained in different ways in order to satisfy one’s epistemic aims remains. Happily, the remedy for the worry about semantic loading in this broader sense is likely to be the same—investigating the provenance of those results and comparing the variety of factors that have contributed to their causal production.
Kuhn placed too much emphasis on the discontinuity between evidence generated in different paradigms. Even if we accept a broadly Kuhnian picture, according to which paradigms are heterogeneous collections of experimental practices, theoretical principles, problems selected for investigation, approaches to their solution, etc., connections between components are loose enough to allow investigators who disagree profoundly over one or more theoretical claims to nevertheless agree about how to design, execute, and record the results of their experiments. That is why neuroscientists who disagreed about whether nerve impulses consisted of electrical currents could measure the same electrical quantities, and agree on the linguistic meaning and the accuracy of observation reports including such terms as ‘potential’, ‘resistance’, ‘voltage’ and ‘current’. As we discussed above, the success that scientists have in repurposing results generated by others for different purposes speaks against the confinement of evidence to its native paradigm. Even when scientists working with radically different core theoretical commitments cannot make the same measurements themselves, with enough contextual information about how each conducts research, it can be possible to construct bridges that span the theoretical divides.
One could worry that the intertwining of the theoretical and empirical would open the floodgates to bias in science. Human cognizing, both historical and present day, is replete with disturbing commitments including intolerance and narrow mindedness of many sorts. If such commitments are integral to a theoretical framework, or endemic to the reasoning of a scientist or scientific community, then they threaten to corrupt the epistemic utility of empirical results generated using their resources. The core impetus of the ‘value-free ideal’ is to maintain a safe distance between the appraisal of scientific theories according to the evidence on one hand, and the swarm of moral, political, social, and economic values on the other. While proponents of the value-free ideal might admit that the motivation to pursue a theory or the legal protection of human subjects in permissible experimental methods involve non-epistemic values, they would contend that such values ought not ought not enter into the constitution of empirical results themselves, nor the adjudication or justification of scientific theorizing in light of the evidence (see Intemann 2021, 202).
As a matter of fact, values do enter into science at a variety of stages. Above we saw that ‘theory-ladenness’ could refer to the involvement of theory in perception, in semantics, and in a kind of circularity that some have worried begets unfalsifiability and thereby dogmatism. Like theory-ladenness, values can and sometimes do affect judgments about the salience of certain evidence and the conceptual framing of data. Indeed, on a permissive construal of the nature of theories, values can simply be understood as part of a theoretical framework. Intemann (2021) highlights a striking example from medical research where key conceptual resources include notions like ‘harm,’ ‘risk,’ ‘health benefit,’ and ‘safety.’ She refers to research on the comparative safety of giving birth at home and giving birth at a hospital for low-risk parents in the United States. Studies reporting that home births are less safe typically attend to infant and birthing parent mortality rates—which are low for these subjects whether at home or in hospital—but leave out of consideration rates of c-section and episiotomy, which are both relatively high in hospital settings. Thus, a value-laden decision about whether a possible outcome counts as a harm worth considering can influence the outcome of the study—in this case tipping the balance towards the conclusion that hospital births are more safe (ibid., 206).
Note that the birth safety case differs from the sort of cases at issue in the philosophical debate about risk and thresholds for acceptance and rejection of hypotheses. In accepting an hypothesis, a person makes a judgement that the risk of being mistaken is sufficiently low (Rudner 1953). When the consequences of being wrong are deemed grave, the threshold for acceptance may be correspondingly high. Thus, in evaluating the epistemic status of an hypothesis in light of the evidence, a person may have to make a value-based judgement. However, in the birth safety case, the judgement comes into play at an earlier stage, well before the decision to accept or reject the hypothesis is to be made. The judgement occurs already in deciding what is to count as a ‘harm’ worth considering for the purposes of this research.
The fact that values do sometimes enter into scientific reasoning does not by itself settle the question of whether it would be better if they did not. In order to assess the normative proposal, philosophers of science have attempted to disambiguate the various ways in which values might be thought to enter into science, and the various referents that get crammed under the single heading of ‘values.’ Anderson (2004) articulates eight stages of scientific research where values (‘evaluative presuppositions’) might be employed in epistemically fruitful ways. In paraphrase: 1) orientation in a field, 2) framing a research question, 3) conceptualizing the target, 4) identifying relevant data, 5) data generation, 6) data analysis, 7) deciding when to cease data analysis, and 8) drawing conclusions (Anderson 2004, 11). Similarly, Intemann (2021) lays out five ways “that values play a role in scientific reasoning” with which feminist philosophers of science have engaged in particular:
(1) the framing [of] research problems, (2) observing phenomena and describing data, (3) reasoning about value-laden concepts and assessing risks, (4) adopting particular models, and (5) collecting and interpreting evidence. (208)
Ward (2021) presents a streamlined and general taxonomy of four ways in which values relate to choices: as reasons motivating or justifying choices, as causal effectors of choices, or as goods affected by choices. By investigating the role of values in these particular stages or aspects of research, philosophers of science can offer higher resolution insights than just the observation that values are involved in science at all and untangle crosstalk.
Similarly, fine points can be made about the nature of values involved in these various contexts. Such clarification is likely important for determining whether the contribution of certain values in a given context is deleterious or salutary, and in what sense. Douglas (2013) argues that the ‘value’ of internal consistency of a theory and of the empirical adequacy of a theory with respect to the available evidence are minimal criteria for any viable scientific theory (799–800). She contrasts these with the sort of values that Kuhn called ‘virtues,’ i.e. scope, simplicity, and explanatory power that are properties of theories themselves, and unification, novel prediction and precision, which are properties a theory has in relation to a body of evidence (800–801). These are the sort of values that may be relevant to explaining and justifying choices that scientists make to pursue/abandon or accept/reject particular theories. Moreover, Douglas (2000) argues that what she calls “non-epistemic values” (in particular, ethical value judgements) also enter into decisions at various stages “internal” to scientific reasoning, such as data collection and interpretation (565). Consider a laboratory toxicology study in which animals exposed to dioxins are compared to unexposed controls. Douglas discusses researchers who want to determine the threshold for safe exposure. Admitting false positives can be expected to lead to overregulation of the chemical industry, while false negatives yield underregulation and thus pose greater risk to public health. The decision about where to set the unsafe exposure threshold, that is, set the threshold for a statistically significant difference between experimental and control animal populations, involves balancing the acceptability of these two types of errors. According to Douglas, this balancing act will depend on “whether we are more concerned about protecting public health from dioxin pollution or whether we are more concerned about protecting industries that produce dioxins from increased regulation” (ibid., 568). That scientists do as a matter of fact sometimes make such decisions is clear. They judge, for instance, a specimen slide of a rat liver to be tumorous or not, and whether borderline cases should count as benign or malignant (ibid., 569–572). Moreover, in such cases, it is not clear that the responsibility of making such decisions could be offloaded to non-scientists.
Many philosophers accept that values can contribute to the generation of empirical results without spoiling their epistemic utility. Anderson’s (2004) diagnosis is as follows:
Deep down, what the objectors find worrisome about allowing value judgments to guide scientific inquiry is not that they have evaluative content, but that these judgments might be held dogmatically, so as to preclude the recognition of evidence that might undermine them. We need to ensure that value judgements do not operate to drive inquiry to a predetermined conclusion. This is our fundamental criterion for distinguishing legitimate from illegitimate uses of values in science. (11)
Data production (including experimental design and execution) is heavily influenced by investigators’ background assumptions. Sometimes these include theoretical commitments that lead experimentalists to produce non-illuminating or misleading evidence. In other cases they may lead experimentalists to ignore, or even fail to produce useful evidence. For example, in order to obtain data on orgasms in female stumptail macaques, one researcher wired up females to produce radio records of orgasmic muscle contractions, heart rate increases, etc. But as Elisabeth Lloyd reports, “… the researcher … wired up the heart rate of the male macaques as the signal to start recording the female orgasms. When I pointed out that the vast majority of female stumptail orgasms occurred during sex among the females alone, he replied that yes he knew that, but he was only interested in important orgasms” (Lloyd 1993, 142). Although female stumptail orgasms occurring during sex with males are atypical, the experimental design was driven by the assumption that what makes features of female sexuality worth studying is their contribution to reproduction (ibid., 139). This assumption influenced experimental design in such a way as to preclude learning about the full range of female stumptail orgasms.
Anderson (2004) presents an influential analysis of the role of values in research on divorce. Researchers committed to an interpretive framework rooted in ‘traditional family values’ could conduct research on the assumption that divorce is mostly bad for spouses and any children that they have (ibid., 12). This background assumption, which is rooted in a normative appraisal of a certain model of good family life, could lead social science researchers to restrict the questions with which they survey their research subjects to ones about the negative impacts of divorce on their lives, thereby curtailing the possibility of discovering ways that divorce may have actually made the ex-spouses lives better (ibid., 13). This is an example of the influence that values can have on the nature of the results that research ultimately yields, which is epistemically detrimental. In this case, the values in play biased the research outcomes to preclude recognition of countervailing evidence. Anderson argues that the problematic influence of values comes when research “is rigged in advance” to confirm certain hypotheses—when the influence of values amounts to incorrigible dogmatism (ibid., 19). “Dogmatism” in her sense is unfalsifiability in practice, “their stubbornness in the face of any conceivable evidence”(ibid., 22).
Fortunately, such dogmatism is not ubiquitous and when it occurs it can often be corrected eventually. Above we noted that the mere involvement of the theory to be tested in the generation of an empirical result does not automatically yield vicious circularity—it depends on how the theory is involved. Furthermore, even if the assumptions initially made in the generation of empirical results are incorrect, future scientists will have opportunities to reassess those assumptions in light of new information and techniques. Thus, as long as scientists continue their work there need be no time at which the epistemic value of an empirical result can be established once and for all. This should come as no surprise to anyone who is aware that science is fallible, but it is no grounds for skepticism. It can be perfectly reasonable to trust the evidence available at present even though it is logically possible for epistemic troubles to arise in the future. A similar point can be made regarding values (although cf. Yap 2016).
Moreover, while the inclusion of values in the generation of an empirical result can sometimes be epistemically bad, values properly deployed can also be harmless, or even epistemically helpful. As in the cases of research on female stumptail macaque orgasms and the effects of divorce, certain values can sometimes serve to illuminate the way in which other epistemically problematic assumptions have hindered potential scientific insight. By valuing knowledge about female sexuality beyond its role in reproduction, scientists can recognize the narrowness of an approach that only conceives of female sexuality insofar as it relates to reproduction. By questioning the absolute value of one traditional ideal for flourishing families, researchers can garner evidence that might end up destabilizing the empirical foundation supporting that ideal.
Empirical results are most obviously put to epistemic work in their contexts of origin. Scientists conceive of empirical research, collect and analyze the relevant data, and then bring the results to bear on the theoretical issues that inspired the research in the first place. However, philosophers have also discussed ways in which empirical results are transferred out of their native contexts and applied in diverse and sometimes unexpected ways (see Leonelli and Tempini 2020). Cases of reuse, or repurposing of empirical results in different epistemic contexts raise several interesting issues for philosophers of science. For one, such cases challenge the assumption that theory (and value) ladenness confines the epistemic utility of empirical results to a particular conceptual framework. Ancient Babylonian eclipse records inscribed on cuneiform tablets have been used to generate constraints on contemporary geophysical theorizing about the causes of the lengthening of the day on Earth (Stephenson, Morrison, and Hohenkerk 2016). This is surprising since the ancient observations were originally recorded for the purpose of making astrological prognostications. Nevertheless, with enough background information, the records as inscribed can be translated, the layers of assumptions baked into their presentation peeled back, and the results repurposed using resources of the contemporary epistemic context, the likes of which the Babylonians could have hardly dreamed.
Furthermore, the potential for reuse and repurposing feeds back on the methodological norms of data production and handling. In light of the difficulty of reusing or repurposing data without sufficient background information about the original context, Goodman et al. (2014) note that “data reuse is most possible when: 1) data; 2) metadata (information describing the data); and 3) information about the process of generating those data, such as code, all all provided” (3). Indeed, they advocate for sharing data and code in addition to results customarily published in science. As we have seen, the loading of data with theory is usually necessary to putting that data to any serious epistemic use—theory-loading makes theory appraisal possible. Philosophers have begun to appreciate that this epistemic boon does not necessarily come at the cost of rendering data “tragically local” (Wylie 2020, 285, quoting Latour 1999). But it is important to note the useful travel of data between contexts is significantly aided by foresight, curation, and management for that aim.
In light of the mediated nature of empirical results, Boyd (2018) argues for an “enriched view of evidence,” in which the evidence that serves as the ‘tribunal of experience’ is understood to be “lines of evidence” composed of the products of data collection and all of the products of their transformation on the way to the generation of empirical results that are ultimately compared to theoretical predictions, considered together with metadata associated with their provenance. Such metadata includes information about theoretical assumptions that are made in data collection, processing, and the presentation of empirical results. Boyd argues that by appealing to metadata to ‘rewind’ the processing of assumption-imbued empirical results and then by re-processing them using new resources, the epistemic utility of empirical evidence can survive transitions to new contexts. Thus, the enriched view of evidence supports the idea that it is not despite the intertwining of the theoretical and empirical that scientists accomplish key epistemic aims, but often in virtue of it (ibid., 420). In addition, it makes the epistemic value of metadata encoding the various assumptions that have been made throughout the course of data collection and processing explicit.
The desirability of explicitly furnishing empirical data and results with auxiliary information that allow them to travel can be appreciated in light of the ‘objectivity’ norm, construed as accessibility to interpersonal scrutiny. When data are repurposed in novel contexts, they are not only shared between subjects, but can in some cases be shared across radically different paradigms with incompatible theoretical commitments.
4. The epistemic value of empirical evidence
One of the important applications of empirical evidence is its use in assessing the epistemic status of scientific theories. In this section we briefly discuss philosophical work on the role of empirical evidence in confirmation/falsification of scientific theories, ‘saving the phenomena,’ and in appraising the empirical adequacy of theories. However, further philosophical work ought to explore the variety of ways that empirical results bear on the epistemic status of theories and theorizing in scientific practice beyond these.
It is natural to think that computability, range of application, and other things being equal, true theories are better than false ones, good approximations are better than bad ones, and highly probable theoretical claims are better than less probable ones. One way to decide whether a theory or a theoretical claim is true, close to the truth, or acceptably probable is to derive predictions from it and use empirical data to evaluate them. Hypothetico-Deductive (HD) confirmation theorists proposed that empirical evidence argues for the truth of theories whose deductive consequences it verifies, and against those whose consequences it falsifies (Popper 1959, 32–34). But laws and theoretical generalization seldom if ever entail observational predictions unless they are conjoined with one or more auxiliary hypotheses taken from the theory they belong to. When the prediction turns out to be false, HD has trouble explaining which of the conjuncts is to blame. If a theory entails a true prediction, it will continue to do so in conjunction with arbitrarily selected irrelevant claims. HD has trouble explaining why the prediction does not confirm the irrelevancies along with the theory of interest.
Another approach to confirmation by empirical evidence is Inference to the Best Explanation (IBE). The idea is roughly that an explanation of the evidence that exhibits certain desirable characteristics with respect to a family of candidate explanations is likely to be the true on (Lipton 1991). On this approach, it is in virtue of their successful explanation of the empirical evidence that theoretical claims are supported. Naturally, IBE advocates face the challenges of defending a suitable characterization of what counts as the ‘best’ and of justifying the limited pool of candidate explanations considered (Stanford 2006).
Bayesian approaches to scientific confirmation have garnered significant attention and are now widespread in philosophy of science. Bayesians hold that the evidential bearing of empirical evidence on a theoretical claim is to be understood in terms of likelihood or conditional probability. For example, whether empirical evidence argues for a theoretical claim might be thought to depend upon whether it is more probable (and if so how much more probable) than its denial conditional on a description of the evidence together with background beliefs, including theoretical commitments. But by Bayes’ Theorem, the posterior probability of the claim of interest (that is, its probability given the evidence) is proportional to that claim’s prior probability. How to justify the choice of these prior probability assignments is one of the most notorious points of contention arising for Bayesians. If one makes the assignment of priors a subjective matter decided by epistemic agents, then it is not clear that they can be justified. Once again, one’s use of evidence to evaluate a theory depends in part upon one’s theoretical commitments (Earman 1992, 33–86; Roush 2005, 149–186). If one instead appeals to chains of successive updating using Bayes’ Theorem based on past evidence, one has to invoke assumptions that generally do not obtain in actual scientific reasoning. For instance, to ‘wash out’ the influence of priors a limit theorem is invoked wherein we consider very many updating iterations, but much scientific reasoning of interest does not happen in the limit, and so in practice priors hold unjustified sway (Norton 2021, 33).
Rather than attempting to cast all instances of confirmation based on empirical evidence as belonging to a universal schema, a better approach may be to ‘go local’. Norton’s material theory of induction argues that inductive support arises from background knowledge, that is, from material facts that are domain specific. Norton argues that, for instance, the induction from “Some samples of the element bismuth melt at 271°C” to “all samples of the element bismuth melt at 271°C” is admissible not in virtue of some universal schema that carries us from ‘some’ to ‘all’ but matters of fact (Norton 2003). In this particular case, the fact that licenses the induction is a fact about elements: “their samples are generally uniform in their physical properties” (ibid., 650). This is a fact pertinent to chemical elements, but not to samples of material like wax (ibid.). Thus Norton repeatedly emphasizes that “all induction is local”.
Still, there are those who may be skeptical about the very possibility of confirmation or of successful induction. Insofar as the bearing of evidence on theory is never totally decisive, insofar there is no single trusty universal schema that captures empirical support, perhaps the relationship between empirical evidence and scientific theory is not really about support after all. Giving up on empirical support would not automatically mean abandoning any epistemic value for empirical evidence. Rather than confirm theory, the epistemic role of evidence could be to constrain, for example by furnishing phenomena for theory to systematize or to adequately model.
4.2 Saving the phenomena
Theories are said to ‘save’ observable phenomena if they satisfactorily predict, describe, or systematize them. How well a theory performs any of these tasks need not depend upon the truth or accuracy of its basic principles. Thus according to Osiander’s preface to Copernicus’ On the Revolutions, a locus classicus, astronomers “… cannot in any way attain to true causes” of the regularities among observable astronomical events, and must content themselves with saving the phenomena in the sense of using
… whatever suppositions enable … [them] to be computed correctly from the principles of geometry for the future as well as the past … (Osiander 1543, XX)
Theorists are to use those assumptions as calculating tools without committing themselves to their truth. In particular, the assumption that the planets revolve around the sun must be evaluated solely in terms of how useful it is in calculating their observable relative positions to a satisfactory approximation. Pierre Duhem’s Aim and Structure of Physical Theory articulates a related conception. For Duhem a physical theory
… is a system of mathematical propositions, deduced from a small number of principles, which aim to represent as simply and completely, and exactly as possible, a set of experimental laws. (Duhem 1906, 19)
‘Experimental laws’ are general, mathematical descriptions of observable experimental results. Investigators produce them by performing measuring and other experimental operations and assigning symbols to perceptible results according to pre-established operational definitions (Duhem 1906, 19). For Duhem, the main function of a physical theory is to help us store and retrieve information about observables we would not otherwise be able to keep track of. If that is what a theory is supposed to accomplish, its main virtue should be intellectual economy. Theorists are to replace reports of individual observations with experimental laws and devise higher level laws (the fewer, the better) from which experimental laws (the more, the better) can be mathematically derived (Duhem 1906, 21ff).
A theory’s experimental laws can be tested for accuracy and comprehensiveness by comparing them to observational data. Let EL be one or more experimental laws that perform acceptably well on such tests. Higher level laws can then be evaluated on the basis of how well they integrate EL into the rest of the theory. Some data that don’t fit integrated experimental laws won’t be interesting enough to worry about. Other data may need to be accommodated by replacing or modifying one or more experimental laws or adding new ones. If the required additions, modifications or replacements deliver experimental laws that are harder to integrate, the data count against the theory. If the required changes are conducive to improved systematization the data count in favor of it. If the required changes make no difference, the data don’t argue for or against the theory.
4.3 Empirical adequacy
On van Fraassen’s (1980) semantic account, a theory is empirically adequate when the empirical structure of at least one model of that theory is isomorphic to what he calls the “appearances” (45). In other words, when the theory “has at least one model that all the actual phenomena fit inside” (12). Thus, for van Fraassen, we continually check the empirical adequacy of our theories by seeing if they have the structural resources to accommodate new observations. We’ll never know that a given theory is totally empirically adequate, since for van Fraassen, empirical adequacy obtains with respect to all that is observable in principle to creatures like us, not all that has already been observed (69).
The primary appeal of dealing in empirical adequacy rather than confirmation is its appropriate epistemic humility. Instead of claiming that confirming evidence justifies belief (or boosted confidence) that a theory is true, one is restricted to saying that the theory continues to be consistent with the evidence as far as we can tell so far. However, if the epistemic utility of empirical results in appraising the status of theories is just to judge their empirical adequacy, then it may be difficult to account for the difference between adequate but unrealistic theories, and those equally adequate theories that ought to be taken seriously as representations. Appealing to extra-empirical virtues like parsimony may be a way out, but one that will not appeal to philosophers skeptical of the connection thereby supposed between such virtues and representational fidelity.
On an earlier way of thinking, observation was to serve as the unmediated foundation of science—direct access to the facts upon which the edifice of scientific knowledge could be built. When conflict arose between factions with different ideological commitments, observations could furnish the material for neutral arbitration and settle the matter objectively, in virtue of being independent of non-empirical commitments. According to this view, scientists working in different paradigms could at least appeal to the same observations, and propagandists could be held accountable to the publicly accessible content of theory and value-free observations. Despite their different theories, Priestley and Lavoisier could find shared ground in the observations. Anti-Semites would be compelled to admit the success of a theory authored by a Jewish physicist, in virtue of the unassailable facts revealed by observation.
This version of empiricism with respect to science does not accord well with the fact that observation per se plays a relatively small role in many actual scientific methodologies, and the fact that even the most ‘raw’ data is often already theoretically imbued. The strict contrast between theory and observation in science is more fruitfully supplanted by inquiry into the relationship between theorizing and empirical results.
Contemporary philosophers of science tend to embrace the theory ladenness of empirical results. Instead of seeing the integration of the theoretical and the empirical as an impediment to furthering scientific knowledge, they see it as necessary. A ‘view from nowhere’ would not bear on our particular theories. That is, it is impossible to put empirical results to use without recruiting some theoretical resources. In order to use an empirical result to constrain or test a theory it has to be processed into a form that can be compared to that theory. To get stellar spectrograms to bear on Newtonian or relativistic cosmology, they need to be processed—into galactic rotation curves, say. The spectrograms by themselves are just artifacts, pieces of paper. Scientists need theoretical resources in order to even identify that such artifacts bear information relevant for their purposes, and certainly to put them to any epistemic use in assessing theories.
This outlook does not render contemporary philosophers of science all constructivists, however. Theory mediates the connection between the target of inquiry and the scientific worldview, it does not sever it. Moreover, vigilance is still required to ensure that the particular ways in which theory is ‘involved’ in the production of empirical results are not epistemically detrimental. Theory can be deployed in experiment design, data processing, and presentation of results in unproductive ways, for instance, in determining whether the results will speak for or against a particular theory regardless of what the world is like. Critical appraisal of the roles of theory is thus important for genuine learning about nature through science. Indeed, it seems that extra-empirical values can sometimes assist such critical appraisal. Instead of viewing observation as the theory-free and for that reason furnishing the content with which to appraise theories, we might attend to the choices and mistakes that can be made in collecting and generating empirical results with the help of theoretical resources, and endeavor to make choices conducive to learning and correct mistakes as we discover them.
Recognizing the involvement of theory and values in the constitution and generation of empirical results does not undermine the special epistemic value of empirical science in contrast to propaganda and pseudoscience. In cases where the influence of cultural, political, and religious values hinder scientific inquiry, it is often the case that they do so by limiting or determining the nature of the empirical results. Yet, by working to make the assumptions that shape results explicit we can examine their suitability for our purposes and attempt to restructure inquiry as necessary. When disagreements arise, scientists can attempt to settle them by appealing to the causal connections between the research target and the empirical data. The tribunal of experience speaks through empirical results, but it only does so through via careful fashioning with theoretical resources.