Evanratzan.com

Hypothesis

The philosophical foundations of science are not often addressed by basic scientists; however, they have been a longstanding bedrock of scientific exploration prior to the modern iterations that we are familiar with. This text is a discourse on the formation of hypotheses along with a somewhat historical context for the dichotomous approach into questioning particular theories. I am not the only researcher that has asked, “what is the best approach to forming a hypothesis?” and here is a discussion of exactly that concept. I emphasize five core features that I view as criteria for well-formed hypotheses.


Method

Testing a well-founded hypothesis is no easy task and the strategic approaches for experimental design vary a great deal. Careful planning depends on an adequate understanding of the nature of variable, sample analysis, and validating controls. The appropriate design of an experiment in many ways depends on the research question asked, but there are common features associated with experimental diesign that have widely applicable value. I emphasize core feautres that are indicative of an optimal experimental method.


 

On Hypothesis Formation

A comparison of dialectic approaches in classical philosophy and modern approaches to scientific research reveals a consistent theme of dueling opposition as the leading approach to understanding science. A common method of experimental design hinges upon testing hypotheses by building and testing distinct dualistic models which are largely mutually exclusive to produce a binary output. For example, the ancient Greek philosopher Democritus was a proponent of the atom hypothesis which put forth the idea that all matter was composed of physically separate atoms which implied an underlying mechanistic explanation to the nature of reality. In direct opposition to this idea, was Aristotle’s belief that all matter consisted of four core elements (earth, air, water, and fire). Both of these Ancient Greek approaches to understanding matter incorporated a reductionist view breaking down matter into its core components, but ultimately are at odds with one other regarding the details of these components, and thus they could be described as in dueling opposition.

Modern science has emerged from this classical tradition of diametric opposition and the generation of testable hypotheses have largely followed a similar format.  While the presence of technology influences the results of scientific experimentation, it does do not appear to influence the philosophical approach to discovery. For example, Aristotle emphasized that “credit must be given to observation rather than to theories, and to theories only if what they affirm agrees with the observed facts” (Adler 769). Yet, at that timepoint in history it was not possible to easily observe or quantify atomic interactions which ultimately validated Democritus’ theories. However, a similar impasse was reached thousands of years later in the early twentieth century when particle physicists debated the merits of the Standard Model versus the Higsless model to explain the forces behind elementary subatomic particles. Despite substantial advances in technology Modern philosophical approach to scientific research is highly reminiscent to the dialectic method proposed by ancient Greeks: one of pitting diametrically opposed ideas into direct competition dependent upon emerging.

In this way, the modern philosophical approach to science has roots in competitive dualism as the predominant method of obtaining scientific progress. Indeed, during my own undergraduate education, most faculty sought to simplify complex scientific concepts into generalizations of two dueling hypotheses to understand them. As the French physicist Poincare pointed out, “every generalization is a hypothesis” (Adler 329). For example, the nature versus nurture debate regarding adolescent psychological development seeks to divide the major influences on human behavior to either experience dependent learning or innate inherited predisposition. The broader scientific community would likely concede that both categories influence behavioral psychology in different ways in different contexts. However, the presence of such oversimplifications persists within scientific research due to their utility in early educational training of scientists. Scientific research which relies on the formation of dueling hypotheses as a way of understanding the underlying truth are often subject to oversimplification and generalization which inherently lacks a full contextual perspective.

The classical approach of dueling opposition attempts to favor utility of outcome by inherently limiting the number of possible conclusions of a proposed experimental design. As Thomas Kuhn puts it, for any given problem, “There must also be rules that limit both the nature of acceptable solutions and the steps by which they are to be obtained” (Kuhn 38). When one designs an experiment anticipating a binary outcome, then the results of the experiment will support the hypothesis and thereby the model, or refute the hypothesis and thereby the model, but this is not necessarily the case in real world experimentation. For example, Camilo Golgi proposed that the entire nervous system is one continuous network of fibers rather than one of discrete cells while his detractor Ramon y Cajal supported the Neuron theory where cells were individual units. Evidence from careful dissection and microscopic assessment of cells produced evidence to support both hypotheses. Neither Golgi nor Cajal produced results to support one standalone model, and as a result they shared the 1906 Nobel Prize in Physiology and Medicine, much to Golgi’s chagrin. Dual opposition as an experimental design does not ensure binary results to confirm a specific scientific theory or hypothesis.

Furthermore, even binary results within scientific experiments do not necessarily affirm the underlying model or get closer to a natural law. Binary results which confirm or deny a specific hypothesis are limited by the scope of available knowledge underlying the formation of a given model. For example, the early observations of the Danish astronomer Tycho Brahe produced a geocentric model of organization in the solar system largely based on the observations of Nicolaus Copernicus. However, Galileo Galilei utilized many of the same Copernican observations to form a heliocentric model in which the earth orbits around the sun, thereby producing a separate model from similar observations. Another example is the false belief that stomach ulcers were predominantly caused by stress based on correlative data from patient interviews. The 2005 Nobel Prize in Physiology and Medicine was awarded to Barry Marshal and Robin Warren for identifying the bacteria helicobacter pylori as the underlying cause for stomach ulcers (Warren and Marshall 1983). Both examples illustrate how results and data can be interpreted in favor of an incorrect scientific model. A dualistic approach to experimental design does not necessarily ensure a definitive binary output of results in support of a more correct scientific model.

No direct relationship between forming a dueling hypothesis as a means of obtaining increased validity of scientific results has been demonstrated thus far. Admittedly, the dominance of the dualistic approach throughout history has prevented this comparison, but there are also examples of success in the development of scientific theories from a domain without competing models. Perhaps the most famous example of scientific theory as an emergent phenomenon is evident in Charles Darwin’s famous expedition on the Beagle. Darwin writes, “Natural selection acts only by the preservation and accumulation of small inherited modifications, each profitable to the preserved being; and as modern geology has almost banished such views as the excavation of a great valley by a single diluvial wave, so will natural selection banish the belief of the continued creation of new  organic beings, or of any great and sudden modification of their structure” (Darwin 1859). At the time, the theory of Natural Selection was formed from a vast array of observations from many different scientists as an observed consensus, rather than a model specifically designed in order to compete with an accepted model. In modern times, the debate has morphed into a debate between evolution and creationism regarding the origin of man; however, the original theory of natural selection was far less entrenched in disproving a creationist model. Although modern theories, like the theory of natural selection, necessarily contradict or call into question common assumptions within the scientific community, the hypothesis was not generated in order to specifically contradict an existing model.

An alternative approach to addressing dueling models by hypothesis is one which emphasizes concatenation of unbiased observations to inform experimental design. Although it may seem like hypothesis formation based on testing dominant models, this view is less restrictive in its scope and is often referred to as “hypothesis-free” science or experimentation which is “unbiased”. An excellent example of this was the Human Genome Project which involved a massive collaborative international effort bridging between the private sector and government funded labs to gain massive insight into the genetic sequences of homo sapiens. The rippling effects of these discoveries and observations have provided tremendous insights into various forms of cancer as well as diverse array of rare diseases. However, these discoveries were not based on ruling out one model versus its opposite model. Sir Isaac Newton described the value of forming hypotheses on observations rather than models when he wrote, “As an experimentalist, I feel bound to let experiment guide me into any train of thought which I justify, being satisfied that experiment, like analysis, must lead to strict truth if rightly interpreted, and believing also that it is in its nature, far more suggestive of new constrains of thought and new conditions of natural power” (Adler 768).  Experimental data acquisition can produce useful scientific experimental results without necessarily becoming dependent on the formation of dueling hypothetical models.

Importantly, and perhaps ironically, alternative approaches of observation and hypothesis-free testing can also produce novel hypotheses and models for future testing. Provided the accumulation of sufficient observations, one can more easily identify causative relationships between variables. The modern study of statistics employs these assumptions in the idea that for any variable with normal (i.e. Gaussian) distribution there will be finite mean (µ) and finite deviation (σ). As the number of observations (N) approach infinite, there is a convergence upon the true mean, and the deviation reduces. An example of this in a simpler context, if one wanted to know the average age of humans, then the most accurate measure would be to collect every single person’s age on earth. However, the calculated mean from the first two individuals will likely be inaccurate, but concatenation of all the data will converge on the true mean and have the added benefit of providing more information for multiple different tests. More importantly, in the absence of a well-constructed hypothesis one can still utilize the large dataset to test existing alternative hypotheses which is exactly what happened the accumulated 1,000 Genomes Project. The large amount of DNA sequenced provided researchers with the ability to identify differential rates of disease and heredity in different populations. Large scale hypothesis-free testing can produce results in high enough abundance that they give rise to new testable hypotheses within the large dataset.

More importantly, in the absence of an established model which describes some natural phenomena, then the curious scientist must generate one in the absence of competing models. Every historical example of a proposed competing scientific models at one point necessarily originated from observations in an environment devoid of proposed models. For example, Richard Feynman’s Parton model depended on Albert Einstein’s general theory of relativity which depended on Sir Isaac Newton’s theory of gravity which was dependent upon Johannes Kepler’s laws of planetary motion, which were dependent upon Nicolaus Copernicus heliocentric theories, which have been attributed to Aristarchus in ancient Greece, and so on ad nauseum. At a certain point in tracing the historical origins of modern scientific theory, one reaches an event horizon in which there must presumably be innovation of thought based solely on observations rather than a formal response to a developed model. An illustration of the development of a scientific model which was independent from other models of the same era was the ancient Mayan calendar based on astronomical observations using azimuths and tracking of celestial bodies. These observations resulted in a working model based on measured observation of the natural world independent from those identified by Ptolemy in ancient Greece. In the absence of a stratified falsifiable scientific model, it becomes necessary to construct one’s own model from accumulated scientific observations.

Testing a specific model via a firmly entrenched hypothesis puts a scientist at risk for becoming highly dependent on a specific outcome or anticipated scientific result. A hypothesis that there is a specific causal relationship between two variables biases the researcher to anticipating and relying upon that result to become true for scientific progress to move forward rapidly. An example of this approach would be to test either Gene A or Gene B as the hypothesized underlying cause of a disease state. If one believed a particular gene (Gene A) causes a specific disease, then one could analyze Gene A to see how it changes in a healthy individual versus a diseased one. This method will confirm or deny the hypothesis, but one result is clearly a more beneficial outcome than the other. First, let us assume the result is that the Gene A level is greatly changed in a disease state compared to healthy state. This is a highly desirable result and entices the researcher to investigate more deeply into Gene A and its associated biology. Alternatively, let us assume that the researcher finds no change in the Gene A. This result is not appealing to collaborators, grant reviewers, or the scientist in terms of collecting accolades, funding, or most importantly scientific insight. Now the researcher is faced with returning to test Gene B, Gene C, etc. ad nauseum with only an incremental increase in the likelihood of achieving a desirable result: the identification of a causative gene. Restricting scientific experimentation to ruling between dichotomous hypothetical models tends to put the researcher at risk of implicit bias toward one more favorable result because it greatly restricts the scope of experimental results.

Designing experiments with a hypothesis that favors a particular model can put a scientist at an increased risk for Type I errors. A Type I error is rejection of a true null hypothesis (false positive). A real-world example of this concept is the early Miasma theory regarding the cause of bubonic plague in Europe. The Miasma theory hypothesized that the origin of the black plague epidemic was from the spread of disease from rotting organic matter forming “bad air”, with the null hypothesis that it is not caused by airborne spread. To test this idea, one might remove rotting organic matter, and find that the rates of infection reduce and conclude that the Miasma theory is correct, but this is a Type I error. It is easy to imagine that less rotting organic matter would result in less animals like rats trying to eat food waste, which means less fleas, which in turn means less bacterial infection from flea bites. The Miasma theory was eventually replaced by the Germ theory when it was discovered that the spread of the black death was from Yersinia pestis bacteria from flea bites. By specifically favoring results to bolster the Miasma model, one risks concluding a false positive that organic decay causes bubonic plague. Perhaps a more contemporary example is perceived correlation that vaccines cause autism. Vaccines are typically introduced to neonatal children in the first year of life. Abnormal behavior or differences in the sensory perception of autistic infants typically occurs when these behaviors are observable, during the first years of life.  The timing of diagnosis shortly after vaccination has led to the erroneous conclusion that these two events are somehow causally related based solely on a temporal relationship, but by this same token the use of diapers or infant formula could be determined as the cause of autism. Over reliance on one specific model leads to interpretation of results which favors the hypothesis underlying that model, this is commonly referred to as confirmation bias. Interpreting data to specifically address one hypothesis over another tends to bias the interpreter toward a particular result, or set of results, and thus an increased risk of Type I errors.

Similarly, reliance upon a specific model when forming a hypothesis can also put the researcher at higher risk for Type II errors.  A Type II error is the acceptance of a false null hypothesis (false negative), which leads the researcher to abandon what would otherwise be the correct course of action. For example, triple-negative breast cancer involves the growth of tumors which do not express estrogen receptors, progesterone receptors, or HER2 protein receptors. While one could conclude that a tumor is benign due to being negative for these markers, in some cases these cancerous tumors continue to grow and proliferate. A clinician, pathologist, or diagnostician of some sort may favor an outcome that their patient does not have breast cancer but instead that they suffer from a different ailment and interpret the triple negative incorrectly in some rare cases, therefore producing  a Type II error. Another example of a proclivity for Type II error could occur in genetic knockout phenotyping in neurobiology. Let us say that a scientist makes a transgenic animal which lacks a specific gene (Gene A), which has been proposed by a competing lab to cause blindness. The scientist examines all of the rod and cone cells of the retina and concludes that the eye is normal, and Gene A does not cause blindness. To be sure of this, the scientist would also have to examine all of the amacrine cells, the retinal ganglion cells, the Muller glia, the vasculature, the tectum, the visual cortex, ad nauseum. In both examples, the continuation of additional testing requires time, energy, money and the motivation to continue may be curtailed by premature acceptance of the null hypothesis.  Typically, favoring one theory over another leads to favoring of one hypothesis over another, which in turn can lead to favoring one interpretation of results over another which can lead to Type II error also known as acceptance of a false negative.

Given the tendency of researcher bias toward particular theories, recent support of hypothesis-free testing has attempted to rectify tendency toward error but this approach is limited by its inability to directly refute potential outcomes. Despite the ability to examine tens of thousands of genes from thousands of organisms simultaneously, Nobel laureate Sydney Brenner once referred to this approach as “high-throughput, low input, no output” science due to its lack of ability to address specific theories (Friedberg 2008). For example, the field of developmental biology often favors single-cell RNA sequencing to assess changes in gene expression over time with the idea that given a large enough dataset, the relevant cell types will cluster together. The vast cell number and implicit heterogeneity of the dataset within limited depth of transcript detection coupled with machine learning techniques which function independent of hypotheses produce clusters which can be classified as different cell types. However, the relevance of an ever-increasing number of cell types in addressing important questions in developmental biology remains unclear. In many cases, identifying heterogeneity within a cell population does not explain the function of these cells, the origin of these cells, or how they contribute to the broader organization of a tissue or body plan of an organism. Hypothesis-free testing and high-throughput data clustering can produce results which often fail to address scientific paradigms within a research field, and they ultimately fade into obscurity due to an overly methodological approach which fails to accept or reject a model.

Another challenge that hypothesis-free testing faces is similar to the Type I error associated with a false positive associated with an inability to direct research based on a specific underlying hypothesis. Large-scale data acquisition will produce statistically significant differences between conditions ipso facto given enough observations to compare. For example, in recent years the field of experimental psychology has relied heavily on functional magnetic resonance imaging (fMRI) which can collect massive amounts of data from the entire brain with spatial and temporal resolution that is quite impressive from a historical perspective. Furthermore, massive collections from tens of thousands of healthy individuals from various international human brain mapping collaborations produces unfathomable amounts of data. Given the vast scope of such data, it is relatively easy to produce statistically significant results between brain regions or identify correlative activation in regions which may not be relevant. This has led to both a series of published works which are irreproducible, as well as conclusions which fail to address the crux questions of human consciousness because they do not originate from a falsifiable hypothesis. Given enough data, conclusions can easily be reached which may not be the same as Type I error, but still serve as red herrings for a field and ultimately prolong stasis of scientific understanding.

Supporters of hypothesis-free testing often tout hypothesis generation as a strength of the approach while detractors claim that it fails to resolve disagreements between theories. Thomas Kuhn writes, “We often hear that they are found by examining measurements undertaken for their own sake and without theoretical commitment. But history offers no support for so excessively a Baconinan method […], laws have often been correctly guessed with the aid of a paradigm years before apparatus could be designed for their experimental determination” (Kuhn 29). However, notable breakthrough hypotheses have been credited to inductive reasoning independent of an established dichotomy of theories. For example, the origin of the structure for the Benzene ring has been associated with August Kekule’s dream as he describes it, “I was sitting writing on my textbook, but the work did not progress; my thoughts were elsewhere. I turned my chair to the fire and dozed. Again the atoms were gamboling before my eyes. This time the smaller groups kept modestly in the background. My mental eye, rendered more acute by the repeated visions of the kind, could now distinguish larger structures of manifold conformation; long rows sometimes more closely fitted together all twining and twisting in snake-like motion. But look! What was that? One of the snakes had seized hold of its own tail, and the form whirled mockingly before my eyes. As if by a flash of lightning I awoke; and this time also I spent the rest of the night in working out the consequences of the hypothesis” (Olah 2001). While this hypothesis was groundbreaking in organic chemistry, critics have noted that Kekule concocted this dream explanation in order to detract from the contributions of Alexander Butlerov and Andrew Couper, other chemists at the time (Browne 1988)(Sorrell 1998). Hypotheses can be generated independent from the recent approach of unbiased high-throughput screen and hypothesis can be generated independent from dueling theories within the field of interest.

Another criticism of unbiased or hypothesis-free generated scientific questions is that they fail to address outstanding questions that predominate within a given field. In this sense, generating a question with an unresolved answer remains as problematic as coming up with an answer without any resolved question. For example, the wide application of single cell transcriptomics has enabled numerous researchers to claim that they have identified unique cell types. In this case, the answer researchers found is there are more cell types (i.e. more tissue heterogeneity) than we previously thought. However, the implied question would then be: are there more cell types in a given tissue? Perhaps such a question is not particularly relevant to advancing a field. While a more relevant scientific question might be, “how does a tissue organize and develop over time?”. One risk that hypothesis-free scientific approaches face is failing to adequately address unresolved questions in scientific research.

The danger of not directly addressing predominant theories in the field can negatively affect the impact of a study by exempting it from overt criticism. In many ways, the hypercompetitive nature of modern scientific research has presented problems for aspiring researchers attempting to obtain funding, but one benefit in highly contested fields may be the abundance of skeptics monetarily incentivized to identify flaws in their competitor’s arguments. In this sense, directly addressing a known area of hotly contested research with a novel hypothesis places the impetus on the researcher to triple-check their own assumptions when generating predictions for future research. On the other hand, if a researcher avoids such confrontation with hypothesis-free testing, they may put themselves at a disadvantage by removing the number of interested skeptical competitors who may review their work. For example, within the field of inner ear physiology, a hotly contested topic has been: how hair cells of the inner ear engage in mechanotransduction. Vigorous debate ensued between two different camps over whether a specific MET channel was responsible for depolarizing hair cells and enabling downstream neurotransmitter release to generate action potentials in afferent neurons going to the brain. In this example, claims about the MET channel had to be carefully tempered and hypothesis testing had to anticipate and address potential claims from detractors to be taken seriously. Thus, the controversy and competition generated by detractors from a hypothesis can serve as important check to balance ongoing research which is important for scientific progress. Insufficient conflict over a novel hypothesis can lead to half-hearted scientific debates that are no more than two ships passing in the wind, which leaves results lacking scrutiny.

Novel hypothesis generation from the so-called unbiased or hypothesis-free testing perspective can also fail to produce interest in a given area of research. Avoiding conflict not only results in diminished scrutiny but can also fail to further a scientific discipline because it is too far beyond the scope of a particular area of study.  As an illustration, look to the findings of Gregor Mendel in his study of heredity using pea plants. While these observations are meaningful in the field of genetics, the inapplicability of these observations to any theories of the time made them largely overlooked. For this reason, hypotheses which fail to address the preeminent dogma of the time or directly conflict with competing findings also often fail to stir enough interest necessary for longstanding consideration by the scientific community.

The arguments presented thus far have provided insights for various pitfalls associated with different strategies for hypothesis formation, but what then are common criteria for hypothesis formation which are likely to yield success? While there is a high degree of unexpected findings in science, there are some universalities that can be gleaned from a rich history in scientific research to facilitate trainees to develop hypotheses to yield skillful results. As Louis Pasteur put it, “fortune favors the prepared mind”. The task to generate an all-encompassing list of how to construct a perfect hypothesis is  a daunting one but, “If the coherence of the research tradition is be understood in terms of rules, some specification of common ground in the corresponding area is needed” (Kuhn 44). Thus, here I put forth a few criteria which make a sound hypothesis which are hopefully independent of a trainee’s methodological constraints and pedigree relating to established models or theories.

A hypothesis ought to directly address outstanding unanswered questions within the field of research. As previously mentioned, a hypothesis which fails to engage in the pursuant scientific thought in each field is at a high risk of becoming obsolete. This obsolescence can stem from perceived inapplicability of the question at hand, a failure in sufficient scrutiny from peers, or results which leave researchers nonplussed. For example, the hypothesis that that two different species of rodents are capable of visually discriminating similar types of objects fails to address larger questions in the field of visual neuroscience. Instead, the question likely needs to focus more specifically on areas that are currently undergoing active pursuit. Currently, mechanisms of synaptic plasticity or systems-level processing between different visual areas in order to facilitate encoding are more favorable areas of pursuit than the previously mentioned example. A hypothesis needs to directly address open-ended questions within the desired field of study.

A hypothesis ought to be testable and falsifiable. While one can imagine an almost infinite number of questions surrounding a phenomenon, the inability to test such ideas makes the process of infinite hypothesis generation overly speculative. Researchers need to strike a balance between inductive empirical logic and results-based observations from pragmatism in order make predictions about future events which can be tested with some finality. For instance, hypotheses which seek to negate an idea within a field without positive data to support an alternative are often doomed to fail due to the difficulty of “proving a negative”. An example of a bad hypothesis in this vein would be: DNA sequence is not the most important aspect of gene expression. In this example, the test of “importance” is vague to quantify and even an additional finding pointing to other phenomena would not necessarily negate DNA’s role in gene expression. A more direct hypothesis that can be tested would perhaps be DNA nucleotide sequence can be modified by methylation in order to silence gene expression. In this example, the implication is similar but there is a way to measure and test via expression levels and methylation levels in order to either support the hypothesis or refute it. The strongest hypotheses are often only validated in retrospect through interpretation of results, but upon initial formulation, a good hypothesis should be phrased in such a definitive way that there is no risk of arriving at an unresolved answer.

A hypothesis ought to be independent from a specific desired result.  Implicit bias to bolster support for a given theory or outcome is widespread across many disciplines, but it behooves the emerging scientist to formulate hypotheses which contribute important information regardless of the outcome. For example, an unresolved question in a given field may be, “what role does Gene A play in Alzheimer’s disease etiology?”. A bad hypothesis would be: Gene A causes Alzheimer’s disease because the threshold for proving causative effects of Gene A for an already well studied and genetically complex disease is quite high. Anything short of groundbreaking results which change the field of neurodegenerative research would be failure. For instance, assume the results show that there is a correlative increase in Alzheimer’s within a subpopulation of patients. In this case, the hypothesis would not be supported as stated, and would instead require revision. A better hypothesis would be: Gene A significantly increases the risk of developing Alzheimer’s disease in population A. In the second example, the question of Gene A’s role in Alzheimer’s disease is both more specifically addressed and the burden of proof is lowered to ensure that the researcher does not entirely depend on one all-or-none outcome.  Careful drafting of hypotheses can ensure that the scientist is not paving a path to failure by remaining overly-expectant on one outcome.

A well-formed hypothesis should anticipate outcomes which generate additional hypotheses.  One trait of effective scientific research is that it invariably inspires further novel research without remaining entrenched in overly cyclical iterations of the same questions. While drafting a hypothesis a researcher may anticipate several outcomes and, in the best constructed hypotheses, all of these outcomes should produce additional novel questions of interest. For example, assume that a researcher is examining the mechanism of a drug which reduces symptoms associated with long-term depression. A good hypothesis might anticipate several different intracellular signaling pathways and for each pathway, the researcher’s imagination should ignite with other hypothetical interactions. However, if several different pathways produce a definitive answer which leaves the scientist uninterested in further pursuit, then it is best to restructure or reconsider the course of research.  In this sense, although outcomes are often unpredictable, the scientist should not be naïve of the consequences of the outcomes and should anticipate their relative importance as a useful exercise.

A well formed hypothesis strategically anticipates consequential impacts of potential results. While “fortune favors the prepared mind”, it also favors the bold and the more noteworthy scientific advances tend to be ones that involve significant changes in experimental paradigms of an era, which at times come with some risk to the scientist’s career and potentially safety. While anticipating the potential experimental outcomes of a hypothesis, scientists also benefit from conceptualizing the potential consequences of such research on a societal level. Notably, the example given earlier of h pylori as the cause of stomach ulcers was in part demonstrated by consumption of the bacteria by its discoverer, quite a committed gesture for a man who had a clear implication for the results of the research. Likewise, Charles Darwin carefully weighed the consequences of publishing Origin of Species for a long time before deciding to make his work public. In contrast, Albert Einstein was much more concerned with understanding mass and energy than the inevitable application of such discoveries. After witnessing the vast destruction which arose from the atomic bomb Einstein famously remarked, “If I had known, I should have become a watchmaker”. The relative scientific and societal impact, whether large or small, should be carefully considered during the formulation of a hypothesis.

Works Cited

Adler, M.J. (1992) The Great Ideas: A Lexicon of Western Thought by Mortimer J. Adler Macmillan Publishing Company

Aristotle, De Generatione et Corruptione, translated as On Generation and Corruption by H. H. Joachim in W. D. Ross, ed., The Works of Aristotle, vol. 2 (Oxford: Oxford, 1930)

Browne, M. (1988) The Benzene Ring: Dream Analysis New York Times August 16, 1988, Section C, Page 10.

Darwin, C. (1859) On the Origin of Species

Darwin, C. (1871) The Descent of Man

Friedberg, E.C. (2008) An Interview With Sydney Brenner Nature Reviews January 2008.

Kuhn, T. (1996) The Structure of Scientific Revolutions The University of Chicago Press

Olah, G. (2001) A Life of Magic Chemistry : Autobiographical Reflections of a Nobel Prize Winner , p. 54

Sorrell, T. (1998) Organic Chemistry University Science Books

Warren, JR. and Marshall, B. (1983) Unidentified Curved Bacilli on Gastric Epithelium in Active Chronic Gastritis Lancet  1983 Jun 4;1(8336):1273-5.

 

 

On Experimental Methodology

Scientific experimental methodology often conflicts directly with the intuitive or holistic approach to understanding truth because the scientist intentionally attempts to undermine the legitimacy of claims. The scientific method uses deductive logic to test individual variables to address the accuracy behind a given claim in an infinitely reiterative attempt to reach an unobtainable truth. “We certainly can’t believe ingenuously that a scientific theory, I mean in the field of natural sciences, could be something like ‘true’. Not because of some radical skepticism toward the sciences, but rather by virtue of the very process of science. In the course of its history, this process showed and extraordinary inventiveness in ceaselessly destroying its own theories including the most fundamental ones, replacing them with paradigms whose novelty was so extreme that nobody could anticipate the beginning of their configuration” (Meillassoux 2014). The scientific method in this sense is autophagous even cannibalistic at times because it seeks a more precise approach which limits variation to eliminate observations which cannot be reliably observed.

During optimal design of a scientific study, the experimenter is searching for holes or inconsistencies in a theory or commonly held observation to better understand some phenomena. Attempts to understand the nature of reality which rely entirely on intuition without attempting to falsify any claims often lack depth of understanding because they preclude hypothesis testing in favor of faith-based acceptance of claim. Again, Descartes writes, “As for the false sciences, I saw no need to learn more about them in intellectual self-defence: I thought I already knew their worth well enough not to be open to deception by the promises of an alchemist or the predictions of an astrologer, the tricks of a magician, or the frauds and boasts of those who profess to know more than they do” (Descartes 4). In this case, a “false science” would be an explanation of the natural world which ignores data deduced from scientific experiments. Take for example, an individual that declares themselves an ideological supporter of homeopathic medicine and joins an anti -vaccination movement championing the belief that vaccines cause Autism Spectrum Disorder (ASD), a claim that they trust intuitively. The individual in this example reasons that the symptoms of ASD occur in infants after they receive a vaccine, thus it makes intuitive sense that there is a correlation between vaccination and ASD onset. You will note that this approach lacks skepticism as it does not seek to test the hypothetical correlation, which is what makes this an intuitive approach lacking in dept. Correlation does not mean causation. Actual scientific studies funded by anti-vaccine groups showed no strong correlation between vaccines and ASD, by rigorously testing this hypothesis (Gadad, B.S 2015)(Hasegawa 2018)(Curtis 2015). Persistent ideological belief based on intuition and emotion that is exempt from scrutiny is considered unscientific, and researchers ought to avoid such approaches when designing studies.

Rather than circumnavigating personal biases, an optimal experimental design ought to include a way to test the presuppositions directly held by the experimenter conducting the study. An ideal researcher will be capable of playing devil’s advocate by stepping into an impartial mindset and attacking their own claims, and then test the validity of such attacks experimentally. For example, if a researcher has strong data to suggest that their new antibiotic is the most effective way to treat an ear infection, then it is incumbent upon the researcher to validate that claim through rigorous testing. Thus, the researcher would benefit from strategically testing alternative treatments that show similar or improved efficacy, cheaper production costs, or potential for reduced side effects. In this sense, good experimental design is like establishing fair competition in sports. Ideally, you want to normalize the playing field to ensure that competitors begin without a drastic advantage beyond participant skill level. Normalization is why there are weight classes in professional fighting as well as distinctions between professional and amateur sports leagues. At the experimental design level, a good researcher seeks to normalize the playing field to allow compete their own hypothesis along with other ideas to identify the most likely explanation.

Typically, effective testing of a hypothesis requires experimental design which focuses on specific variable at the expense of others and therefore it behooves the research scientist to simplify the experimental design as much as possible. Rene Descartes realized early on that personally testing every variable would be unrealistic within a scientist’s lifetime, “Every day increases my awareness of how my project of self-instruction is being held back by the need for innumerable experiments that I need and can’t possibly conduct without the help of others” (Bennett 29). For example, when examining the causal relationship of the gene Apoe4 to Alzheimer’s Disease, the best experiment should attempt to examine the effect of Apoe4 directly by examining the consequences of loss of function mutation (i.e. gene knockout). It may be tempting to include additional variables like increasing the expression of the Apoe4 gene, or knocking out Apoe4 and Apoe2 together, or knocking the gene out at different times but these additional variables can add a great deal of complexity. Each measurement will have a level of uncertainty and the overall likelihood of experimental error and uncertainty will increase with the number of variables tested. Thus, the most straightforward way to test a hypothesis effectively is to identify the key variable involved in the observations and isolate it to limit complexity.

Moreover, the larger the number of variables which are examined in a scientific experiment, the higher the degree of complexity involved in the study design. The simplest experiment will test a single independent variable which avoids increased complexity and includes multiple internal controls. A commonplace example of this increase in complexity is the use of a password used to log into an encrypted webserver. The longer password is more secure it is due to the larger number of characters it has. A brute force attack used to obtain a password is done by re-iterative guessing of every possible character and order and it can modeled by a permutation function where the number of characters in the password string n and the number of repetitive attempts to crack the code is r. Thus, nr can represent the complexity of the password where every character is a variable being tested. A single character case-sensitive alphanumeric password would be correctly identified by the hacker in only 94 attempts or less. However, a 10-character password would have 3,628,800 possible solutions and that would take a great deal of processing time. Consider a similar calculation applied to variables within an experiment where more variables n results in greater experimental complexity. In such an example, the research scientist attempting to crack a chemical or biological code will benefit from reducing the number of experimental repetitions to as few as possible

The greater the degree of variable complexity, the greater the computational timing demands within the experimental design. Adding levels of abstraction or larger numbers of variables in real-time has been shown to increase the amount of processing time to complete a specific task, even for the human brain.  For example, reading the list of words: red blue green , orange , yellow is easier than reading: red blue green , orange , yellow aloud due to the incongruency between perceived colors and the meaning of each word in context. This phenomenon is known as the Stroop effect and it causes a delay in reaction timing but the same could be said of added steps required to unpack visual cues for colorblind or dyslexic individuals. Although the emergence of quantum computers has allowed for the solution of computational problems in cryptography which would otherwise be time prohibitive by normal computation, there is still a quantal relationship between complexity and processing time. Scientists attempting to interpret higher degrees of complexity or abstraction within experimental design are likely to incur credit in the form of slower processing time and therefore simplicity in experimental design is an ideal strategy.

As was briefly mentioned, the increase in experimental complexity also increases the likelihood of error. “In science, the word error does not carry the usual connotations of the terms mistake or blunder. Error in scientific measurement means the inevitable uncertainty that attends all measurements. As such, errors are not mistakes; you cannot eliminate them by being very careful. The best you can do is ensure that errors are as small as reasonably possibly and to have a reliable estimate of how large they are” (Taylor 1997).  On a long enough timeline, the likelihood of error by chance increases with the probability of error approaching 1. For example, a scientist may measure the length of a bone in millimeters using a ruler that has a standard error of 1mm. However, this same error would be unacceptable if the scientist were measuring a cell of 10 micrometers and thus the researcher must rely on a microscope with a different standard error. Many labs attempt to test multiple variables at a time by employing a very large staff with multiple different methods of measurement. However, this strategy contributes to propagation of error when each uncertainty is carried through to the interpretation of results. Each potential for error can build upon one another to increase the overall likelihood of a false positive or false negative as described in the section on hypothesis formation. Increasing the number of variables increases the complexity, the time for completion, and ultimately increases the likelihood of error for the overall study.

Experimental complexity can also lead to an increased probability of error in the interpretation of results. More complex results tend to obscure the causal relationship between independent variables due to the increased conditions necessary to test each variable.  A clear example of this effect is in epidemiological study design on human population with respect to nutrition where it is impossible to control for all variables examined. In such studies, an attempt to identify a simple causal relationship between a proposed independent variable like sugar intake (g) and dependent variable being measured as body mass (kg) becomes difficult to discern. While the experimental design seems simple, interpreting the result can become quite complicated due to many unexpected variables included within the study design, age, sex, exercise level, fat intake, protein intake which may partially explain the perceived causal relationship (or lack of causal relationship) where an individual would eat 10g more sugar and gain 10kg of body mass. At best, many epidemiological studies can make strong correlative claims without a strong underlying causal basis, but at worst they run the risk of falsely identifying a correlation which is more strongly dependent on an underlying unseen variable that the experimentalist has not yet recognized. Experimental complexity is by no means limited to epidemiological studies and any study design which incorporates too many variables runs a higher risk of erroneous interpretation.

Both realized and unrealized experimental complexity can increase the likelihood of erroneous conclusion, and it is necessary to address apparent complexity of a problem prior to experimental design to avoid such pitfalls. While it is important to limit the number of variables being within an individual experiment, it is also important to recognize the limitations of the experimental paradigm being used and anticipate criticisms of anticipated results. Albert Einstein suggests, “Everything should be made as simple as possible, but not simpler” (Calaprice 2011). Many classic experiments involved very simple experimental variables. For example, Otto Loewi originally identified neurotransmitters by taking fluid from an electrically stimulated frog heart and used it to drive beating in a different frog heart without electrical stimulation. Fascinatingly, the chemically stimulated heart responded to the amount of neurotransmitter that was added to it. The experiment showed what was not obvious at the time, which was the contractility of heart muscles was driven by secreted chemical components rather than only by electrical stimulation but left open what chemicals were driving the effect. “In my opinion, these observations prove the onset or onset of the various effects are a function of the concentration of the same vagus substance” [translated from German] (Loewi 1924). Loewi could have made grandiose claims about which components of the vagus substance drove contractility, but this would add the need to specify additional variable for testing and the somewhat broad conclusion is more accurate and less complex to explain overall. Simplicity of experimental design avoids the tendency toward errors in experimental completion and has the added benefit of simplicity in results by limiting the number of variables which the experimenter needs to wrestle with.

Controls also reduce variability by contextualizing results and increasing precision and accuracy of analysis. Although inclusion of controls to a study would seem to add a layer of complexity in experimentation, their presence often simplifies the overall study design by reducing the total number of variables during the analysis stage. The use of internal controls can reduce variable number by eliminating potential causative variables by using an internal reference to compare and normalize values. In the example of the epidemiological sugar study, one way to implement internal control would be to normalize for anticipated variables by only comparing subjects with similar age, sex, body weight, etc. and only allowing sugar intake to vary amongst subject. Although this is often referred to as “controlling for the variables”, it is not a true internal control. In this same example, a true internal control would be a known subject that is positive or negative for sugar intake. For example, a positive control would be a person who only eats sugar and the negative control would be someone who does not eat any sugar at all. Verified controls are especially useful to provide context for interpreting complex results by generating a range at either value extreme which contextualizes the rest of the dataset. The use of positive and negative controls thus can serve as Boolean operators (0 or 1), in which the presence or absence of a potential causal variable can be observed with better scrutiny.

Another important function of internal controls is to ensure precision and accuracy of measurement. The use of certified reference materials as controls allow a scientists to examine the quantitative relationship between two variables with more precision than the Boolean 0 or 1. For example, a patient’s blood sample being tested for methylmercury is often extracted and analyzed with a standard curve because their clinician wants to know how much methylmercury is in their blood and not whether it is merely present or absent. A typical approach would be to include a standard curve consisting many different values from a certified reference material of known concentration prepared by an impartial third-party source. The sample preparations are often diluted to encompass the range of predicted values for a typical patient. For example, if most patient samples contain between 1-100 parts per billion (ppb), then the standard curve might contain 0.1, 1, 25, 50, 75, 100 and 1,000ppb of methylmercury from a certified reference material. A calibration curve surrounding samples assists multivariate analysis by providing a sense of precision of measurement at different concentrations.

The use of experimental controls often provides an important context for the interpretation of experimental results. Controls establish a functional range which extends above and below the sample values for the experimental variables being tested and describe the background levels of analyte present in the sample matrix. For example, in patients’ blood where there is some amount of analyte to be assessed (i.e. between 0 and ∞), there are many different possible values that may be identified. The range of control values ought to be able to detect orders of magnitude above and below the anticipated mean of the sample. With respect to the instrument’s dynamic range, controls define the lower and upper limits of detection (LOD) and thus provide a context for what exactly a “0” or “non-detect” means (Buckingham 458). For example in the curve below; the 0 is considered the LOD, but in some cases an instrument may not be able to distinguish the difference between 10 units and 0 units, and thus the LOD would be 10. The minimum and maximum limits of detection and quantitation of unit concentration and the range of control values must be established experimentally for each analyte. For example, assume that the orange line represents the detection of a specific protein. Note, that perfect linearity of detection is represented by the blue line where the expected value is always measured. However, in the case of orange protein, the linearity is only predictable and consistent between 0 and 600 units and after that point there is nonlinearity in measurement. This result would likely encourage the researcher to dilute their samples within the dynamic range of the instrument for the analyte of interest. The use of experimental controls provides a background for the interpretation of describe a relevant context for the interpretation of values by the researcher.

Calibration Curves.png

For some variables tested, the experimental conditions may exhibit a great deal of variance which can be tracked across different analyses using CCVs, duplicates, and replicates. The dynamic range of detection can vary for a variety of reasons including instrument type, temperature, length of experiment, reagent lot/batch number purity, working solution preparation, and many more. The variance inherent to a given experimental analysis can often merit the inclusion of a continuous calibration value (CCV) to assist in avoiding these pitfalls. A CCV will often be prepared in large amounts to allow repeated measures with each analytical run which adds confidence for the researcher interpreting results. CCV’s also provide the added benefit of allowing a sense of the rate of sample degradation, instrumental variance over time, and identification unexpected contamination of a commercial reagent. An experienced researcher may also choose to include duplicate preparations and/or replicate sample analyses to assess variation by the experimenter’s preparation or drift by the instrument over the length of a specific run. A duplicate is when an experimenter prepares a single sample two times for two separate analyses to check repeatability in preparation and analysis. In contrast, a replicate sample with prepared only once, but is analyzed two times to check the repeatability of the analytical instrumentation independently from the preparation. CCV’s, duplicates, and replicates are useful controls and provide context for variability within samples and between samples.

Furthermore, the inclusion of controls allows an audience to assess the data in greater detail when interpreting results independently from the experimenter. Controls provide an experimenter’s accuracy, precision, and bias. “Precision provides a measure of the random, or indeterminate, error of analysis Figures of merit for precision include absolute standard deviation, relative standard deviation, standard error of the mean, coefficient of variation, and variance” (Skoog et al 2005). A researcher does not need to achieve low standard error, standard deviations, or high correlation coefficients to be a successful scientist necessarily. However, a successful scientist must recognize how they affect precision of analysis and what the limitations are of their method of generating data. While precision addresses variability, accuracy relates to how far a measure deviates from the true underlying values. For example, a precise game of darts will have all the darts in a tight cluster somewhere on the board, whereas an accurate shot would be a single dart which hit the absolute center of the bullseye. A cluster of darts by one player to the left side of the target would reveal a leftward bias, which would be important for that person if they are trying to improve their dart game.

In addition to multivariate calibration curve, CCV’s, duplicates and replicates, the use of spike-in controls can also provide important information regarding sample matrix variability. To make a spike-in control, an experimenter will typically add an amount of certified reference material into a sample preparation while blinded to the original concentration of the sample. This is known commonly known as a “spike” or “spike-in” and returns the original values of the sample plus the added reference amount. For instance, adding 10ppb methylmercury to a 40ppb sample should return 50ppb in the final analysis. If analysis returns 48ppb, then the spike-recovery is 80% because only 8/10 of the amount added was recovered upon detection and analysis.  Incorporating certified reference materials from a third party helps provide a reliability to the accuracy and precision of measurement and spikes are a great way to test the precision and accuracy of detection.

Common inductive logical fallacies are associated with a lack of empirical testing, a lack of deductive reasoning, and often rely on premature interpretation of limited data. These include hasty generalizations, unrepresentative samples, false analogies, slothful inductions, and fallacy exclusions all of which lead to faulty conclusions. Take the example of hasty generalizations, which tend to draw conclusions from insufficient observations or small sample sizes. For example, assuming one misdiagnosis from a doctor means that all doctors are incompetent is a logical fallacy drawn from a sample size of N=1 and ignores the cases which were not observed. Similarly, this doctor may have a unique training deficiency compared fellow doctors, and thus may be an unrepresentative sample. Ignoring the unique training of this doctor could also be considered a fallacy of exclusion. Studies reliant upon unrepresentative samples which hastily generalize are often accused of “cherry picking”, because only the so-called “ripe” results are selected, and others are cast aside as unwanted fruit.

In other cases, intuitive logical leaps are a product of emotional fervor which tends to blur the lines of reason by relying heavily on observer bias to make draw faulty conclusions from results. A false analogy occurs when one assumes that similarity between object/event/observation A is similar to B, and thus both A and B exhibit a similar property “P”. For example, uninformed medics in World War II noted melanin differences between white and black soldiers indicated that white soldiers (A) and black soldiers (B) have distinct property in their skin and thus also have distinct blood types (P). This resulted in unnecessary separation of donor blood for injured soldiers and arose from the racist segregationist beliefs running strong within the USA. Dr. Charles Drew, often called the father of the blood bank remarked, “Whenever, however, one breaks out of this rather high-walled prison of the "Negro problem" by virtue of some worthwhile contribution, not only is he himself allowed more freedom, but part of the wall crumbles. And so it should be the aim of every student in science to knock down at least one or two bricks of that wall by virtue of his own accomplishment” (Drew 1947). A proper experimental study not only avoids assumptions of shared properties in favor of empirically tested observations as a basis for hypothesis testing but seeks to expose and overturn them actively to approach truth.

Similarly, slothful induction is a fallacy which implies a conclusion despite data that contradicts the proposition. Slothful induction has become increasingly popular amongst ideologues that choose to believe a particular hypothesis without any empirical testing or data due to strong emotional, cultural, or social support for the propositions. For example, modern conspiracy theorist hypotheses including flat earthers, QAnon, and reptilian humanoid believers. In each case, the propositions put forward directly contradict evidence to the contrary, which is abundant in droves. By seeking to ignore contradictory data, a scientist ultimately falls victim to slothful induction an fails to utilize the scientific method effectively.

Optimal experimental methods exhibit simplicity in design. Resolution of competing hypotheses and strong scientific conclusions depend on clarity throughout the experimental design. Reducing the number of variables requiring testing decreases the complexity of the experiment and the time to completion. In part this is due to the inclusion of only the necessary positive and negative controls and exclusion of excessive numbers of variables requiring additional internal controls. Ultimately, simplicity of design also makes the conclusions regarding independent and dependent variables easier to understand and describe as well.

An optimal experimental methodology will test the researcher’s own presuppositions deductively. Testing one’s implicit presuppositions provides confidence in observations by revealing areas of relevant uncertainty. Typically, great progress in science follows from unearthing explanations about previously misunderstood phenomena in a way that changes the approach to a particular problem for an entire field. Many classic examples include a geocentric versus heliocentric model of the galaxy or the revelation of DNA as the principle hereditary molecule rather than protein. Scientific research methodology does well to incorporate testing of inherent presuppositions as a way of approaching the ever-moving target of certainty.

Ideal methods strive to innovate novel ways of measurement and observation. Innovative methods of observation allow researchers to circumnavigate limitations of less accurate means of measuring data. Any method of observation ultimately has limits of detection and quantification and increasing the precision and accuracy of measurement through innovation increases the depth of observation. The classic example of Antonie van Leeuwenhoek’s microscope was the first time that unicellular organisms could be visualized and ultimately gave birth to the field of microbiology when previous methods of observation lacked the ability to resolve such structures. Ultimately, the unknown may defy observation and innovative techniques of measurement in study design can incorporate simplicity and test presuppositions in tandem.

An optimal experimental methodology includes positive and negative controls. Experimental results are trusted most when they come from a source that can demonstrate accuracy and precision in measurements, and this is completed through inclusion of third-party standardized reference materials or alternatively a novel but appropriate control which has independently observed in some way by other researchers. A method that fails to reproduce a positive or negative control value can hardly be trusted to accurately produce values for samples if they cannot reach consensus with observations as others in their field. Similarly, exclusion of positive and negative controls within a methodology indicates a massive oversight in the researcher’s analytical approach and should be “aaken with a grain of salt or distrusted entirely.

An optimal experimental methodology attempts to normalize the variables on an equal playing field. Every method of analysis comes with some variability between samples and an optimal method of experimental design seeks remove unwanted variables in favor of a more specific focus on the main variable in question. Normalizing sample variability by creating eligibility criteria for samples which are to be included versus excluded is one way the researcher can separate confounding variables. Oftentimes standard deviation is used as a criterion for the removal of outliers to ensure a normal distribution; however, it is also possible to remove unwanted variables by normalizing them to an internal control within the sample matrix. This is the equivalent of a statement like, “all things being equal,” in that any internal variability has been adjusted a portion of a constant internal value to ensure the data can be compared. Optimal methods seek to interrogate specific variables in question by normalizing the distribution of variability in a way that favors specific comparison of only desired variables and not unwanted variables.

Works Cited

Bennett, J. (2017) Discourse on the Method of Rightly Conducting one’s Reason and Seeking Truth in the Sciences

Buckingham, L. (2012)  Molecular Diagnostics: Fundamentals, Methods, and Clinical Applications 2nd ed.

Calaprice, A. (2011) The Ultimate Quotable Einstein Princeton University Press

Curtis, B. (2015) Examination of the safety of pediatric vaccine schedules in a non-human primate model: assessments of neurodevelopment, learning, and social behavior Environ Health Perspect. 2015 Jun;123(6):579-89

Descartes, R. A Discourse on Method. Oxford University Press 2006 translated by Ian Maclean

Drew, C. (1947) ‘”Charles R. Drew to Mrs. J.F. Bates, a Fort Worth, Texas schoolteacher, January 27, 1947” US National Library of Medicine, < https://profiles.nlm.nih.gov/spotlight/bg/feature/biographical-overview> accessed Jan 20, 2021.

Gadad, B.S et al (2015) Administration of thimerosal-containing vaccines to infant rhesus macaques does not result in autism-likebehavior or neuropathology PNAS

Haseegawa, Y. et al (2018) Microbial structure and function in infant and juvenile rhesus macaques are primarily affected by age, not vaccination status Sci. Rep  2018 Oct 26;8(1)15867

Loewi, O. (1921). "Über humorale Übertragbarkeit der Herznervenwirkung. I.". Pflügers Archiv. 189: 239–242.

Skoog, D., Holler, F.J., and Crouch, S.R. (2005) Principles of Instrumental Analysis 6th. ed Thomson Brooks Cole  (P. 14)

Meillassoux, Q. (2014) Time Without Becoming  

Taylor, John R. (1997) Introduction to Error Analysis 2nd ed. University Science Books, pp. 3, 7)