1. Overview
John Wilder Tukey was a pivotal American mathematician and statistician whose work profoundly influenced the fields of data science and statistical analysis. He is widely recognized for his co-development of the Fast Fourier Transform (FFT) algorithm and for pioneering exploratory data analysis (EDA), a paradigm that shifted focus from rigid statistical confirmation to flexible data exploration. Tukey also made significant contributions to the English lexicon by coining the terms 'bit' and 'software'. His emphasis on EDA, particularly its integration with computer science and computer graphics, aimed to make complex data more accessible and understandable, thereby playing a foundational role in the democratization of information and public discourse by empowering individuals to derive insights directly from data.
2. Biography
John Wilder Tukey's life was dedicated to advancing mathematics and statistics, leaving an indelible mark on how data is understood and analyzed.
2.1. Early Life and Education
Tukey was born on June 16, 1915, in New Bedford, Massachusetts. His father was a Latin teacher, and his mother was a private tutor. He received the majority of his early education at home, primarily taught by his mother, only attending regular classes for specific subjects such as French. He began his higher education at Brown University, where he earned a Bachelor of Arts degree in chemistry in 1936, followed by a Master of Science degree in the same field in 1937. Subsequently, he pursued a PhD in mathematics at Princeton University, completing his doctoral dissertation, "On denumerability in topology", in 1939.
2.2. World War II and Early Career
During World War II, Tukey contributed his expertise to the war effort, working at the Fire Control Research Office. In this role, he collaborated with notable statisticians Samuel S. Wilks and William Gemmell Cochran. It is claimed that he played a part in the design of the U-2 spy plane. After the war concluded, Tukey returned to Princeton University, where he balanced his time between his academic duties at the university and his research endeavors at AT&T Bell Laboratories. He was elected to the American Philosophical Society in 1962.
3. Professional Career and Affiliations
Tukey's career was marked by significant contributions across academia, industry, and government, solidifying his reputation as a versatile and influential figure.
3.1. Princeton University
Returning to Princeton University after World War II, Tukey advanced quickly in his academic career. He achieved the rank of full professor at the age of 35. In 1965, he became the founding chairman of Princeton's statistics department, a testament to his leadership and vision in establishing statistics as a distinct and vital academic discipline within the university.
3.2. Bell Laboratories
Simultaneously with his academic work at Princeton, Tukey maintained an extensive research affiliation with AT&T Bell Laboratories. His work there was instrumental in developing statistical methods for computers. It was during his tenure at Bell Labs in 1947 that he coined the term 'bit', a fundamental unit of information theory. He is also credited with the first published use of the word 'software'.
3.3. Consulting and Advisory Roles
Beyond his primary academic and research positions, Tukey served in various influential consulting and advisory capacities. From 1960 to 1980, he played a crucial role in designing the NBC television network polls, which were used for election prediction and analysis. He also provided his expertise as a consultant to several prominent organizations, including the Educational Testing Service, the Xerox Corporation, and Merck & Company. During the 1970s and early 1980s, Tukey significantly contributed to the design and conduct of the National Assessment of Educational Progress, a program that evaluates the academic performance of students in the United States. Additionally, he served on a committee of the American Statistical Association that produced a critical report on the statistical methodology employed in the Kinsey Report, titled Statistical Problems of the Kinsey Report on Sexual Behavior in the Human Male. This report contained a notable critique, summarizing that "A random selection of three people would have been better than a group of 300 chosen by Mr. Kinsey".
3.4. Awards and Honors
Tukey's extensive contributions were recognized with numerous prestigious awards and honors throughout his career. In 1973, he was awarded the National Medal of Science by President Richard Nixon. He received the IEEE Medal of Honor in 1982, specifically "For his contributions to the spectral analysis of random processes and the fast Fourier transform (FFT) algorithm".
4. Major Scientific Contributions
Tukey's scientific work spans statistics, computer science, and data analysis, introducing concepts and tools that became foundational to modern quantitative research.
4.1. Fast Fourier Transform (FFT) Algorithm
One of Tukey's most significant contributions to scientific computing was his co-development, with James Cooley, of the Fast Fourier Transform (FFT) algorithm. This algorithm, published in 1965, provided a highly efficient method for computing the discrete Fourier transform and its inverse. The FFT algorithm revolutionized fields such as signal processing, image processing, and scientific computing by dramatically reducing the computational time required for such operations.
4.2. Exploratory Data Analysis (EDA) and Statistical Techniques
Tukey is widely celebrated for developing and popularizing exploratory data analysis (EDA), a set of approaches and techniques for analyzing data sets to summarize their main characteristics, often with visual methods. In his seminal 1977 book, "Exploratory Data Analysis," he introduced the box plot, a simple yet powerful graphical method for displaying distributions of numerical data.
Beyond the box plot, Tukey developed or significantly contributed to numerous other statistical methods and tools:
- The jackknife method, sometimes referred to as the Quenouille-Tukey jackknife, to which he contributed significantly in 1970. It is a resampling technique used for bias and variance estimation.
- The Tukey's range test, also known as Tukey's Honest Significant Difference (HSD) test, used in ANOVA to find means that are significantly different from each other.
- The Tukey lambda distribution, a flexible probability distribution useful in modeling and data analysis.
- The Tukey test of additivity, used to assess whether a linear model fits the data adequately or if there is a non-additive relationship.
- The Teichmüller-Tukey lemma in set theory, relating to maximal elements in partially ordered sets.
- Less known but still impactful methods include the trimean, an alternative measure of central tendency, and the median-median line, a simpler alternative to linear regression for fitting a line to data points.
- In 1974, in collaboration with Jerome H. Friedman, Tukey developed the concept of projection pursuit, a statistical technique used to find interesting projections of high-dimensional data.
4.3. Foundations of Data Science
John Tukey is widely regarded as a pioneer and, by some, even the "father" of modern data science. He challenged the prevailing dominance of what he termed "confirmatory data analysis" (CDA) during the 1960s, which relied on rigid mathematical configurations and hypothesis testing. Instead, Tukey advocated for a more flexible and inquisitive approach, which he called "exploratory data analysis" (EDA). EDA emphasized the importance of exploring data carefully to uncover hidden structures and information, serving as a crucial precursor to the field of data science.
Tukey also recognized the integral role that computer science would play in EDA. While much of his initial work focused on static visual displays like box plots that could be drawn by hand, he foresaw that computer graphics would offer significantly more effective means for studying complex multivariate data. This vision led to the conception of PRIM-9, the first program for viewing multivariate data, which he helped design in the early 1970s. This foresight-the strategic coupling of data analysis with computer science, especially through interactive visualization-is precisely what defines the modern discipline of data science. His philosophy of EDA aimed to make data more transparent and understandable to a broader audience, fostering a greater capacity for informed decision-making and contributing to the democratization of information.
5. Coining Key Terms
John Wilder Tukey significantly contributed to the lexicon of computing and information theory by introducing two fundamental terms.
While collaborating with John von Neumann on early computer designs, Tukey coined the word 'bit' as a portmanteau of "binary digit." This term first appeared in print in Claude Shannon's influential 1948 article, "A Mathematical Theory of Communication", quickly becoming the standard unit of information in digital computing.
Additionally, Tukey is credited with the first published use of the term 'software'. Although Paul Niquette claimed to have created the term in 1953, its earliest known appearance in print is in Tukey's 1958 paper in the American Mathematical Monthly journal. This widespread adoption of the term solidified its place in the vocabulary of computer science.
6. Philosophy and Approach to Data Analysis
Tukey's methodological views were deeply influential, particularly his advocacy for exploratory data analysis (EDA) as distinct from confirmatory data analysis (CDA). He believed that while both were valuable, too much statistical methodology focused solely on the latter. He emphasized that EDA should be treated as a separate, crucial phase in data analysis, allowing for discovery and pattern recognition without being constrained by predefined hypotheses.
He also introduced the concept of "uncomfortable science" to describe situations, particularly in natural science, where the clear separation between exploratory and confirmatory analysis becomes problematic. These are contexts where the iterative nature of scientific discovery blurs the lines between exploring data and confirming hypotheses.
A. D. Gordon summarized Tukey's key principles for statistical practice:
- The acknowledgement of both the usefulness and limitations of mathematical statistics.
- The importance of developing statistical analysis methods that are robust to violations of underlying assumptions.
- The necessity of accumulating experience regarding the behavior of specific analysis methods to guide their appropriate use.
- The critical principle of allowing the data itself to influence the choice of analytical methods.
- The call for statisticians to reject the role of "guardian of proven truth" and to resist attempts to provide definitive, over-unified solutions to problems.
- The recognition of the iterative nature of data analysis.
- Consideration of the implications arising from the increasing power, availability, and affordability of computing facilities.
- The importance of the proper training of statisticians.
Tukey's lectures were known for their unusual and contemplative style. Peter McCullagh described a lecture given by Tukey in London in 1977, noting that Tukey, "a great bear of a man dressed in baggy pants and a black knitted shirt," slowly and deliberately chalked headings on a blackboard. McCullagh recounted that Tukey's words came "like overweight parcels, delivered at a slow unfaltering pace." After completing his list, Tukey turned to the audience and asked, "Comments, queries, suggestions?", then "clambered onto the podium and manoeuvred until he was sitting cross-legged facing the audience," creating an atmosphere of anticipation.
7. Death
John Wilder Tukey retired in 1985. He died on July 26, 2000, in New Brunswick, New Jersey, at the age of 85.
8. Legacy and Evaluation
John Wilder Tukey's legacy is immense, shaping the trajectory of statistics and giving birth to the field of data science.
8.1. Positive Assessment and Achievements
Tukey's work is celebrated for its profound and lasting impact across various scientific disciplines. His co-development of the Fast Fourier Transform (FFT) revolutionized signal processing and computational methods, becoming one of the most cited algorithms in scientific literature. His pioneering of exploratory data analysis (EDA) fundamentally changed how data is approached, emphasizing visualization, flexibility, and discovery over rigid hypothesis testing. This paradigm shift empowered researchers to uncover insights from data that might otherwise remain hidden, making data analysis more intuitive and accessible. Furthermore, his introduction of widely used tools like the box plot and his contributions to methods such as the jackknife and various Tukey tests provided practical and robust solutions for data practitioners. The terms 'bit' and 'software', which he coined, are now ubiquitous in the digital world, underscoring his foresight into the future of computing. His advocacy for a more holistic and less constrained approach to data analysis laid the groundwork for modern data science.
8.2. Criticism and Controversy
While largely celebrated, Tukey's work and affiliations occasionally drew scrutiny. A notable instance was his involvement with a committee of the American Statistical Association that critiqued the statistical methodology of the Kinsey Reports. The committee's report, Statistical Problems of the Kinsey Report on Sexual Behavior in the Human Male, contained a pointed criticism of the sampling methods used, asserting that "A random selection of three people would have been better than a group of 300 chosen by Mr. Kinsey". This illustrates his role in critically assessing the statistical rigor of influential studies and advocating for sound methodology.
8.3. Impact on Later Generations
Tukey's methodologies, ideas, and tools have had a profound and enduring influence on subsequent generations of scientists and researchers. His emphasis on EDA as a distinct phase of data analysis encouraged a more hands-on, iterative, and visual approach to data that became a cornerstone of modern data science. By integrating computer science and computer graphics with statistical analysis, he foresaw and facilitated the development of computational statistics and data visualization, shaping how large and complex datasets are now explored and understood. His philosophical distinction between exploratory and confirmatory analysis continues to guide methodological choices in research. Many of the statistical techniques bearing his name remain standard tools in various fields, from engineering to social sciences, demonstrating the practical and theoretical robustness of his contributions. Tukey's legacy is most visible in the current prevalence of data-driven decision-making and the widespread adoption of visual analytics, which directly trace back to his pioneering efforts to make data more interpretable and actionable.
9. Publications
John Wilder Tukey was a prolific author and editor, contributing numerous books and influential papers that shaped the fields of statistics and data analysis. His published works include:
- Andrews, David F.; Bickel, Peter J.; Hampel, Frank R.; Huber, Peter J.; Rogers, W. H.; and Tukey, John Wilder. Robust estimates of location: survey and advances. Princeton University Press, 1972.
- Basford, Kaye E. and Tukey, John Wilder. Graphical Analysis of Multiresponse Data. Chapman & Hall/CRC Press, 1998.
- Blackman, R. B. and Tukey, John Wilder. The measurement of power spectra from the point of view of communications engineering. Dover Publications, 1959.
- Cochran, William Gemmell; Mosteller, Charles Frederick; and Tukey, John Wilder. Statistical problems of the Kinsey report on sexual behavior in the human male. Journal of the American Statistical Association, 1953.
- Cooley, James W. and Tukey, John W. "An algorithm for the machine calculation of complex Fourier series". Math. Comput. 19 (90), 297-301, 1965.
- Hoaglin, David C.; Mosteller, Charles Frederick; and Tukey, John Wilder (editors). Understanding Robust and Exploratory Data Analysis. Wiley, 1983.
- Hoaglin, David C.; Mosteller, Charles Frederick; and Tukey, John Wilder (editors). Exploring Data Tables, Trends and Shapes. Wiley, 1985.
- Hoaglin, David C.; Mosteller, Charles Frederick; and Tukey, John Wilder (editors). Fundamentals of exploratory analysis of variance. Wiley, 1991.
- Morgenthaler, Stephan and Tukey, John Wilder (editors). Configural polysampling: a route to practical robustness. Wiley, 1991.
- Mosteller, Charles Frederick and Tukey, John Wilder. Data analysis and regression: a second course in statistics. Addison-Wesley, 1977.
- Tukey, John Wilder. Convergence and Uniformity in Topology. Princeton University Press, 1940.
- Tukey, John Wilder. Exploratory Data Analysis. Addison-Wesley, 1977.
- Tukey, John Wilder; Ross, Ian C.; and Bertrand, Verna. Index to statistics and probability. R & D Press, 1973.
The collected works of John W Tukey, edited by William S. Cleveland:
- Brillinger, David R. (editor). Volume I: Time series, 1949-1964. Wadsworth, Inc., 1984.
- Brillinger, David R. (editor). Volume II: Time series, 1965-1984. Wadsworth, Inc., 1985.
- Jones, Lyle V. (editor). Volume III: Philosophy and principles of data analysis, 1949-1964. Wadsworth & Brooks/Cole, 1985.
- Jones, Lyle V. (editor). Volume IV: Philosophy and principles of data analysis, 1965-1986. Wadsworth & Brooks/Cole, 1986.
- Cleveland, William S. (editor). Volume V: Graphics, 1965-1985. Wadsworth & Brooks/Cole, 1988.
- Mallows, Colin L. (editor). Volume VI: More mathematical, 1938-1984. Wadsworth & Brooks/Cole, 1990.
- Cox, David R. (editor). Volume VII: Factorial and ANOVA, 1949-1962. Wadsworth & Brooks/Cole, 1992.
- Braun, Henry I. (editor). Volume VIII: Multiple comparisons, 1949-1983. Chapman & Hall/CRC Press, 1994.