Introduction to Statistical Analysis: a regression-from-the-outset approach
2020-07-19
Preface
The order in which we are proceeding with the writing, and what we have completed this far
We are proceeding somewhat non-linearly in the writing. After completing a first draft of Chapter 2 (an overview of Statistical parameters, parameter constrast and parameter functions) we began on the mother-of-all-regressions chapter for a single mean (ch. 5), and planned to move on the the corresponding ones for a single proportion (ch 6) and event-rate (ch 7). However, we realized that to get through these, students would need the concepts of interval estimates of parameters, and these in turn need other preliminaries. So we have completed 1st drafts of chapters on probability and random variables, but we are hoping to motivate the various statistical concepts and laws by simulation and to use the parallel sessions on computing in R
to make these laws more concrete, and less as products of calculus and mathematical statistics. This way we hope to get to serious regression applications sooner than most introductory courses do. And we are also hoping to have the parsimony of a regression approach become evident much sooner than in the 'traditional t-test chi-sq. regression' sequence that only reaches regression 3 weeks before the end of the course.
We have completed 1st drafts of computing chapters 1, 2 and 3, each with dual learning objectives -- computing, and statistical concepts). Having completed the theoretical chapter on random variables, we are now preparing a 4th computing chapter that emphasizes simulations.
Once these are done, we will move back to interval estimation, and then back to regression models.
In practice, we plan to intruduce the regression early, and use the computing to reinforce/(?replace) the theoretical pre-requisite chapters that (traditionally) seem to hold up so many inytroductory statistics courses.
So we see part II as a sidebar^ for part I -- i.e., to be consulted as we go along.
^ secondary article accompanying a larger one in a newspaper," 1948, from side (adj.) + bar (n.1).
^ a typographically distinct section of a page, as in a book or magazine, that amplifies or highlights the main text.
^ a conference between the judge and lawyers out of the presence of the jury.
^ a subordinate or incidental issue, remark, activity, etc.
Comments welcomed at any stage!
0.1 Target
The target is graduate students in population health sciences in their first year. Concurrently, they take their first courses on epidemiologic methods. The department is known for its emphasis on quantitative methods, and students' ability to carry out their own quantitative work. Since most of the data they will deal with are non-experimental, there is a strong emphasis on multivariable regression. While some students will have had some statistical courses as undergraduates, the courses start at the beginning, and are pitched at the Master's level.
In the last decade, the incoming classes have become more diverse, both in their backgrounds, and in their career plans. Some of those in the recently begun MScPH program plan to me consumers rather than producers of research; previously, the majority of students pursued a thesis-based Masters that involved considerable statistical analyses to produce new statistical evidence.
0.2 Topics/textbooks
For the first term course 607, recent choices have been The Practice of Statistics in the Life Sciences by Baldi and Moore, and Stats by de Veaux, Velleman and Bock. Others that have been recommended are the older texts by Pagano and Gauvreau, and by Rosner. Some of us have also drawn on material in Statistics by Freedman, Pisani, Purves and Adkikari, and Statistical Methods in Medical Research, 4th Edition_ by Armitage,Berry, and Matthews.
The newer books have tried to teach the topic more engagingly, by starting with where data come from, and (descriptively) displaying single distributions, or relationships between variables. They and the many others then typically move on to Probability; Random Variables; Sampling Distributions; Confidence intervals and Tests of Hypotheses; Inference about/for a single Mean/Proportion/Rate and a difference of two Means/Proportions/Rates; Chi-square Tests for 2 way frequency tables; Simple Correlation and Regression. Most include a (or point to an online) chapter on Non-Parametric Tests. They typically end with tables of probability tail areas, and critical values.
Bradford Hill's Principles of Medical Statistics followed the same sequence 80 years ago, but in black type in a book that measured 6 inches by 9 inches by 1 inch, and weighed less than a pound. Today's multi-colour texts are 50% longer, 50% wider, and twice as thick, and weigh 5 pounds or more.
The topics to be covered in the second term course include multiple regression involving Gaussian, Binomial, and Poisson variation, as well as (possibly censored) time-durations -- or their reciprocals, event rates. Here is is more difficult to point to one modern comprehensive textbook. There is pressure to add even more topics, such as correlated data, missing data, measurement error etc. top the second statistics course.
0.3 Regression from the outset
It is important to balance the desire to cover more of these regression-based topics with having a good grounding, from the first term, in the basic concepts that underlie all statistical analyses.
The first term epidemiology course deals with proportions and rates (risks and hazards) and -- at the core of epidemiology -- comparisons involving these. Control for confounding is typically via odds/risk/rate differences/ratios obtained by standardization or Mantel-Haenszel-type summary measures. Teachers are reluctant to spend the time to teach the classical confidence intervals for these, as they are not that intuitive and -- once students have covered multiple regression -- superceded by model-based intervals.
One way to synchronize with epidemiology, is to teach the six separate topics Mean/Proportion/Rate and differences of two Means/Proportions/Rates in a more unified way by embedding all 6 in a regression format right from the outset, to use generalized linear models, and to focus on all-or-none contrasts, represented by binary 'X' values.
This would have other benefits. As of now, a lot of time in 607 is spent on 1-sample and 2-sample methods (and chi-square tests) that don't lead anywhere (generalize). Ironically, the first-term concerns with equal and unequal variance tests are no longer raised, or obsessed about, in the multiple regression framework in second term.
The teaching/learning of statistical concepts/techiques is greatly enriched by real-world applications from published reports of public health and epidemiology research. In 1980, a first course in statistics provided access to 80% of the articles in NEJM articles. This large dividend is no longer the case -- and even less so for journals that report on non-experimental research. The 1-sample and 2-sample methods, and chi-square tests that have been the core of first statistics courses are no longer the techniques that underlie the reported summaries in the abstracts and in the full text. The statistical analysis sections of many of these articles do still start off with descriptive statistics and a perfunctory list of parametric and non-parametric 1 and 2 sample tests, but most then describe the multivariable techniques used to produce the reported summaries. [Laboratory sciences can still get by with t-tests and 'anova's -- and the occasional ancova'; studies involving intact human beings in free-living populations can not.] Thus, if the first statistical course is to to get the same 'understanding' dividend from research articles as the introductory epidemiology course does, that first statistical course needs to teach the techniques that produce the results in the abstracts. Even if it can only go so far, such an approach can promote a regression approach right from week one, and build on it each week, rather than introduce it for the first time at week 9 or 10, when the course is already beginning to wind down, and assignments from other courses are piling up.
0.4 Parameters first, data later
When many teachers and students think of regression, they imagine a cloud of points in x-y space, and the least squares fitting of a regression line. They start with thinking about the data.
A few teachers, when they introduce regression, do so by describing/teaching it as an equation that connects parameters, constructed in such a way that the parameter-constrast of interest is easily and directly visible. Three such teachers are Clayton and Hills 1995, Miettinen1985, and Rothman 2012. In each case, their first chapter on regression is limited to the parameters and to undersatnding what they mean; data only appear in the next chapter.
There is a lot to commend this approach. It reminds epidemiologists -- and even statisticians -- that statistical inference is about parameters. Before addressing data and data-summaries, we need to specify what the estimands are -- i.e, what parameter(s) is(are) we pursuing.
Fisher, when introducing Likelihood in 1922, was one of the earliest statisticians to distinguish parameters from statistics. he decried 'the obscurity which envelops the theoretical bases of statistical methods' and ascribed it to two considerations. (emphasis ours)
In the first place, it appears to be widely thought, or rather felt, that in a subject in which all results are liable to greater or smaller errors, precise definition of ideas or concepts is, if not impossible, at least not a practical necessity.
In the second place, it has happened that in statistics a purely verbal confusion has hindered the distinct formulation of statistical problems; for it is customary to apply the same name, mean, standard deviation, correlation coefficient, etc., both to the true value which we should like to know, but can only estimate, and to the particular value at which we happen to arrive by our methods of estimation. [R. A. Fisher. On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, Vol. 222 (1922), pp. 309-368]
It is easy and tempting to start with data, since the form of the summary statistic is usually easy to write down directly. It can also be used to motivate a definition: for example, we could define an odds ratio by its empirical computational form ad/bc. However, this 'give me the answer first, and the question later' approach comes up short as soon as one asks how statistically stable this estimate is. To derive a standard error or confidence interval, one has to appeal to a sampling distribution. To do this, one needs to identify the random variables involvced, and the parameters that determine/modulate their statistical distributions.
Once students master the big picture (the parameter(s) being pursued), the task of estimating them by fitting these equations to data is considerably simplified, and becomes more generic. In this approach more upfront thinking is devoted to the parameters -- to what Miettinen calls the design of the study object -- with the focus on a pre-specified 'deliverable.'
0.5 Let's switch to "y-bar", and drop "x-bar".
The prevailing practice, when introducing descriptive statistics, and even to 1 and two sample procedures, is to use the term x-bar (\(\bar{x}\)) for an arithmetic mean (one notable execption is de Veaux at al.) This misses the chance to prepare students for regression, where E[Y|X] is the object of interest, and the X-conditional Y's are the random variables. Technically speaking, the X's are not even considered random variables. Elevating the status of the Y's and explaining the role of the X's, and the impact of the X distributions on precision might also cut down on the practice of checking the normality of the X's, even though the X's are not random variables. They are merely the X locations/profiles at which Y was measured/recorded. When possible, the X distribution should be determined by the investigators, so as to give more precise and less correlated estimates of the parameters being pursued. Switching from \(\bar{x}\) to \(\bar{y}\) is a simple yet meaningful step in this direction. JH made this switch about 10 years ago.
0.6 Computing from the outset
In 1980, most calculations in the first part of 607 were by hand calculator. Computing summary statistics by hand was seen as a way to help students understand the concepts involved, and the absence of automated rapid computation was not considered a drawback. However, doing so did not always help students understand the concept of a standard deviation or a regression slope, since these formulae were designed to minimize the number of keystrokes, rather than to illuminate the construct involved. For example, it was common to rescale and relocate data to cut down on the numbers of digits entered, to group values into bins, and use midpoints and frequencies. It was also common to use the computationally-economical 1-pass-through-the-data formula for the sample variance \[s^2 = \frac{ \sum y^2 - \frac{(\sum y)^2}{n}}{n-1},\]
even though the definitional formula is
\[s^2 = \frac{\sum(y - \bar{y})^2}{n-1}.\]
The latter (definitional) one was considered too long, even though having to first have to compute \(\bar{y}\) and then go back and compute (and square) each \(y - \bar{y}\) would have helped students to internalize what a sample variance is.
When spreadsheets arrived in the early 1980s, students could use the built-in mean formula to compute and display \(\bar{y}\), another formula to compute and display a new column of the deviations from \(\bar{y}\), another to compute and display a new column of the squares of these deviations, another to count the number of deviations, and a final formula to arrives at \(s^2.\) The understanding comes from coding the definitional formula, and the spreadsheet simply and speedily carries them out, allowing to user to see all of the relevant components, and from noticing if each one looks reasonable. Ultimately, once students master the concept, they could move on to built-in formulae that hide (or themselves avoid) the intermediate quantities.
Few teachers actually encouraged the use of spreadsheets, and instead promoted commercial statistical packages such as SAS, SPSS and Stata. Thus, the opportunity to learn to `walk first, run later' afforded by spreadsheets was not fully exploited.
RStudio is an integrated environment for R, a free software environment for statistical computing and graphics that runs on a wide variety of platforms. Just like spreadsheet software, one can use R not just as a calculator, but as a programmable calculator, and by programming them, learn the concepts before moving on to the built-in functions. There is a large user-community and a tradition of sharing information and ways of doing things. The graphics language contains primitive functions that allow customization, as well as higher-level functions, and is tighly integrated with the statistical routines amd data frame functions. R Markdown helps to foster reproducible research. Shiny apps allow interactivity and visualization, a bit like 'what-ifs' with a spreadsheet.
It takes practice to become comfortable with R. Gor those less mathematical, it is somewhat more cryptic than, and not quite as intuitive as, other packages. For the last several years, the department has offered a 13 hour course introduction to R in first term. Initially the aim was to prepare students for using it in course 621 in second term, but in the Fall 2018 and 2019 offerings of course 607, computing with R and use of R Studio became mandatory. Just as the epidemiology material in the Fall is shared between 2 courses (601 and 602), the aim will be to also spread the statistics material over 607 and 613, and to integrate the two more tightly. As an example, the material on 'descriptive' (i.e., not model-based) statistics and graphical displays will be covered in 613, while 607 will begin with parameters and models. Rather than treat computing as a separate activity, exercises based on 607 material will be carried out as part of 613 classes/tutorials. The statistical material will be used to motivate the computer tasks.
0.7 Appendix:
[Still rough] History of current introductory biostatistics courses
The senior author first taught a 2-course sequence for first year graduate students in epidemiology in 1980, using Colton’s Statistics in Medicine as the text for the introductory course (607). He developed his own notes for the second course, which covered multiple regression for quantitative responses. Over the next 10 years, he continued to teach the first course – first from Colton, but latterly from Moore and McCabe (and undergraduate text) and with epi statistics from Armitage and Berry and some other fundamentals from Freedman (Statistics). Stan S taught the second (621 Data Analysis in the Health Sciences), mostly from Kleinbaum’s Applied Regression Analysis and Other Multivariable Methods.
In the 1990s, Lawrence J taught 607, and Michal A 621. Neither used a required textbook. LJ developed an extensive set of written notes (still available on his website) (and contributed a chapter Introduction to Biostatistics: Describing and Drawing Inferences from Data book on Surgical Arithmetic) while MA used transparencies that were widely photocopied. 2000s Robert P 621? LJ 621
Meanwhile JH taught to summer students (mostly medical residents and fellows): 607 and a second course (678, Analysis of Multivariable Data). Both sets of content are available on JH's website. He last taught the Fall version of 607 in 2001, when LJ was on sabbatical.
607: Tina 2006 - 201x Erica M; 20xx - Paramita SC. 2018, 2019 Sahir B 621: Aurelie Alexandra 2020 Shirin