# College Math Teaching

## April 5, 2019

Let’s start with an example from sports: basketball free throws. At a certain times in a game, a player is awarded a free throw, where the player stands 15 feet away from the basket and is allowed to shoot to make a basket, which is worth 1 point. In the NBA, a player will take 2 or 3 shots; the rules are slightly different for college basketball.

Each player will have a “free throw percentage” which is the number of made shots divided by the number of attempts. For NBA players, the league average is .672 with a variance of .0074.

Now suppose you want to determine how well a player will do, given, say, a sample of the player’s data? Under classical (aka “frequentist” ) statistics, one looks at how well the player has done, calculates the percentage ($p$) and then determines a confidence interval for said $p$: using the normal approximation to the binomial distribution, this works out to $\hat{p} \pm z_{\frac{\alpha}{2}} \sqrt{n}\sqrt{p(1-p)}$

\

Yes, I know..for someone who has played a long time, one has career statistics ..so imagine one is trying to extrapolate for a new player with limited data.

That seems straightforward enough. But what if one samples the player’s shooting during an unusually good or unusually bad streak? Example: former NBA star Larry Bird once made 71 straight free throws…if that were the sample, $\hat{p} = 1$ with variance zero! Needless to say that trend is highly unlikely to continue.

Classical frequentist statistics doesn’t offer a way out but Bayesian Statistics does.

This is a good introduction:

But here is a simple, “rough and ready” introduction. Bayesian statistics uses not only the observed sample, but a proposed distribution for the parameter of interest (in this case, p, the probability of making a free throw). The proposed distribution is called a prior distribution or just prior. That is often labeled $g(p)$

Since we are dealing with what amounts to 71 Bernoulli trials where p = .672 so the distribution of each random variable describing the outcome of each individual shot has probability mass fuction $p^{y_i}(1-p)^{1-y_i}$ where $y_i = 1$ for a make and $y_i = 0$ for a miss.

Our goal is to calculate what is known as a posterior distribution (or just posterior) which describes $g$ after updating with the data; we’ll call that $g^*(p)$.

How we go about it: use the principles of joint distributions, likelihood functions and marginal distributions to calculate $g^*(p|y_1, y_2...,y_n) = \frac{L(y_1, y_2, ..y_n|p)g(p)}{\int^{\infty}_{-\infty}L(y_1, y_2, ..y_n|p)g(p)dp}$

The denominator “integrates out” p to turn that into a marginal; remember that the $y_i$ are set to the observed values. In our case, all are 1 with $n = 71$.

What works well is to use the beta distribution for the prior. Note: the pdf is $\frac{\Gamma (a+b)}{\Gamma(a) \Gamma(b)} x^{a-1}(1-x)^{b-1}$ and if one uses $p = x$, this works very well. Now because the mean will be $\mu = \frac{a}{a+b}$ and $\sigma^2 = \frac{ab}{(a+b)^2(a+b+1)}$ given the required mean and variance, one can work out $a, b$ algebraically.

Now look at the numerator which consists of the product of a likelihood function and a density function: up to constant $k$, if we set $\sum^n_{i=1} y_i = y$ we get $k p^{y+a-1}(1-p)^{n-y+b-1}$
The denominator: same thing, but $p$ gets integrated out and the constant $k$ cancels; basically the denominator is what makes the fraction into a density function.

So, in effect, we have $kp^{y+a-1}(1-p)^{n-y+b-1}$ which is just a beta distribution with new $a^* =y+a, b^* =n-y + b$.

So, I will spare you the calculation except to say that that the NBA prior with $\mu = .672, \sigma^2 =.0074$ leads to $a = 19.355, b= 9.447$

Now the update: $a^* = 71+19.355 = 90.355, b^* = 9.447$.

What does this look like? (I used this calculator)

That is the prior. Now for the posterior:

Yes, shifted to the right..very narrow as well. The information has changed..but we avoid the absurd contention that $p = 1$ with a confidence interval of zero width.

We can now calculate a “credible interval” of, say, 90 percent, to see where $p$ most likely lies: use the cumulative density function to find this out:

And note that $P(p < .85) = .042, P(p < .95) = .958 \rightarrow P(.85 < p < .95) = .916$. In fact, Bird’s lifetime free throw shooting percentage is .882, which is well within this 91.6 percent credible interval, based on sampling from this one freakish streak.

## March 16, 2019

### The beta function integral: how to evaluate them

My interest in “beta” functions comes from their utility in Bayesian statistics. A nice 78 minute introduction to Bayesian statistics and how the beta distribution is used can be found here; you need to understand basic mathematical statistics concepts such as “joint density”, “marginal density”, “Bayes’ Rule” and “likelihood function” to follow the youtube lecture. To follow this post, one should know the standard “3 semesters” of calculus and know what the gamma function is (the extension of the factorial function to the real numbers); previous exposure to the standard “polar coordinates” proof that $\int^{\infty}_{-\infty} e^{x^2} dx = \sqrt{\pi}$ would be very helpful.

So, what it the beta function? it is $\beta(a,b) = \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)}$ where $\Gamma(x) = \int_0^{\infty} t^{x-1}e^{-t} dt$. Note that $\Gamma(n+1) = n!$ for integers $n$ The gamma function is the unique “logarithmically convex” extension of the factorial function to the real line, where “logarithmically convex” means that the logarithm of the function is convex; that is, the second derivative of the log of the function is positive. Roughly speaking, this means that the function exhibits growth behavior similar to (or “greater”) than $e^{x^2}$

Now it turns out that the beta density function is defined as follows: $\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} x^{a-1}(1-x)^{b-1}$ for $0 < x < 1$ as one can see that the integral is either proper or a convergent improper integral for $0 < a < 1, 0 < b < 1$.

I'll do this in two steps. Step one will convert the beta integral into an integral involving powers of sine and cosine. Step two will be to write $\Gamma(a) \Gamma(b)$ as a product of two integrals, do a change of variables and convert to an improper integral on the first quadrant. Then I'll convert to polar coordinates to show that this integral is equal to $\Gamma(a+b) \beta(a,b)$

Step one: converting the beta integral to a sine/cosine integral. Limit $t \in [0, \frac{\pi}{2}]$ and then do the substitution $x = sin^2(t), dx = 2 sin(t)cos(t) dt$. Then the beta integral becomes: $\int_0^1 x^{a-1}(1-x)^{b-1} dx = 2\int_0^{\frac{\pi}{2}} (sin^2(t))^{a-1}(1-sin^2(t))^{b-1} sin(t)cos(t)dt = 2\int_0^{\frac{\pi}{2}} (sin(t))^{2a-1}(cos(t))^{2b-1} dt$

Step two: transforming the product of two gamma functions into a double integral and evaluating using polar coordinates.

Write $\Gamma(a) \Gamma(b) = \int_0^{\infty} x^{a-1} e^{-x} dx \int_0^{\infty} y^{b-1} e^{-y} dy$

Now do the conversion $x = u^2, dx = 2udu, y = v^2, dy = 2vdv$ to obtain:

$\int_0^{\infty} 2u^{2a-1} e^{-u^2} du \int_0^{\infty} 2v^{2b-1} e^{-v^2} dv$ (there is a tiny amount of algebra involved)

From which we now obtain

$4\int^{\infty}_0 \int^{\infty}_0 u^{2a-1}v^{2b-1} e^{-(u^2+v^2)} dudv$

Now we switch to polar coordinates, remembering the $rdrd\theta$ that comes from evaluating the Jacobian of $x = rcos(\theta), y = rsin(\theta)$

$4 \int^{\frac{\pi}{2}}_0 \int^{\infty}_0 r^{2a +2b -1} (cos(\theta))^{2a-1}(sin(\theta))^{2b-1} e^{-r^2} dr d\theta$

This splits into two integrals:

$2 \int^{\frac{\pi}{2}}_0 (cos(\theta))^{2a-1}(sin(\theta))^{2b-1} d \theta 2\int^{\infty}_0 r^{2a +2b -1}e^{-r^2} dr$

The first of these integrals is just $\beta(a,b)$ so now we have:

$\Gamma(a) \Gamma(b) = \beta(a,b) 2\int^{\infty}_0 r^{2a +2b -1}e^{-r^2} dr$

The second integral: we just use $r^2 = x \rightarrow 2rdr = dx \rightarrow \frac{1}{2}\frac{1}{\sqrt{x}}dx = dr$ to obtain:

$2\int^{\infty}_0 r^{2a +2b -1}e^{-r^2} dr = \int^{\infty}_0 x^{a+b-\frac{1}{2}} e^{-x} \frac{1}{\sqrt{x}}dx = \int^{\infty}_0 x^{a+b-1} e^{-x} dx =\Gamma(a+b)$ (yes, I cancelled the 2 with the 1/2)

And so the result follows.

That seems complicated for a simple little integral, doesn’t it?

## March 14, 2019

### Sign test for matched pairs, Wilcoxon Signed Rank test and Mann-Whitney using a spreadsheet

Filed under: statistics, Uncategorized — Tags: , , , — collegemathteaching @ 10:33 pm

Our goal: perform non-parametric statistical tests for two samples, both paired and independent. We only assume that both samples come from similar distributions, possibly shifted.

I’ll show the steps with just a bit of discussion of what the tests are doing; the text I am using is Mathematical Statistics (with Applications) by Wackerly, Mendenhall and Scheaffer (7’th ed.) and Mathematical Statistics and Data Analysis by John Rice (3’rd ed.).

First the data: 56 students took a final exam. The professor gave some questions and a committee gave some questions. Student performance was graded and the student performance was graded as a “percent out of 100” on each set of questions (committee graded their own questions, professor graded his questions).

The null hypothesis: student performance was the same on both sets of questions. Yes, this data was close enough to being normal that a paired t-test would have been appropriate and one was done for the committee. But because I am teaching a section on non-parametric statistics, I decided to run a paired sign test and a Wilcoxon signed rank test (and then, for the heck of it, a Mann-Whitney test which assumes independent samples..which these were NOT (of course)). The latter was to demonstrate the technique for the students.

There were 56 exams and “pi” was the score on my questions, “pii” the score on committee questions. The screen shot shows a truncated view.

The sign test for matched pairs.
The idea behind this test: take each pair and score it +1 if sample 1 is larger and score it -1 if the second sample is larger. Throw out ties (use your head here; too many ties means we can’t reject the null hypothesis ..the idea is that ties should be rare).

Now set up a binomial experiment where $n$ is the number of pairs. We’d expect that if the null hypothesis is true, $p = .5$ where $p$ is the probability that the pair gets a score of +1. So the expectation would be $np = \frac{n}{2}$ and the standard deviation would be $\frac{1}{2} \sqrt{n}$, that is, $\sqrt{npq}$

This is easy to do in a spreadsheet. Just use the difference in rows:

Now use the “sign” function to return a +1 if the entry from sample 1 is larger, -1 if the entry from sample 2 is larger, or 0 if they are the same.

I use “copy, paste, delete” to store the data from ties, which show up very easily.

Now we need to count the number of “+1”. That can be a tedious, error prone process. But the “countif” command in Excel handles this easily.

Now it is just a matter of either using a binomial calculator or just using the normal approximation (I don’t bother with the continuity correction)

Here we reject the null hypothesis that the scores are statistically the same.

Of course, this matched pairs sign test does not take magnitude of differences into account but rather only the number of times sample 1 is bigger than sample 2…that is, only “who wins” and not “by what score”. Clearly, the magnitude of the difference could well matter.

That brings us to the Wilcoxon signed rank test. Here we list the differences (as before) but then use the “absolute value” function to get the magnitudes of such differences.

Now we need to do an “average rank” of these differences (throwing out a few “zero differences” if need be). By “average rank” I mean the following: if there are “k” entries between ranks n, n+1, n+2, ..n+k-1, then each of these gets a rank $\frac{n + n+1 + n+2 +...+ n+k-1}{k} = n + \frac{(k-1)}{2}$

(use $\sum^n_{k=1} k = \frac{n(n+1)}{2}$ to work this out).

Needless to say, this can be very tedious. But the “rank.avg” function in Excel really helps.

Example: rank.avg(di, $d$2:$d$55, 1) does the following: it ranks the entry in cell di versus the cells in d2: d55 (the dollar signs make the cell addresses “absolute” references, so this doesn’t change as you move down the spreadsheet) and the “1” means you rank from lowest to highest.

Now the test works in the following manner: if the populations are roughly the same, the larger or smaller ranked differences will each come from the same population roughly half the time. So we denote $T^{-}$ the sum of the ranks of the negative differences (in this case, where “pii” is larger) and $T^{+}$ is the sum of the positive differences.

One easy way to tease this out: $T^{+} + T^{-1} = \frac{1}{2}n(n+1)$ and $T^{+} - T^{-}$ can be computed by summing the entries in which the larger differences in “pii” get a negative sign. This is easily done by multiplying the absolute value of the differences by the sign of the differences. Now note that $\frac{1}{2}((T^{+} + T^{-1}) + (T^{+} - T^{-1})) = T^{+}$ and $\frac{1}{2}((T^{+} + T^{-1}) - (T^{+} +-T^{-1})) = T^{-}$

One can use a T table (this is a different T than “student T”) or one can use the normal approximation (if n is greater than, say, 25) with
$E(T^{+}) = \frac{n(n+1)}{2}, V(T^{+}) = \frac{n(n+1)(2n+1)}{24}$ and use the normal approximation.

How these are obtained: the expectation is merely the one half the sum of all the ranks (what one would expect if the distributions were the same) and the variance comes from $n$ Bernouilli random variables $I_k$ (one for each pair) with $p = \frac{1}{2}$ where the variance is $\frac{1}{4} \sum^n_{k=1} k^2 = \frac{1}{4} \frac{n(n+1)(2n+1)}{6}$

Here is a nice video describing the process by hand:

Mann-Whitney test
This test doesn’t apply here as the populations are, well, anything but independent, but we’ll pretend so we can crunch this data set.

Here the idea is very well expressed:

Do the following: label where the data comes from, and rank it all together. Then add the ranks of the population, of say, the first sample. If the samples are the same, the sums of the ranks should be the same for both populations.

Again, do a “rank average” and yes, Excel can do this over two different columns of data, while keeping the ranks themselves in separate columns.

And one can compare, using either column’s rank sum: the expectation would be $E = \frac{n_1(n_1 +n_2 + 1}{2}$ and variance would be $V = \frac{n_1n_2(n_1+n_2+1)}{12}$

Where this comes from: this is really a random sample of since $n_1$ drawn without replacement from a population of integers $1, 2, ... n_1+n_2$ (all possible ranks…and how they are ordered and the numbers we get). The expectation is $n_1 \mu$ and the variance is $n_1 \sigma^2 \frac{n_1+n_2-n_1}{n_1+n_2 -1}$ where $\mu = \frac{n_1+n_2 +1}{2}, \sigma^2 \frac{(n_1+n_2)^2-1}{12}$ (should remind you of the uniform distribution). The rest follows from algebra.

So this is how it goes:

Note: I went ahead and ran the “matched pairs” t-test to contrast with the matched pairs sign test and Wilcoxon test, and the “two sample t-test with unequal variances” to contrast to the Mann-Whitney test..use the “unequal variances” assumption as the variance of sample pii is about double that of pi (I provided the F-test).

## February 18, 2019

### An easy fact about least squares linear regression that I overlooked

The background: I was making notes about the ANOVA table for “least squares” linear regression and reviewing how to derive the “sum of squares” equality:

Total Sum of Squares = Sum of Squares Regression + Sum of Squares Error or…

If $y_i$ is the observed response, $\bar{y}$ the sample mean of the responses, and $\hat{y}_i$ are the responses predicted by the best fit line (simple linear regression here) then:

$\sum (y_i - \bar{y})^2 = \sum (\hat{y}_i -\bar{y})^2+ \sum (y_i - \hat{y}_i)^2$ (where each sum is $\sum^n_{i=1}$ for the n observations. )

Now for each $i$ it is easy to see that $(y_i - \bar{y}) = (\hat{y}_i -\bar{y}) + (y_i - \hat{y}_i)$ but the equations still holds if when these terms are squared, provided you sum them up!

And it was going over the derivation of this that reminded me about an important fact about least squares that I had overlooked when I first presented it.

If you go in to the derivation and calculate: $\sum ( (\hat{y}_i -\bar{y}) + (y_i - \hat{y}_i))^2 = \sum ((\hat{y}_i -\bar{y})^2 + (y_i - \hat{y}_i)^2 +2 (\hat{y}_i -\bar{y})(y_i - \hat{y}_i))$

Which equals $\sum ((\hat{y}_i -\bar{y})^2 + (y_i - \hat{y}_i)^2 + 2\sum (\hat{y}_i -\bar{y})(y_i - \hat{y}_i))$ and the proof is completed by showing that:

$\sum (\hat{y}_i -\bar{y})(y_i - \hat{y}_i)) = \sum (\hat{y}_i)(y_i - \hat{y}_i)) - \sum (\bar{y})(y_i - \hat{y}_i))$ and that BOTH of these sums are zero.

But why?

Let’s go back to how the least squares equations were derived:

Given that $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$

$\frac{\partial}{\partial \hat{\beta}_0} \sum (\hat{y}_i -y_i)^2 = 2\sum (\hat{y}_i -y_i) =0$ yields that $\sum (\hat{y}_i -y_i) =0$. That is, under the least squares equations, the sum of the residuals is zero.

Now $\frac{\partial}{\partial \hat{\beta}_1} \sum (\hat{y}_i -y_i)^2 = 2\sum x_i(\hat{y}_i -y_i) =0$ which yields that $\sum x_i(\hat{y}_i -y_i) =0$

That is, the sum of the residuals, weighted by the corresponding x values (inputs) is also zero. Note: this holds with multilinear regreassion as well.

Really, that is what the least squares process does: it sets the sum of the residuals and the sum of the weighted residuals equal to zero.

Yes, there is a linear algebra formulation of this.

Anyhow returning to our sum:

$\sum (\bar{y})(y_i - \hat{y}_i)) = (\bar{y})\sum(y_i - \hat{y}_i)) = 0$ Now for the other term:

$\sum (\hat{y}_i)(y_i - \hat{y}_i)) = \sum (\hat{\beta}_0+\hat{\beta}_1 x_i)(y_i - \hat{y}_i)) = \hat{\beta}_0\sum (y_i - \hat{y}_i) + \hat{\beta}_1 \sum x_i (y_i - \hat{y}_i))$

Now $\hat{\beta}_0\sum (y_i - \hat{y}_i) = 0$ as it is a constant multiple of the sum of residuals and $\hat{\beta}_1 \sum x_i (y_i - \hat{y}_i)) = 0$ as it is a constant multiple of the weighted sum of residuals..weighted by the $x_i$.

That was pretty easy, wasn’t it?

But the role that the basic least squares equations played in this derivation went right over my head!

## November 29, 2016

### Facebook data for a statistics class

Filed under: statistics — Tags: , , , — collegemathteaching @ 6:04 pm

I have to admit that teaching statistics has kind of ruined me. I find myself seeking patterns and data sets everywhere.

Now a national election does give me some data to play with; I used 2012 data for those purposes a few years ago.

But now I have Facebook. And I have a very curious Facebook friendship (I won’t embarrass the person by naming the person).

She became my FB friend in January of 2014. Lately, we’ve been talking a lot, mostly about the 2016 general election. But we went a long time without conversing via “private message”.

I noticed in the first 560 days of our FB “friendship” we exchanged 30 private messages. Then we started to talk more and more. $t$ is time in days since we started to talk (March 2014) and $NMSG$ is the cumulative number of private messages that we exchanged:

So I figured: this has to be an example of an exponential situation, so I ran a regression $r^2 \geq 0.99$ and got: $N = .1248e^{.010835 t}$ where $N$ is the number of messages and $t$ is the time in days.

Of course, practically speaking, this can’t continue but this “virtually zero” for a long time followed by an “explosion” is a classical exponential phenomenon.

## November 1, 2016

### A test for the independence of random variables

Filed under: algebra, probability, statistics — Tags: , — collegemathteaching @ 10:36 pm

We are using Mathematical Statistics with Applications (7’th Ed.) by Wackerly, Mendenhall and Scheaffer for our calculus based probability and statistics course.

They present the following Theorem (5.5 in this edition)

Let $Y_1$ and $Y_2$ have a joint density $f(y_1, y_2)$ that is positive if and only if $a \leq y_1 \leq b$ and $c \leq y_2 \leq d$ for constants $a, b, c, d$ and $f(y_1, y_2)=0$ otherwise. Then $Y_1, Y_2$ are independent random variables if and only if $f(y_1, y_2) = g(y_1)h(y_2)$ where $g(y_1), h(y_2)$ are non-negative functions of $y_1, y_2$ alone (respectively).

Ok, that is fine as it goes, but then they apply the above theorem to the joint density function: $f(y_1, y_2) = 2y_1$ for $(y_1,y_2) \in [0,1] \times [0,1]$ and 0 otherwise. Do you see the problem? Technically speaking, the theorem doesn’t apply as $f(y_1, y_2)$ is NOT positive if and only if $(y_1, y_2)$ is in some closed rectangle.

It isn’t that hard to fix, I don’t think.

Now there is the density function $f(y_1, y_2) = y_1 + y_2$ on $[0,1] \times [0,1]$ and zero elsewhere. Here, $Y_1, Y_2$ are not independent.

But how does one KNOW that $y_1 + y_2 \neq g(y_1)h(y_2)$?

I played around a bit and came up with the following:

Statement: $\sum^{n}_{i=1} a_i(x_i)^{r_i} \neq f_1(x_1)f_2(x_2).....f_n(x_n)$ (note: assume $r_i \in \{1,2,3,....\}, a_i \neq 0$

Proof of the statement: substitute $x_2 =x_3 = x_4....=x_n = 0$ into both sides to obtain $a_1 x_1^{r_1} = f_1(x_1)(f_2(0)f_3(0)...f_n(0))$ Now none of the $f_k(0) = 0$ else function equality would be impossible. The same argument shows that $a_2 x_2^{r_2} = f_2(x_2)f_1(0)f_3(0)f_4(0)...f_n(0)$ with none of the $f_k(0) = 0$.

Now substitute $x_1=x_2 =x_3 = x_4....=x_n = 0$ into both sides and get $0 = f_1(0)f_2(0)f_3(0)f_4(0)...f_n(0)$ but no factor on the right hand side can be zero.

This is hardly profound but I admit that I’ve been negligent in pointing this out to classes.

## March 21, 2014

### Projections, regressions and Anscombe’s quartet…

Data and its role in journalism is a hot topic among some of the bloggers that I regularly follow. See: Nate Silver on what he hopes to accomplish with his new website, and Paul Krugman’s caveats on this project. The debate is, as I see it, about the role of data and the role of having expertise in a subject when it comes to providing the public with an accurate picture of what is going on.

Then I saw this meme on a Facebook page:

These two things (the discussion and meme) lead me to make this post.

First the meme: I thought of this meme as a way to explain volume integration by “cross sections”. 🙂 But for this post, I’ll focus on this meme showing an example of a “projection map” in mathematics. I can even provide some equations: imagine the following set in $R^3$ described as follows: $S= \{(x,y,z) | (y-2)^2 + (z-2)^2 \le 1, 1 \le x \le 2 \}$ Now the projection map to the $y-z$ plane is given by $p_{yz}(x,y,z) = (0,y,z)$ and the image set is $S_{yz} = \{(0,y,z)| (y-2)^2 + (z-2)^2 \le 1$ which is a disk (in the yellow).

The projection onto the $x-z$ plane is given by $p_{xz}(x,y,z) = (x,0,z)$ and the image is $S_{xz} = \{(x,0,z)| 1 \le x \le 2, 1 \le z \le 3 \}$ which is a rectangle (in the blue).

The issue raised by this meme is that neither projection, in and of itself, determines the set $S$. In fact, both of these projections, taken together, do not determine the object. For example: the “hollow can” in the shape of our $S$ would have the same projection; there are literally an uncountable. Example: imagine a rectangle in the shape of the blue projection joined to one end disk parallel to the yellow plane.

Of course, one can put some restrictions on candidates for $S$ (the pre image of both projections taken together); say one might want $S$ to be a manifold of either 2 or 3 dimensions, or some other criteria. But THAT would be adding more information to the mix and thereby, in a sense, providing yet another projection map.

Projections, by design, lose information.

In statistics, a statistic, by definition, is a type of projection. Consider, for example, linear regression. I discussed linear regressions and using “fake data” to teach linear regression here. But the linear regression process inputs data points and produces numbers including the mean and standard deviations of the $x, y$ values as well as the correlation coefficient and the regression coefficients.

But one loses information in the process. A good demonstration of this comes from Anscombe’s quartet: one has 4 very different data set producing identical regression coefficients (and yes, correlation coefficients, confidence intervals, etc). Here are the plots of the data:

And here is the data:

The Wikipedia article I quoted is pretty good; they even provide a link to a paper that gives an algorithm to generate different data sets with the same regression values (and yes, the paper defines what is meant by “different”).

Moral: when one crunches data, one has to be aware of the loss of information that is involved.

## July 23, 2013

### Nate Silver’s Book: The signal and the noise: why so many predictions fail but some don’t

Filed under: books, elementary mathematics, science, statistics — Tags: , — collegemathteaching @ 4:10 pm

Reposted from my personal blog and from my Daily Kos Diary:

Quick Review
Excellent book. There are a few tiny technical errors (e. g., “non-linear” functions include exponential functions, but not all non-linear phenomena are exponential (e. g. power, root, logarithmic, etc.).
Also, experts have some (justified) quibbles with the book; you can read some of these concerning his chapter on climate change here and some on his discussion of hypothesis testing here.

But, aside from these, it is right on. Anyone who follows the news closely will benefit from it; I especially recommend it to those who closely follow science and politics and even sports.

It is well written and is designed for adults; it makes some (but reasonable) demands on the reader. The scientist, mathematician or engineer can read this at the end of the day but the less technically inclined will probably have to be wide awake while reading this.

Details
Silver sets you up by showing examples of failed predictions; perhaps the worst of the lot was the economic collapse in the United States prior to the 2008 general elections. Much of this was due to the collapse of the real estate market and falling house/property values. Real estate was badly overvalued, and financial firms made packages of investments whose soundness was based on many mortgages NOT defaulting at the same time; it was determined that the risk of that happening was astronomically small. That was wrong of course; one reason is that the risk of such an event is NOT described by the “normal” (bell shaped) distribution but rather by one that allows for failure with a higher degree of probability.

There were more things going on, of course; and many of these things were difficult to model accurately just due to complexity. Too many factors makes a model unusable; too few means that the model is worthless.

Silver also talks about models providing probabilistic outcomes: example saying that the GDP will be X in year Y is unrealistic; what we really should say that the probability of the GDP being X plus/minus “E” is Z percent.

Next Silver takes on pundits. In general: they don’t predict well; they are more about entertainment than anything else. Example: look at the outcome of the 2012 election; the nerds were right; the pundits (be they NPR or Fox News pundits) were wrong. NPR called the election “razor tight” (it wasn’t); Fox called it for the wrong guy. The data was clear and the sports books new this, but that doesn’t sell well, does it?

Now Silver looks at baseball. Of course there are a ton of statistics here; I am a bit sorry he didn’t introduce Bayesian analysis in this chapter though he may have been setting you up for it.

Topics include: what does raw data tell you about a player’s prospects? What role does a talent scout’s input have toward making the prediction? How does a baseball players hitting vary with age, and why is this hard to measure from the data?

The next two chapters deal with predictions: earthquakes and weather. Bottom line: we have statistical data on weather and on earthquakes, but in terms of making “tomorrow’s prediction”, we are much, much, much further along in weather than we are on earthquakes. In terms of earthquakes, we can say stuff like “region Y has a X percent chance of an earthquake of magnitude Z within the next 35 years” but that is about it. On the other hand, we are much better about, say, making forecasts of the path of a hurricane, though these are probabilistic:

In terms of weather: we have many more measurements.

But there IS the following: weather is a chaotic system; a small change in initial conditions can mean to a large change in long term outcomes. Example: one can measure a temperature at time t, but only to a certain degree of precision. The same holds for pressure, wind vectors, etc. Small perturbations can lead to very different outcomes. Solutions aren’t stable with respect to initial conditions.

You can see this easily: try to balance a pen on its tip. Physics tells us there is a precise position at which the pen is at equilibrium, even on its tip. But that equilibrium is so unstable that a small vibration of the table or even small movement of air in the room is enough to upset it.

In fact, some gambling depends on this. For example, consider a coin toss. A coin toss is governed by Newton’s laws for classical mechanics, and in principle, if you could get precise initial conditions and environmental conditions, the outcome shouldn’t be random. But it is…for practical purposes. The same holds for rolling dice.

Now what about dispensing with models and just predicting based on data alone (not regarding physical laws and relationships)? One big problem: data is noisy and is prone to be “overfitted” by a curve (or surface) that exactly matches prior data but is of no predictive value. Think of it this way: if you have n data points in the plane, there is a polynomial of degree n-1 that will fit the data EXACTLY, but in most cases have a very “wiggly” graph that provides no predictive value.

Of course that is overfitting in the extreme. Hence, most use the science of the situation to posit the type of curve that “should” provide a rough fit and then use some mathematical procedure (e. g. “least squares”) to find the “best” curve that fits.

The book goes into many more examples: example: the flu epidemic. Here one finds the old tug between models that are too simplistic to be useful for forecasting and too complicated to be used.

There are interesting sections on poker and chess and the role of probability is discussed as well as the role of machines. The poker chapter is interesting; Silver describes his experience as a poker player. He made a lot of money when poker drew lots of rookies who had money to spend; he didn’t do as well when those “bad” players left and only the most dedicated ones remained. One saw that really bad players lost more money than the best players won (not that hard to understand). He also talked about how hard it was to tell if someone was really good or merely lucky; sometimes this wasn’t perfectly clear after a few months.

Later, Silver discusses climate change and why the vast majority of scientists see it as being real and caused (or made substantially worse) by human activity. He also talks about terrorism and enemy sneak attacks; sometimes there IS a signal out there but it isn’t detected because we don’t realize that there IS a signal to detect.

However the best part of the book (and it is all pretty good, IMHO), is his discussion of Bayes law and Bayesian versus frequentist statistics. I’ve talked about this.

I’ll demonstrate Bayesian reasoning in a couple of examples, and then talk about Bayesian versus frequentist statistical testing.

Example one: back in 1999, I went to the doctor with chest pains. The doctor, based on my symptoms and my current activity level (I still swam and ran long distances with no difficulty) said it was reflux and prescribed prescription antacids. He told me this about a possible stress test: “I could stress test you but the probability of any positive being a false positive is so high, we’d learn nothing from the test”.

Example two: suppose you are testing for a drug that is not widely used; say 5 percent of the population uses it. You have a test that is 95 percent accurate in the following sense: if the person is really using the drug, it will show positive 95 percent of the time, and if the person is NOT using the drug, it will show positive only 5 percent of the time (false positive).

So now you test 2000 people for the drug. If Bob tests positive, what is the probability that he is a drug user?

Answer: There are 100 actual drug users in this population, so you’d expect 100*.95 = 95 true positives. There are 1900 non-users and 1900*.05 = 95 false positives. So there are as many false positives as true positives! The odds that someone who tests positive is really a user is 50 percent.

Now how does this apply to “hypothesis testing”?

Consider basketball. You know that a given player took 10 free shots and made 4. You wonder: what is the probability that this player is a competent free throw shooter (given competence is defined to be, say, 70 percent).

If you just go by the numbers that you see (true: n = 10 is a pathetically small sample; in real life you’d never infer anything), well, the test would be: given the probability of making a free shot is 70 percent, what is the probability that you’d see 4 (or fewer) made free shots out of 10?

Using a calculator (binomial probability calculator), we’d say there is a 4.7 percent chance we’d see 4 or fewer free shots made if the person shooting the shots was a 70 percent shooter. That is the “frequentist” way.

But suppose you found out one of the following:
1. The shooter was me (I played one season in junior high and some pick up ball many years ago…infrequently) or
2. The shooter was an NBA player.

If 1 was true, you’d believe the result or POSSIBLY say “maybe he had a good day”.
If 2 was true, then you’d say “unless this player was chosen from one of the all time worst NBA free throw shooters, he probably just had a bad day”.

Bayesian hypothesis testing gives us a way to make and informed guess. We’d ask: what is the probability that the hypothesis is true given the data that we see (asking the reverse of what the frequentist asks). But to do this, we’d have to guess: if this person is an NBA player, what is the probability, PRIOR to this 4 for 10 shooting, that this person was 70 percent or better (NBA average is about 75 percent). For the sake of argument, assume that there is a 60 percent chance that this person came from the 70 percent or better category (one could do this by seeing the percentage of NBA players shooing 70 percent of better). Assign a “bad” percentage as 50 percent (based on the worst NBA free throw shooters): (the probability of 4 or fewer made free throws out of 10 given a 50 percent free throw shooter is .377)

Then we’d use Bayes law: (.0473*.6)/(.0473*.6 + .377*.4) = .158. So it IS possible that we are seeing a decent free throw shooter having a bad day.

This has profound implications in science. For example, if one is trying to study genes versus the propensity for a given disease, there are a LOT of genes. Say one tests 1000 genes of those who had a certain type of cancer and run a study. If we accept p = .05 (5 percent) chance of having a false positive, we are likely to have 50 false positives out of this study. So, given a positive correlation between a given allele and this disease, what is the probability that this is a false positive? That is, how many true positives are we likely to have?

This is a case in which we can use the science of the situation and perhaps limit our study to genes that have some reasonable expectation of actually causing this malady. Then if we can “preassign” a probability, we might get a better feel if a positive is a false one.

Of course, this technique might induce a “user bias” into the situation from the very start.

The good news is that, given enough data, the frequentist and the Bayesian techniques converge to “the truth”.

Summary Nate Silver’s book is well written, informative and fun to read. I can recommend it without reservation.

## July 12, 2013

### An example to apply Bayes’ Theorem and multivariable calculus

I’ve thought a bit about the breast cancer research results and found a nice “application” exercise that might help teach students about Bayes Theorem, two-variable maximizing, critical points, differentials and the like.

I’ve been interested in the mathematics and statistics of the breast cancer screening issue mostly because it provided a real-life application of statistics and Bayes’ Theorem.

So right now, for women between 40-49, traditional mammograms are about 80 percent accurate in the sense that, if a woman who really has breast cancer gets a mammogram, the test will catch it about 80 percent of the time. The false positive rate is about 8 percent in that: if 100 women who do NOT have breast cancer get a mammogram, 8 of the mammograms will register a “positive”.
Since the breast cancer rate for women in this age group is about 1.4 percent, there will be many more false positives than true positives; in fact a woman in this age group who gets a “positive” first mammogram has about a 16 percent chance of actually having breast cancer. I talk about these issues here.

So, suppose you desire a “more accurate test” for breast cancer. The question is this: what do you mean by “more accurate”?

1. If “more accurate” means “giving the right answer more often”, then that is pretty easy to do.
Current testing is going to be wrong: if C means cancer, N means “doesn’t have cancer”, P means “positive test” and M means “negative test”, then the probability of being wrong is:
$P(M|C)P(C) + P(P|N)P(N) = .2(.014) + .08(.986) = .08168$. On the other hand, if you just declared EVERYONE to be “cancer free”, you’d be wrong only 1.4 percent of the time! So clearly that does not work; the “false negative” rate is 100 percent, though the “false positive” rate is 0.

On the other hand if you just told everyone “you have it”, then you’d be wrong 98.6 percent of the time, but you’d have zero “false negatives”.

So being right more often isn’t what you want to maximize, and trying to minimize the false positives or the false negatives doesn’t work either.

2. So what about “detecting more of the cancer that is there”? Well, that is where this article comes in. Switching to digital mammograms does increase detection rate but also increases the number of false positives:

The authors note that for every 10,000 women 40 to 49 who are given digital mammograms, two more cases of cancer will be identified for every 170 additional false-positive examinations.

So, what one sees is that if a woman gets a positive reading, she now has an 11 percent of actually having breast cancer, though a few more cancers would be detected.

Is this progress?

My whole point: saying one test is “more accurate” than another test isn’t well defined, especially in a situation where one is trying to detect something that is relatively rare.
Here is one way to look at it: let the probability of breast cancer be $a$, the probability of detection of a cancer be given by $x$ and the probability of a false positive be given by $y$. Then the probability of a person actually having breast cancer, given a positive test is given by:
$B(x,y) =\frac{ax}{ax + (1-a)y}$; this gives us something to optimize. The partial derivatives are:
$\frac{\partial B}{\partial x}= \frac{(a)(1-a)y}{(ax+ (1-a)y)^2},\frac{\partial B}{\partial y}=\frac{(-a)(1-a)x}{(ax+ (1-a)y)^2}$. Note that $1-a$ is positive since $a$ is less than 1 (in fact, it is small). We also know that the critical point $x = y =0$ is a bit of a “duh”: find a single test that gives no false positives and no false negatives. This also shows us that our predictions will be better if $y$ goes down (fewer false positives) and if $x$ goes up (fewer false negatives). None of that is a surprise.

But of interest is in the amount of change. The denominators of each partial derivative are identical. The coefficients of the numerators are of the same magnitude; there are different signs. So the rate of improvement of the predictive value is dependent on the relative magnitudes of $x$, which is $.8$ for us, and $y$, which is $.08$. Note that $x$ is much larger than $y$ and $x$ occurs in the numerator $\frac{\partial B}{\partial y}$. Hence an increase in the accuracy of the $y$ factor (a decrease in the false positive rate) will have a greater effect on the accuracy of the test than a similar increase in the “false negative” accuracy.
Using the concept of differentials, we expect a change $\Delta x = .01$ leads to an improvement of about .00136 (substitute $x = .8, y = .08$ into the expression for $\frac{\partial B}{\partial x}$ and multiply by $.01$. Similarly an improvement (decrease) of $\Delta y = -.01$ leads to an improvement of .013609.

You can “verify” this by playing with some numbers:

Current ($x = .8, y = .08$) we get $B = .1243$. Now let’s change: $x = .81, y = .08$ leads to $B = .125693$
Now change: $x = .8, y = .07$ we get $B = .139616$

Bottom line: the best way to increase the predictive value of the test is to reduce the number of false positives, while staying the same (or improving) the percentage of “false negatives”. As things sit, the false positive rate is the bigger factor affecting predictive value.

### Hypothesis Testing: Frequentist and Bayesian

Filed under: science, statistics — Tags: , — collegemathteaching @ 4:24 pm

I was working through Nate Silver’s book The Signal and the Noise and got to his chapter about hypothesis testing. It is interesting reading and I thought I would expand on that by posing a couple of problems.

Problem one: suppose you knew that someone attempted some basketball free throws.
If they made 1 of 4 shots, what would the probability be that they were really, say, a 75 percent free throw shooter?
Or, what if they made 5 of 20 shots?

Problem two: Suppose a woman aged 40-49 got a digital mammagram and got a “positive” reading. What is the probability that she indeed has breast cancer, given that the test catches 80 percent of the breast cancers (note: 20 percent is one estimate of the “false negative” rate; and yes, the false positive rate is 7.8 percent. The actual answer, derived from data, might surprise you: it is : 16.3 percent.

I’ll talk about problem two first, as this will limber the mind for problem one.

So, you are a woman between 40-49 years of age and go into the doctor and get a mammogram. The result: positive.

So, what is the probability that you, in fact, have cancer?

Think of it this way: out of 10,000 women in that age bracket, about 143 have breast cancer and 9857 do not.
So, the number of false positives is 9857*.078 = 768.846; we’ll keep the decimal for the sake of calculation;
The number of true positives is: 143*.8 = 114.4.
The total number of positives is therefore 883.246.

The proportion of true positives is $\frac{114.4}{883.246} = .1628$ So the false positive rate is 83.72 percent.

It turns out that, data has shown the 80-90 percent of positives in women in this age bracket are “false positives”, and our calculation is in line with that.

I want to point out that this example is designed to warm the reader up to Bayesian thinking; the “real life” science/medicine issues are a bit more complicated than this. That is why the recommendations for screening include criteria as to age, symptoms vs. asymptomatic, family histories, etc. All of these factors affect the calculations.

For example: using digital mammograms with this population of 10,000 women in this age bracket adds 2 more “true” detections and adds 170 more false positives. So now our calculation would be $\frac{116.4}{1055.25} = .1103$ , so while the true detections go up, the false positives also goes up!

Our calculation, while specific to this case, generalizes. The formula comes from Bayes Theorem which states:
$P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|not(A))P(not(A))}$. Here: $P(A|B)$ is the probability of event A occurring given that B occurs and P(A) is the probability of event A occurring. So in our case, we were answering the question: given a positive mammogram, what is the probability of actually having breast cancer? This is denoted by P(A|B) . We knew: P(B|A) which is the probability of having a positive reading given that one has breast cancer and P(B|not(A)) is the probability of getting a positive reading given that one does NOT have cancer. So for us:$P(B|A) = .8, P(B|not(A)) = .078$ and $P(A) = .0143, P(not(A)) = .9857$ .

The bottom line: If you are testing for a condition that is known to be rare, even a reasonably accurate test will deliver a LOT of false positives.

Here is a warm up (hypothetical) example. Suppose a drug test is 99 percent accurate in that it will detect that a certain drug is there 99 percent of the time (if it is really there) and only yield a false positive 1 percent of the time (gives a positive result even if the person being tested is free of this drug). Suppose the drug use in this population is “known” to be, say 5 percent.

Given a positive test, what is the probability that the person is actually a user of this drug?

Answer: $\frac{.99*.05}{.99*.05+.01*.95} = .839$ . So, in this population, about 16.1 percent of the positives will be “false positives”, even though the test is 99 percent accurate!

Now that you are warmed up, let’s proceed to the basketball question:

Question: suppose someone (that you don’t actually see) shoots free throws.

Case a) the player makes 1 of 4 shots.
Case b) the player makes 2 of 8 shots.
Case c) the player makes 5 of 20 shots.

Now you’d like to know: what is the probability that the player in question is really a 75 percent free throw shooter? (I picked 75 percent as the NBA average for last season is 75.3 percent).

Now suppose you knew NOTHING else about this situation; you know only that someone attempted free throws and you got the following data.

The traditional “hypothesis test” uses the “frequentist” model: you would say: if the hypothesis that the person really is a 75 percent free throw shooter is true, what is the probability that we’d see this data?

So one would use the formula for the binomial distribution and use n = 4 for case A, n = 8 for case B and n = 20 for case C and use p = .75 for all cases.

In case A, we’d calculate the probability that the number of “successes” (made free throws) is less than or equal to 1; 2 for case B and 5 for case C.

For you experts: the null hypothesis would be, say for the various cases would be $P(Y \le 1 | p = .75), P(Y \le 2 | p = .75), P(Y \le 5 | = .75)$ respectively, where the probability mass function is adjusted for the different values of n .

We could do the calculations by hand, or rely on this handy calculator.

Case A: .0508
Case B: .0042
Case C: .0000 ($3.81 \times 10^{-6}$)

By traditional standards: Case A: we would be on the verge of “rejecting the null hypothesis that p = .75 and we’d easily reject the null hypothesis in cases B and C. The usual standard (for life science and political science) is p = .05).

(for a refresher, go here)

So that is that, right?

Well, what if I told you more of the story?

Suppose now, that in each case, the shooter was me? I am not a good athlete and I played one season in junior high, and rarely, some pickup basketball. I am a terrible player. Most anyone would happily reject the null hypothesis without a second thought.

But now: suppose I tell you that I took these performances from NBA box scores? (the first one was taken from one of the Spurs-Heat finals games; the other two are made up for demonstration).

Now, you might not be so quick to reject the null hypothesis. You might reason: “well, he is an NBA player and were he always as bad as the cases show, he wouldn’t be an NBA player. This is probably just a bad game.” In other words, you’d be more open to the possibility that this is a false positive.

Now you don’t know this for sure; this could be an exceptionally bad free throw shooter (Ben Wallace shot 41.5 percent, Shaquille O’Neal shot 52.7 percent) but unless you knew that, you’d be at least reasonably sure that this person, being an NBA player, is probably a 70-75 shooter, at worst.

So “how” sure might you be? You might look at NBA statistics and surmise that, say (I am just making this up), 68 percent of NBA players shoot between 72-78 percent from the line. So, you might say that, prior to this guy shooting at all, the probability of the hypothesis being true is about 70 percent (say). Yes, this is a prior judgement but it is a reasonable one. Now you’d use Bayes law:

$P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|not(A))P(not(A))}$

Here: A represents the “75 percent shooter” being actually true, and B is the is the probability that we actually get the data. Note the difference in outlook: in the first case (the “frequentist” method), we wondered “if the hypothesis is true, how likely is it that we’d see data like this”. In this case, called the Bayesian method, we are wondering: “if we have this data, what is the probability that the null hypothesis is true”. It is a reverse statement, of sorts.

Of course, we have $P(A) = .7, P(not(A)) = .3$ and we’ve already calculated P(B|A) for the various cases. We need to make a SECOND assumption: what does event not(A) mean? Given what I’ve said, one might say not(A) is someone who shoots, say, 40 percent (to make him among the worst possible in the NBA). Then for the various cases, we calculate $P(B|not(A)) = .4752, .3154, .1256$ respectively.

So, we now calculate using the Bayesian method:

Case A, the shooter made 1 of 4: .1996. The frequentist p-value was .0508
Case B, the shooter made 2 of 8: .0301. The frequentist p-value was .0042
Case C, the shooter made 5 of 20: 7.08 x 10^-5 The frequentist p-value was 3.81 x 10^-6

We see the following:
1. The Bayesian method is less likely to produce a “false positive”.
2. As n, the number of data points, grows, the Bayesian conclusion and the frequentist conclusions tend toward “the truth”; that is, if the shooter shoots enough foul shots and continues to make 25 percent of them, then the shooter really becomes a 25 percent free throw shooter.

So to sum it up:
1. The frequentist approach relies on fewer prior assumptions and is computationally simpler. But it doesn’t include extra information that might make it easier to distinguish false positives from genuine positives.
2. The Bayesian approach takes in more available information. But it is a bit more prone to the user’s preconceived notions and is harder to calculate.

How does this apply to science?
Well, suppose you wanted to do an experiment that tried to find out which human gene alleles correspond so a certain human ailment. So a brute force experiment in which every human gene is examined and is statistically tested for correlation with the given ailment with null hypothesis of “no correlation” would be a LOT of statistical tests; tens of thousands, at least. And at a p-value threshold of .05 (we are willing to risk a false positive rate of 5 percent), we will get a LOT of false positives. On the other hand, if we applied bit of science prior to the experiment and were able to assign higher prior probabilities (called “posterior probability”) to the genes “more likely” to be influential and lower posterior probability to those unlikely to have much influence, our false positive rates will go down.

Of course, none of this eliminates the need for replication, but Bayesian techniques might cut down the number of experiments we need to replicate.

Older Posts »