# College Math Teaching

## April 5, 2019

Let’s start with an example from sports: basketball free throws. At a certain times in a game, a player is awarded a free throw, where the player stands 15 feet away from the basket and is allowed to shoot to make a basket, which is worth 1 point. In the NBA, a player will take 2 or 3 shots; the rules are slightly different for college basketball.

Each player will have a “free throw percentage” which is the number of made shots divided by the number of attempts. For NBA players, the league average is .672 with a variance of .0074.

Now suppose you want to determine how well a player will do, given, say, a sample of the player’s data? Under classical (aka “frequentist” ) statistics, one looks at how well the player has done, calculates the percentage ($p$) and then determines a confidence interval for said $p$: using the normal approximation to the binomial distribution, this works out to $\hat{p} \pm z_{\frac{\alpha}{2}} \sqrt{n}\sqrt{p(1-p)}$

\

Yes, I know..for someone who has played a long time, one has career statistics ..so imagine one is trying to extrapolate for a new player with limited data.

That seems straightforward enough. But what if one samples the player’s shooting during an unusually good or unusually bad streak? Example: former NBA star Larry Bird once made 71 straight free throws…if that were the sample, $\hat{p} = 1$ with variance zero! Needless to say that trend is highly unlikely to continue.

Classical frequentist statistics doesn’t offer a way out but Bayesian Statistics does.

This is a good introduction:

But here is a simple, “rough and ready” introduction. Bayesian statistics uses not only the observed sample, but a proposed distribution for the parameter of interest (in this case, p, the probability of making a free throw). The proposed distribution is called a prior distribution or just prior. That is often labeled $g(p)$

Since we are dealing with what amounts to 71 Bernoulli trials where p = .672 so the distribution of each random variable describing the outcome of each individual shot has probability mass fuction $p^{y_i}(1-p)^{1-y_i}$ where $y_i = 1$ for a make and $y_i = 0$ for a miss.

Our goal is to calculate what is known as a posterior distribution (or just posterior) which describes $g$ after updating with the data; we’ll call that $g^*(p)$.

How we go about it: use the principles of joint distributions, likelihood functions and marginal distributions to calculate $g^*(p|y_1, y_2...,y_n) = \frac{L(y_1, y_2, ..y_n|p)g(p)}{\int^{\infty}_{-\infty}L(y_1, y_2, ..y_n|p)g(p)dp}$

The denominator “integrates out” p to turn that into a marginal; remember that the $y_i$ are set to the observed values. In our case, all are 1 with $n = 71$.

What works well is to use the beta distribution for the prior. Note: the pdf is $\frac{\Gamma (a+b)}{\Gamma(a) \Gamma(b)} x^{a-1}(1-x)^{b-1}$ and if one uses $p = x$, this works very well. Now because the mean will be $\mu = \frac{a}{a+b}$ and $\sigma^2 = \frac{ab}{(a+b)^2(a+b+1)}$ given the required mean and variance, one can work out $a, b$ algebraically.

Now look at the numerator which consists of the product of a likelihood function and a density function: up to constant $k$, if we set $\sum^n_{i=1} y_i = y$ we get $k p^{y+a-1}(1-p)^{n-y+b-1}$
The denominator: same thing, but $p$ gets integrated out and the constant $k$ cancels; basically the denominator is what makes the fraction into a density function.

So, in effect, we have $kp^{y+a-1}(1-p)^{n-y+b-1}$ which is just a beta distribution with new $a^* =y+a, b^* =n-y + b$.

So, I will spare you the calculation except to say that that the NBA prior with $\mu = .672, \sigma^2 =.0074$ leads to $a = 19.355, b= 9.447$

Now the update: $a^* = 71+19.355 = 90.355, b^* = 9.447$.

What does this look like? (I used this calculator)

That is the prior. Now for the posterior:

Yes, shifted to the right..very narrow as well. The information has changed..but we avoid the absurd contention that $p = 1$ with a confidence interval of zero width.

We can now calculate a “credible interval” of, say, 90 percent, to see where $p$ most likely lies: use the cumulative density function to find this out:

And note that $P(p < .85) = .042, P(p < .95) = .958 \rightarrow P(.85 < p < .95) = .916$. In fact, Bird’s lifetime free throw shooting percentage is .882, which is well within this 91.6 percent credible interval, based on sampling from this one freakish streak.

## August 28, 2018

### Conditional Probability in the news..

Filed under: probability — Tags: , — collegemathteaching @ 1:11 am

I am going to stay in my lane here and not weigh in on a social science issue. But I will comment on this article, which I was alerted to here. This is from the Atlantic article:

When the ACLU report came out in 2017, Dyer told the Fresno Bee the findings of racial disparities were “without merit” but also said that the disproportionate use of force corresponds with high crime populations. At the end of our conversation, Dyer pointed to a printout he brought with him, a list of the department’s “most wanted” people. “We can’t plug in a bunch of white guys,” he said. “You know who’s shooting black people? Black people. It’s black-on-black crime.”

But so-called “black-on-black crime” as an explanation for heightened policing of black communities has been widely debunked. A recent study by the U.S. Department of Justice found that, overwhelmingly, violent crimes are committed by people who are the same race as their victims. “Black-on-black” crime rates, the study found, are comparable to “white-on-white” crime rates.

So, just what did that “recent study” find? I put a link to it, but basically, it said that most white crime victims were the victim of a white criminal and that most black victims were the victim of a black criminal. THAT is their “debunking”. That is a conditional probability: GIVEN that you were a crime victim to begin with, then the perpetrator was probably of the same race.

That says nothing about how likely a white or a black person was to be a crime victim to being with. From the blog post critiquing the Atlantic article:

What the rest of us mean by “black-on-black crime rate” is the overall rate at which blacks victimize others or the rate at which they are victimized themselves––which, for homicide, has ranged from 6 to 8 times higher than for whites in recent decades. Homicide is the leading cause of death for black boys/men aged 15-19, 20-24, and 25-34, according to the CDC. That fact cannot be said about any other ethnicity/age combination. Blacks only make up 14% of the population. But about half of the murdered bodies that turn up in this country are black bodies (to use a phrase in vogue on the identitarian Left), year in and year out.

In short, blacks are far more often to be the crime victim too. Even the study that the Atlantic article linked to shows this.

Anyhow, that is a nice example of conditional probability.

## November 1, 2016

### A test for the independence of random variables

Filed under: algebra, probability, statistics — Tags: , — collegemathteaching @ 10:36 pm

We are using Mathematical Statistics with Applications (7’th Ed.) by Wackerly, Mendenhall and Scheaffer for our calculus based probability and statistics course.

They present the following Theorem (5.5 in this edition)

Let $Y_1$ and $Y_2$ have a joint density $f(y_1, y_2)$ that is positive if and only if $a \leq y_1 \leq b$ and $c \leq y_2 \leq d$ for constants $a, b, c, d$ and $f(y_1, y_2)=0$ otherwise. Then $Y_1, Y_2$ are independent random variables if and only if $f(y_1, y_2) = g(y_1)h(y_2)$ where $g(y_1), h(y_2)$ are non-negative functions of $y_1, y_2$ alone (respectively).

Ok, that is fine as it goes, but then they apply the above theorem to the joint density function: $f(y_1, y_2) = 2y_1$ for $(y_1,y_2) \in [0,1] \times [0,1]$ and 0 otherwise. Do you see the problem? Technically speaking, the theorem doesn’t apply as $f(y_1, y_2)$ is NOT positive if and only if $(y_1, y_2)$ is in some closed rectangle.

It isn’t that hard to fix, I don’t think.

Now there is the density function $f(y_1, y_2) = y_1 + y_2$ on $[0,1] \times [0,1]$ and zero elsewhere. Here, $Y_1, Y_2$ are not independent.

But how does one KNOW that $y_1 + y_2 \neq g(y_1)h(y_2)$?

I played around a bit and came up with the following:

Statement: $\sum^{n}_{i=1} a_i(x_i)^{r_i} \neq f_1(x_1)f_2(x_2).....f_n(x_n)$ (note: assume $r_i \in \{1,2,3,....\}, a_i \neq 0$

Proof of the statement: substitute $x_2 =x_3 = x_4....=x_n = 0$ into both sides to obtain $a_1 x_1^{r_1} = f_1(x_1)(f_2(0)f_3(0)...f_n(0))$ Now none of the $f_k(0) = 0$ else function equality would be impossible. The same argument shows that $a_2 x_2^{r_2} = f_2(x_2)f_1(0)f_3(0)f_4(0)...f_n(0)$ with none of the $f_k(0) = 0$.

Now substitute $x_1=x_2 =x_3 = x_4....=x_n = 0$ into both sides and get $0 = f_1(0)f_2(0)f_3(0)f_4(0)...f_n(0)$ but no factor on the right hand side can be zero.

This is hardly profound but I admit that I’ve been negligent in pointing this out to classes.

## January 6, 2016

### On all but a set of measure zero

Filed under: analysis, physics, popular mathematics, probability — Tags: — collegemathteaching @ 7:36 pm

This blog isn’t about cosmology or about arguments over religion. But it is unusual to hear “on all but a set of measure zero” in the middle of a pop-science talk: (2:40-2:50)

## September 2, 2014

### Using convolutions and Fourier Transforms to prove the Central Limit Theorem

Filed under: probability — Tags: , , — collegemathteaching @ 5:40 pm

I’ve used the presentation in the our Probability and Statistics text; it is appropriate given that many of our students haven’t seen the Fourier Transform. But this presentation is excellent.

Upshot: use the convolution to derive the density function for $S_n = X_1 + X_2 + ....X_n$ (independent, identically distributed random variables of finite variance), assume mean is zero, variance is 1 and divide $S_n$ by $\sqrt{n}$ to obtain the variance of the sum to be 1. Then use the Fourier transform on the whole thing (the normalized version) to turn convolution into products, use the definition of Fourier transform and use the Taylor series for the $e^{i 2 \pi x \frac{s}{\sqrt{n}}}$ terms, discard the high order terms, take the limit as $n$ goes to infinity and obtain a Gaussian, which, of course, inverse Fourier transforms to another Gaussian.

## May 22, 2013

### In the news….and THINK before you reply to an article. :-)

Ok, a mathematician who is known to be brilliant self-publishes (on the internet) a dense, 512 page proof of a famous conjecture. So what happens?

The Internet exploded. Within days, even the mainstream media had picked up on the story. “World’s Most Complex Mathematical Theory Cracked,” announced the Telegraph. “Possible Breakthrough in ABC Conjecture,” reported the New York Times, more demurely.

On MathOverflow, an online math forum, mathematicians around the world began to debate and discuss Mochizuki’s claim. The question which quickly bubbled to the top of the forum, encouraged by the community’s “upvotes,” was simple: “Can someone briefly explain the philosophy behind his work and comment on why it might be expected to shed light on questions like the ABC conjecture?” asked Andy Putman, assistant professor at Rice University. Or, in plainer words: I don’t get it. Does anyone?

The problem, as many mathematicians were discovering when they flocked to Mochizuki’s website, was that the proof was impossible to read. The first paper, entitled “Inter-universal Teichmuller Theory I: Construction of Hodge Theaters,” starts out by stating that the goal is “to establish an arithmetic version of Teichmuller theory for number fields equipped with an elliptic curve…by applying the theory of semi-graphs of anabelioids, Frobenioids, the etale theta function, and log-shells.”

This is not just gibberish to the average layman. It was gibberish to the math community as well.

[…]

Here is the deal: reading a mid level mathematics research paper is hard work. Refereeing it is even harder work (really checking the proofs) and it is hard work that is not really going to result in anything positive for the person doing the work.

Of course, if you referee for a journal, you do your best because you want YOUR papers to get good refereeing. You want them fairly evaluated and if there is a mistake in your work, it is much better for the referee to catch it than to look like an idiot in front of your community.

But this work was not submitted to a journal. Interesting, no?

Of course, were I to do this, it would be ok to dismiss me as a crank since I haven’t given the mathematical community any reason to grant me the benefit of the doubt.

And speaking of idiots; I made a rather foolish remark in the comments section of this article by Edward Frenkel in Scientific American. The article itself is fine: it is about the Abel prize and the work by Pierre Deligne which won this prize. The work deals with what one might call the geometry of number theory. The idea: if one wants to look for solutions to an equation, say, $x^2 + y^2 = 1$ one gets different associated geometric objects which depend on “what kind of numbers” we allow for $x, y$. For example, if $x, y$ are integers, we get a 4 point set. If $x, y$ are real numbers, we get a circle in the plane. Then Frenkel remarked:

such as x2 + y2 = 1, we can look for its solutions in different domains: in the familiar numerical systems, such as real or complex numbers, or in less familiar ones, like natural numbers modulo N. For example, solutions of the above equation in real numbers form a circle, but solutions in complex numbers form a sphere.

The comment that I bolded didn’t make sense to me; I did a quick look up and reviewed that $|z_1|^2 + |z_2|^2 = 1$ actually forms a 3-sphere which lives in $R^4$. Note: I added in the “absolute value” signs which were not there in the article.

This is easy to see: if $z_1 = x_1 + y_1 i, z_2 = x_2 + y_2i$ then $|z_1|^2 + |z_2|^2 = 1$ implies that $x_1^2 + y_1^2 + x_2^2 + y_2^2 = 1$. But that isn’t what was in the article.

Frenkel made a patient, kind response …and as soon as I read “equate real and imaginary parts” I winced with self-embarrassment.

Of course, he admits that the complex version of this equation really yields a PUNCTURED sphere; basically a copy of $R^2$ in $R^4$.

Just for fun, let’s look at this beast.

Real part of the equation: $x_1^2 + x_2^2 - (y_1^2 + y_2^2) = 1$
Imaginary part: $x_1y_1 + x_2y_2 = 0$ (for you experts: this is a real algebraic variety in 4-space).

Now let’s look at the intersection of this surface in 4 space with some coordinate planes:
Clearly this surface misses the $x_1=x_2 = 0$ plane (look at the real part of the equation).
Intersection with the $y_1 = y_2 = 0$ plane yields $x_1^2+ x_2^2 = 1$ which is just the unit circle.
Intersection with the $y_1 = x_2 = 0$ plane yields the hyperbola $x_1^2 - y_2^2 = 1$
Intersection with the $y_2 = x_1 = 0$ plane yields the hyperbola $x_2^2 - y_1^2 = 1$
Intersection with the $x_1 = y_1 = 0$ plane yields two isolated points: $x_2 = \pm 1$
Intersection with the $x_2 = y_2 = 0$ plane yields two isolated points: $x_1 = \pm 1$
(so we know that this object is non-compact; this is one reason the “sphere” remark puzzled me)

Science and the media
This Guardian article points out that it is hard to do good science reporting that goes beyond information entertainment. Of course, one of the reasons is that many “groundbreaking” science findings turn out to be false, even if the scientists in question did their work carefully. If this sounds strange, consider the following “thought experiment”: suppose that there are, say, 1000 factors that one can study and only 1 of them is relevant to the issue at hand (say, one place on the genome might indicate a genuine risk factor for a given disease, and it makes sense to study 1000 different places). So you take one at random, run a statistical test at $p = .05$ and find statistical significance at $p = .05$. So, if we get a “positive” result from an experiment, what is the chance that it is a true positive? (assume 95 percent accuracy)

So let P represent a positive outcome of a test, N a negative outcome, T means that this is a genuine factor, and F that it isn’t.
Note: P(T) = .001, P(F) = .999, $P(P|T) = .95, P(N|T) = .05, P(P|F) = .05, P(N|F) = .95$. It follows $P(P) = P(T)P(P \cap T)P(T) + P(F)P(P \cap F) = (.001)(.95) + (.999)(.05) = .0509$

So we seek: the probability that a result is true given that a positive test occurred: we seek $P(T|P) =\frac{P(P|T)P(T)}{P(P)} = \frac{(.95)(.001)}{.0509} = .018664$. That is, given a test is 95 percent accurate, if one is testing for something very rare, there is only about a 2 percent chance that a positive test is from a true factor, even if the test is done correctly!

## March 18, 2013

### Odds and the transitive property

Filed under: media, movies, popular mathematics, probability — Tags: — collegemathteaching @ 9:51 pm

I got this from Mano Singham’s blog: he is a physics professor who mostly writes about social issues. But on occasion he writes about physics and mathematics, as he does here. In this post, he talks about the transitive property.

Most students are familiar with this property; roughly speaking it says that if one has a partially ordered set and $a \le b$ and $b \le c$ then $a \le c$. Those who have studied the real numbers might be tempted to greet this concept with a shrug. However in more complicated cases, the transitive property simply doesn’t hold, even when it makes sense to order things. Here is an example: consider the following sets of dice:

What we have going here: Red beats green 4 out of 6 times. Green beats blue 4 out of 6 times. Blue beats red 4 out of 6 times. All the colored dice tie the “normal” die. Yet, the means of the numbers are all the same.

Note: that this can happen is probably not a surprise to sports fans; for example, in boxing: Ken Norton beat Muhammed Ali (the first time), George Foreman destroyed Ken Norton and, Ali beat Foreman in a classic. Of course things like this happen in sports like basketball but when team doesn’t always play its best or its worst.

But this dice example works so beautifully because this “impossibility of the dice obeying a transitive ordering relation is theoretically impossible, by design.

Movies
Since the wife has been gone on a trip, I’ve watched some old movies at night. One of them was the Cincinnati Kid, which features this classic scene:

Basically, the Kid has a full house, but ends up losing to a straight flush. Yes, the odds of the ten cards (in stud poker) ending up in “one hand a full house, the other a straight flush” are extremely remote. I haven’t done the calculations but this assertion seems plausible:

Holden states that the chances of both such hands appearing in one deal are “a laughable” 332,220,508,619 to 1 (more than 332 billion to 1 against) and goes on: “If these two played 50 hands of stud an hour, eight hours a day, five days a week, the situation would arise about once every 443 years.”

But there is one remark from this Wikipedia article that seems interesting:

The unlikely nature of the final hand is discussed by Anthony Holden in his book Big Deal: A Year as a Professional Poker Player, “the odds against any full house losing to any straight flush, in a two-handed game, are 45,102,781 to 1,”

I haven’t done the calculation but that seems plausible. But, here is the real point to the final scene: the Kid knows that he has a full house but The Man is showing 8, 9, 10, Q of diamonds. He knows that the only “down” card that can beat him is the J of diamonds but he knows that he has 3 10’s, 2 A’s. So there are, to his knowledge, $52 - 9 = 43$ cards out, and only 1 that can beat him. So the Kid’s probability of winning is $\frac{42}{43}$ which are pretty strong odds, but they are not of the “million to one” variety.

## March 3, 2013

### Mathematics, Statistics, Physics

Filed under: applications of calculus, media, news, physics, probability, science, statistics — collegemathteaching @ 11:00 pm

This is a fun little post about the interplay between physics, mathematics and statistics (Brownian Motion)

Here is a teaser video:

The article itself has a nice animation showing the effects of a Poisson process: one will get some statistical clumping in areas rather than uniform spreading.

Treat yourself to the whole article; it is entertaining.

## January 17, 2013

### Enigma Machines: some of the elementary math

Note: this type of cipher is really an element of the group $S_{26}$, the symmetric group on 26 letters. Never allowing a letter to go to itself reduced the possibilites to products of cycles that covered all of the letters.

### Math and Probability in Pinker’s book: The Better Angels of our Nature

Filed under: elementary mathematics, media, news, probability, statistics — Tags: , — collegemathteaching @ 1:01 am

I am reading The Better Angels of our Nature by Steven Pinker. Right now I am a little over 200 pages into this 700 page book; it is very interesting. The idea: Pinker is arguing that humans, over time, are becoming less violent. One interesting fact: right now, a random human is less likely to die violently than ever before. Yes, the last century saw astonishing genocides and two world wars. But: when one takes into account how many people there are in the world (2.5 billion in 1950, 6 billion right now) World War II, as horrific as it was, only ranks 9’th on the list of deaths due to deliberate human acts (genocides, wars, etc.) in terms of “percentage of the existing population killed in the event”. (here is Matthew White’s site)

But I have a ways to go in the book…but it is one I am eager to keep reading.

The purpose of this post is to talk about a bit of probability theory that occurs in the early part of the book. I’ll introduce it this way:

Suppose I select a 28 day period. On each day, say starting with Monday of the first week, I roll a fair die one time. I note when a “1” is rolled. Suppose my first “1” occurs Wednesday of the first week. Then answer this: “what is the most likely day that I obtain my NEXT “1”, or all days equally likely?”

Yes, it is true that on any given day, the probability of rolling a “1” is 1/6. But remember my question: “what day is most likely for the NEXT one?” If you have had some probability, the distribution you want to use is the geometric distribution, starting on Thursday of the next week.

So you can see, the mostly likely day for the next “1” is Thursday! Well, why not, say, Friday? Well, if Friday is the next 1, then this means that you got “any number but 1” on Thursday followed by a “1” on Friday, and the probability of that is $\frac{5}{6} \frac{1}{6} = \frac{5}{36}$. The probability of the next one being Saturday is $\frac{25}{196}$ and so on.

The point: if one is studying the distribution of events that have a Poisson distribution (probability $p$) on a given time period, the overall distribution of such events is likely to show up “clumped” rather than evenly spaced. For an example of this happening in sports, check this out.

Anyway, Pinker applies this principle to the outbreak of wars, mass killings and the like.

Older Posts »