College Math Teaching

February 11, 2013

Gee, Math is Hard! But ignore it at your peril…

Via Slate Magazine: (Edward Frenkel)

Imagine a world in which it is possible for an elite group of hackers to install a “backdoor” not on a personal computer but on the entire U.S. economy. Imagine that they can use it to cryptically raise taxes and slash social benefits at will. Such a scenario may sound far-fetched, but replace “backdoor” with the Consumer Price Index (CPI), and you get a pretty accurate picture of how this arcane economics statistic has been used.
Tax brackets, Social Security, Medicare, and various indexed payments, together affecting tens of millions of Americans, are pegged to the CPI as a measure of inflation. The fiscal cliff deal that the White House and Congress reached a month ago was almost derailed by a proposal to change the formula for the CPI, which Matthew Yglesias characterized as “a sneaky plan to cut Social Security and raise taxes by changing how inflation is calculated.” That plan was scrapped at the last minute. But what most people don’t realize is that something similar had already happened in the past. A new book, The Physics of Wall Street by James Weatherall, tells that story: In 1996, five economists, known as the Boskin Commission, were tasked with saving the government $1 trillion. They observed that if the CPI were lowered by 1.1 percent, then a $1 trillion could indeed be saved over the coming decade. So what did they do? They proposed a way to alter the formula that would lower the CPI by exactly that amount!
This raises a question: Is economics being used as science or as after-the-fact justification, much like economic statistics were manipulated in the Soviet Union? More importantly, is anyone paying attention? Are we willing to give government agents a free hand to keep changing this all-important formula whenever it suits their political needs, simply because they think we won’t get the math?

Well, most probably won’t get the math and even more won’t be able to if some have their way:

Ironically, in a recent op-ed in the New York Times, social scientist Andrew Hacker suggested eliminating algebra from the school curriculum as an “onerous stumbling block,” and instead teaching students “how the Consumer Price Index is computed.” What seems to be completely lost on Hacker and authors of similar proposals is that the calculation of the CPI, as well as other evidence-based statistics, is in fact a difficult mathematical problem, which requires deep knowledge of all major branches of mathematics including … advanced algebra.
Whether we like it or not, calculating CPI necessarily involves some abstract, arcane body of math. If there were only one item being consumed, then we could easily measure inflation by dividing the unit price of this item today by the unit price a year ago. But if there are two or more items, then knowing their prices is not sufficient.

The article continues on; it is well worth reading.

So why does Andrew Hacker suggest that we eliminate an algebra requirement from the school curriculum?

This debate matters. Making mathematics mandatory prevents us from discovering and developing young talent. In the interest of maintaining rigor, we’re actually depleting our pool of brainpower. I say this as a writer and social scientist whose work relies heavily on the use of numbers. My aim is not to spare students from a difficult subject, but to call attention to the real problems we are causing by misdirecting precious resources.

The toll mathematics takes begins early. To our nation’s shame, one in four ninth graders fail to finish high school. In South Carolina, 34 percent fell away in 2008-9, according to national data released last year; for Nevada, it was 45 percent. Most of the educators I’ve talked with cite algebra as the major academic reason.

Shirley Bagwell, a longtime Tennessee teacher, warns that “to expect all students to master algebra will cause more students to drop out.” For those who stay in school, there are often “exit exams,” almost all of which contain an algebra component. In Oklahoma, 33 percent failed to pass last year, as did 35 percent in West Virginia.

Algebra is an onerous stumbling block for all kinds of students: disadvantaged and affluent, black and white. In New Mexico, 43 percent of white students fell below “proficient,” along with 39 percent in Tennessee. Even well-endowed schools have otherwise talented students who are impeded by algebra, to say nothing of calculus and trigonometry.

California’s two university systems, for instance, consider applications only from students who have taken three years of mathematics and in that way exclude many applicants who might excel in fields like art or history. Community college students face an equally prohibitive mathematics wall. A study of two-year schools found that fewer than a quarter of their entrants passed the algebra classes they were required to take.

“There are students taking these courses three, four, five times,” says Barbara Bonham of Appalachian State University. While some ultimately pass, she adds, “many drop out.”

Another dropout statistic should cause equal chagrin. Of all who embark on higher education, only 58 percent end up with bachelor’s degrees. The main impediment to graduation: freshman math. […]

In other words: math is too hard! 🙂

Well, “gee, I won’t need it!” Well, actually, math literacy is a prerequisite to understanding many seemingly unrelated things. For example, I am reading The Better Angels of our Nature by Steven Pinker. Though the book’s purpose is to demonstrate that human violence is trending downward and has been trending downward for some time, much of the argument is statistical; being mathematically illiterate would make this book inaccessible.

We some basic mathematics when in discussions on our economy. For example: how does one determine if, say, government spending is up or not? It isn’t as simple as counting dollars spent; after all, our population is growing and we’d expect a country with a larger population to spend more than a country with a smaller one. Then there is gross domestic product; spending is usually correlated with that; hence “government spending graphs” are usually presented in terms of “percent of GDP”. But then what if absolute spending hits a flat stretch and GDP falls, as it does during a recession? That’s right: a smaller denominator makes for a bigger number! You see this concept presented here.

But if you are mathematically illiterate, all of this is invisible to you.

Ever see the “jobs graph” that the current Presidential Administration touts?

bikini-graph-January-2013-overall-economy

What does it mean? It actually demonstrates a calculus concept.

What about risk measurements? You need statistics to determine those; else you run the risk of pushing for an expensive “feel good” policy which, well, really doesn’t help.

Politics? If you can’t read a poll or understand what the polls are saying, you are basically sunk (as were many of our pundits in 2012). Of course, if you can’t understand a collection of polls, you can be a journalist or a pundit, but there is limited opportunity for that.

Science? Example: is evolution too improbable to have occurred? Uh, no. But you need some mathematical literacy to see why.

Advertisements

December 4, 2012

Teaching Linear Regression and ANOVA: using “cooked” data with Excel

During the linear regression section of our statistics course, we do examples with spreadsheets. Many spreadsheets have data processing packages that will do linear regression and provide output which includes things such as confidence intervals for the regression coefficients, the r, r^2 values, and an ANOVA table. I sometimes use this output as motivation to plunge into the study of ANOVA (analysis of variance) and have found that “cooked” linear regression examples to be effective teaching tools.

The purpose of this note is NOT to provide an introduction to the type of ANOVA that is used in linear regression (one can find a brief introduction here or, of course, in most statistics textbooks) but to show a simple example using the “random number generation” features in the Excel (with the data analysis pack loaded into it).

I’ll provide some screen shots to show what I did.

If you are familiar with Excel (or spread sheets in general), this note will be too slow-paced for you.

Brief Background (informal)

I’ll start the “ANOVA for regression” example with a brief discussion of what we are looking for: suppose we have some data which can be thought of as a set of n points in the plane (x_i, y_i). Of course the set of y values has a variance which is calculated as \frac{1}{n-1} \sum^n_{i=1}(y_i - \bar{y})^2 = \frac{1}{n-1}SS

It turns out that the “sum of squares” SS = \sum^n_{i=1} (y_i - \hat{y_i})^2 + \sum^n_{i=1}(\hat{y_i} - \bar{y})^2 where the first term is called “sum of squares error” and the second term is called “sum of squares regression”; or: SS = SSE + SSR. Here is an informal way of thinking about this: SS is what you use to calculate the “sample variation” of the y values (one divides this term by “n-1” ). This “grand total” can be broken into two parts: the first part is the difference between the actual y values and the y values predicted by the regression line. The second is the difference between the predicted y values (from the regression) and the average y value. Now imagine if the regression slope term \beta_1 was equal to zero; then the SSE term would be, in effect, the SS term and the second term SSR would be, in effect, zero (\bar{y} - \bar{y} ). If we denote the standard deviation of the y’s by \sigma then \frac{SSR/\sigma}{SSE/((n-2)\sigma} is a ratio of chi-square distributions and is therefore F with 1 numerator and n-2 denominator degrees of freedom. If \beta_1 = 0 or was not statistically significant, we’d expect the ratio to be small.

For example: if the regression line fit the data perfectly, the SSE terms would be zero and the SSR term would equal the SS term as the predicted y values would be the y values. Hence the ratio of (SSR/constant) over (SSE/constant) would be infinite.

That is, the ratio that we use roughly measures the percentage of variation of the y values that comes from the regression line verses the percentage that comes from the error from the regression line. Note that it is customary to denote SSE/(n-2) by MSE and SSR/1 by MSR. (Mean Square Error, Mean Square Regression).

The smaller the numerator relative to the denominator the less that the regression explains.

The following examples using Excel spread sheets are designed to demonstrate these concepts.

The examples are as follows:

Example one: a perfect regression line with “perfect” normally distributed residuals (remember that the usual hypothesis test on the regression coefficients depend on the residuals being normally distributed).

Example two: a regression line in which the y-values have a uniform distribution (and are not really related to the x-values at all).

Examples three and four: show what happens when the regression line is “perfect” and the residuals are normally distributed, but have greater standard deviations than they do in Example One.

First, I created some x values and then came up with the line y = 4 + 5x . I then used the formula bar as shown to create that “perfect line” of data in the column called “fake” as shown. Excel allows one to copy and paste formulas such as these.

fig1formulabar

This is the result after copying:

fig2copyformula

Now we need to add some residuals to give us a non-zero SSE. This is where the “random number generation” feature comes in handy. One goes to the data tag and then to “data analysis”

fig3dataanalysis

and clicks on “random number generation”:

fig4rnselect

This gives you a dialogue box. I selected “normal distribution”; then I selected “0” of the mean and “1” for the standard deviation. Note: the assumption underlying the confidence interval calculation for the regression parameter confidence intervals is that the residuals are normally distributed and have an expected value of zero.

fig5rngen

I selected a column for output (as many rows as x-values) which yields a column:

fig6rncolumn

Now we add the random numbers to the column “fake” to get a simulated set of y values:

fig7addrndm

That yields the column Y as shown in this next screenshot. Also, I used the random number generator to generate random numbers in another column; this time I used the uniform distribution on [0,54]; I wanted the “random set of potential y values” to have roughly the same range as the “fake data” y-values.

fig8randuniform

Y holds the “non-random” fake data and YR holds the data for the “Y’s really are randomly distributed” example.

fig9ranuniformns

I then decided to generate two more “linear” sets of data; in these cases I used the random number generator to generate normal residuals of larger standard deviation and then create Y data to use as a data set; the columns or residuals are labeled “mres” and “lres” and the columns of new data are labeled YN and YVN.

Note: in the “linear trend data” I added the random numbers to the exact linear model y’s labeled “fake” to get the y’s to represent data; in the “random-no-linear-trend” data column I used the random number generator to generate the y values themselves.

Now it is time to run the regression package itself. In Excel, simple linear regression is easy. Just go to the data analysis tab and click, then click “regression”:

fig10regressselect

This gives a dialogue box. Be sure to tell the routine that you have “headers” to your columns of numbers (non-numeric descriptions of the columns) and note that you can select confidence intervals for your regression parameters. There are other things you can do as well.

fig11regressdialog

You can select where the output goes. I selected a new data sheet.

fig12regoutbest

Note the output: the r value is very close to 1, the p-values for the regression coefficients are small and the calculated regression line (to generate the \hat{y_i}'s is:
y = 3.70 + 5.01x . Also note the ANOVA table: the SSR (sum squares regression) is very, very large compared to the SSE (sum squares residuals), as expected. The variance in y values is almost completely explained by the variance in the y values from the regression line. Hence we obtain an obscenely large F value; we easily reject the null hypothesis (that \beta_1 = 0 ).

This is what a plot of the calculated regression line with the “fake data” looks like:

ploty

Yes, this is unrealistic, but this is designed to demonstrate a concept. Now let’s look at the regression output for the “uniform y values” (y values generated at random from a uniform distribution of roughly the same range as the “regression” y-values):

fig13regressoutrand

Note: r^2 is nearly zero, we fail to reject the null hypothesis that \beta_1 = 0 and note how the SSE is roughly equal to the SS; the reason, of course, is that the regression line is close to y = \bar{y} . The calculated F value is well inside the “fail to reject” range, as expected.

A plot looks like:

plot2yr

The next two examples show what happens when one “cooks” up a regression line with residuals that are normally distributed, have mean equal to zero, but have larger standard deviations. Watch how the r values change, as well as how the SSR and SSE values change. Note how the routine fails to come up with a statistically significant estimate for the “constant” part of the regression line but the slope coefficient is handled easily. This demonstrates the effect of residuals with larger standard deviations.

fig14regresoutnoise

plotyn

fig15regresvnois

plotyvn

Blog at WordPress.com.