College Math Teaching

September 9, 2014

Chebyshev polynomials: a topological viewpoint

Chebyshev (or Tchebycheff) polynomials are a class of mutually orthogonal polynomials (with respect to the inner product: $f \cdot g = \int^1_{-1} \frac{1}{\sqrt{1 - x^2}} f(x)g(x) dx$) defined on the interval $[-1, 1]$. Yes, I realize that this is an improper integral, but it does converge in our setting.

These are used in approximation theory; here are a couple of uses:

1. The roots of the Chebyshev polynomial can be used to find the values of $x_0, x_1, x_2, ...x_k \in [-1,1]$ that minimize the maximum of $|(x-x_0)(x-x_1)(x-x_2)...(x-x_k)|$ over the interval $[-1,1]$. This is important in minimizing the error of the Lagrange interpolation polynomial.

2. The Chebyshev polynomial can be used to adjust an approximating Taylor polynomial $P_n$ to increase its accuracy (away from the center of expansion) without increasing its degree.

The purpose of this note isn’t to discuss the utility but rather to discuss an interesting property that these polynomials have. The Wiki article on these polynomials is reasonably good for that purpose.

Let’s discuss the polynomials themselves. They are defined for all positive integers $n$ as follows:

$T_n = cos(n acos(x))$. Now, it is an interesting exercise in trig identities to discover that these ARE polynomials to begin with; one shows this to be true for, say, $n \in \{0, 1, 2\}$ by using angle addition formulas and the standard calculus resolution of things like $sin(acos(x))$. Then one discovers a relation: $T_{n+1} =2xT_n - T_{n-1}$ to calculate the rest.

The $cos(n acos(x))$ definition allows for some properties to be calculated with ease: the zeros occur when $acos(x) = \frac{\pi}{2n} + \frac{k \pi}{n}$ and the first derivative has zeros where $arcos(x) = \frac{k \pi}{n}$; these ALL correspond to either an endpoint max/min at $x=1, x = -1$ or local max and mins whose $y$ values are also $\pm 1$. Here are the graphs of $T_4(x), T_5 (x)$

Now here is a key observation: the graph of a $T_n$ forms $n$ spanning arcs in the square $[-1, 1] \times [-1,1]$ and separates the square into $n+1$ regions. So, if there is some other function $f$ whose graph is a connected, piecewise smooth arc that is transverse to the graph of $T_n$ that both spans the square from $x = -1$ to $x = 1$ and that stays within the square, that graph must have $n$ points of intersection with the graph of $T_n$.

Now suppose that $f$ is the graph of a polynomial of degree $n$ whose leading coefficient is $2^{n-1}$ and whose graph stays completely in the square $[-1, 1] \times [-1,1]$. Then the polynomial $Q(x) = T_n(x) - f(x)$ has degree $n-1$ (because the leading terms cancel via the subtraction) but has $n$ roots (the places where the graphs cross). That is clearly impossible; hence the only such polynomial is $f(x) = T_n(x)$.

This result is usually stated in the following way: $T_n(x)$ is normalized to be monic (have leading coefficient 1) by dividing the polynomial by $2^{n-1}$ and then it is pointed out that the normalized $T_n(x)$ is the unique monic polynomial over $[-1,1]$ that stays within $[-\frac{1}{2^{n-1}}, \frac{1}{2^{n-1}}]$ for all $x \in [-1,1]$. All other monic polynomials have a graph that leaves that box at some point over $[-1,1]$.

Of course, one can easily cook up analytic functions which don’t leave the box but these are not monic polynomials of degree $n$.

March 14, 2014

Approximating the derivative and round off error: class demonstration

In numerical analysis we are covering “approximate differentiation”. One of the formulas we are using: $f'(x_0) = \frac{f(x_0 + h) -f(x_0 -h)}{2h} - \frac{h^2}{6} f^{(3)}(\zeta)$ where $\zeta$ is some number in $[x_0 -h, x_0 + h]$; of course we assume that the third derivative is continuous in this interval.

The derivation can be done in a couple of ways: one can either use the degree 2 Lagrange polynomial through $x_0-h, x_0, x_0 + h$ and differentiate or one can use the degree 2 Taylor polynomial expanded about $x = x_0$ and use $x = x_0 \pm h$ and solve for $f'(x_0)$; of course one runs into some issues with the remainder term if one uses the Taylor method.

But that isn’t the issue that I want to talk about here.

The issue: “what should we use for $h$?” In theory, we should get a better approximation if we make $h$ as small as possible. But if we are using a computer to make a numerical evaluation, we have to concern ourselves with round off error. So what we actually calculate will NOT be $f'(x_0) = \frac{f(x_0 + h) -f(x_0 -h)}{2h}$ but rather $f'(x_0) = \frac{\hat{f}(x_0 + h) -\hat{f}(x_0 -h)}{2h}$ where $\hat{f}(x_0 \pm h) = f(x_0 \pm h) - e(x_0 \pm h)$ where $e(x_0 \pm h)$ is the round off error used in calculating the function at $x = x_0 \pm h$ (respectively).

So, it is an easy algebraic exercise to show that:

$f'(x_0) - \frac{f(x_0 + h) -f(x_0 -h)}{2h} = - \frac{h^2}{6} f^{(3)}(\zeta)-\frac{e(x_0 +h) -e(x_0 -h)}{2h}$ and the magnitude of the actual error is bounded by $\frac{h^2 M}{6} + \frac{\epsilon}{2}$ where $M = max\{f^{(3)}(\eta)\}$ on some small neighborhood of $x_0$ and $\epsilon$ is a bound on the round-off error of representing $f(x_0 \pm h)$.

It is an easy calculus exercise (“take the derivative and set equal to zero and check concavity” easy) to see that this error bound is a minimum when $h = (\frac{3\epsilon}{M})^{\frac{1}{3}}$.

Now, of course, it is helpful to get a “ball park” estimate for what $\epsilon$ is. Here is one way to demonstrate this to the students: solve for $\epsilon$ and obtain $\frac{M h^3}{3} = \epsilon$ and then do some experimentation to determine $\epsilon$.

That is: obtain an estimate of $h$ by using this “3 point midpoint” estimate for a known derivative near a value of $x_0$ for which $M$ (a bound for the 3’rd derivative) is easy to obtain, and then obtain an educated guess for $h$.

Here are a couple of examples: one uses Excel and one uses MATLAB. I used $f(x) = e^x$ at $x = 0$; of course $f'(0) = 1$ and $M = 1$ is reasonable here (just a tiny bit off). I did the 3-point estimation calculation for various values of $h$ and saw where the error started to increase again.

Here is the Excel output for $f(x) = e^x$ at $x =0$ and at $x = 1$ respectively. In the first case, use $M = 1$ and in the second $M = e$

In the $x = 0$ case, we see that the error starts to increase again at about $h = 10^{-5}$; the same sort of thing appears to happen for $x = 1$.

So, in the first case, $\epsilon$ is about $\frac{1}{3} \times (10^{-5})^3 = 3.333 \times 10^{-16}$; it is roughly $10^{-15}$ at $x =1$.

Note: one can also approach $h$ by using powers of $\frac{1}{2}$ instead; something interesting happens in the $x = 0$ case; the $x = 1$ case gives results similar to what we’ve shown. Reason (I think): 1 is easy to represent in base 2 and the powers of $\frac{1}{2}$ can be represented exactly.

Now we turn to MATLAB and here we do something slightly different: we graph the error for different values of $h$. Since the values of $h$ are very small, we use a $-log_{10}$ scale by doing the following (approximating $f'(0)$ for $f(x) = e^x$)

. By design, $N = -log_{10}(H)$. The graph looks like:

Now, the small error scale makes things hard to read, so we turn to using the log scale, this time on the $y$ axis: let $LE = -log_{10}(E)$ and run plot(N, LE):

and sure enough, you can see where the peak is: about $10^{-5}$, which is the same as EXCEL.

July 12, 2013

An example to apply Bayes’ Theorem and multivariable calculus

I’ve thought a bit about the breast cancer research results and found a nice “application” exercise that might help teach students about Bayes Theorem, two-variable maximizing, critical points, differentials and the like.

I’ve been interested in the mathematics and statistics of the breast cancer screening issue mostly because it provided a real-life application of statistics and Bayes’ Theorem.

So right now, for women between 40-49, traditional mammograms are about 80 percent accurate in the sense that, if a woman who really has breast cancer gets a mammogram, the test will catch it about 80 percent of the time. The false positive rate is about 8 percent in that: if 100 women who do NOT have breast cancer get a mammogram, 8 of the mammograms will register a “positive”.
Since the breast cancer rate for women in this age group is about 1.4 percent, there will be many more false positives than true positives; in fact a woman in this age group who gets a “positive” first mammogram has about a 16 percent chance of actually having breast cancer. I talk about these issues here.

So, suppose you desire a “more accurate test” for breast cancer. The question is this: what do you mean by “more accurate”?

1. If “more accurate” means “giving the right answer more often”, then that is pretty easy to do.
Current testing is going to be wrong: if C means cancer, N means “doesn’t have cancer”, P means “positive test” and M means “negative test”, then the probability of being wrong is:
$P(M|C)P(C) + P(P|N)P(N) = .2(.014) + .08(.986) = .08168$. On the other hand, if you just declared EVERYONE to be “cancer free”, you’d be wrong only 1.4 percent of the time! So clearly that does not work; the “false negative” rate is 100 percent, though the “false positive” rate is 0.

On the other hand if you just told everyone “you have it”, then you’d be wrong 98.6 percent of the time, but you’d have zero “false negatives”.

So being right more often isn’t what you want to maximize, and trying to minimize the false positives or the false negatives doesn’t work either.

2. So what about “detecting more of the cancer that is there”? Well, that is where this article comes in. Switching to digital mammograms does increase detection rate but also increases the number of false positives:

The authors note that for every 10,000 women 40 to 49 who are given digital mammograms, two more cases of cancer will be identified for every 170 additional false-positive examinations.

So, what one sees is that if a woman gets a positive reading, she now has an 11 percent of actually having breast cancer, though a few more cancers would be detected.

Is this progress?

My whole point: saying one test is “more accurate” than another test isn’t well defined, especially in a situation where one is trying to detect something that is relatively rare.
Here is one way to look at it: let the probability of breast cancer be $a$, the probability of detection of a cancer be given by $x$ and the probability of a false positive be given by $y$. Then the probability of a person actually having breast cancer, given a positive test is given by:
$B(x,y) =\frac{ax}{ax + (1-a)y}$; this gives us something to optimize. The partial derivatives are:
$\frac{\partial B}{\partial x}= \frac{(a)(1-a)y}{(ax+ (1-a)y)^2},\frac{\partial B}{\partial y}=\frac{(-a)(1-a)x}{(ax+ (1-a)y)^2}$. Note that $1-a$ is positive since $a$ is less than 1 (in fact, it is small). We also know that the critical point $x = y =0$ is a bit of a “duh”: find a single test that gives no false positives and no false negatives. This also shows us that our predictions will be better if $y$ goes down (fewer false positives) and if $x$ goes up (fewer false negatives). None of that is a surprise.

But of interest is in the amount of change. The denominators of each partial derivative are identical. The coefficients of the numerators are of the same magnitude; there are different signs. So the rate of improvement of the predictive value is dependent on the relative magnitudes of $x$, which is $.8$ for us, and $y$, which is $.08$. Note that $x$ is much larger than $y$ and $x$ occurs in the numerator $\frac{\partial B}{\partial y}$. Hence an increase in the accuracy of the $y$ factor (a decrease in the false positive rate) will have a greater effect on the accuracy of the test than a similar increase in the “false negative” accuracy.
Using the concept of differentials, we expect a change $\Delta x = .01$ leads to an improvement of about .00136 (substitute $x = .8, y = .08$ into the expression for $\frac{\partial B}{\partial x}$ and multiply by $.01$. Similarly an improvement (decrease) of $\Delta y = -.01$ leads to an improvement of .013609.

You can “verify” this by playing with some numbers:

Current ($x = .8, y = .08$) we get $B = .1243$. Now let’s change: $x = .81, y = .08$ leads to $B = .125693$
Now change: $x = .8, y = .07$ we get $B = .139616$

Bottom line: the best way to increase the predictive value of the test is to reduce the number of false positives, while staying the same (or improving) the percentage of “false negatives”. As things sit, the false positive rate is the bigger factor affecting predictive value.

August 4, 2012

The day started with a talk by Karen King from the National Council of Teachers of Mathematics.
I usually find math education talks to be dreadful, but this one was pretty good.

The talk was about the importance of future math teachers (K-12) actually having some math background. However, she pointed out that students just having passed math courses didn’t imply that they understood the mathematical issues that they would be teaching…and it didn’t imply that their students would do better.

She gave an example: about half of those seeking to teach high school math couldn’t explain why “division by zero” was undefined! They knew that it was undefined but couldn’t explain why. I found that astonishing since I knew that in high school.

Later, she pointed out that potential teachers with a math degree didn’t understand what the issues were in defining a number like $2^{\pi}$. Of course, a proper definition of this concept requires at least limits or at least a rigorous definition of the log function and she was well aware that the vast majority of high school students aren’t ready for such things. Still, the instructor should be; as she said “we all wave our hands from time to time, but WE should know when we are waving our hands.”

She stressed that we need to get future math teachers to get into the habit (she stressed the word: “habit”) of always asking themselves “why is this true” or “why is it defined in this manner”; too many of our math major courses are rule bound, and at times we write our exams in ways that reward memorization only.

Next, Bernd Sturmfels gave the second talk in his series; this was called Convex Algebraic Geometry.

You can see some of the material here. He also lead this into the concept of “Semidefinite programming”.

The best I can tell: one looks at the objects studied by algebraic geometers (root sets of polynomials of several variables) and then takes a “affine slice” of these objects.

One example: the “n-ellipse” is the set of points on the plane that satisfy $\sum^m_{k=1} \sqrt{(x-u_k)^2 + (y-v_k)^2} = d$ where $(u_k, v_k)$ are points in the plane.

Questions: what is the degree of the polynomial that describes the ellipse? What happens if we let $d$ tend to zero? What is the smallest $d$ for which the ellipse is non-vanishing (Fermat-Webber point)? Note: the 2 ellipse is the circle, the 3 ellipse (degree 8) is what we usually think of as an ellipse.

Note: these type of surfaces can be realized as the determinant of a symmetric matrix; these matrices have real eigenvalues. We can plot curves over which an eigenvalue goes to zero and then changes sign. This process leads to what is known as a spectrahedron ; this is a type of shape in space. A polyhedron can be thought of as the spectrahedron of a diagonal matrix.

Then one can seek to optimize a linear function over a spectrahedron; this leads to semidefinite programming, which, in general, is roughly as difficult as linear programming.

One use: some global optimization problems can be reduced to a semidefinite programming problem (not all).

Shorter Talks
There was a talk by Bob Palais which discussed the role of Rodrigues in the discovery of the quaternions. The idea is that Rodrigues discovered the quaternions before Hamilton did; but he talked about these in terms of rotations in space.

There were a few talks about geometry and how to introduce concepts to students; of particular interest was the concept of a geodesic. Ruth Berger talked about the “fish swimming in jello” model: basically suppose you had a sea of jello where the jello’s density was determined by its depth with the most dense jello (turning to infinite density) at the bottom; and it took less energy for the fish to swim in the less dense regions. Then if a fish wanted to swim between two points, what path would it take? The geometry induced by these geodesics results in the upper half plane model for hyperbolic space.

Nick Scoville gave a talk about discrete Morse theory. Here is a user’s guide. The idea: take a simplicial complex and assign numbers (integers) to the points, segments, triangles, etc. The assignment has to follow rules; basically the boundary of a complex has to have a lower number that what it bounds (with one exception….) and such an assignment leads to a Morse function. Critical sets can be defined and the various Betti numbers can be calculated.

Christopher Frayer then talked about the geometry of cubic polynomials. This is more interesting than it sounds.
Think about this: remember Rolles Theorem from calculus? There is an analogue of this in complex variables called the Guass-Lucas Theorem. Basically, the roots of the derivative lie in the convex hull of the roots of the polynomial. Then there is Marden’s Theorem for polynomials of degree 3. One can talk about polynomials that have a root of $z = 1$ and two other roots in the unit circle; then one can study where the the roots of the derivative lie. For a certain class of these polynomials, there is a dead circle tangent to the unit circle at 1 which encloses no roots of the derivative.

February 26, 2011

Ants and the Calculus of Variations

Filed under: applied mathematics, calculus of variations, optimization, popular mathematics, science — collegemathteaching @ 10:38 pm

Yes, ants appear to solve a calculus of variations problem despite having little brains. Here is the “why and how”:

embedded in the ants’ tiny brains is not an evolutionary algorithm for solving the Steiner problem, but a simple rule combined with a fact of chemistry: ants follow their own pheromone trails, and those pheromones are volatile. As Wild explains, ants start out making circuitous paths, but more pheromone evaporates from the longer ones because ants take longer to traverse them while laying down their own scent. The result is that the shortest paths wind up marked with the most pheromone, and ants follow the strongest scents.

Wild shows a nice simulation video on his site, demonstrating how, given these simple assumptions, ants wind up taking the shortest trails.

Before we say that evolution can’t explain a behavior, it behooves us to learn as much as we can about that behavior.

Ants find the shortest route because of three simple facts:

2. Pheromone trails degrade over time
3. Short paths take less time to traverse

When two points (say, two nests, or a nest and a food source) need to be connected, ants may start out tracing several winding pheromone paths among them. As ants zing back and forth down trails, pheromone levels build up. Long trails take more time to travel, so long-trail ants makes fewer overall circuits, more pheromone dissipates between passes, and the trails end up poorly marked. Short trails enable ants to make more trips, less time elapses between passes, so these trails end up marked more strongly. The shortest trail emerges.

Here’s a simulation showing digital ants selecting the shortest of 4 possible routes. Note how and where pheromone concentration builds:

They solved the geodesic problem and haven’t even had my numerical analysis class! 🙂

September 5, 2010

The Black Swan by Nicholas Taleb

The short: I enjoyed the book and found it hard to put down. It challenged some of my thinking and changed the way that I look at things.

What I didn’t like: the book was very inefficient; he could have conveyed the same message in about 1/3 of the pages.
But: the fluff/padding was still interesting; the author has a sense of humor and writes in an entertaining style.

What is the gist of the book? Well, the lessons are basically these:

1. Some processes lend themselves to being mathematically modeled, others don’t. Unfortunately, some people use mathematical models in situations where it is inappropriate to do so (e. g., making long term forecasts about the economy). People who rely too much on mathematical modeling are caught unprepared (or just plain surprised) when some situation arises that wasn’t considered possible in the mathematical model (e. g., think of a boxer getting in a fight with someone who grabs, kicks and bites).

2. Some processes can be effectively modeled by the normal distribution, others can’t. Example: suppose you are machining bolts and are concerned about quality, as, say, measured by the width of the bolt. That sort of process lends itself to a normal distribution; after all, if the specification is, say, 1 cm, there is no way that an errant bolt will be, say, 10 cm wide. On the other hand, if you are talking about stock markets, it is possible that some catastrophic event (called a “black swan”) can occur that causes the market to, say, lose half or even 2/3’rd of its value. If one tried to model recent market price changes by some sort of normal-like distribution, such a large variation would be deemed as being all but impossible.

3. Sometimes these extremely rare events have catastrophic outcomes. But these events are often impossible to predict beforehand, even if people do “after the fact studies” that say “see, you should have predicted this.”

4. The future catastrophic event is, more often than not, one that hasn’t happened before. The ones that happened in the past, in many cases, won’t happen again (e. g., terrorists successfully coordinating at attack that slams airplanes into buildings). But the past catastrophic events are the ones that people prepare for! Bottom line: sometimes, preparing to react better is possible where being proactive is, in fact, counter productive.

5. Sometimes humans look for and find patterns that are really just coincidence, and then use faulty logic to make an inference. Example: suppose you interview 100 successful CEO’s and find that all of them pray to Jesus each day. So, obviously, praying to Jesus is a factor in becoming a CEO, right? Well, you need to look at everyone in business who prayed to Jesus and see how many of them became CEOs; often that part of the study is not done. Very rarely do we examine what the failures did.

I admit that I had to laugh at his repeated slamming of academics (I am an academic). In one place, he imagines a meeting between someone named “Fat Tony” and an academic. Taleb poses the problem: “suppose you are told that a coin is fair. Now you flip it 99 times and it comes up heads. On the 100’th flip, what the odds of another head?” Fat Tony says something like “about 99 percent” where the academic says “50 percent”.

Frankly, that hypothetical story is pure nonsense. In this case, the academic is really saying “if I am 100 percent sure that the coin is fair, there is a Black Swan even that has 100 heads in a row” though, in reality, the academic would reject the null hypothesis that the coin is fair as the probability of a fair coin coming up heads 99 times in a row is 2^{-99} which is way in the rejection region of a statistical test.

Taleb also discusses an interesting aspect of human nature that I didn’t believe at first..until I tried it out with friends. This is a demonstration: ask your friend “which is more likely:
1. A random person drives drunk and gets into an auto accident or
2. A random person gets into an auto accident.

Or you could ask: “which is more likely: a random person:
1. Is a smoker and gets lung cancer or
2. Gets lung cancer.

Of course, the correct answer in each case is “2”: the set of all auto accidents caused by drunk driving is a subset of all auto accidents and the set of all lung cancer cases due to smoking is a subset of all lung cancer cases.

But when I did this, my friend chose “1”!!!!!!

I had to shake my head, but that is a human tendency.

One other oddity of the book toward the end, Taleb discusses fitness. He mentions that he hit on the perfect fitness program by asking himself: “what did early humans do? Ans.: walk long distances to hunt, and engage in short burst of high intensity activity”. He then decided to walk long, slow distances and do sprints every so often.

Well, nature also had humans die early of various diseases; any vaccine or cure works against “mother nature”. So I hardly view nature as always being optimal. But I did note with amusement that Taleb walks 10-15 hours a week, which translates to 30-45 miles per week! (20 minutes per mile pace).

I’d say THAT is why he is fit. 🙂

(note: since I love to hike and walk long distances, this comment was interesting to me)

April 3, 2010

Optimization Problems in Calculus: symmetries in solutions

Filed under: calculus, optimization, student learning — collegemathteaching @ 9:23 pm

For reasons I’ve yet to comprehend, “business calculus” students often have trouble with the following kind of problem:

Suppose you wish to fence in a rectangular plot of land. You have one long wall to use for one side. The sides perpendicular to the wall cost 10 dollars per foot of fencing and the side parallel to the long wall costs 30 dollars per foot of fencing. What is the largest area that you can enclose?

Yes, there are those who can’t even figure out that the area is denoted by $A= xy$ and some have trouble with the cost function $C = 20x + 30Y$

Some struggle with the algebra.

But let’s look at the mathematics itself: there is a certain symmetry to the solution; if one views the cost function as $C = Ax + BY$ and the area function $V = xy$ then one can show that the solution $x=x_0,y=y_0$ will always have the following property: $Ax_0 = By_0 = (1/2)C$

In fact it is a pleasant exercise to show that the solution to the following problem:
Maximize $x^{r}y^{s}=V (r\geq 1, s\geq 1)$
Subject to $Ax+By = C, A > 0, B > 0, C > 0$
is given by $x = (C/A)(r/(r+s)), y = (C/B)(s/(r+s))$

That is, $Ax = C(r/(r+s)), By =C(s/(r+s))$ or that the “share” of the cost associated with the $Ax$ term depends on the fraction $r/(r+s))$, which is the fraction of the total exponent of the function determining $V$.

If we look at the associated dual problem (minimize the cost given that we need the area $V$) we find that there is a more complicated symmetry at hand:
$x = ((rb)/(sa))^{s/(r+s)}V^{1/(r+s)}, y = ((sa)/(rb))^{r/r+s}V^{1/(r+s)}$.
Note the symmetry: if we take the ratio of the solution components we obtain:
$x/y = (r/s)(B/A)$.
How cool is that? 🙂
Note: I know that this follows immediately from the Lagrange Multipliers technique; still it is fun to derive via single variable calculus techniques.