# College Math Teaching

## March 14, 2014

### Approximating the derivative and round off error: class demonstration

In numerical analysis we are covering “approximate differentiation”. One of the formulas we are using: $f'(x_0) = \frac{f(x_0 + h) -f(x_0 -h)}{2h} - \frac{h^2}{6} f^{(3)}(\zeta)$ where $\zeta$ is some number in $[x_0 -h, x_0 + h]$; of course we assume that the third derivative is continuous in this interval.

The derivation can be done in a couple of ways: one can either use the degree 2 Lagrange polynomial through $x_0-h, x_0, x_0 + h$ and differentiate or one can use the degree 2 Taylor polynomial expanded about $x = x_0$ and use $x = x_0 \pm h$ and solve for $f'(x_0)$; of course one runs into some issues with the remainder term if one uses the Taylor method.

But that isn’t the issue that I want to talk about here.

The issue: “what should we use for $h$?” In theory, we should get a better approximation if we make $h$ as small as possible. But if we are using a computer to make a numerical evaluation, we have to concern ourselves with round off error. So what we actually calculate will NOT be $f'(x_0) = \frac{f(x_0 + h) -f(x_0 -h)}{2h}$ but rather $f'(x_0) = \frac{\hat{f}(x_0 + h) -\hat{f}(x_0 -h)}{2h}$ where $\hat{f}(x_0 \pm h) = f(x_0 \pm h) - e(x_0 \pm h)$ where $e(x_0 \pm h)$ is the round off error used in calculating the function at $x = x_0 \pm h$ (respectively).

So, it is an easy algebraic exercise to show that:

$f'(x_0) - \frac{f(x_0 + h) -f(x_0 -h)}{2h} = - \frac{h^2}{6} f^{(3)}(\zeta)-\frac{e(x_0 +h) -e(x_0 -h)}{2h}$ and the magnitude of the actual error is bounded by $\frac{h^2 M}{6} + \frac{\epsilon}{2}$ where $M = max\{f^{(3)}(\eta)\}$ on some small neighborhood of $x_0$ and $\epsilon$ is a bound on the round-off error of representing $f(x_0 \pm h)$.

It is an easy calculus exercise (“take the derivative and set equal to zero and check concavity” easy) to see that this error bound is a minimum when $h = (\frac{3\epsilon}{M})^{\frac{1}{3}}$.

Now, of course, it is helpful to get a “ball park” estimate for what $\epsilon$ is. Here is one way to demonstrate this to the students: solve for $\epsilon$ and obtain $\frac{M h^3}{3} = \epsilon$ and then do some experimentation to determine $\epsilon$.

That is: obtain an estimate of $h$ by using this “3 point midpoint” estimate for a known derivative near a value of $x_0$ for which $M$ (a bound for the 3’rd derivative) is easy to obtain, and then obtain an educated guess for $h$.

Here are a couple of examples: one uses Excel and one uses MATLAB. I used $f(x) = e^x$ at $x = 0$; of course $f'(0) = 1$ and $M = 1$ is reasonable here (just a tiny bit off). I did the 3-point estimation calculation for various values of $h$ and saw where the error started to increase again.

Here is the Excel output for $f(x) = e^x$ at $x =0$ and at $x = 1$ respectively. In the first case, use $M = 1$ and in the second $M = e$

In the $x = 0$ case, we see that the error starts to increase again at about $h = 10^{-5}$; the same sort of thing appears to happen for $x = 1$.

So, in the first case, $\epsilon$ is about $\frac{1}{3} \times (10^{-5})^3 = 3.333 \times 10^{-16}$; it is roughly $10^{-15}$ at $x =1$.

Note: one can also approach $h$ by using powers of $\frac{1}{2}$ instead; something interesting happens in the $x = 0$ case; the $x = 1$ case gives results similar to what we’ve shown. Reason (I think): 1 is easy to represent in base 2 and the powers of $\frac{1}{2}$ can be represented exactly.

Now we turn to MATLAB and here we do something slightly different: we graph the error for different values of $h$. Since the values of $h$ are very small, we use a $-log_{10}$ scale by doing the following (approximating $f'(0)$ for $f(x) = e^x$)

. By design, $N = -log_{10}(H)$. The graph looks like:

Now, the small error scale makes things hard to read, so we turn to using the log scale, this time on the $y$ axis: let $LE = -log_{10}(E)$ and run plot(N, LE):

and sure enough, you can see where the peak is: about $10^{-5}$, which is the same as EXCEL.

## March 9, 2014

### Bézier Curves

I am currently teaching Numerical Analysis and using Burden-Faires. The book covers the topics we like, but I feel that the section on splines and parametrized curves is a bit weak; in particular the discussion on Bézier curves is a bit lacking. The pity: the discussion need not be all that deep, and the standard equation for Bézier curves is actually easy to remember.

Also: where the text talks about how the Bézier curve equations differs from the “bare handed parametric cubic spline” that they derive, they don’t explain the reason for the difference.

So, I decided to write these notes. I will have to explain some basic concepts.

The setting: $R^n$ with the usual geometry induced by the usual “dot product”.

Convex Sets in $R^n$

A set $X \subset R^n$ is said to be convex if for any two points $x, y \in X$, the straight line segment connecting $x$ to $y$ is also in $X$; that is, the set $tx + (1-t)y \in X$ for all $t \in [0,1]$.

Convex Hull for a set of points

Now suppose one is given a collection of points $C= x_0, x_1, x_2, x_3,.... \in R^n$. The convex hull $H$ for $C$ is the smallest convex set which contains all of $C$. That is, if $Y$ is any convex set that contains $C$, then $H \subseteq Y$. In the case where the set of points is finite (say, $C = \{x_0, x_1, x_2, ....x_n \} )$ then $H$ consists the set of all $\sum^{n}_{i = 0} \alpha_i x_i$ where $\alpha_i \ge 0$ and $\sum^{n}_{i=0} \alpha_i = 1$.

Note: the convex hull for a set of points is, in general, an example of a vector subset that is NOT a vector subspace.

Binomial Theorem and the Bernstein coefficient polynomials

Recall from algebra: if $n$ is a positive integer and $a, b$ numbers (real, complex, or even arbitrary field elements), $(a+b)^n = \sum^{n}_{j =0} { n \choose j} a^{n-j} b^{j}$, where ${n \choose j} = \frac{n!}{(n-j)! j !}$. For example, $(a+b)^3 = a^3 + 3a^2b + 3ab^2 + b^3$.

Now consider the rather silly looking: $1^n = ((1-t) + t)^n = \sum^n_{j=0}{ n \choose j} (1-t)^{n-j} t^{j}$ Note that this expression is equal to 1 for ALL values of $t$ and that for $t \in [0,1]$, each summand ${ n \choose j} (1-t)^{n-j} t^{j}$ is positive or zero.

These “coefficient polynomials” ${ n \choose j} (1-t)^{n-j} t^{j}$ are called the Bernstein polynomials (or Bernstein basis polynomials) and we denote them as follows: $b_{j,n}(t) = { n \choose j} (1-t)^{n-j} t^{j}$. We now see that for all $t \in [0,1], 0 \le b_{j,n}(t) \le 1$ and $\sum^n_{j=0}b_{j,n}(t) = ((1-t)+t)^n =1^n =1$

Definition of a Bézier curve and some of its properties

Now let $P_0, P_1, P_2, ...P_n$ be a collection of distinct points in $R^k$. One can think of these points as vectors.
The Bézier curve with control points $P_0, P_1, P_2, ...P_n$ is defined to be $B(t)= \sum^n_{j=0}b_{j,n}(t)P_j, t \in [0,1]$.

Properties

$B(0) = P_0, B(1) =P_n$. This is clear because $b_{0,n}(0) = 1, b_{n,n}(1) =1$ and for $i \notin \{0,1\}, b_{i,n}(0)=b_{i,n}(1) = 0$.

The polygon formed by $P_0, P_1, ....P_n$ is called the control polygon for the Bézier curve.

For all $t \in [0,1], B(t)$ is in the convex hull of $P_0, P_1, ...P_n$. This is clear because $\sum^n_{j=0}b_{j,n}(t) = ((1-t)+t)^n =1^n =1$ and each $b_{i,n}(t)$ is positive.

“Guideposts”: the text talks about the “guideposts”: the text looks at a cubic Bézier curve in the plane and uses $(x_0, y_0) =P_0, (x_0+ \alpha_0, y_0 + \beta_0) = P_1, (x_1 - \alpha_1, y_1 - \beta_1)= P_2, (x_1, y_1) =P_3$

Now $P_1$ and $P_{n-1}$ directly affect the (one sided) tangent to the Bézier curve at $t=0, t=1$. In fact we will show that if we use the one-sided parametric curve derivative, we see that $B'(0) = n(P_1 - P_0), B'(1) = n(P_n - P_{n-1})$. The text calls $n$ the scaling factor and notes that the scaling factor is 3 when $n = 3$.

We’ll do the calculations for $B'(0), B'(1)$ for the general degree $n$ Bézier curve using elementary calculus (product rule):

First write $B(t) = (1-t)^nP_0 + n(1-t)^{n-1}tP_1 + \sum^{n-2}_{j=2} b_{j,n}(t) P_j + n(1-t)t^{n-1}P_{n-1} + t^n P_n$. Now take the derivative and we see:
$B'(t) = -n(1-t)^{n-1}P_0 + (n(1-t)^{n-1} - n(n-1)(1-t)^{n-2}t)P_1 + \frac{d}{dt} (\sum^{n-2}_{j=2} b_{j,n}(t) P_j) +(n(n-1)(1-t)t^{n-2}-nt^{n-1})P_{n-1} + nt^{n-1}P_n$

Key observation: every term of $\frac{d}{dt} (\sum^{n-2}_{j=2} b_{j,n}(t) P_j)$ has both a factor of $t$ and $(1-t)$ in it; hence this middle term evaluates to zero when $t \in {0,1}$ and is therefor irrelevant to the calculation of $B'(0)$ and $B'(1)$.

So $B'(0) = -nP_0 + nP_1 = n(P_1 - P_0)$ (the last two terms are zero at $t =0$ and $B'(1) = -nP_{n-1} + nP_n = n(P_n - P_{n-1})$ (the first two terms are zero at $t = 1$ ).

It follows that the DIRECTION of the (one sided) tangents at the ends of the Bézier curve depends only on the unit tangent vectors in the direction of $P_1 - P_0, P_n - P_{n-1}$ respectively. Of course, the tangent vector has a magnitude (norm) as well, and that certainly affects the curve.

Here are some examples of Bézier cubic curves: the points with the open circles are $P_0, P_3$ and the points that are filled in with gray are the control points $P_1, P_2$. The last curve is two Bézier cubics joined together.

Software
The software that I provided writes the cubic Bézier curve as a “conventional” cubic in $x, y$ coordinates: $B_{x}(t) = a_3t^3 + a_2t^2 + a_1t + a_0$ and $B_{y} = b_3t^3 + b_2t^2 + b_1t + b_0$.

## February 17, 2014

### Aitken acceleration: another “easy” example for the students

In a previous post I showed some spreadsheet data to demonstrate the Aitken acceleration process. Here, I’ll go through an example where a sequence converges linearly: let $p_n = L +a^n + b^n$ where $0 < a < b < 1$. We use the form $q_n = \frac{p_{n+2}p_{n} -(p_{n+1})^2}{p_{n+2} - 2p_{n+1} + p_n}$ (I am too lazy to use the traditional “p-hat” notation). First, note that the denominator works out to $a^n(1-a)^2 +b^n(1-b)^2$

The numerator is a tiny bit more work: the $L^2$ terms cancel and as far as the rest:
$L(a^{n+2} + b^{n+2}) + L(a^n + b^n) +(a^{n+2}+b^{n+2})(a^n + b^2)-2L(a^{n+1}+b^{n+1})-(a^{n+1}+b^{n+1})^2$
which simplifies to a term involving $L$ and one that doesn’t. Here is the term involving $L$:

$L(a^{n+2}-2a^{n+1} + a^n + b^{n+2} -2b^{n+1} +b^n) = L(a^n(1-a)^2 +b^n(1-b)^2)$

which, of course, is just $L$ times the denominator.

Now the terms not involving $L$: $(a^{n+2}+b^{n+2})(a^n+b^n) - (a^{n+1} + b^{n+1})^2 = b^na^n(b^2+a^2-2ab) = b^na^n(b-a)^2$

So our fraction is merely

$\frac{L((a^n(1-a)^2 +b^n(1-b)^2)) + b^na^n(b-a)^2}{a^n(1-a)^2 +b^n(1-b)^2}$ $= L + \frac{b^na^n(b-a)^2}{a^n(1-a)^2 +b^n(1-b)^2}$

This can be rearranged to $L + \frac{(b-a)^2}{\frac{(1-a)^2}{b^n} + \frac{(1-b)^2}{a^n}}$

Clearly as $n$ goes to infinity, the error goes to zero very quickly. It might be instructive to look at the ratio of the errors for $p_n$ and $q_n$:

This ratio is
$(a^n + b^n)\frac{a^n(1-a)^2 + b^n(1-b)^2}{a^nb^n(b-a)^2} =$$(a^n +b^n)(\frac{1}{b^n}(\frac{1-a}{b-a})^2 + \frac{1}{a^n}(\frac{1-b}{b-a})^2)$

Note that in the right hand factor: both squared factors are fixed and the coefficients go to infinity as $n$ goes to infinity. If one multiplies out, one obtains:

$((\frac{a}{b})^n +1)(\frac{1-a}{b-a})^2 + ((\frac{b}{a})^n +1)(\frac{1-b}{b-a})^2$. In the limit, the first term decreases to $(\frac{1-a}{b-a})^2$ and the second goes to infinity.

Hence the errors in the accelerated sequence are smaller.

## February 8, 2014

### Demonstrating Aitken’s sequence acceleration

Right now, we are studying the various root finding algorithms in Numerical Analysis (bisection, Newton’s, etc.)

Such algorithms yield a sequence of numbers, which (hopefully) converge to a solution: $p_0, p_1, p_2, p_3, ....$.

Of course, each point in the sequence is obtained by calculations and, if there were a way to combine these points so as to obtain a sequence that converges faster (while not adding much to the computation complexity), there is some benefit. And yes, it is possible to use the sequence points, do the series manipulation and use the manipulated points in the root finding algorithm itself (e. g. Steffensen’s method).

In this post, I’ll talk about Aitken’s method and how one can cook up examples that not only show that the method can work but give the students some intuition as to why it might work.

I’ll provide just a bit of background in the event that the general reader comes across this.

Let $p_i \rightarrow p$. If we have $\frac{|p_{n+1} - p|}{(|p_n -p|)^{\alpha}} \rightarrow \lambda$ ($\lambda$ is positive, of course) we say that the convergence is LINEAR if $\alpha = 1$ and $\lambda < 1$ the inequality must be strict. If $\alpha = 2$ then we say that convergence is quadratic (regardless of $\lambda$.)

To see the reason for the terminology, just multiply both sides of the “defining equation” by the denominator. In the linear case : $|p_{n+1} - p| = \lambda |p_{n} - p|$ so one can think: “get new approximation by multiplying the old approximation by some constant less than one”. For example: $p_n = \frac{1}{2^n}$ exhibits linear convergence to 0; that is a decent way to think about it.

Now (in the linear convergence case anyway), suppose you think of your approximation having an error that shrinks with iteration but shrinks in the following way: the n’th iteration looks like $p_n = p + a^n + b^n$ where $a, b$ are constants strictly between 0 and 1. Of course, as $n$ goes to infinity, $p_n$ approaches the limit $p$ as the error terms die away.

Aitken’s method is this: let’s denote a new sequence by $q_n$. (I am not using the traditional P-hat out of laziness) Then $q_n = p_n -\frac{(p_{n+1} - p_n)^2}{p_{n+2} -2p_{n+1} + p_{n}}$. To see how this formula is obtained, check out these excellent course notes. Or the Wiki article is pretty good.

Look at the numerator of what is being subtracted off: if we have the terms written (very roughly) as $p_{n+1} = p + a^{n+1} + b^{n+1}, p_n = a^n + b^n$ and if, say, $a$ is much larger than $b$, then $a^{n+1}$ will be closer to $a^n$ than $b^{n+1}$ is to $b^n$, hence more of this error will be subtracted away.

Yes, I know that his is simple-minded, but it “gets the spirit” of the process. I’ve set up some spreadsheets with “cooked” sequences linearly converging to zero $p_n = a^n + b^n$ and showed how the Aitken process works there. Note: the spreadsheet round off errors start occurring at the $10^{-16}$ range; you can see that here.

(click to see the larger image)

To see an abstract example where $p_n = L + a^n + b^n$ where $a, b \in (0,1)$, go to the next post in this series.

## February 1, 2014

### Numerical methods class: round off error using geometric series and spreadsheets…

Of course computers store numbers in binary; that is numbers are represented by $\sum^n_{k=-m} a_k 2^{k} = 2^{n}+a_{n-1}2^{n-1} +....+a_1 2 + a_0 1 + a_{-1}\frac{1}{2} + ...+a_{-m} \frac{1}{2^m}$ where each $a_j \in {0,1}$ (of course the first coefficient is 1).

We should probably “warm up” by showing some binary expansions. First: someone might ask “how do I know that a number even HAS a binary expansion? The reason: the dyatic rationals are dense in the number line. So just consider a set of nested partitions of the number line, where each partition in the family has width $\frac{1}{2^k}$ and it isn’t hard to see a sequence that converges to a given number. That sequence leads to the binary expansion.

Example: What is the binary expansion for 10.25?

Answer: Break 10.25 down into 10 + 0.25.

$10 = 8 + 2 = 2^3 + 2^1$ so the “integer part” of the binary expansion is $1010 = 2^3 + 0*2^2 + 2^1 +0*2^0$.

Now for the “faction part”: $0.25 = \frac{1}{4} = \frac{1}{2^2} = 0*\frac{1}{2} + \frac{1}{2^2} = .01$ in binary.

Hence $10.25$ in base 10 $= 1010.01$ in binary.

So what about something harder? Integers are easy, so lets look at $\frac{3}{7} = \sum_{k=1}a_k\frac{1}{2^k}$ where each $a_k \in \{0,1\}$.

Clearly $a_1 =0$ since $\frac{1}{2}$ is greater than $\frac{3}{7}$. Now $a_2 = 1$ so multiply both sides of $\frac{3}{7} = \sum_{k=1}a_k\frac{1}{2^k}$ by 4 to obtain $\frac{12}{7} = 1 + a_3 \frac{1}{2} + a_4 \frac{1}{4} + ...$

Now subtract 1 from both sides and get $\frac{5}{7} = a_3 \frac{1}{2} + a_4 \frac{1}{4} + ...$ Here $a_3 = 1$ so multiply both sides by 2 and subtract 1 and get $\frac{3}{7} = a_4 \frac{1}{2} + a_5 \frac{1}{4} ...$

Note that we are back where we started. The implication is that $\frac{3}{7} = (\frac{1}{4} + \frac{1}{8})(1 + \frac{1}{2^3} + \frac{1}{2^6} +....)$ and so the base 2 decimal expansion for $\frac{3}{7}$ is $.\overline{011}$

Of course there is nothing special about $\frac{3}{7}$; a moment’s thought reveals that if one starts with $\frac{p}{q}$ where $p$ is less than $q$, the process stops (we arrive at zero on the left hand side) or we return to some other $\frac{m}{q}$ where $m$ is less than $q$; since there are only a finite number of $m$ we have to arrive at a previous fraction eventually (pigeonhole principle). So every rational number has a terminating or repeating binary decimal expansion.

Now it might be fun to check that we did the expansion correctly; the geometric series formula will help:

$(\frac{1}{4} + \frac{1}{8})(1 + \frac{1}{2^3} + \frac{1}{2^6} +....) = (\frac{3}{8})(\frac{1}{1-\frac{1}{2^3}})=\frac{3}{8-1} =\frac{3}{7}$

Now about how the computer stores a number: a typical storage scheme is what I call the 1-11-52 scheme. There are 64 bits used for the number. The first is for positive or negative. The next 11 bits are for the exponent. Now $2^{12} - 1 = 2047$ so technically the exponent could be as large as $2047$. But we want to be able to store small numbers as well, so $1023$ is subtracted from this number so we could get exponents that range from $1024$ to $-1023$. Also remember these are exponents for base 2.

This leaves 52 digits for the mantissa; that is all the computer can store. This is the place where one can look at round off error.

Let’s see two examples:

10.25 is equal to $2^3 (1 + \frac{1}{4} + \frac{1}{32})$. The first bit is a 0 because the number is positive. The exponent is 3 and so is represented by $3 + 1023 = 1026$ which is $10000000011$. Now the mantissa is assumed to start with 1 every time; hence we get 0100100000000000000000000000000000000000000000000000.

So, let’s look at the $\frac{3}{7}$ example. The first digit is 0 since the number is positive. We write the binary expansion as: $2^{-2}(1+\frac{1}{2} + \frac{1}{8} + \frac{1}{16} + ....)$. The exponent is $-2$ which is stored as $-2 + 1023 = 1021 = 01111111101$. Now the fun starts: we need an infinite number of bits for an exact representation but we have 52. $.011$ is what is repeated so we can repeat this 51 times plus the next 0. So the number that serves as a proxy for $\frac{3}{7}$ is really $\frac{3}{7} - (\frac{1}{4} + \frac{1}{8})^{18}\sum^{\infty}_{k=0} (\frac{1}{8})^k = \frac{3}{7}- (\frac{3}{8})^{18}(\frac{1}{1 - \frac{1}{8}}) = \frac{3}{7} - \frac{3}{7}(\frac{3}{8})^{18} = \frac{3}{7}(1-(\frac{3}{8})^{18}))$

Of course, it is difficult to use base 2 directly to demonstrate round off error, so many texts use regular decimals and instruct the students to perform calculations using “n-digit rounding” and “n-digit chopping” to show how errors can build up with iterated operations.

One common example: use the quadratic formula to find the roots of a quadratic equation. Of course the standard formula for the roots of $ax^2 + bx + c =0$ is $\frac{-b \pm \sqrt{b^2 -4ac}}{2a}$ and there is an alternate formula $\frac{-2c}{b \pm\sqrt{b^2 -4ac}}$ that leads to less round off error in the case when the “alternative formula” denominator has large magnitude.

Now the computations, when done by hand, can be tedious and more of an exercise in repeated calculator button punching than anything else.

But a spreadsheet can help, provided one can find a way to use the “ROUND(N, M)” command and the “TRUNC(N, M)” commands to good use. In each case, N is the number to be rounded or truncated and M is the decimal place.

A brief review of these commands: the ROUND command takes a decimal and rounds to the nearest $\frac{1}{10^N}$ place; the TRUNC truncates the decimal to the nearest $\frac{1}{10^N}$ place.
The key: M can be any integer; positive, negative or zero .

Examples: TRUNC(1234.5678, 2) = 1234.56, TRUNC(1234.5678, -1) = 1230, ROUND(1234.5678, 3) = 1234.569, ROUND(1234.5678, -2) = 1200.

Formally: if one lets $a_i \in {0, 1, 2, ...9}$ for all $i$ and if $x = \sum_{i=-m}^{k} a_i (10)^i = a_k 10^k + a_{k-1} 10^{k-1} +....a_1 10 + a_0 + a_{-1}10^{-1} +...a_{-m}10^{-m}$ then TRUNC(x, M) = $\sum_{i=-M}^k a_i (10)^i =a_i (10)^i = a_k 10^k + a_{-k-1} 10^{-k-1}+...+ a_{-M}10^{-M}$

So if M is negative, this process stops at a positive power of 10.

That observation is the key to getting the spread sheet to round or truncate to a desired number of significant digits.

Recall that the base 10 log picks of the highest power of 10 that appears in a number; for example $log10(1234.5678) = 3.xxxxx$ So let’s exploit that: we can modify our commands as follows:

for TRUNC we can use TRUNC(N, M -1 – INT(log10(abs(N)))) and for ROUND we can use ROUND(N, M -1 – INT(log10(abs(N)))). Subtracting the 1 and the integer part of the base 10 log moves the “start point” of the truncation or the rounding to the left of the decimal and M moves the truncation point back to the right to the proper place.

Of course, this can be cumbersome to type over and over again, so I recommend putting the properly typed formula in some “out of the way” cell and using “cut and paste” to paste the formula in the proper location for the formula.

Here is an example:
This spread sheet shows the “4 digit rounding” calculation for the roots of the quadratics $x^2 - (60 + \frac{1}{60})x + 10$ and $1.002x^2 - 11.01x +.01265$ respectively.

(click for a larger view).

Note that one has to make a cell for EVERY calculation because we have to use 4 digit arithmetic at EACH step. Note also the formula pasted as test in the upper right hand cell.

One can cut an paste as a cell formula as shown below:

Here one uses cell references as input values.

Here is another example:

Note the reference to the previous cells.

## December 20, 2013

### Teaching the basics of numerical methods for solving differential equations

This semester we had about a week to spend on numerical methods. My goal was to give them the basics of how a numerical method works: given $y' = f(t,y), y(t_0) = y_0$ one selects a step size $\delta t$ and then one rides the slope of the tangent line: $y(t_0 + \delta t) = y_0 + (\delta t) f(t_0, y_0)$ and repeat the process. This is the basic Euler method; one can do an averaging process to get a better slope (Runge-Kutta) and one, if one desires, can use previous points in a multi-step process (e. g. Adams-Bashforth, etc.). Ultimately, it is starting at a point and using the slope at that point to get a piece wise linear approximation to a solution curve.

But the results of such a process confused students. Example: if one used a spreadsheet to do the approximation process (e. g. Euler or Runge-Kutta order 4), one has an output something like this:

So, there is confusion. They know how to get from one row to the other and what commands to type. But….”where is the solution?” they ask.

One has to emphasize what is obvious to us: the $x, y$ columns, is the approximate solution…a piece wise approximation of one anyway. What we have is a set $x, y(x)$ where these ordered pairs are points in the approximate solution to the differential equation that runs through those points. One cannot assume that the students understand this, even when they can do the algorithm.

An Exam Question

As a bonus question, I gave the following graph:

I then said: “I was testing out an Euler method routine on the differential equation $y' = y(2-y), y(0) = 1$ and I got the following output.

A) Is this solution mathematically possible?

B) If there is an error, is this an error induced by the Euler method or by an improper step size, or is a coding error more likely?

Many students got part A correct: some noted that $y = 2$ is an equilibrium solution and that this differential equation meets the “existence and uniqueness” criteria everywhere; hence the graph of the proposed solution intersecting the equilibrium solution is impossible.

Others noted that the slope of a solution at $y = 2$ would be zero; others noted that the slopes above $y = 2$ were negative and this proposed solution leveled out. Others noted that $y = 2$ is an attractor hence any solution near $y = 2$ would have to stay near there.

But no one got part B; one even went as far to say that someone with a Ph. D. in math would never make such an elementary coding error (LOL!!!!!)

But the key here: the slopes ARE negative above $y = 2$ and a correct Euler method (regardless of step size…ok, within reason) would drive the curve down.

So this WAS the result of a coding error.

What went wrong: I was running both RK-order 4 and Euler (for teaching purposes) and I stored the RK slope with one variable and calculated the “new Y” using the RK slope (obtained from the RK approximation) for the Euler method. Hence when the curve “jumped over” the $y = 2$, the new slope it picked up was a near zero slope from the RK approximation for the same value of $t$ (which was near, but below the $y = 2$ equilibrium.

My problem is that the two variables in the code differed by a single number (my bad). I was able to fix the problem very quickly though.

An aside
On another part of the test, I told them to solve $y' = y (y-1) (y+3), y(0) = 1$ and gave them the phrase: “hint: THINK first; this is NOT a hard calculation”. A few students got it, but mostly the A students. There was ONE D student who got it right!

## March 26, 2012

### Humiliated by Convex Splines

Update The writing in this article leaves something to be desired. I am going to clean this up a bit.

————————-
I was working on an analysis problem. One of the steps I needed to take was: given a set of points $p_1, p_2, p_3,...., p_i = (x_i,y_i)$ in the first quadrant of the standard Cartesian plane and given the fact that these points converged to the origin AND given the fact that these points met a convexity property in that for all $i, 0 < \frac{y_{i+2}-y{i+1}}{x_{i+2}-x_{i+1}} < \frac{y_{i+1}-y{i}}{x_{i+1}-x_{i}}$ could one then fit a convex $C^1$ curve though all of these points (and hence to the origin)? The answer turns out to be “no” but happily one can put an decreasing $C^1$ convex curve though an infinite subset of these. One standard way is to use the Beizer quadratic spline.

But I figured: “hey, we are talking about, what, a 4’th degree polynomial through two points that contains position and derivative information at the endpoints…how hard can that be?”

Then I tried it…..and found some more humility.

First I found out: it would be impossible to put a convex curve $f(x)$ through points $(x_0,y_0), (x_1, y_1)$ that had $p'(x_0) = m_0, p'(x_1) = m_1, m_0 < m_1$ unless certain conditions were met.

Here is why: suppose $\frac{f(x_1) - f(x_0)}{x_1 - x_0} = m \geq m_1$ The Mean Value theorem assures the existence of $x, x \in (x_0, x_1)$ where $f'(x) = m$ which is impossible due to convexity, unless $m = m_1$, in which case an application of the Mean Value Theorem to $f$ on $[x_0, x_f]$ (where $x_f$ is the first place that $f' = m_1$) shows that convexity is impossible.

So what about a convex polynomial that runs through the first and last point of a three point convex set and has the required derivatives at the first and last point (“knot”)? It turns out that such a problem has been the focus of much research and, in general, is impossible to do (reference).

But we can get a non-polynomial function (in particular a Frobenius type polynomial with fractional powers) that yields one sided differentiability
at the first and last point, which permits appropriate “piecing together” of splines to yield a $C^1$ convex curve.

The key will be to show that given three knots of “convex data” and a derivative condition at the first and last knot, one can always fit in a convex “Frobenius” polynomial through the first and last knot that meet the specified derivative condition at those knots.
The set up: the last knot will be taken to be the origin; the first knot will be given in terms of the sum of two line segments; the segment ending at the origin will have the specified slope at the origin and the second segment will have the slop of the “third knot”. The missing knot (the one that the function will not run through) can be thought of as being the sum of a segment whose slope is greater than the slope at the origin along with a segment ending at the third knot whose slope is the specified slope at the third knot. Denote the first knot by $(0, g(0))$ and $(1, g(1))$.

Theorem: Given real numbers $0 < m_{1} < m_{2}$ and $\rho$, $0 <\rho$ there is a convex Frobenius polynomial $g$ such that:

(i) $g(0)=0$

(ii)$g(1)=(1-\rho )m_{2}+\rho m_{1}$

(iii) $g^{\prime }(0)=m_{1}$

(iv) $g^{\prime }(1)=m_{2}$

Where the derivatives in question are the appropriate one-sided derivatives.

Proof. Define $g(x) = Dx + Ax^{1 + \alpha} + Bx^{1+ \alpha + \beta} + Cx^{1 + \alpha + \beta + \delta}$.
If we can find such a $g$ that meets conditions i, ii, iii and iv with $A,B,C, D$ positive, convexity follows.

Set $D=m_{1}$ to satisfy properties i and iii.

$m_{1}+A+B+C=(1-\rho )m_{2}+\rho m_{1}$

$A+B+C=(1-\rho )m_{2}+\rho m_{1}-m_{1}=(1-\rho )m_{2}-(1-\rho )m_{1}$

$A+B+C=(1-\rho )(m_{2}-m_{1})$

$m_{1}+(1+\alpha )A+(1+\alpha +\beta )B+(1+\alpha +\beta +\delta )C=m_{2}$

Or

$(1+\alpha )A+(1+\alpha +\beta )B+(1+\alpha +\beta +\delta )C=m_{2}-m_{1}$

So set $\Delta =m_{2}-m_{1}$

We have

$A+B+C=(1-\rho )\Delta$

$(1+\alpha )A+(1+\alpha +\beta )B+(1+\alpha +\beta +\delta )C=\Delta$

This leads to the augmented matrix:

$\left[ \begin{array}{cccc} 1 & 1 & 1 & (1-\rho )\Delta \\ (1+\alpha ) & (1+\alpha +\beta ) & (1+\alpha +\beta +\delta ) & \Delta\end{array}\right]$

Perform the following row operation: $R_{2}\rightarrow R_{2}-(1+\alpha )R_{1}$

$\left[ \begin{array}{cccc} 1 & 1 & 1 & (1-\rho )\Delta \\ 0 & \beta & \beta +\delta & \Delta -(1+\alpha )(1-\rho )\Delta \end{array} \right] =\left[ \begin{array}{cccc} 1 & 1 & 1 & (1-\rho )\Delta \\ 0 & \beta & \beta +\delta & \Delta (-\alpha +\rho +\alpha \rho ) \end{array} \right] =$

$\left[ \begin{array}{cccc} 1 & 1 & 1 & (1-\rho )\Delta \\ 0 & \beta & \beta +\delta & \Delta (\rho -\alpha (1-\rho )) \end{array} \right]$

Now perform $R_{2}\rightarrow \frac{1}{\beta }R_{2}$

$\left[ \begin{array}{cccc} 1 & 1 & 1 & (1-\rho )\Delta \\ 0 & 1 & 1+\frac{\delta }{\beta } & \Delta \frac{1}{\beta }(\rho -\alpha (1-\rho )) \end{array} \right]$

Now perform $R_{1}\rightarrow R_{1}-R_{2}$

$\left[ \begin{array}{cccc} 1 & 0 & -\frac{\delta }{\beta } & (1-\rho )\Delta -\Delta \frac{1}{\beta } (\rho -\alpha (1-\rho )) \\ 0 & 1 & 1+\frac{\delta }{\beta } & \Delta \frac{1}{\beta }(\rho -\alpha (1-\rho )) \end{array} \right]$

So our solution is :

$\left[ \begin{array}{c} A \\ B \\ C \end{array} \right] =\left[ \begin{array}{c} \Delta \frac{1}{\beta }(\beta (1-\rho )-\rho +\alpha (1-\rho ))+\frac{\delta }{\beta }u \\ \Delta \frac{1}{\beta }(\rho -\alpha (1-\rho ))-(1+\frac{\delta }{\beta })u \\ u \end{array} \right]$

Where $u$ is some real number $u\geq 0$

Note that $\Delta > 0$ and $0< \rho$ (as $u$ can be made as small as
desired)

Hence $\rho > \alpha (1-\rho )\rightarrow \frac{\rho }{1-\rho }> \alpha$

Now for the $A$ term:

$\beta (1-\rho )-\rho +\alpha (1-\rho )>0$
$\beta (1-\rho ) > \rho -\alpha (1-\rho )$

$\beta >\frac{\rho -\alpha (1-\rho )}{(1-\rho )}=\frac{\rho }{1-\rho } -\alpha >0$

Now choose $\delta >0$ and

$\Delta (\frac{(\rho -\alpha (1-\rho ))}{\beta +\delta })>u>0$

And convexity is guaranteed because the coefficients are all positive.

So now we can “piece together” as many of these splines as needed by matching the derivatives at the end points.

Next: now that we have something that works we should say something about how closely this “Frobenius polynomial” approximates the knot that it misses. That will be the subject for a subsequent post.

## December 9, 2011

### Striking a balance between precision and being intelligible

Ok, what do we mean by: $x + 2 = 1$? Now, what do we mean by $(A+B)x + (B-A) = 1$? Of course, the answer is “it depends”. The most common use of the first “equation” is “find the real number $x$ such that that number added to 2 equals 1.” In the second case, the most common use is “find real numbers $A, B$ such that this equation is true for all real $x$.

In short, we are using the equal sign very differently: in the first case we are using it as the equivalence relation in the field of real numbers. In the second case, we are really talking about vector space equivalence.

We see this multiple use in calculus all the time; for example $\int \int_{A} f dx dy = \int \int_{A} f dy dx$ but $\int \int_{A} f dx\wedge dy = -\int \int_{A} f dy\wedge dx$ Of course, the first is the usual non-oriented integral that we talk about in calculus courses (absolute values of the Jacobians!) and the latter is the oriented integral that we use for 2-forms, which, when you think about it, is the logical extension of the usual calculus I definite integral.

There are certainly more examples.

What got me to thinking about this was an office hour encounter I had with a numerical methods student (a good student who is doing solid work in the course). We were talking about various methods of solving the matrix problem $AX = B$ where $X$ is a column vector of variables and $B$ is the “answer” vector of numbers. We were discussing the number of operations (multiplications/divisions and additions/subtractions) required to obtain a solution if we had that $A = LDU$ where $D$ was a diagonal matrix with non-zero entries, $L, U$ are lower and upper triangular matrices (respectively) with 1’s on the diagonal.

She kept on being off by a peculiar factor on the multiplication count.

Eventually we figured out the problem. When we converted the matrix equations to equations, she was counting the matrix entry multiplied by the unsolved for variables as a multiplication. Why? Well, once we solved for the variable we then counted operations with it AFTER it had been “solved for”. Example: given $a_{1,1}x_1 +a_{1,2}x_2 = 3, a_{2,2}x_2=5$ we don’t count the “coefficient times the variable” as a multiplication. But once we solve and obtain $x_2 = \frac{5}{a_{2,2}}$ we then count operations involving $x_2$. (of course, the diagonal elements are non-zero).

It is clear why we do this: prior to being solved for, the variables are really storage locations, and we are interested in counting the numerical operations that can contribute to round off error. But when we think about it, we are actually distinguishing between several types of multiplications: matrix multiplication, scalar multiplication in a vector space between a vector and a scalar, and the scalar (numerical) multiplication.

However, explaining that in class might lead to confusion among the students; it is probably best to bring this up only if someone is confused about it.

The language of mathematics can be so subtle that sometimes it probably good pedagogy to speak a bit informally, at least to beginning students.

## October 10, 2011

### The Picard Iterates: how they can yield an interval of existence.

One of the many good things about my teaching career is that as I teach across the curriculum, I fill in the gaps of my own education.
I got my Ph. D. in topology (low dimensional manifolds; in particular, knot theory) and hadn’t seen much of differential equations beyond my “engineering oriented” undergraduate course.

Therefore, I learned more about existence and uniqueness theorems when I taught differential equations; though I never taught the existence and uniqueness theorems in a class, I learned the proofs just for my own background. In doing so I learned about the Picard iterated integral technique for the first time; how this is used to establish “uniqueness of solution” can be found here.

However I recently discovered (for myself) what thousands of mathematicians already know: the Picard process can be used to yield an interval of existence for a solution for a differential equation, even if we cannot obtain the solution in closed form.

The situation
I assigned my numerical methods class to solve $y'= t + y^2$ with $y(0)=1$ and to produce the graph of $y(t)$ from $t = 0$ to $t = 3$.

There is a unique solution to this and the solution is valid so long as the $t$ and $y$ value of the solution curve stays finite; note that $\frac{\partial }{\partial y}f(t,y)=2y.$

So, is it possible that the $y$ values for this solution become unbounded?

What follows are the notes I gave to my class.

Numeric output seems to indicate this, but numeric output is NOT proof.

To find a proof of this, let’s turn to the Picard iteration technique. We
know that the Picard iterates will converge to the unique solution.

$y_{0}=1$

$y_{1}=1+\int_{0}^{t}x+1dx=\frac{1}{2}t^{2}+t+1$

$y_{2}=1+\int_{0}^{t}x+(\frac{1}{2}x^{2}+x+1)^{2}dx=$

$y_{2}=\frac{1}{20}t^{5}+\frac{1}{4}t^{4}+\frac{2}{3}t^{3}+\frac{3}{2}t^{2}+t+1$

The integrals get pretty ugly around here; I used MATLAB to calculate the
higher order iterates. I’ll show you $y_{3}$

$y_{3}=\frac{49}{60}t^{5}+\frac{13}{12}t^{4}+\frac{4}{3}t^{3}+\frac{3}{2}t^{2}+t+1+O(t^{11})$

where $O(t^{11})$ means assorted polynomial terms from order 6 to 11.

Here is one more:

$y_{4}=\frac{17}{12}t^{5}+\frac{17}{12}t^{4}+\frac{4}{3}t^{3}+\frac{3}{2}t^{2}+t+1+O(t^{23})$

We notice some patterns developing here. First of all, the coefficient of
the $t^{n}$ term is staying the same for all $y_{m}$ where $m\geq n.$

That is tedious to prove. But what is easier to show (and sufficient) is
that the coefficients for the $t^{n}$ terms for $y_{n}$ all appear to be
bigger than 1. This is important!

Why? If we can show that this is the case, then our ”limit” solution $\sum_{k=0}^{\infty }a_{k}t^{k}$ will have an interval of convergence less than 1. Why? Substitute $t=1$ and see that the sum
diverges because the $a_{k}$ not only fail to converge to zero, but they
stay greater than 1.

So, can we prove this general pattern?

YES!

Here is the idea: $y_{m}=q(t)+p(t)$ where $p(t)$ is a polynomial of order $m$
and $q(t)$ is a polynomial whose terms all have order $m+1$ or greater.

Now put into the Picard process:

$y_{m+1}=1+\int_{0}^{t}((q(x)+p(x))^{2}+xdx=$

$1+\int_{0}^{t}((q(x)^{2}+2p(x)q(x))dx+\int_{0}^{t}(p(x))^{2}+xdx$

Note: all of the terms of $y_{m+1}$ of degree $m+1$ or higher must come from
the second integral.

Now by induction we can assume that all of the coefficients of the
polynomial $p(x)$ are greater than or equal to one.

When we ”square out” the polynomial, the coefficients of the new
polynomial will consist of the sum of positive numbers, each of which is
greater than 1. For the coefficients of the polynomial $(p(x))^{2}$ of
degree $m$ or higher: if one is interested in the $k^{\prime }th$
coefficient, one has to add at least $k+1$ numbers together, each of which
is bigger than one.

Now when one does the integration on these particular terms, one, of course,
divides by $k+1$ (power rule for integration). But that means that the
coefficient (after integration) is then greater than 1.

Here is a specific example:

Say $p(x)=a+bx+cx^{2}+dx^{3}$

Now $p(x)^{2}=a^{2}+(ab+ab)x+(ac+ca+b^{2})x^{2}+(ad+da+bc+cb)x^{3}+\{O(x^{6})\}$

Remember that $a,b,c,d$ are all greater than or equal to one.

Now $p(x)^{2}+x=a^{2}+(ab+ab+1)x+(ac+ca+b^{2})x^{2}+(ad+da+bc+cb)x^{3}+\{O(x^{6})\}$

Now when we integrate term by term, we get:

$\int_{0}^{t}(p(x))^{2}+xdx=a^{2}x+\frac{1}{2}(ab+ab+1)x^{2}+\frac{1}{3}(ac+ca+b^{2})x^{3}+\frac{1}{4}(ad+da+bc+cb)x^{4}+\{O(x^{7})\}$

But note that $ab+ab+1>2,ac+ca+b^{2}\geq 3,$ and $ad+da+bc+cb\geq 4$

Since all of the factors are greater than or equal to 1.

Hence in our new polynomial approximation, the order 4 terms or less all
have coefficients which are greater than or equal to one.

We can make this into a Proposition:

Proposition
Suppose $p(x)=\sum_{j=0}^{k}a_{j}x^{j}$ where each $a_{j}\geq 1.$

If $q(t)=\sum_{j=0}^{2k+1}b_{j}x^{j}=1+\int_{0}^{x}((p(t))^{2}+t)dt$

Then for all $j\leq k+1,b_{j}\geq 1.$

Proof. Of course, $b_{0}=1,b_{1}=a_{0}^{2},$ and $b_{2}=\frac{2a_{0}a_{1}+1}{2}$

Let $n\leq k+1.$

Then we can calculate: (since all of the $a_{n-1},a_{n-2},....a_{1}$ are
defined):

If $n$ is odd, then $b_{n}=\frac{1}{n}(2a_{0}a_{n-1}+2a_{1}a_{n-2}+...2a_{\frac{n-3}{2}}a_{\frac{n+1}{2}}+(a_{\frac{n-1}{2}})^{2})\geq \frac{1}{n}(2\ast \frac{n-1}{2}+1)=1$

If $n$ is even then $b_{n}=\frac{1}{n}(2a_{0}a_{n-1}+2a_{1}a_{n-2}+....2a_{\frac{n-1}{2}}a_{\frac{n+1}{2}})\geq \frac{1}{n}(2\ast \frac{n}{2})=1$

The Proposition is proved.

Of course, this possibly fails for $b_{n}$ where $n>k+1$ as we would fail to
have a sufficient number of terms in our sum.

Now if one wants a challenge, one can modify the above arguments to show that the coefficients of the approximating polynomial never get ”too big”; that is, the coefficient of the $k^{\prime }th$ order term is less than, say, $k$.

It isn’t hard to show that $b_{n}\leq \max a_{i}^{2}$ where $i\in\{0,1,2,...n-1\}$

Then one can compare to the derivative of the geometric series to show that
one gets convergence on an interval up to but not including 1.