Banded matrices and ordinary differential equations

$\def\qqand{\qquad\hbox{and}\qquad} \def\D{{\rm d}\,} \def\I{{\rm i}\,}$

Prelude

This is the first in a series of blog posts detailing the results of a NumFocus Small Development grant, which funded a two week coding spree by Mayeul d'Avezac of Imperial's Research Software Engineering to add support for general backend storage types for BlockBandedMatrices.jl. This small change has a big impact on what is possible with the package, in particular, we are able to leverage this technology to solving partial differential equations (PDEs) on the GPU and across multiple processes using shared memory arrays.

1. Ordinary differential equations and banded matrices

This first post outlines some background by describing how banded matrices can be used for solving ordinary differential equations (ODEs). In the next post we will see how this approach can be generalised for partial differential equations. We will cover the following topics:

  1. Ordinary differential equations and banded matrices
  2. Finite differences and banded matrices
  3. Fourier methods and banded matrices
  4. Why not sparse?

First, some definitions. A matrix A is banded if it has lower bandwidth l and upper bandwidth u, in the sense that A[k,j] = 0 if j-k > u or k-j > l. BandedMatrices.jl is a Julia package that implements the custom matrix type BandedMatrix for representing banded matrices. For example, we can construct a random banded matrix with bandwidths (l,u) = (1,2) as follows:

using LinearAlgebra, BandedMatrices

m, n = 5,6                              # create a 5 × 6 matrix
l, u = 1,2                              # with 1 subdiagonal and 2 superdiagonals
A = BandedMatrix(rand(1:9,m,n), (l,u))  # use random integers between 1 and 9
A
5×6 BandedMatrix{Int64,Array{Int64,2},Base.OneTo{Int64}}:
 2  3  1  ⋅  ⋅  ⋅
 3  9  2  4  ⋅  ⋅
 ⋅  6  5  3  1  ⋅
 ⋅  ⋅  2  6  4  8
 ⋅  ⋅  ⋅  4  7  2

This is a thin wrapper around an (l + u + 1) × n matrix storing the non-zero entries of the matrix in a BLAS-like format:

A.data
4×6 Array{Int64,2}:
 0  0  1  4  1  8
 0  3  2  3  4  2
 2  9  5  6  7  0
 3  6  2  4  0  0

Here the columns of A.data correspond to the columns of A. The top left and bottom right corners of A.data fall off the edges of A, and so these entries are ignored. They are often initialised to be 0 but may be arbitrary integers depending on what happened to be in memory when created.

Banded matrices have a number of applications, but we focus on linear ordinary differential equations. Consider a simple boundary value problem of the form

$\epsilon u^{(4)}(x) + v(x) u(x) = f(x) \qqand u(0) = u(1) = u'(0) = u'(1) = 0$

for $0 \leq x \leq 1$.

Standard solution techniques include for ODEs include, Finite differences, Finite elements, and Spectral methods. Each method reduces ODEs to banded linear systems. We now explain this for finite differences and spectral methods applied to our simple fourth-order example.

2. Finite differences and banded matrices

Our finite difference scheme begins by approximating the solution $u$ to an ODE by its values at the n points on a evenly spaced grid

$x_k = (k-1) h$

for $h = 1/(n-1)$ and $j = 1,2,\ldots,n$. The central difference approximation of the second derivative then has the form

$(u(x_{k-1})-2u(x_k) +u(x_{k+1}))/h^2 ≈ u''(x_k)$

for $k = 2, \ldots, n-1$. We can view this as an (n-2)×n rectangular banded matrix with bandwidths (0,2), that is, every entry outside the diagonal and the first two super-diagonals are zero.

It is straightforward to construct central differences as a banded matrix¹:

function centraldifference(n)
    h = 1/(n-1)   # spacing between grid points
    l, u =  (0,2) # lower and upper bandwidths
    Δ = BandedMatrix{Float64}(undef, (n-2,n), (l,u))
    Δ[band(0)] .= 1/h^2     # set the diagonal band
    Δ[band(1)] .= -2/h^2    # set the super-diagonal band
    Δ[band(2)] .= 1/h^2     # set the second super-diagonal band
    Δ
end

centraldifference(10)
8×10 BandedMatrix{Float64,Array{Float64,2},Base.OneTo{Int64}}:
 81.0  -162.0    81.0      ⋅       ⋅       ⋅       ⋅       ⋅       ⋅     ⋅ 
   ⋅     81.0  -162.0    81.0      ⋅       ⋅       ⋅       ⋅       ⋅     ⋅ 
   ⋅       ⋅     81.0  -162.0    81.0      ⋅       ⋅       ⋅       ⋅     ⋅ 
   ⋅       ⋅       ⋅     81.0  -162.0    81.0      ⋅       ⋅       ⋅     ⋅ 
   ⋅       ⋅       ⋅       ⋅     81.0  -162.0    81.0      ⋅       ⋅     ⋅ 
   ⋅       ⋅       ⋅       ⋅       ⋅     81.0  -162.0    81.0      ⋅     ⋅ 
   ⋅       ⋅       ⋅       ⋅       ⋅       ⋅     81.0  -162.0    81.0    ⋅ 
   ⋅       ⋅       ⋅       ⋅       ⋅       ⋅       ⋅     81.0  -162.0  81.0

To approximate the fourth-derivative, we apply two central difference approximations in a row. Since we have already lost two grid points, the second application is an (n-4)×(n-2) matrix. Multiplication of banded matrices is also a banded matrix (albeit one with larger bandwidths), and BandedMatrices.jl takes care of the details for us:

fourthdifference(n) = centraldifference(n)[2:end-1,2:end-1]*centraldifference(n)
fourthdifference(10)
6×10 BandedMatrix{Float64,Array{Float64,2},Base.OneTo{Int64}}:
 6561.0  -26244.0   39366.0  -26244.0  …        ⋅         ⋅       ⋅ 
     ⋅     6561.0  -26244.0   39366.0           ⋅         ⋅       ⋅ 
     ⋅         ⋅     6561.0  -26244.0           ⋅         ⋅       ⋅ 
     ⋅         ⋅         ⋅     6561.0       6561.0        ⋅       ⋅ 
     ⋅         ⋅         ⋅         ⋅      -26244.0    6561.0      ⋅ 
     ⋅         ⋅         ⋅         ⋅   …   39366.0  -26244.0  6561.0

We haven't discussed boundary conditions, but for brevity we note that first order zero Dirichlet conditions are equivalent to dropping the first and last 2 columns of the discretisation, and assuming these entries are zero. Putting everything together we get the banded discretisation L of the differential operator, using the quadratic potential $v(x) = 2(2x-1)^2 - 1$:

function fourthorderode(v, ε, n)
    D⁴ = fourthdifference(n)
    x = range(0; stop=1, length=n)      # create the grid
    V = BandedMatrix{Float64}(undef, (n-4,n), (-2,2))
    V[band(2)] .= v.(x[3:end-2])        # potential evaluated at restricted grid
    L = ε*D⁴ + V
    L[:, 3:end-2]                       # drop first and last two columns to impose boundary conditions
end

v = x -> 2*(2x-1)^2 -1                  # quadratic potential
fourthorderode(v, 0.001, 10)
6×6 BandedMatrix{Float64,Array{Float64,2},Base.OneTo{Int64}}:
  38.9833  -26.244     6.561      ⋅         ⋅         ⋅    
 -26.244    38.5882  -26.244     6.561      ⋅         ⋅    
   6.561   -26.244    38.3907  -26.244     6.561      ⋅    
    ⋅        6.561   -26.244    38.3907  -26.244     6.561 
    ⋅         ⋅        6.561   -26.244    38.5882  -26.244 
    ⋅         ⋅         ⋅        6.561   -26.244    38.9833

Thus we arrive at a banded linear system, which can be rapidly solved (and plotted):

f = x -> exp(x)                     # arbitrary right-hand side
n = 100_000

ε = 1E-10
A = fourthorderode(v, ε, n)         # construct the differential operator
x = range(0; stop=1, length=n)
Å© = A \ f.(x[3:end-2])              # Å© is u[3:end-2]
u = [0; 0; Å©; 0; 0]                 # add in the entries we dropped

using Plots  # Load Plots for plotting
plot(x, u; label="Solution", legend=:bottomleft)
plot!(x, 100v.(x); label="Potential (rescaled)")

This has calculated the solution to roughly 4 digits of accuracy (when compared to ApproxFun's high accuracy spectral method solution). The solution changes from exponential behaviour to oscillatory behaviour where the potential crosses the $x$ axis, which is typical behavour for singularly perturbed problems like this, that is, when the highest order derivative is multiplied by a small ε factor.

Banded solvers are extremely efficient. Here, \ is overloaded for banded matrices to use specialised solvers that have optimal $O(n)$ complexity; in this case it is using the default LU factorisation from LAPack, but QR factorisations are also available via qr(A) \ f.(x[3:end-2]). Even with a million unknowns we can solve the resulting system in under a second:

using BenchmarkTools # Provides reliable timings

n = 1_000_000
A = fourthorderode(v, ε, n)     # construct the differential operator
x = range(0; stop=1, length=n)
@btime A \ f.(x[3:end-2]);
95.223 ms (19 allocations: 76.29 MiB)

Using dense linear algebra, this same calculation would have taken about half a year.

Unfortunately, finite differences are unstable as $n \rightarrow \infty$: while the calculation of A \ f.(x[3:end-2]) is fast, the answer is wrong. This is because the matrix A is ill-conditioned: the condition number grows like $O(n^4)$, and a million to the power 4 is rather large. Spectral methods avoid this ill-conditioning, which motivates the next section.

3. Fourier methods and banded matrices

Banded matrices arise very naturally for finite differences: derivatives depend only on local information, so it stands to reason that a discretisation method that represents values on a grid will only have interaction with close-by grid points. Higher order (either differential or convergence) discretisations use more and more information, and the bandwidth of the discretisation increases.

Spectral methods are in some sense ∞-convergence order, and so when viewed as acting on a grid are not banded. On the other hand, derivatives are also local in frequency space: $\D / \D x$ becomes multiplication by $\I k$. It stands to reason that we continue to have banded matrices as long as we work in frequency space².

As an example, consider the fourth order ODE

$\epsilon u^{(4)}(\theta) + v(\theta) u(\theta) = f(\theta), u(0) = u(\pi) = u''(0) = u''(\pi) = 0$

for $0 \leq \theta \leq \pi$. We will solve this ODE using sine series by approximating

$u(\theta) \approx \sum_{k=1}^n u_k \sin k \theta$

and expanding the variable coefficient $v(\theta)$ in cosine series

$v(\theta) \approx \sum_{k=0}^{M-1} v_k \cos k \theta.$

Note that the approximation degree of $v$ is not the same as the approximation of $u$.

Second derivatives and multiplication become banded operators due to the following trigonometric identities:

${\D \over \D \theta} \sin k \theta = -k^2 \sin k \theta, \qquad 2\cos n \theta \sin m \theta = \sin (n+m) \theta + \sin (n-m) \theta$

We thus arrive at a banded system of equations, where the bandwidth is dictated by the number of terms in the cosine expansion of $v$:

# Constructs the Laplacian as an `n × n` Diagonal matrix:
sinlaplacian(n) = Diagonal([-k^2 for k = 1:n])

# Constructs the multiplication of `v` given by its cosine coefficients
# where the bandwidth is dictated by `length(v)`
function cos_times_sin(v::AbstractVector{T}, n) where T
    m = length(v) # the upper/lower bandwidth is given by the no. of coefficients
    M = BandedMatrix{float(T)}(undef, (n,n), (m-1,m-1))
    isempty(v) && return M

    M[band(0)] .= v[1]
    for j = 2:m
        M[band(j-1)] .= v[j]/2
        M[band(1-j)] .= v[j]/2
    end

    M
end

# Puts these together to form the full operator
function sin_fourthorderode(v, ε, n)
     = sinlaplacian(n)
    D⁴ = ()^2
    V =  cos_times_sin(v, n)
    L = ε*D⁴ + V
end

# The operator is banded. Here the first argument are the cosine coefficients
# of `v`, which we have chosen arbitrarily
sin_fourthorderode([2,1,3], 0.001, 10)
10×10 BandedMatrix{Float64,Array{Float64,2},Base.OneTo{Int64}}:
 2.001  0.5    1.5     ⋅      ⋅      ⋅      ⋅      ⋅      ⋅       ⋅ 
 0.5    2.016  0.5    1.5     ⋅      ⋅      ⋅      ⋅      ⋅       ⋅ 
 1.5    0.5    2.081  0.5    1.5     ⋅      ⋅      ⋅      ⋅       ⋅ 
  ⋅     1.5    0.5    2.256  0.5    1.5     ⋅      ⋅      ⋅       ⋅ 
  ⋅      ⋅     1.5    0.5    2.625  0.5    1.5     ⋅      ⋅       ⋅ 
  ⋅      ⋅      ⋅     1.5    0.5    3.296  0.5    1.5     ⋅       ⋅ 
  ⋅      ⋅      ⋅      ⋅     1.5    0.5    4.401  0.5    1.5      ⋅ 
  ⋅      ⋅      ⋅      ⋅      ⋅     1.5    0.5    6.096  0.5     1.5
  ⋅      ⋅      ⋅      ⋅      ⋅      ⋅     1.5    0.5    8.561   0.5
  ⋅      ⋅      ⋅      ⋅      ⋅      ⋅      ⋅     1.5    0.5    12.0

This can be solved as rapidly as in the finite differences case:

n = 1_000_000
L = sin_fourthorderode([2,1,3], 1E-12, n)
f = [3; 4; 5; Zeros(n-3)] # arbitrary right-hand side

u = qr(L) \ f    # Use QR to ensure stability

using ApproxFun  # ApproxFun makes it easy to evaluate sin series and thus plot

plot(Fun(SinSpace(), u); legend=:bottomleft, label="Solution")
plot!(500Fun(CosSpace(), [2.,1,3]); label="Potential (rescaled)")

Again, we see the switch from exponential to oscillatory behaviour where the potential changes sign.

The attractiveness of spectral methods are their fast convergence: once the oscillations are resolved the approximation converges rapidly to within machine precision. This can be seen by plotting the decay in the coefficients:

# The .+ 1E-40 avoids a bug in Plots
plot(abs.(u) .+ 1E-100; yscale=:log10, xscale=:log10,
                     title="Absolute value of coefficients", legend=false)

Unlike finite differences, spectral methods are stable: while the condition number of the matrix still grows like $O(n^4)$, the matrix is diagonally dominant and the right-hand side is padded by exactly zeros, which allows us to accurately solve the resulting systems as $n$ becomes and large and $\epsilon$ become small. (Explaining this in detail would be a blog post in itself.)

4. Why not sparse?

We have shown that banded matrices provide a convenient and very efficient way to set-up and solve ODEs. But there is an alternative, more established, approach: use Julia's excellent support for general sparse matrices via the sparse command of SparseArrays.jl. What makes sparse so powerful is its generality: one can specify any sparsity pattern (including bandedness as a special case) which are optimised for on-the-fly, using graph-theoretical algorithms. We can easily construct either of the previous problems using sparse, for example, the fourth-order finite difference scheme is constructed as:

using SparseArrays

function sp_centraldifference(n)
    h = 1/(n-1)
    Is = [1:n-2; 1:n-2; 1:n-2]
    Js = [1:n-2; 2:n-1; 3:n]
    Vs = [Fill(1/h^2,n-2);  Fill(-2/h^2,n-2); Fill(1/h^2,n-2)]
    sparse(Is, Js, Vs)
end

sp_fourthdifference(n) = sp_centraldifference(n-2)*sp_centraldifference(n)

function sp_fourthorderode(v, ε, n)
    ε = 0.001
    x = range(0; stop=1, length=n)
    V = sparse(1:n-4, 3:n-2, v.(x[3:end-2]), n-4, n) # quadratic potential
    D⁴ = sp_fourthdifference(n)
    L = ε*D⁴ + V
    L[:, 3:end-2]
end;

For solving linear systems, sparse uses SuiteSparse's LU factorization, via UMFPack. SuiteSparse is an established and highly optimised software suite. One might argue that that it is impossible to compete with SuiteSparse, and that any benefits of a specialised data structure are not worth the extra complexity in code design or package dependencies.

Let's test this viewpoint, in particular, we want to investigate the cost of the following operations, which are used in numerical methods for ODEs:

  1. Matrix-vector products.
  2. Triangular system solves.
  3. Construction of matrices.
  4. Calculating matrix factorizations.
  5. Solving linear systems.

Let's start with the bad news: matrix-vector products. Here the results are not very impressive for banded matrices³, with the performance only barely surpassing sparse:

n = 1_000_000
L = fourthorderode(v, 1.0, n)       # banded matrix
 = sp_fourthorderode(v, 1.0, n)    # sparse matrix
x = randn(size(L,2))

BLAS.set_num_threads(1) # set 1 thread for consistent tests

@btime L*x;  # banded matrix*vector, using BLAS.gbmv!
4.635 ms (3 allocations: 7.63 MiB)
@btime *x;  # sparse matrix*vector
6.753 ms (2 allocations: 7.63 MiB)

For large bandwidths the difference is more noticeable, with banded matrices achieving roughly twice the performance. Here's an example with 200 upper and lower bands:

M = sin_fourthorderode(randn(200), 1.0, 200_000)
 = sparse(M)
x = randn(200_000)
@btime M*x;
29.865 ms (3 allocations: 1.53 MiB)
@btime *x;
62.018 ms (2 allocations: 1.53 MiB)

Triangular solves show a more substantial difference, with the large bandwidth case showing a roughly 4x speed up:

x = randn(size(M,1))
@btime UpperTriangular(M) \ x;
24.921 ms (7 allocations: 1.53 MiB)
@btime UpperTriangular() \ x;
95.734 ms (3 allocations: 1.53 MiB)

These previous experiments demonstrate some benefit to banded matrices, but perhaps not enough to warrant the trade-offs in using a non-standard library package. The next three examples tell a more convincing story. Consider the basic task of constructing matrices. Even the simplest example of constructing a central difference formula shows that banded matrices are over 10x faster than their sparse analogue:

n = 1_000_000
@btime centraldifference(n);            # banded matrix
12.665 ms (9 allocations: 22.89 MiB)
@btime sp_centraldifference(n);         # sparse matrix
155.872 ms (26 allocations: 183.11 MiB)

The extra cost for sparse arrays is due to a sorting of the indices, as explained in the documentation for SparseArrays.sparse!.

Now consider the matrix factorisation step. For the LU factorisation, there is a roughly 20x speedup for banded matrices, but more importantly, there is also a 20x reduction in memory usage:

@btime lu(L); # non-pivoted LU on banded matrices
56.994 ms (5 allocations: 61.04 MiB)
@btime lu(); # pivoted LU on banded matrices
1.311 s (83 allocations: 1.15 GiB)

QR factorisations and large bandwidth problems show a similar difference in speeds and memory usage.

The last example is really the nail in the coffin for sparse as an all-purpose, high performance solver. For many problems we need to use the QR factorisation; this could be because we need more reliability than LU can provide, that we are solving a rectangular system, or that there is ill-conditioning in the problem. Unfortunately, QR for sparse matrices has catastrophic performance in this setting, even discounting the cost of factorisation. Indeed, it appears to not even achieve optimal complexity:

banded_qr_time = Vector{Float64}()
sparse_qr_time = Vector{Float64}()
sparse_lu_time = Vector{Float64}()
Ns = 2 .^ (8:14)

for n in Ns
    L = fourthorderode(v, 1E-18, n)
     = sparse(L)
    x = randn(size(L,2))
    F,, = qr(L), qr(), lu()
    push!(banded_qr_time, @belapsed($F \ $x))
    push!(sparse_qr_time, @belapsed($ \ $x))
    push!(sparse_lu_time, @belapsed($ \ $x))
end


plot(Ns, banded_qr_time; xscale=:log10, yscale=:log10, label="banded QR",
                         xlabel="n", ylabel="time (s)", legend=:topleft)
    plot!(Ns, sparse_qr_time; label="sparse QR")
    plot!(Ns, sparse_lu_time; label="sparse LU")

We include the sparse LU timings for comparison, and to show that even though it achieves (roughly) the right complexity, it is still not as efficient as banded QR.

A one-off experiment like this may seem like a fairly niche issue. But these are exactly the type of equations that arise from, say, a spectral method for PDEs using a tensor product of Fourier bases (or Fourier and Chebyshev bases). In Dedalus (an easy-to-use Python package for spectral methods for PDEs) these discretisations need to be performed for each Fourier mode, and extreme memory usage can be devastating in terms of both computational efficiency and software management⁴.

5. What's next?

Banded matrices have the advantage that it is easy to use non-standard storage arrays. We can swap out the underlying Matrix storage of A.data with, for example, a CuMatrix from the CuArrays.jl to construct a CUDA matrix living on the GPU. Or we can use a SharedMatrix to store the matrix in shared memory and allow multiple processes to access it in parallel, thereby speeding up calculations. For partial differential equations, we can generalise banded matrices to banded-block-banded matrices that retain many of the benefits of banded matrices and also support general backends. These will be topics of future posts.

¹ There are many equivalent ways of constructing banded matrices, and we have chosen one of the more verbose. For example, we could have also constructed this matrix using diagm inspired syntax:

n = 10; h = 1/(n-1);
BandedMatrix{Float64}((0 => fill(1/h^2, n-2), 1 => fill(-2/h^3, n-2), 2 => fill(1/h^2, n-2)), (n-2,n))
8×10 BandedMatrix{Float64,Array{Float64,2},Base.OneTo{Int64}}:
 81.0  -1458.0     81.0       ⋅        ⋅   …       ⋅        ⋅    
    ⋅     ⋅ 
   ⋅      81.0  -1458.0     81.0       ⋅           ⋅        ⋅      
  ⋅     ⋅ 
   ⋅        ⋅      81.0  -1458.0     81.0          ⋅        ⋅      
  ⋅     ⋅ 
   ⋅        ⋅        ⋅      81.0  -1458.0          ⋅        ⋅    
    ⋅     ⋅ 
   ⋅        ⋅        ⋅        ⋅      81.0        81.0       ⋅    
    ⋅     ⋅ 
   ⋅        ⋅        ⋅        ⋅        ⋅   …  -1458.0     81.0 
      ⋅     ⋅ 
   ⋅        ⋅        ⋅        ⋅        ⋅         81.0  -1458.0   
  81.0    ⋅ 
   ⋅        ⋅        ⋅        ⋅        ⋅           ⋅      81.0 
 -1458.0  81.0

² For non-periodic functions one can still achieve banded systems and spectral convergence using ultraspherical polynomials. In this case the notion of "frequency space" changes to a more general "coefficient space", that is, representing the unknowns as coefficients in an expansion.

³ These tests are in Julia built from source using MKL for the BLAS routines. The timings are substantially worse using OpenBLAS, which is the default for Julia. In fact there are numerous bugs in OpenBLAS's banded operations including segmentation faults for large matrices and debug statements in release code. These examples are not meant to disparage OpenBLAS, rather, they demonstrate just how much general sparse solvers have taken over: I suspect BandedMatrices.jl is the only user of some of the more specialised banded BLAS calls like trmv!.

⁴ After discovering this example with Geoff Vasil and Keaton Burns of the Dedalus team, they decided to develop a Python banded matrix package.


Comments

Popular Posts