Introduction and NLA Review

(1)

Lecture 1:

Introduction and NLA Review

NMNV468: Numerical Linear Algebra for data science and informatics

February 21, 2020

(2)

Data Science and Informatics

• Huge amounts of data collected in modern society

• Currently generating and storing more data than we know what to do with!

• Data mining and analysis: extracting meaningful information from large data sets

• Huge number of application areas: business, bioinformatics, medicine, informational retrieval, search engines, scientific applications

• Truly interdisciplinary: computer science, statistics, linear algebra, optimization

• Numerical linear algebra is a basic ingredient in many techniques

• course goal: give introduction to mathematical and numerical approaches behind data analysis and informatics applications

• Not primarily a numerical linear algebra course, but rather application of

(3)

Numerical Linear Algebra vs. Data Analysis

Slide from Yousef Saad

(4)

Data Analytics in Scientific Discovery

• Data analytics and machine learning increasingly important in scientific discovery

• Event identification, correlation in high-energy physics

• Climate simulation validation using sensor data

• Determine patterns and trends from astronomical data

• Genetic sequencing

(5)

Data Analytics used in Computational Science

(6)

Logistics of the Course

• Class format:

• mix of lectures, exercises/demonstrations

• no lectures/exercises after April 3.

• Class website: https://dl1.cuni.cz/enrol/instances.php?id=9187

• Change class time?

• Tuesday 9:00-10:30 and 10:40-12:10?

• Thursday 15:40-17:10 and 17:20-16:50?

• Monday 15:40-17:10 and 17:20-16:50?

• Topic outline:

• See syllabus and schedule

(7)

Today's Outline and Goals

• Goals: review what you should have from basic numerical linear algebra course Topics for Today*

• Vectors and Matrices

• Examples arising in DS&I applications

• Basic Linear Algebra Review

• Linear Systems and LU Decomposition

• Least Squares Problems

• Orthogonality

• QR Decomposition

• Singular Value Decomposition

*May go into next week depending on familiarity of students with the topics

(8)

Vectors and Matrices

Think of a matrix as a rectangular array of data, where elements are scalar real numbers

𝐴 =

𝑎₁₁ 𝑎₁₂ ⋯ 𝑎_1𝑛 𝑎₂₁ 𝑎₂₂ ⋯ 𝑎_2𝑛

⋮ ⋮ ⋮

𝑎_𝑚1 𝑎_𝑚2 ⋯ 𝑎_𝑚𝑛

∈ ℝ^𝑚×𝑛

We can also think of a matrix as a set of n vectors in ℝ^𝑚...

(9)

Example - Term-Document Matrix

• Term-document matrices are used in information retrieval.

• Consider the five documents below, with keywords (or terms) in bold:

Doc 1: The Google matrix P is a model of the internet Doc 2: 𝑃_𝑖𝑗 is nonzero if there is a link from web page j to i Doc 3: The Google matrix is used to rank all Web Pages

Doc 4: The ranking is done by solving a matrix eigenvalue problem Doc 5: England dropped out of the top 10 in the FIFA ranking

• If we count the frequency of terms in each document we get the following

(10)

Example - Term-Document Matrix

Each document is represented by a vector (a point in ℝ¹⁰) and we have the term-document matrix

𝐴 =

0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 0 0

(11)

Example - Term-Document Matrix

• Say we want to find all documents relevant to the query "ranking of web pages". This query has the query vector:

𝑞 = 0 0 0 0 0 0 0 1 1 1

∈ ℝ¹⁰

• The task of information retrieval is the following mathematical problem:

Find the columns of A that are close to the vector q.

• Common that m is large (≈ 10⁶); common that A is sparse

• Many linear algebra techniques used in solving this problem

(12)

Example - Pattern Recognition

Also often useful to consider a matrix as a linear operator Denote the columns of A

𝑎_⋅𝑗 =

𝑎_1𝑗 𝑎_2𝑗

⋮ 𝑎_𝑚𝑗

, 𝑗 = 1,2, … , 𝑛 so that 𝐴 = 𝑎⋅1 ⋯ 𝑎_⋅𝑛

Then the linear transformation is defined

𝑦 = 𝐴𝑥 = 𝑎⋅1 ⋯ 𝑎_⋅𝑛 𝑥₁

⋮

𝑥_𝑛 =

𝑗=1 𝑛

𝑥_𝑗𝑎_⋅𝑗

(13)

Example - Pattern Recognition

• Model problem in pattern recognition is classification of handwritten digits

• Image of one digit is a 16x16 matrix of numbers, representing grayscale

• Can also be represented as a vector in ℝ²⁵⁶

• A set of n digits can then be represented by a matrix 𝐴 ∈ ℝ^256×𝑛 and the columns of 𝐴 span a subspace of ℝ²⁵⁶

• We can compute an approximate basis of this subspace using the SVD (singular value decomposition) of A, 𝐴 = 𝑈Σ𝑉^𝑇

(14)

Example - Pattern Recognition

• Let's say we have a vector b representing some unknown digit. We now want to automatically classify the digit as 0-9.

• Given a set of approximate basis vectors representing, e.g., 3s, 𝑢₁, … , 𝑢_𝑘, we can determine whether b is a 3 by checking if there is a linear

combination of the basis vectors such that

𝑏 − _𝑗=1^𝑘 𝑥_𝑗𝑢_𝑗 = something small

(15)

Example - PageRank

• Search engines extract information from web pages available on the internet

• The core of the Google search engine is a matrix computation

• The "Google matrix" P is of the order of billions (close to the total number of web pages on the internet)

• Matrix structure based on link structure of the web: 𝑃_𝑖𝑗 is nonzero if there is a link from webpage j to i

𝑃 =

0 1/3 0 0 0 0

1/3 0 0 0 0 0

0 1/3 0 0 1/3 1/2 1/3 0 0 0 1/3 0 1/3 1/3 0 0 0 1/2

0 0 1 0 1/3 0

• The link graph matrix constructed so that cols and rows represent webpages and nonzero elements in col j denote outlinks from webpage j.

• For search engine to be useful, it must measure quality of webpages by some ranking; this ranking is done by solving an eigenvalue problem

1 2 3

4 5 6

(16)

Floating Point Computations

• Modeling cost of a computation

• Often by counting flops (even just highest order terms)

• This can often be misleading - amount of data movement and data access patterns often have more influence than number of flops

• especially true in large-scale settings

(17)

The Cost of a Computation

• Algorithms have two costs: communication and computation

• Communication : moving data between levels of memory hierarchy (sequential), between processors (parallel)

• On today’s computers, communication is expensive, computation is cheap – Flop time << 1/bandwidth << latency

– True at all scales – from smartphones to supercomputers

– Communication bottleneck a barrier to achieving time and energy scalability

sequential comm.

parallel comm.

(18)

Floating Point Rounding Errors

• We will assume that computations are done under the IEEE floating point standard

• Real number x can not in general be represented exactly in a floating point system.

Let fl[x] be the floating point number representing x. Then 𝑓𝑙 𝑥 = 𝑥(1 + 𝜖)

for some 𝜖 satisfying 𝜖 ≤ 𝑢, where 𝑢 is the unit roundoff of the system

• e.g., in IEEE double precision, 𝑢 ≈ 10⁻¹⁶

Thus the relative error in the floating point representation of any real number x satisfies

𝑓𝑙 𝑥 − 𝑥

𝑥 ≤ 𝑢

(19)

Floating Point Rounding Errors

• Let 𝑓𝑙[𝑥⨀𝑦] be the result of a floating point arithmetic operation, where ⨀ is +, -, x, or /

• Then provided that 𝑥⨀𝑦 ≠ 0,

𝑥⨀𝑦 − 𝑓𝑙[𝑥⨀𝑦]

𝑥⨀𝑦 ≤ 𝑢

or equivalently,

𝑓𝑙 𝑥⨀𝑦 = 𝑥⨀𝑦 1 + 𝜖 for some 𝜖 satisfying 𝜖 ≤ 𝑢

• When we estimate the error in the result of a computation in floating point arithmetic, we can think of it as a forward error.

• Alternatively, we could rewrite the above as

𝑓𝑙 𝑥⨀𝑦 = x + e ⨀(𝑦 + 𝑓) for some numbers 𝑒 and 𝑓 that satisfy

𝑒 ≤ 𝑢 𝑥 , 𝑓 ≤ 𝑢|𝑦|

• I.e., 𝑓𝑙[𝑥⨀𝑦] is the exact result of the operation on slightly perturbed data.

• This is called backward error analysis

(20)

Forward Error vs. Backward Error

(21)

Linear Algebra Review

(22)

Basic Linear Algebra

Matrix-Vector multiplication

𝑦 = 𝐴𝑥, 𝑦_𝑖 =

𝑗=1 𝑛

𝑎_𝑖𝑗𝑥_𝑗, 𝑖 = 1, … , 𝑚

𝑦 = 𝐴𝑥 = 𝑎_⋅1 𝑎_⋅2 ⋯ 𝑎_⋅𝑛

𝑥₁ 𝑥₂

⋮ 𝑥_𝑛

=

𝑗=1 𝑛

𝑥_𝑗𝑎_⋅𝑗

(23)

Basic Linear Algebra

Matrix-Matrix multiplication: 𝐴 ∈ ℝ^𝑚×𝑘, 𝐵 ∈ ℝ^𝑘×𝑛, 𝐶 ∈ ℝ^𝑚×𝑛 𝐶 = 𝐴𝐵, 𝑐_𝑖𝑗 =

𝑠=1 𝑘

𝑎_𝑖𝑠𝑏_𝑠𝑗, 𝑖 = 1, … , 𝑚, 𝑗 = 1, … , 𝑛

(24)

Inner Products and Vector Norms

• Definition and properties of a vector norm

• 1-norm, 2-norm, infinity norm

• Generalization: p-norms

• With norms we can introduce concepts of continuity and error in approximations of vectors

• Absolute error, relative error

• Equivalence of norms

• Vector norm inequalities table

• Distance measures commonly-used in data science

(25)

Matrix Norms

• Vector norms and corresponding operator norms

• Matrix norm properties

• Fundamental inequalities

• Examples: 2-norm, 1-norm, infinity norm, Frobenius norm

• Matrix norm equivalence inequalities table

(26)

Linear Independence

Given a set of vectors 𝑣_𝑗, 𝑗 = 1, … , 𝑛 in ℝ^𝑚, 𝑚 ≥ 𝑛, consider the set of linear combinations

𝑠𝑝𝑎𝑛 𝑣₁, … , 𝑣_𝑛 = 𝑦|𝑦 =

𝑗=1 𝑛

𝛼_𝑗𝑣_𝑗

The vectors 𝑣_𝑗, 𝑗 = 1, … , 𝑛 are called linearly independent when

𝑗=1𝑛 𝛼_𝑗𝑣_𝑗 = 0 if and only if 𝛼_𝑗 = 0 for 𝑗 = 1, … , 𝑛.

A set of 𝑚 linearly independent vectors in ℝ^𝑚 is called a basis. Any vector in ℝ^𝑚 can be expressed as a linear combination of the basis vectors.

If the vectors 𝑣_𝑗, 𝑗 = 1, … , 𝑛 are linearly dependent, then some 𝑣_𝑘 can be written as a linear combination of the rest.

• In practice, rarely have exactly linearly dependent vectors. Often have almost linearly dependent vectors.

(27)

Rank of a Matrix

• Rank = maximum number of linearly independent column vectors in 𝐴

• Also: rank = number of nonzero singular values

• Rank-1 matrix can be written as the outer product matrix 𝑥𝑦^𝑇

• A square matrix 𝐴 ∈ ℝ^𝑛×𝑛 with rank 𝑛 is called nonsingular and has an inverse 𝐴⁻¹ satisfying

𝐴𝐴⁻¹ = 𝐴⁻¹𝐴 = 𝐼

• If we multiply linearly independent vectors by a nonsingular matrix, then the vectors remain linearly independent

(28)

Numerical Rank of a Matrix

Even though a matrix may be full rank, it may be so close to being rank deficient that it behaves as such when we are computing in finite precision.

Let 𝐴 ∈ ℝ^𝑚×𝑛. Let

𝛿 𝐴 = 𝜖 ⋅ max 𝑚, 𝑛 ⋅ 2 ^log²^(𝜎¹⁾ where 𝜎₁ = 𝐴 ₂ is the largest singular value of 𝐴.

The numerical rank 𝑟(𝐴) is then defined as

𝑟 𝐴 = 𝜎_𝑖 > 𝛿 𝐴 | 1 ≤ 𝑖 ≤ min(𝑚, 𝑛)

(i.e., the number of singular values of 𝐴 that are greater than 𝛿(𝐴)).

This is what is used in Matlab. We could also use slightly different definitions

(29)

Linear Systems and

Least Squares

(30)

Linear Systems

• Want to solve 𝐴𝑥 = 𝑏 where 𝐴 ∈ ℝ^𝑛×𝑛 is square and nonsingular

• Existence and uniqueness: Assume that 𝐴 ∈ ℝ^𝑛×𝑛 is nonsingular. Then for any right-hand-side b, the linear system Ax=b has a unique solution

• Commonly solved using Gaussian elimination with partial pivoting

• partial pivoting: In the ith step, reorder the rows of the matrix so that the element of largest magnitude in the ith column is moved to the (i,i) position.

• equivalent to multiplying on the left by a permutation matrix P

• In each step, elimination involves zeroing elements in the first column below the diagonal

• Ex: In the first step, we compute 𝐴⁽¹⁾ = 𝐿⁻¹₁ 𝑃₁𝐴 where 𝐿₁ is a Gauss transformation

𝐿₁ = 1 0

𝑚₁ 𝐼 , 𝑚₁ =

𝑚₂₁

⋮ 𝑚_𝑛1

We continue until we have an upper triangular matrix, and then we can

(31)

LU Decomposition

• The process of Gaussian elimination produces the LU decomposition of a matrix:

𝐿₁𝐴⁽¹⁾ = 𝑃₁𝐴 𝐿𝑈 = 𝑃𝐴⋮

• Theorem: Any nonsingular square matrix 𝐴 can be decomposed into 𝑃𝐴 = 𝐿𝑈

where 𝑃 is a permutation matrix, 𝐿 is lower triangular with ones on the diagonal, and 𝑈 is an upper triangular matrix.

• Flops: computing the LU decomposition takes approximately 2𝑛³/3 flops

• In kth step of Gaussian elimination, one operates on an (n-k+1)x(n-k+1) submatrix, and for each element in that submatrix we do one multiplication and one addition. Thus the total number of flops is

2

𝑘=1 𝑛−1

𝑛 − 𝑘 + 1 ² ≈ 2𝑛³ 3

(32)

The Symmetric Positive Definite Case

• In the case the matrix is symmetric (𝐴 = 𝐴^𝑇) and positive definite (all eigenvalues are positive), then:

• The LU decomposition can be computed without pivoting

• The decomposition becomes symmetric, and so can be done with half the work

Theorem: Any symmetric positive definite matrix A has a decomposition 𝐴 = 𝐿𝐷𝐿^𝑇

where L is lower triangular with ones on the diagonal and D is a diagonal matrix with positive diagonal elements.

(33)

Cholesky Decomposition

Since D has positive diagonal elements, we can write 𝐷^1/2 =

𝑑₁

⋱

𝑑_𝑛 and then we get

𝐴 = 𝐿𝐷𝐿^𝑇 = 𝐿𝐷^{1 2} 𝐷^{1 2}𝐿^𝑇 = 𝑈^𝑇𝑈 where U is an upper triangular matrix.

This variant is called the Cholesky decomposition.

Storage requirements: Only need to store one triangular matrix: 𝑛(𝑛 + 1)/2 Flops: Half the work as LU decomposition: approx. 𝑛³/3 flops

(34)

Perturbation Theory and Condition Number

The condition number of a nonsingular matrix A is defined as 𝜅 𝐴 = 𝐴 ‖𝐴⁻¹‖

where ‖ ⋅ ‖ is any operator norm.

We can also specify a norm, e.g.,

𝜅₂ 𝐴 = 𝐴 ₂ 𝐴⁻¹ ₂.

The condition number is important. It tells us how much the solution 𝒙 to a linear system can change when the matrix 𝑨 and/or right-hand-side 𝒃 are perturbed.

(35)

Perturbed Linear Systems

Let 𝛿𝐴 be some perturbation to the matrix A and let 𝛿𝑏 be a perturbation to the right-hand-side b.

Assume that A is nonsingular and that 𝛿𝐴 𝐴⁻¹ = 𝑟 < 1.

Then the matrix 𝐴 + 𝛿𝐴 is nonsingular and

𝐴 + 𝛿𝐴 ⁻¹ ≤ 𝐴⁻¹ 1 − 𝑟 .

The solution of the perturbed system

𝐴 + 𝛿𝐴 𝑦 = 𝑏 + 𝛿𝑏 satisfies

𝑦 − 𝑥

‖𝑥‖ ≤ 𝜅 𝐴 1 − 𝑟

𝛿𝐴

𝐴 + 𝛿𝑏 𝑏 .

A matrix for which 𝜅 𝐴 is large is called ill-conditioned.

A linear system with an ill-conditioned matrix is sensitive to perturbations in the data (i.e., the matrix and the right-hand-side).

(36)

Rounding Errors in GE

Recall:

𝑓𝑙 𝑥 = 𝑥 1 + 𝜖 , 𝜖 ≤ 𝑢

When representing elements of a matrix A and vector b in floating point system, there also arise errors, e.g.:

𝑓𝑙 𝑎_𝑖𝑗 = 𝑎_𝑖𝑗 1 + 𝜖_𝑖𝑗 , 𝜖_𝑖𝑗 ≤ 𝑢

Therefore we can write

𝑓𝑙 𝐴 = A + 𝛿𝐴, 𝑓 𝑏 = 𝑏 + 𝛿𝑏 where

𝛿𝐴 _∞ ≤ 𝑢 𝐴 _∞, 𝛿𝑏 _∞ ≤ 𝑢 𝑏 _∞

If no other floating point errors arise during solving Ax=b, we have that the solution 𝑥 satisfies

𝐴 + 𝛿𝐴 𝑥 = 𝑏 + 𝛿𝑏

(37)

Rounding Errors in GE

Using perturbation theory, we can estimate the error in 𝑥: 𝑥 − 𝑥

‖𝑥‖ ≤ 𝜅_∞ 𝐴 1 − 𝑟

𝛿𝐴 _∞

𝐴 _∞ + 𝛿𝑏 _∞ 𝑏 _∞

=

^𝜅_1−𝑟^∞ ^𝐴

(2𝑢)

provided that 𝑟 = 𝑢 𝜅_∞ 𝐴 < 1.

Note that this is just error due to storing A and b in finite precision.

We of course also make errors in computing GE in finite precision.

(38)

Rounding Errors in GE

Theorem: Assume we use a floating point system with unit roundoff 𝑢. Let 𝐿 and 𝑅 be the triangular factors obtained from GE with partial pivoting

applied to A. Further, assume that 𝑥 is computed using the forward and backward substitutions

𝐿 𝑦 = 𝑃𝑏, 𝑅 𝑥 = 𝑦.

Then 𝑥 is the exact solution of a system

𝐴 + 𝛿𝐴 𝑥 = 𝑏, where

𝛿𝐴 _∞ ≤ 𝑘 𝑛 𝑔_𝑛𝑢 𝐴 _∞, 𝑔_𝑛 =

max𝑖,𝑗,𝑘 | 𝑎_𝑖𝑗^(𝑘)| max𝑖,𝑗 |𝑎_𝑖𝑗|

where 𝑘 𝑛 is a degree-3 polynomial and 𝑎_𝑖𝑗^(𝑘) are the elements computed in step k-1 of the elimination.

𝑔 is called the pivot growth factor

(39)

Rounding Error in GE

𝑔_𝑛 depends on the growth of the matrix elements during GE and not explicitly on the magnitude of the multipliers.

𝑔_𝑛 can be computed a posteriori (we can not compute it a priori; we must do the elimination to find it).

We can bound it a priori, however. In general, we can only say that 𝑔_𝑛 ≤ 2^𝑛−1

This is attainable in specifically constructed examples, but it is almost never this high in practice.

• Average case, it is closer to 𝑛^2/3 when partial pivoting is used [Trefethen and Schreiber, 1990]

Bound is lower for specific cases, e.g., for SPD matrices, 𝑔_𝑛 = 1.

(40)

What is the point?

• The bound is often very pessimistic in practice, so what is the point?

• In general, these types of a priori rounding error analyses are NOT to give tight bounds on the error, but rather to expose potential instabilities of the algorithms and to allow for comparing of different algorithms.

(41)

Banded Matrices

• Band matrix: nonzeros are concentrated around the main diagonal, and off- diagonal elements are 0

• More precisely, A is banded with lower bandwidth q and upper bandwidth p if

𝑎_𝑖𝑗 = 0 if 𝑗 − 1 > 𝑝 or 𝑖 − 𝑗 > 𝑞

• w=q+p+1 is called the bandwidth of the matrix

• If A is a band matrix, L and U are also band matrices

• L will have lower bandwidth q and U will have upper bandwidth p+q+1

(42)

The Least Squares Problem

Suppose we want to solve 𝐴𝑥 = 𝑏 where 𝐴 ∈ ℝ^𝑚×𝑛, 𝑚 > 𝑛.

This system is called overdetermined: it has more equations than unknowns.

In general, no solution exists. Alternative is to find an x such that the residual, r=b-Ax, is as small as possible.

The solution will depend on how we define "small" and how we measure the size of the residual.

In the least squares method, we use the standard Euclidean distance. I.e., we want to find a vector 𝑥 ∈ ℝ^𝑛 that solves the minimization problem

min𝑥 𝑏 − 𝐴𝑥 ₂.

As x occurs linearly in this minimization problem, this is often referred to as the linear least squares problem.

(43)

Solving LS Problems

We therefore find the solution by making r orthogonal to the columns of A:

𝑟^𝑇 𝑎_⋅1 𝑎_⋅2 ⋯ 𝑎_⋅𝑛 = 𝑟^𝑇𝐴 = 0 (= 𝐴^𝑇𝑟) Then

𝑟 = 𝑏 − 𝐴𝑥

𝐴^𝑇𝑟 = 𝐴^𝑇𝑏 − 𝐴^𝑇𝐴𝑥 = 0

⇒ 𝐴^𝑇𝐴𝑥 = 𝐴^𝑇𝑏

These are called the normal equations. We can now solve the linear system 𝐴^𝑇𝐴𝑥 = 𝐴^𝑇𝑏 to find the least squares solution x.

• Theorem: If the columns of A are linearly independent, 𝐴^𝑇𝐴 is nonsingular

(44)

Drawbacks

• Solving the normal equations is only one possible way to solve the least squares problem

• It has the following drawbacks:

• Forming 𝐴^𝑇𝐴 can lead to loss of information due to finite precision computation

• The condition number 𝐴^𝑇𝐴 is the square of that of 𝐴

• Example: 𝐴 = 1 1 𝜖 0 0 𝜖

, 𝐴^𝑇𝐴 = 1 + 𝜖² 1 1 1 + 𝜖²

If 𝜖 is so small that the floating point representation of 1 + 𝜖² satisfies 𝑓𝑙 1 + 𝜖² = 1, then in floating point arithmetic, the normal equations become singular.

(45)

Sequences of LS Problems

Often in practice, one has to solve a sequence of independent LS problems with the same matrix A,

min𝑥_𝑖 𝐴𝑥_𝑖 − 𝑏 ₂, 𝑖 = 1,2, … , 𝑝 with solutions

𝑥_𝑖 = 𝐴^𝑇𝐴 ⁻¹𝐴^𝑇𝑏_𝑖, 𝑖 = 1,2, … , 𝑝

Defining 𝑋 = [𝑥₁ 𝑥₂ ⋯ 𝑥_𝑝] and 𝐵 = [𝑏₁ 𝑏₂ ⋯ 𝑏_𝑝], we can write the problem in matrix form:

min𝑋 𝐴𝑋 − 𝐵 _𝐹 with the solution 𝑋 = 𝐴^𝑇𝐴 ⁻¹𝐴^𝑇𝐵.

Why Frobenius norm? Due to identity 𝐴𝑋 − 𝐵 _𝐹² =

𝑖=1 𝑝

𝐴𝑥_𝑖 − 𝑏_{𝑖 2}²

(46)

Orthogonality

(47)

Orthogonal Vectors and Matrices

• Often advantageous to use orthogonal vectors as basis vectors in a vector space

• Two vectors x and y are orthogonal if 𝑥^𝑇𝑦 = 0 (cos 𝜃 𝑥, 𝑦 = 0)

• Let 𝑞_𝑗, 𝑗 = 1,2, … , 𝑛 be orthogonal, i.e., 𝑞_𝑖^𝑇𝑞_𝑗 = 0 if 𝑖 ≠ 𝑗. Then they are linearly independent.

• If a set of orthogonal vectors 𝑞_𝑗, 𝑗 = 1,2, … , 𝑚 in ℝ^𝑚 are normalized, i.e., 𝑞_{𝑗 2} = 1, then they are called orthonormal, and they constitute an

orthonormal basis for ℝ^𝑚.

• A square matrix 𝑄 = 𝑞₁ 𝑞₂⋯ 𝑞_𝑚 ∈ ℝ^𝑚×𝑚 whose columns are orthonormal is called an orthogonal matrix.

• Orthogonal matrices have a number of important properties

(48)

Properties of Orthogonal Matrices

• An orthogonal matrix 𝑄 satisfies 𝑄^𝑇𝑄 = 𝐼

• Implies that an orthogonal matrix has full rank and 𝑄⁻¹ = 𝑄^𝑇

• The rows of an orthogonal matrix are orthogonal: 𝑄𝑄^𝑇 = 𝐼

• The product of two orthogonal matrices is also orthogonal

• Given a matrix 𝑄₁ ∈ ℝ^𝑚×𝑘 with orthonormal columns, there exists a matrix 𝑄₂ ∈ ℝ^{𝑚× 𝑚−𝑘} such that 𝑄 = [𝑄₁ 𝑄₂] is an orthogonal matrix

• The Euclidean length of a vector is invariant under an orthogonal transformation Q, i.e., 𝑄𝑥 ₂ = 𝑥 ₂

• The matrix 2-norm and Frobenius norm are invariant under orthogonal

transformations; let 𝑈 ∈ ℝ^𝑚×𝑚 and 𝑉 ∈ ℝ^𝑛×𝑛 be orthogonal. Then for 𝐴 ∈ ℝ^𝑚×𝑛,

𝑈𝐴𝑉 ₂ = 𝐴 ₂, 𝑈𝐴𝑉 _𝐹 = 𝐴 _𝐹

(49)

Plane Rotations

A 2x2 plane rotation matrix

𝐺 = 𝑐 𝑠

−𝑠 𝑐 , 𝑐² + 𝑠² = 1 is orthogonal

Multiplication of a vector by G rotates the vector in a clockwise direction by an angle 𝜃 where 𝑐 = cos 𝜃

Plane rotations are often called Givens rotations in the literature

(50)

Example

Given a vector 𝑥 ∈ ℝ⁴, we transform it to 𝑘𝑒₁. First, by a rotation 𝐺₃ in the plane (3,4) we zero the last element:

1 0 0 0 0 1 0 0 0 0 𝑐₁ 𝑠₁ 0 0 −𝑠₁ 𝑐₁

×

=

×

∗ 0

Then, by a rotation 𝐺₂ in the plane (2,3) we zero the element in position 3:

1 0 0 0 0 𝑐₂ −𝑠₂ 0 0 𝑠₂ 𝑐₂ 0 0 0 0 1

×

× 0

=

×

∗ 0 0

Finally, we zero out the element in position 2 by a rotation 𝐺₁ :

𝑐₃ −𝑠₃ 0 0 𝑠₃ 𝑐₃ 0 0 0 0 1 0 0 0 0 1

×

× 0 0

= 𝑘 0 0 0

Since Euclidean length is preserved, we know 𝑘 = 𝑥 ₂.

(51)

Householder Transformations

Let 𝑣 ≠ 0 be an arbitrary vector and put

𝑃 = 𝐼 − 2

𝑣^𝑇𝑣𝑣𝑣^𝑇.

𝑃 is symmetric and orthogonal; P is called a reflection matrix or a Householder transformation.

Let x and y be given vectors of the same length and we ask: Can we determine a Householder transformation P such that Px=y?

The equation Px=y can be written

𝑥 − 2𝑣^𝑇𝑥

𝑣^𝑇𝑣 𝑣 = 𝑦, which is of the form 𝛽𝑣 = 𝑥 − 𝑦.

Since v enters P in such a way that a factor 𝛽 cancels, we can choose 𝛽 = 1.

With v=x-y we get

𝑣^𝑇𝑣 = 𝑥^𝑇𝑥 + 𝑦^𝑇𝑦 − 2𝑥^𝑇𝑦 = 2 𝑥^𝑇𝑥 − 𝑥^𝑇𝑦 , since 𝑥^𝑇𝑥 = 𝑦^𝑇𝑦. Further,

𝑣^𝑇𝑥 = 𝑥^𝑇𝑥 − 𝑦^𝑇𝑥 = 1 2𝑣^𝑇𝑣

(52)

Householder Transformations

Therefore we have

𝑃𝑥 = 𝑥 − 2𝑣^𝑇𝑥

𝑣^𝑇𝑣 𝑣 = 𝑥 − 𝑣 = 𝑦 as we wanted.

In matrix computations we often want to zero elements in a vector and we now choose 𝑦 = 𝑘𝑒₁, where 𝑘 = ± 𝑥 ₂. The vector v should be taken equal

to 𝑣 = 𝑥 − 𝑘𝑒₁.

In order to avoid cancellation (subtraction of two close floating point

numbers), we choose sign(k) = -sign(𝑥). Now that we have computed v, we can simplify and write

𝑃 = 𝐼 − 2

𝑣^𝑇𝑣 𝑣𝑣^𝑇 = 𝐼 − 2𝑢𝑢^𝑇, 𝑢 = 1 𝑣 ₂ 𝑣 Thus the Householder vector u has length 1.

(53)

Householder Transformations

One should avoid forming the Householder matrix P explicitly since it can be represented more compactly by the vector u

Multiplication by P should be done via 𝑃𝑥 = 𝑥 − 2𝑢^𝑇𝑥 𝑢, where the matrix multiplication requires 4n flops (instead of 𝑛² if P were formed explicitly).

The matrix multiplication PX is done by

𝑃𝑋 = 𝐴 − 2𝑢(𝑢^𝑇𝑋)

(54)

Flop Counts

• We will compare the number of flops to transform the first column of an mxn matrix A to a multiple of a unit vector 𝑘𝑒₁ using plane rotations and Householder transformations.

• Plane rotations: application of rotation matrix to vector requires 4 multiplications and 2 additions = 6 flops.

• Application to mxn matrix requires 6n flops.

• Need to apply m-1 rotations to zero all but 1 row

• Total: ≈ 6𝑚𝑛 flops

• Householder transformations: 4n flops for multiplication by P, need to do it m-1 times to zero out all but one row

• Total: ≈ 4𝑚𝑛 flops

(55)

Orthogonal Transformations in Finite Precision

• Orthogonal transformations are very stable in floating point arithmetic

• Ex: a Householder transformation 𝑃 computed in floating point arithmetic that approximates an exact 𝑃 satisfies

𝑃 − 𝑃 ₂ = 𝑂(𝑢)

where u is the unit roundoff of the floating point system.

We also have the backward error result

𝑓𝑙 𝑃𝐴 = 𝑃 𝐴 + 𝐸 , 𝐸 ₂ = 𝑂 𝑢 𝐴 ₂

Thus the floating point result is equal to the product of the exact orthogonal matrix and a data matrix that has been perturbed by a very small amount.

Analogous results hold for plane rotations.

(56)

QR Decomposition

(57)

QR Decomposition

Any matrix 𝐴 ∈ ℝ^𝑚×𝑛, 𝑚 ≥ 𝑛 can be transformed to upper triangular form by an orthogonal matrix. The transformation is equivalent to the decomposition

𝐴 = 𝑄 𝑅 0

where 𝑅 ∈ ℝ^𝑛×𝑛 is upper triangular and 𝑄 ∈ ℝ^𝑚×𝑚 is orthogonal. If the columns of A are linearly independent, then 𝑅 is nonsingular.

Often convenient to write the "thin QR decomposition" where we keep only columns of Q corresponding to an orthogonalization of columns of A.

Partition 𝑄 = [𝑄₁ 𝑄₂] where 𝑄₁ ∈ ℝ^𝑚×𝑛 and we get 𝐴 = 𝑄₁ 𝑄₂ 𝑅

0 = 𝑄₁𝑅

𝑄₁ gives an orthogonal basis for the column space of A.

(58)

Solving LS Problems

The QR decomposition gives a way to solve least squares problems min𝑥 𝑏 − 𝐴𝑥 ₂

where 𝐴 ∈ ℝ^𝑚×𝑛, 𝑚 ≥ 𝑛, without using the normal equations.

Theorem: Let the matrix 𝐴 ∈ ℝ^𝑚×𝑛 have full column rank and thin QR decomposition 𝐴 = 𝑄₁𝑅. Then the least squares problem min

𝑥 𝑏 − 𝐴𝑥 ₂ has the unique solution

𝑥 = 𝑅⁻¹𝑄₁^𝑇𝑏.

We have

min𝑥 𝐴𝑥 − 𝑏 ₂ = min

𝑥

𝑅𝑥 − 𝑄₁^𝑇𝑏

𝑄₂^𝑇𝑏 ₂ = 𝑄₂^𝑇𝑏 ₂

(59)

Flop Counts

• Applying a Householder transformation to an mxn matrix to zero the

elements of the first column below the diagonal requires approximately 4mn flops.

• In the following transformation, only rows 2 to m and columns 2 to n are changed, and the dimension of the submatrix that is affected by the

transformation is reduced by one in each step.

• Therefore the number of flops for computing R is approximately 4

𝑘=0 𝑛−1

(𝑚 − 𝑘)(𝑛 − 𝑘) ≈ 2𝑚𝑛² − 2𝑛³ 3

• Then Q is available in factored form as a product of Householder transformations.

• If we explicitly compute the full matrix Q, then in step k we need 4(m-k)m flops, which gives a total of

4

𝑘=0 𝑛−1

𝑚 − 𝑘 𝑚 ≈ 4𝑚𝑛 𝑚 − 𝑛 2 .

(60)

Error in the Solution of the LS Problem

Theorem: Assume that 𝐴 ∈ ℝ^𝑚×𝑛, 𝑚 ≥ 𝑛, has full column rank and that the least squares problem min

𝑥 𝐴𝑥 − 𝑏 ₂ is solved using QR factorization via

Householder transformations. Then the computed solution 𝑥 is the exact least squares solution of the perturbed problem

min𝑥 𝐴 + Δ𝐴 𝑥 − 𝑏 + 𝛿𝑏 ₂, where

Δ𝐴 _𝐹 ≤ 𝑐₁𝑚𝑛𝑢 𝐴 _𝐹 + 𝑂 𝑢² , 𝛿𝑏 ₂ ≤ 𝑐₂𝑚𝑛𝑢 𝑏 ₂ + 𝑂(𝑢²) and 𝑐₁ and 𝑐₂ are small integers.

(61)

Error in the Solution of the LS Problem

• The theorem says that the method is backward stable.

• In contrast, the method of solving the LS problem via the normal equations is not backward stable, unless A is well conditioned.

• To see this, in Matlab try both methods of solving the LS problem with 𝐴 = 1 1

𝜖 0 0 𝜖

, 𝑏 = 2 𝜖 𝜖 with 𝜖 = 10⁻⁷.

Note that the true LS solution should be 𝑥 = 1 1 .

(62)

SVD

(63)

The SVD

Theorem: Any mxn matrix A, with 𝑚 ≥ 𝑛 can be factored 𝐴 = 𝑈 Σ

0 𝑉^𝑇

where 𝑈 ∈ ℝ^𝑚×𝑚 and 𝑉 ∈ ℝ^𝑛×𝑛 are orthogonal and Σ ∈ ℝ^𝑛×𝑛 is diagonal, Σ = 𝑑𝑖𝑎𝑔(𝜎₁, … , 𝜎_𝑛)

𝜎₁ ≥ ⋯ ≥ 𝜎_𝑛.

Columns of 𝑈 and 𝑉 are called singular vectors and the 𝜎_𝑖's are called singular vectors.

Other names: principal components (statistics and data analysis); Karhunen- Loewe expansion (image processing)

(64)

The SVD

• "Full SVD" vs. "Thin SVD"

• Matrix equations:

𝐴𝑉 = 𝑈₁Σ, 𝐴^𝑇𝑈₁ = 𝑉Σ

• Outer product form:

𝐴 =

𝑖=1 𝑛

𝜎_𝑖𝑢_𝑖𝑣_𝑖^𝑇

• In MATLAB:

[U,S,V]=svd(A)

• Relation to norms:

(65)

Fundamental Subspaces

• The SVD gives orthogonal bases of the four fundamental subspaces of a matrix: column-space (or range), null-space, row-space, left null-space

• Column space (range): ℛ 𝐴 = 𝑦 𝑦 = 𝐴𝑥, for arbitrary 𝑥}

• Nullspace: 𝒩 𝐴 = 𝑥 𝐴𝑥 = 0}

• Row space: column space of 𝐴^𝑇, ℛ(𝐴^𝑇)

• Left nullspace: nullspace of 𝐴^𝑇, 𝒩(𝐴^𝑇)

(66)

Fundamental Subspaces

Theorem:

• The singular vectors 𝑢₁, 𝑢₂, … , 𝑢_𝑟 are an orthonormal basis for ℛ(A) and 𝑟𝑎𝑛𝑘 𝐴 = dim ℛ 𝐴 = 𝑟

• The singular vectors 𝑣_𝑟+1, 𝑣_𝑟+2, … , 𝑣_𝑛 are an orthonormal basis for 𝒩(𝐴) and

dim 𝒩 𝐴 = 𝑛 − 𝑟

• The singular vectors 𝑣₁, 𝑣₂, … , 𝑣_𝑟 are an orthonormal basis for ℛ(𝐴^𝑇)

• The singular vectors 𝑢_𝑟+1, 𝑢_𝑟+2, … , 𝑢_𝑚 are an orthonormal basis for 𝒩(𝐴^𝑇)

(67)

Matrix Approximation

• Assume that A is a low-rank matrix plus noise:

𝐴 = 𝐴₀ + 𝑁, where the noise N is small compared with 𝐴₀.

• Typically the singular values of A look like the figure

• if the noise is sufficiently small in magnitude, the number of large singular values is often referred to as the numerical rank of the matrix

• then we can “remove the noise” by

approximating A by a matrix of the correct rank

• If numerical rank is k, we can approximate 𝐴 =

𝑖=1 𝑛

𝜎_𝑖𝑢_𝑖𝑣_𝑖^𝑇 ≈

𝑖=1 𝑘

𝜎_𝑖𝑢_𝑖𝑣_𝑖^𝑇 ≡ 𝐴_𝑘

This is called the truncated SVD.

(68)

Truncated SVD

• Truncated SVD is important for removing noise and for compressing data Theorem: Assume that the matrix 𝐴 ∈ ℝ^𝑚×𝑛 has rank r>k. The matrix approximation problem

𝑟𝑎𝑛𝑘 𝑍 =𝑘min 𝐴 − 𝑍 ₂ has the solution

𝑍 = 𝐴_𝑘 ≡ 𝑈_𝑘Σ_𝑘𝑉_𝑘^𝑇

where 𝑈_𝑘 = 𝑢₁, … , 𝑢_𝑘 , 𝑉_𝑘 = 𝑣₁, … , 𝑣_𝑘 , Σ_𝑘 = 𝑑𝑖𝑎𝑔(𝜎₁, … , 𝜎_𝑘).

The minimum is

𝐴 − 𝐴_{𝑘 2} = 𝜎_𝑘+1.

Also holds for the Frobenius norm; in this case the minimum is 𝐴 − 𝐴 =

min 𝑚,𝑛

𝜎²

1/2

(69)

Principal Component Analysis

• Assume that 𝑋 ∈ ℝ^𝑚×𝑛 is a data matrix, where each column is an observation of a real-valued random vector with mean zero

• The matrix is assumed to be centered, i.e., the mean of each column is 0

• Let the SVD of X be 𝑋 = 𝑈Σ𝑉^𝑇.

• The right singular vectors 𝑣_𝑖 are called principal components directions of X.

• The vector 𝑧₁ = 𝑋𝑣₁ = 𝜎₁𝑢₁ has the largest sample variance among all normalized linear combinations of the columns of X:

𝑉𝑎𝑟 𝑧₁ = 𝑉𝑎𝑟 𝑍𝑣₁ = 𝜎₁² 𝑚

• Finding the vector of maximal variance is equivalent to maximizing the Rayleigh quotient:

𝜎₁² = max

𝑣≠0

𝑣^𝑇𝑋^𝑇𝑋𝑣

𝑣^𝑇𝑣 , 𝑣₁ = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑣≠0 𝑣^𝑇𝑋^𝑇𝑋𝑣 𝑣^𝑇𝑣 .

• The normalized vector 𝑢₁ = ¹

𝜎₁ 𝑋𝑣₁ is called the normalized first principal component of X.

(70)

Principal Component Analysis

• Having determined the vector of largest sample variance, we usually want to go on and find the vector of second largest sample variance that is

orthogonal to the first

• This is done by computing the vector of largest sample variance of the deflated data matrix 𝑋 − 𝜎₁𝑢₁𝑣₁^𝑇.

• Continuing this process we can determine all the principal components in order

• i.e., we compute the singular vectors.

• In the general step of the procedure, the subsequent principal component is defined as the vector of maximal variance subject to the constraint that it is orthogonal to the previous ones

(71)

Principal Component Analysis

(72)

Solving Least Squares Problems

Theorem: Let the matrix 𝐴 ∈ ℝ^𝑚×𝑛 have full column rank and thin SVD 𝐴 = 𝑈₁Σ𝑉^𝑇. Then the least squares problem min

𝑥 𝐴𝑥 − 𝑏 ₂ has the unique solution

𝑥 = 𝑉Σ⁻¹𝑈₁^𝑇𝑏 =

𝑖=1

𝑛 𝑢_𝑖^𝑇𝑏 𝜎_𝑖 𝑣_𝑖

(73)

Condition Number and Perturbation Theory

The condition number of a rectangular matrix is defined in terms of the SVD Let A have rank r, i.e., its singular values satisfy

𝜎₁ ≥ ⋯ ≥ 𝜎_𝑟 > 𝜎_𝑟+1 = ⋯ = 𝜎_𝑝 = 0 where 𝑝 = min(𝑚, 𝑛).

Then the condition number is defined

𝜅 𝐴 = 𝜎₁ 𝜎_𝑟 .

(74)

Perturbation Theorem

Assume that the matrix 𝐴 ∈ ℝ^𝑚×𝑛, m ≥ 𝑛 has full column rank, and let x be the solution of the least squares problem min

𝑥 𝐴𝑥 − 𝑏 ₂. Let 𝛿𝐴 and 𝛿𝑏 be perturbations such that

𝜂 = 𝛿𝐴 ₂

𝜎_𝑛 = 𝜅𝜖_𝐴 < 1, 𝜖_𝐴 = 𝛿𝐴 ₂ 𝐴 ₂ .

Then the perturbed matrix 𝐴 + 𝛿𝐴 has full rank, and the perturbation of the solution 𝛿𝑥 satisfies

𝛿𝑥 ₂ ≤ 𝜅

1 − 𝜂 𝜖_𝐴 𝑥 ₂ + 𝛿𝑏 ₂

𝐴 ₂ + 𝜖_𝐴𝜅 𝑟 ₂ 𝐴 ₂ , where 𝑟 is the residual 𝑟 = 𝑏 − 𝐴𝑥.

(75)

Important Observations

• The number 𝜅 determines the condition of the least squares problem, and if 𝑚 = 𝑛, then the residual 𝑟 is equal to zero and the inequality becomes a perturbation result for linear systems.

• In the overdetermined case, the residual is usually not equal to zero. Then the conditioning depends on 𝜅².

• This dependence may be significant if the norm of the residual is large.

(76)

Rank-Deficient Systems

Assume that A is rank-deficient with the SVD

𝐴 = 𝑈₁ 𝑈₂ Σ₁ 0 0 0

𝑉₁^𝑇 𝑉₂^𝑇 , where

𝑈₁ ∈ ℝ^𝑚×𝑟, Σ₁ ∈ ℝ^𝑟×𝑟, 𝑉₁ ∈ ℝ^𝑛×𝑟 and the diagonal elements of Σ₁ are all nonzero.

Then the least squares problem min

𝑥 𝐴𝑥 − 𝑏 ₂ does not have a unique solution.

However, the problem

min𝑥∈ℒ 𝑥 ₂ , ℒ = 𝑥 𝐴𝑥 − 𝑏 ₂= 𝑚𝑖𝑛}

has the unique solution

𝑥 = 𝑉 Σ¹⁻¹ 0

0 0 𝑈^𝑇𝑏 = 𝑉₁Σ₁⁻¹𝑈₁^𝑇𝑏.

−1 0

(77)

Underdetermined Systems

The SVD can also be used to solve underdetermined linear systems, i.e., systems with more unknowns than equations.

Let 𝐴 ∈ ℝ^𝑚×𝑛, 𝑚 < 𝑛, have full row rank, with SVD 𝐴 = 𝑈 Σ 0 𝑉₁^𝑇

𝑉₂^𝑇 , 𝑉₁ ∈ ℝ^𝑚×𝑚.

Then the linear system 𝐴𝑥 = 𝑏 always has a solution, which, however, in nonunique.

The problem

min𝑥∈𝒦 𝑥 ₂, 𝒦 = 𝑥 𝐴𝑥 = 𝑏}

has the unique solution

𝑥₁ = 𝑉₁Σ⁻¹𝑈^𝑇𝑏.

(78)

Computing the SVD

• The svd() function in Matlab is an implementation of algorithms from LAPACK

• The double precision function is implemented in DGESVD

• In this algorithm, the matrix is first reduced to bidiagonal form by a series of Householder transformations from the left and the right.

• Then the bidiagonal matrix is iteratively reduced to diagonal form using a variant of what is called the QR algorithm.

• Flops: For dense matrix, SVD can be computed in 𝑂(𝑚𝑛²) flops

• For sparse matrices, use the svds() function in Matlab. This function is based on Lanczos methods from ARPACK.

(79)

Complete Orthogonal Decomposition

• When the matrix is rank deficient, computing the SVD is the most reliable method for determining the rank

• However, this is relatively expensive to compute and expensive to update (when new rows/columns are added).

• This motivated methods that approximate the SVD, called complete orthogonal decompositions, which (in the noise-free case and in exact arithmetic) can be written

𝐴 = 𝑄 𝑇 0 0 0 𝑍^𝑇

where Q and Z are orthogonal and triangular 𝑇 ∈ ℝ^𝑟×𝑟, where r is the rank of A.

• The SVD is a special case of a complete orthogonal decomposition.

• Can be computed by an algorithm based on QR with column pivoting