• Aucun résultat trouvé

Introduction and NLA Review

N/A
N/A
Protected

Academic year: 2022

Partager "Introduction and NLA Review"

Copied!
79
0
0

Texte intégral

(1)

Lecture 1:

Introduction and NLA Review

NMNV468: Numerical Linear Algebra for data science and informatics

February 21, 2020

(2)

Data Science and Informatics

• Huge amounts of data collected in modern society

• Currently generating and storing more data than we know what to do with!

• Data mining and analysis: extracting meaningful information from large data sets

• Huge number of application areas: business, bioinformatics, medicine, informational retrieval, search engines, scientific applications

• Truly interdisciplinary: computer science, statistics, linear algebra, optimization

• Numerical linear algebra is a basic ingredient in many techniques

• course goal: give introduction to mathematical and numerical approaches behind data analysis and informatics applications

• Not primarily a numerical linear algebra course, but rather application of

(3)

Numerical Linear Algebra vs. Data Analysis

Slide from Yousef Saad

(4)

Data Analytics in Scientific Discovery

Data analytics and machine learning increasingly important in scientific discovery

Event identification, correlation in high-energy physics

Climate simulation validation using sensor data

Determine patterns and trends from astronomical data

Genetic sequencing

(5)

Data Analytics used in Computational Science

(6)

Logistics of the Course

• Class format:

• mix of lectures, exercises/demonstrations

• no lectures/exercises after April 3.

• Class website: https://dl1.cuni.cz/enrol/instances.php?id=9187

• Change class time?

• Tuesday 9:00-10:30 and 10:40-12:10?

• Thursday 15:40-17:10 and 17:20-16:50?

• Monday 15:40-17:10 and 17:20-16:50?

• Topic outline:

• See syllabus and schedule

(7)

Today's Outline and Goals

• Goals: review what you should have from basic numerical linear algebra course Topics for Today*

• Vectors and Matrices

• Examples arising in DS&I applications

• Basic Linear Algebra Review

• Linear Systems and LU Decomposition

• Least Squares Problems

• Orthogonality

• QR Decomposition

• Singular Value Decomposition

*May go into next week depending on familiarity of students with the topics

(8)

Vectors and Matrices

Think of a matrix as a rectangular array of data, where elements are scalar real numbers

𝐴 =

𝑎11 𝑎12 ⋯ 𝑎1𝑛 𝑎21 𝑎22 ⋯ 𝑎2𝑛

⋮ ⋮ ⋮

𝑎𝑚1 𝑎𝑚2 ⋯ 𝑎𝑚𝑛

∈ ℝ𝑚×𝑛

We can also think of a matrix as a set of n vectors in ℝ𝑚...

(9)

Example - Term-Document Matrix

Term-document matrices are used in information retrieval.

Consider the five documents below, with keywords (or terms) in bold:

Doc 1: The Google matrix P is a model of the internet Doc 2: 𝑃𝑖𝑗 is nonzero if there is a link from web page j to i Doc 3: The Google matrix is used to rank all Web Pages

Doc 4: The ranking is done by solving a matrix eigenvalue problem Doc 5: England dropped out of the top 10 in the FIFA ranking

If we count the frequency of terms in each document we get the following

(10)

Example - Term-Document Matrix

Each document is represented by a vector (a point in ℝ10) and we have the term-document matrix

𝐴 =

0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 0 0

(11)

Example - Term-Document Matrix

• Say we want to find all documents relevant to the query "ranking of web pages". This query has the query vector:

𝑞 = 0 0 0 0 0 0 0 1 1 1

∈ ℝ10

• The task of information retrieval is the following mathematical problem:

Find the columns of A that are close to the vector q.

• Common that m is large (≈ 106); common that A is sparse

• Many linear algebra techniques used in solving this problem

(12)

Example - Pattern Recognition

Also often useful to consider a matrix as a linear operator Denote the columns of A

𝑎⋅𝑗 =

𝑎1𝑗 𝑎2𝑗

⋮ 𝑎𝑚𝑗

, 𝑗 = 1,2, … , 𝑛 so that 𝐴 = 𝑎⋅1 ⋯ 𝑎⋅𝑛

Then the linear transformation is defined

𝑦 = 𝐴𝑥 = 𝑎⋅1 ⋯ 𝑎⋅𝑛 𝑥1

𝑥𝑛 =

𝑗=1 𝑛

𝑥𝑗𝑎⋅𝑗

(13)

Example - Pattern Recognition

• Model problem in pattern recognition is classification of handwritten digits

• Image of one digit is a 16x16 matrix of numbers, representing grayscale

• Can also be represented as a vector in ℝ256

• A set of n digits can then be represented by a matrix 𝐴 ∈ ℝ256×𝑛 and the columns of 𝐴 span a subspace of ℝ256

• We can compute an approximate basis of this subspace using the SVD (singular value decomposition) of A, 𝐴 = 𝑈Σ𝑉𝑇

(14)

Example - Pattern Recognition

• Let's say we have a vector b representing some unknown digit. We now want to automatically classify the digit as 0-9.

• Given a set of approximate basis vectors representing, e.g., 3s, 𝑢1, … , 𝑢𝑘, we can determine whether b is a 3 by checking if there is a linear

combination of the basis vectors such that

𝑏 − 𝑗=1𝑘 𝑥𝑗𝑢𝑗 = something small

(15)

Example - PageRank

• Search engines extract information from web pages available on the internet

• The core of the Google search engine is a matrix computation

• The "Google matrix" P is of the order of billions (close to the total number of web pages on the internet)

• Matrix structure based on link structure of the web: 𝑃𝑖𝑗 is nonzero if there is a link from webpage j to i

𝑃 =

0 1/3 0 0 0 0

1/3 0 0 0 0 0

0 1/3 0 0 1/3 1/2 1/3 0 0 0 1/3 0 1/3 1/3 0 0 0 1/2

0 0 1 0 1/3 0

• The link graph matrix constructed so that cols and rows represent webpages and nonzero elements in col j denote outlinks from webpage j.

• For search engine to be useful, it must measure quality of webpages by some ranking; this ranking is done by solving an eigenvalue problem

1 2 3

4 5 6

(16)

Floating Point Computations

• Modeling cost of a computation

• Often by counting flops (even just highest order terms)

• This can often be misleading - amount of data movement and data access patterns often have more influence than number of flops

• especially true in large-scale settings

(17)

The Cost of a Computation

• Algorithms have two costs: communication and computation

• Communication : moving data between levels of memory hierarchy (sequential), between processors (parallel)

• On today’s computers, communication is expensive, computation is cheap – Flop time << 1/bandwidth << latency

– True at all scales – from smartphones to supercomputers

– Communication bottleneck a barrier to achieving time and energy scalability

sequential comm.

parallel comm.

(18)

Floating Point Rounding Errors

• We will assume that computations are done under the IEEE floating point standard

• Real number x can not in general be represented exactly in a floating point system.

Let fl[x] be the floating point number representing x. Then 𝑓𝑙 𝑥 = 𝑥(1 + 𝜖)

for some 𝜖 satisfying 𝜖 ≤ 𝑢, where 𝑢 is the unit roundoff of the system

• e.g., in IEEE double precision, 𝑢 ≈ 10−16

Thus the relative error in the floating point representation of any real number x satisfies

𝑓𝑙 𝑥 − 𝑥

𝑥 ≤ 𝑢

(19)

Floating Point Rounding Errors

• Let 𝑓𝑙[𝑥⨀𝑦] be the result of a floating point arithmetic operation, where ⨀ is +, -, x, or /

• Then provided that 𝑥⨀𝑦 ≠ 0,

𝑥⨀𝑦 − 𝑓𝑙[𝑥⨀𝑦]

𝑥⨀𝑦 ≤ 𝑢

or equivalently,

𝑓𝑙 𝑥⨀𝑦 = 𝑥⨀𝑦 1 + 𝜖 for some 𝜖 satisfying 𝜖 ≤ 𝑢

• When we estimate the error in the result of a computation in floating point arithmetic, we can think of it as a forward error.

• Alternatively, we could rewrite the above as

𝑓𝑙 𝑥⨀𝑦 = x + e ⨀(𝑦 + 𝑓) for some numbers 𝑒 and 𝑓 that satisfy

𝑒 ≤ 𝑢 𝑥 , 𝑓 ≤ 𝑢|𝑦|

• I.e., 𝑓𝑙[𝑥⨀𝑦] is the exact result of the operation on slightly perturbed data.

• This is called backward error analysis

(20)

Forward Error vs. Backward Error

(21)

Linear Algebra Review

(22)

Basic Linear Algebra

Matrix-Vector multiplication

𝑦 = 𝐴𝑥, 𝑦𝑖 =

𝑗=1 𝑛

𝑎𝑖𝑗𝑥𝑗, 𝑖 = 1, … , 𝑚

𝑦 = 𝐴𝑥 = 𝑎⋅1 𝑎⋅2 ⋯ 𝑎⋅𝑛

𝑥1 𝑥2

⋮ 𝑥𝑛

=

𝑗=1 𝑛

𝑥𝑗𝑎⋅𝑗

(23)

Basic Linear Algebra

Matrix-Matrix multiplication: 𝐴 ∈ ℝ𝑚×𝑘, 𝐵 ∈ ℝ𝑘×𝑛, 𝐶 ∈ ℝ𝑚×𝑛 𝐶 = 𝐴𝐵, 𝑐𝑖𝑗 =

𝑠=1 𝑘

𝑎𝑖𝑠𝑏𝑠𝑗, 𝑖 = 1, … , 𝑚, 𝑗 = 1, … , 𝑛

(24)

Inner Products and Vector Norms

• Definition and properties of a vector norm

• 1-norm, 2-norm, infinity norm

• Generalization: p-norms

• With norms we can introduce concepts of continuity and error in approximations of vectors

• Absolute error, relative error

• Equivalence of norms

• Vector norm inequalities table

• Distance measures commonly-used in data science

(25)

Matrix Norms

• Vector norms and corresponding operator norms

• Matrix norm properties

• Fundamental inequalities

• Examples: 2-norm, 1-norm, infinity norm, Frobenius norm

• Matrix norm equivalence inequalities table

(26)

Linear Independence

Given a set of vectors 𝑣𝑗, 𝑗 = 1, … , 𝑛 in ℝ𝑚, 𝑚 ≥ 𝑛, consider the set of linear combinations

𝑠𝑝𝑎𝑛 𝑣1, … , 𝑣𝑛 = 𝑦|𝑦 =

𝑗=1 𝑛

𝛼𝑗𝑣𝑗

The vectors 𝑣𝑗, 𝑗 = 1, … , 𝑛 are called linearly independent when

𝑗=1𝑛 𝛼𝑗𝑣𝑗 = 0 if and only if 𝛼𝑗 = 0 for 𝑗 = 1, … , 𝑛.

A set of 𝑚 linearly independent vectors in ℝ𝑚 is called a basis. Any vector in ℝ𝑚 can be expressed as a linear combination of the basis vectors.

If the vectors 𝑣𝑗, 𝑗 = 1, … , 𝑛 are linearly dependent, then some 𝑣𝑘 can be written as a linear combination of the rest.

In practice, rarely have exactly linearly dependent vectors. Often have almost linearly dependent vectors.

(27)

Rank of a Matrix

• Rank = maximum number of linearly independent column vectors in 𝐴

• Also: rank = number of nonzero singular values

• Rank-1 matrix can be written as the outer product matrix 𝑥𝑦𝑇

• A square matrix 𝐴 ∈ ℝ𝑛×𝑛 with rank 𝑛 is called nonsingular and has an inverse 𝐴−1 satisfying

𝐴𝐴−1 = 𝐴−1𝐴 = 𝐼

• If we multiply linearly independent vectors by a nonsingular matrix, then the vectors remain linearly independent

(28)

Numerical Rank of a Matrix

Even though a matrix may be full rank, it may be so close to being rank deficient that it behaves as such when we are computing in finite precision.

Let 𝐴 ∈ ℝ𝑚×𝑛. Let

𝛿 𝐴 = 𝜖 ⋅ max 𝑚, 𝑛 ⋅ 2 log2(𝜎1) where 𝜎1 = 𝐴 2 is the largest singular value of 𝐴.

The numerical rank 𝑟(𝐴) is then defined as

𝑟 𝐴 = 𝜎𝑖 > 𝛿 𝐴 | 1 ≤ 𝑖 ≤ min(𝑚, 𝑛)

(i.e., the number of singular values of 𝐴 that are greater than 𝛿(𝐴)).

This is what is used in Matlab. We could also use slightly different definitions

(29)

Linear Systems and

Least Squares

(30)

Linear Systems

• Want to solve 𝐴𝑥 = 𝑏 where 𝐴 ∈ ℝ𝑛×𝑛 is square and nonsingular

• Existence and uniqueness: Assume that 𝐴 ∈ ℝ𝑛×𝑛 is nonsingular. Then for any right-hand-side b, the linear system Ax=b has a unique solution

• Commonly solved using Gaussian elimination with partial pivoting

• partial pivoting: In the ith step, reorder the rows of the matrix so that the element of largest magnitude in the ith column is moved to the (i,i) position.

• equivalent to multiplying on the left by a permutation matrix P

• In each step, elimination involves zeroing elements in the first column below the diagonal

• Ex: In the first step, we compute 𝐴(1) = 𝐿−11 𝑃1𝐴 where 𝐿1 is a Gauss transformation

𝐿1 = 1 0

𝑚1 𝐼 , 𝑚1 =

𝑚21

⋮ 𝑚𝑛1

We continue until we have an upper triangular matrix, and then we can

(31)

LU Decomposition

• The process of Gaussian elimination produces the LU decomposition of a matrix:

𝐿1𝐴(1) = 𝑃1𝐴 𝐿𝑈 = 𝑃𝐴⋮

• Theorem: Any nonsingular square matrix 𝐴 can be decomposed into 𝑃𝐴 = 𝐿𝑈

where 𝑃 is a permutation matrix, 𝐿 is lower triangular with ones on the diagonal, and 𝑈 is an upper triangular matrix.

• Flops: computing the LU decomposition takes approximately 2𝑛3/3 flops

• In kth step of Gaussian elimination, one operates on an (n-k+1)x(n-k+1) submatrix, and for each element in that submatrix we do one multiplication and one addition. Thus the total number of flops is

2

𝑘=1 𝑛−1

𝑛 − 𝑘 + 1 2 ≈ 2𝑛3 3

(32)

The Symmetric Positive Definite Case

• In the case the matrix is symmetric (𝐴 = 𝐴𝑇) and positive definite (all eigenvalues are positive), then:

• The LU decomposition can be computed without pivoting

• The decomposition becomes symmetric, and so can be done with half the work

Theorem: Any symmetric positive definite matrix A has a decomposition 𝐴 = 𝐿𝐷𝐿𝑇

where L is lower triangular with ones on the diagonal and D is a diagonal matrix with positive diagonal elements.

(33)

Cholesky Decomposition

Since D has positive diagonal elements, we can write 𝐷1/2 =

𝑑1

𝑑𝑛 and then we get

𝐴 = 𝐿𝐷𝐿𝑇 = 𝐿𝐷1 2 𝐷1 2𝐿𝑇 = 𝑈𝑇𝑈 where U is an upper triangular matrix.

This variant is called the Cholesky decomposition.

Storage requirements: Only need to store one triangular matrix: 𝑛(𝑛 + 1)/2 Flops: Half the work as LU decomposition: approx. 𝑛3/3 flops

(34)

Perturbation Theory and Condition Number

The condition number of a nonsingular matrix A is defined as 𝜅 𝐴 = 𝐴 ‖𝐴−1

where ‖ ⋅ ‖ is any operator norm.

We can also specify a norm, e.g.,

𝜅2 𝐴 = 𝐴 2 𝐴−1 2.

The condition number is important. It tells us how much the solution 𝒙 to a linear system can change when the matrix 𝑨 and/or right-hand-side 𝒃 are perturbed.

(35)

Perturbed Linear Systems

Let 𝛿𝐴 be some perturbation to the matrix A and let 𝛿𝑏 be a perturbation to the right-hand-side b.

Assume that A is nonsingular and that 𝛿𝐴 𝐴−1 = 𝑟 < 1.

Then the matrix 𝐴 + 𝛿𝐴 is nonsingular and

𝐴 + 𝛿𝐴 −1 𝐴−1 1 − 𝑟 .

The solution of the perturbed system

𝐴 + 𝛿𝐴 𝑦 = 𝑏 + 𝛿𝑏 satisfies

𝑦 − 𝑥

‖𝑥‖ 𝜅 𝐴 1 − 𝑟

𝛿𝐴

𝐴 + 𝛿𝑏 𝑏 .

A matrix for which 𝜅 𝐴 is large is called ill-conditioned.

A linear system with an ill-conditioned matrix is sensitive to perturbations in the data (i.e., the matrix and the right-hand-side).

(36)

Rounding Errors in GE

Recall:

𝑓𝑙 𝑥 = 𝑥 1 + 𝜖 , 𝜖 ≤ 𝑢

When representing elements of a matrix A and vector b in floating point system, there also arise errors, e.g.:

𝑓𝑙 𝑎𝑖𝑗 = 𝑎𝑖𝑗 1 + 𝜖𝑖𝑗 , 𝜖𝑖𝑗 ≤ 𝑢

Therefore we can write

𝑓𝑙 𝐴 = A + 𝛿𝐴, 𝑓 𝑏 = 𝑏 + 𝛿𝑏 where

𝛿𝐴 ≤ 𝑢 𝐴 , 𝛿𝑏 ≤ 𝑢 𝑏

If no other floating point errors arise during solving Ax=b, we have that the solution 𝑥 satisfies

𝐴 + 𝛿𝐴 𝑥 = 𝑏 + 𝛿𝑏

(37)

Rounding Errors in GE

Using perturbation theory, we can estimate the error in 𝑥: 𝑥 − 𝑥

‖𝑥‖ ≤ 𝜅 𝐴 1 − 𝑟

𝛿𝐴

𝐴 + 𝛿𝑏 𝑏

=

𝜅1−𝑟 𝐴

(2𝑢)

provided that 𝑟 = 𝑢 𝜅 𝐴 < 1.

Note that this is just error due to storing A and b in finite precision.

We of course also make errors in computing GE in finite precision.

(38)

Rounding Errors in GE

Theorem: Assume we use a floating point system with unit roundoff 𝑢. Let 𝐿 and 𝑅 be the triangular factors obtained from GE with partial pivoting

applied to A. Further, assume that 𝑥 is computed using the forward and backward substitutions

𝐿 𝑦 = 𝑃𝑏, 𝑅 𝑥 = 𝑦.

Then 𝑥 is the exact solution of a system

𝐴 + 𝛿𝐴 𝑥 = 𝑏, where

𝛿𝐴 ≤ 𝑘 𝑛 𝑔𝑛𝑢 𝐴 , 𝑔𝑛 =

max𝑖,𝑗,𝑘 | 𝑎𝑖𝑗(𝑘)| max𝑖,𝑗 |𝑎𝑖𝑗|

where 𝑘 𝑛 is a degree-3 polynomial and 𝑎𝑖𝑗(𝑘) are the elements computed in step k-1 of the elimination.

𝑔 is called the pivot growth factor

(39)

Rounding Error in GE

𝑔𝑛 depends on the growth of the matrix elements during GE and not explicitly on the magnitude of the multipliers.

𝑔𝑛 can be computed a posteriori (we can not compute it a priori; we must do the elimination to find it).

We can bound it a priori, however. In general, we can only say that 𝑔𝑛 ≤ 2𝑛−1

This is attainable in specifically constructed examples, but it is almost never this high in practice.

• Average case, it is closer to 𝑛2/3 when partial pivoting is used [Trefethen and Schreiber, 1990]

Bound is lower for specific cases, e.g., for SPD matrices, 𝑔𝑛 = 1.

(40)

What is the point?

• The bound is often very pessimistic in practice, so what is the point?

• In general, these types of a priori rounding error analyses are NOT to give tight bounds on the error, but rather to expose potential instabilities of the algorithms and to allow for comparing of different algorithms.

(41)

Banded Matrices

• Band matrix: nonzeros are concentrated around the main diagonal, and off- diagonal elements are 0

• More precisely, A is banded with lower bandwidth q and upper bandwidth p if

𝑎𝑖𝑗 = 0 if 𝑗 − 1 > 𝑝 or 𝑖 − 𝑗 > 𝑞

• w=q+p+1 is called the bandwidth of the matrix

• If A is a band matrix, L and U are also band matrices

• L will have lower bandwidth q and U will have upper bandwidth p+q+1

(42)

The Least Squares Problem

Suppose we want to solve 𝐴𝑥 = 𝑏 where 𝐴 ∈ ℝ𝑚×𝑛, 𝑚 > 𝑛.

This system is called overdetermined: it has more equations than unknowns.

In general, no solution exists. Alternative is to find an x such that the residual, r=b-Ax, is as small as possible.

The solution will depend on how we define "small" and how we measure the size of the residual.

In the least squares method, we use the standard Euclidean distance. I.e., we want to find a vector 𝑥 ∈ ℝ𝑛 that solves the minimization problem

min𝑥 𝑏 − 𝐴𝑥 2.

As x occurs linearly in this minimization problem, this is often referred to as the linear least squares problem.

(43)

Solving LS Problems

We therefore find the solution by making r orthogonal to the columns of A:

𝑟𝑇 𝑎⋅1 𝑎⋅2 ⋯ 𝑎⋅𝑛 = 𝑟𝑇𝐴 = 0 (= 𝐴𝑇𝑟) Then

𝑟 = 𝑏 − 𝐴𝑥

𝐴𝑇𝑟 = 𝐴𝑇𝑏 − 𝐴𝑇𝐴𝑥 = 0

⇒ 𝐴𝑇𝐴𝑥 = 𝐴𝑇𝑏

These are called the normal equations. We can now solve the linear system 𝐴𝑇𝐴𝑥 = 𝐴𝑇𝑏 to find the least squares solution x.

• Theorem: If the columns of A are linearly independent, 𝐴𝑇𝐴 is nonsingular

(44)

Drawbacks

• Solving the normal equations is only one possible way to solve the least squares problem

• It has the following drawbacks:

• Forming 𝐴𝑇𝐴 can lead to loss of information due to finite precision computation

• The condition number 𝐴𝑇𝐴 is the square of that of 𝐴

• Example: 𝐴 = 1 1 𝜖 0 0 𝜖

, 𝐴𝑇𝐴 = 1 + 𝜖2 1 1 1 + 𝜖2

If 𝜖 is so small that the floating point representation of 1 + 𝜖2 satisfies 𝑓𝑙 1 + 𝜖2 = 1, then in floating point arithmetic, the normal equations become singular.

(45)

Sequences of LS Problems

Often in practice, one has to solve a sequence of independent LS problems with the same matrix A,

min𝑥𝑖 𝐴𝑥𝑖 − 𝑏 2, 𝑖 = 1,2, … , 𝑝 with solutions

𝑥𝑖 = 𝐴𝑇𝐴 −1𝐴𝑇𝑏𝑖, 𝑖 = 1,2, … , 𝑝

Defining 𝑋 = [𝑥1 𝑥2 ⋯ 𝑥𝑝] and 𝐵 = [𝑏1 𝑏2 ⋯ 𝑏𝑝], we can write the problem in matrix form:

min𝑋 𝐴𝑋 − 𝐵 𝐹 with the solution 𝑋 = 𝐴𝑇𝐴 −1𝐴𝑇𝐵.

Why Frobenius norm? Due to identity 𝐴𝑋 − 𝐵 𝐹2 =

𝑖=1 𝑝

𝐴𝑥𝑖 − 𝑏𝑖 22

(46)

Orthogonality

(47)

Orthogonal Vectors and Matrices

• Often advantageous to use orthogonal vectors as basis vectors in a vector space

• Two vectors x and y are orthogonal if 𝑥𝑇𝑦 = 0 (cos 𝜃 𝑥, 𝑦 = 0)

• Let 𝑞𝑗, 𝑗 = 1,2, … , 𝑛 be orthogonal, i.e., 𝑞𝑖𝑇𝑞𝑗 = 0 if 𝑖 ≠ 𝑗. Then they are linearly independent.

• If a set of orthogonal vectors 𝑞𝑗, 𝑗 = 1,2, … , 𝑚 in ℝ𝑚 are normalized, i.e., 𝑞𝑗 2 = 1, then they are called orthonormal, and they constitute an

orthonormal basis for ℝ𝑚.

• A square matrix 𝑄 = 𝑞1 𝑞2⋯ 𝑞𝑚 ∈ ℝ𝑚×𝑚 whose columns are orthonormal is called an orthogonal matrix.

• Orthogonal matrices have a number of important properties

(48)

Properties of Orthogonal Matrices

• An orthogonal matrix 𝑄 satisfies 𝑄𝑇𝑄 = 𝐼

• Implies that an orthogonal matrix has full rank and 𝑄−1 = 𝑄𝑇

• The rows of an orthogonal matrix are orthogonal: 𝑄𝑄𝑇 = 𝐼

• The product of two orthogonal matrices is also orthogonal

• Given a matrix 𝑄1 ∈ ℝ𝑚×𝑘 with orthonormal columns, there exists a matrix 𝑄2 ∈ ℝ𝑚× 𝑚−𝑘 such that 𝑄 = [𝑄1 𝑄2] is an orthogonal matrix

• The Euclidean length of a vector is invariant under an orthogonal transformation Q, i.e., 𝑄𝑥 2 = 𝑥 2

• The matrix 2-norm and Frobenius norm are invariant under orthogonal

transformations; let 𝑈 ∈ ℝ𝑚×𝑚 and 𝑉 ∈ ℝ𝑛×𝑛 be orthogonal. Then for 𝐴 ∈ ℝ𝑚×𝑛,

𝑈𝐴𝑉 2 = 𝐴 2, 𝑈𝐴𝑉 𝐹 = 𝐴 𝐹

(49)

Plane Rotations

A 2x2 plane rotation matrix

𝐺 = 𝑐 𝑠

−𝑠 𝑐 , 𝑐2 + 𝑠2 = 1 is orthogonal

Multiplication of a vector by G rotates the vector in a clockwise direction by an angle 𝜃 where 𝑐 = cos 𝜃

Plane rotations are often called Givens rotations in the literature

(50)

Example

Given a vector 𝑥 ∈ ℝ4, we transform it to 𝑘𝑒1. First, by a rotation 𝐺3 in the plane (3,4) we zero the last element:

1 0 0 0 0 1 0 0 0 0 𝑐1 𝑠1 0 0 −𝑠1 𝑐1

×

×

×

×

=

×

×

0

Then, by a rotation 𝐺2 in the plane (2,3) we zero the element in position 3:

1 0 0 0 0 𝑐2 −𝑠2 0 0 𝑠2 𝑐2 0 0 0 0 1

×

×

× 0

=

×

0 0

Finally, we zero out the element in position 2 by a rotation 𝐺1 :

𝑐3 −𝑠3 0 0 𝑠3 𝑐3 0 0 0 0 1 0 0 0 0 1

×

× 0 0

= 𝑘 0 0 0

Since Euclidean length is preserved, we know 𝑘 = 𝑥 2.

(51)

Householder Transformations

Let 𝑣 ≠ 0 be an arbitrary vector and put

𝑃 = 𝐼 − 2

𝑣𝑇𝑣𝑣𝑣𝑇.

𝑃 is symmetric and orthogonal; P is called a reflection matrix or a Householder transformation.

Let x and y be given vectors of the same length and we ask: Can we determine a Householder transformation P such that Px=y?

The equation Px=y can be written

𝑥 − 2𝑣𝑇𝑥

𝑣𝑇𝑣 𝑣 = 𝑦, which is of the form 𝛽𝑣 = 𝑥 − 𝑦.

Since v enters P in such a way that a factor 𝛽 cancels, we can choose 𝛽 = 1.

With v=x-y we get

𝑣𝑇𝑣 = 𝑥𝑇𝑥 + 𝑦𝑇𝑦 − 2𝑥𝑇𝑦 = 2 𝑥𝑇𝑥 − 𝑥𝑇𝑦 , since 𝑥𝑇𝑥 = 𝑦𝑇𝑦. Further,

𝑣𝑇𝑥 = 𝑥𝑇𝑥 − 𝑦𝑇𝑥 = 1 2𝑣𝑇𝑣

(52)

Householder Transformations

Therefore we have

𝑃𝑥 = 𝑥 − 2𝑣𝑇𝑥

𝑣𝑇𝑣 𝑣 = 𝑥 − 𝑣 = 𝑦 as we wanted.

In matrix computations we often want to zero elements in a vector and we now choose 𝑦 = 𝑘𝑒1, where 𝑘 = ± 𝑥 2. The vector v should be taken equal

to 𝑣 = 𝑥 − 𝑘𝑒1.

In order to avoid cancellation (subtraction of two close floating point

numbers), we choose sign(k) = -sign(𝑥). Now that we have computed v, we can simplify and write

𝑃 = 𝐼 − 2

𝑣𝑇𝑣 𝑣𝑣𝑇 = 𝐼 − 2𝑢𝑢𝑇, 𝑢 = 1 𝑣 2 𝑣 Thus the Householder vector u has length 1.

(53)

Householder Transformations

One should avoid forming the Householder matrix P explicitly since it can be represented more compactly by the vector u

Multiplication by P should be done via 𝑃𝑥 = 𝑥 − 2𝑢𝑇𝑥 𝑢, where the matrix multiplication requires 4n flops (instead of 𝑛2 if P were formed explicitly).

The matrix multiplication PX is done by

𝑃𝑋 = 𝐴 − 2𝑢(𝑢𝑇𝑋)

(54)

Flop Counts

• We will compare the number of flops to transform the first column of an mxn matrix A to a multiple of a unit vector 𝑘𝑒1 using plane rotations and Householder transformations.

• Plane rotations: application of rotation matrix to vector requires 4 multiplications and 2 additions = 6 flops.

• Application to mxn matrix requires 6n flops.

• Need to apply m-1 rotations to zero all but 1 row

• Total: ≈ 6𝑚𝑛 flops

• Householder transformations: 4n flops for multiplication by P, need to do it m-1 times to zero out all but one row

• Total: ≈ 4𝑚𝑛 flops

(55)

Orthogonal Transformations in Finite Precision

• Orthogonal transformations are very stable in floating point arithmetic

• Ex: a Householder transformation 𝑃 computed in floating point arithmetic that approximates an exact 𝑃 satisfies

𝑃 − 𝑃 2 = 𝑂(𝑢)

where u is the unit roundoff of the floating point system.

We also have the backward error result

𝑓𝑙 𝑃𝐴 = 𝑃 𝐴 + 𝐸 , 𝐸 2 = 𝑂 𝑢 𝐴 2

Thus the floating point result is equal to the product of the exact orthogonal matrix and a data matrix that has been perturbed by a very small amount.

Analogous results hold for plane rotations.

(56)

QR Decomposition

(57)

QR Decomposition

Any matrix 𝐴 ∈ ℝ𝑚×𝑛, 𝑚 ≥ 𝑛 can be transformed to upper triangular form by an orthogonal matrix. The transformation is equivalent to the decomposition

𝐴 = 𝑄 𝑅 0

where 𝑅 ∈ ℝ𝑛×𝑛 is upper triangular and 𝑄 ∈ ℝ𝑚×𝑚 is orthogonal. If the columns of A are linearly independent, then 𝑅 is nonsingular.

Often convenient to write the "thin QR decomposition" where we keep only columns of Q corresponding to an orthogonalization of columns of A.

Partition 𝑄 = [𝑄1 𝑄2] where 𝑄1 ∈ ℝ𝑚×𝑛 and we get 𝐴 = 𝑄1 𝑄2 𝑅

0 = 𝑄1𝑅

𝑄1 gives an orthogonal basis for the column space of A.

(58)

Solving LS Problems

The QR decomposition gives a way to solve least squares problems min𝑥 𝑏 − 𝐴𝑥 2

where 𝐴 ∈ ℝ𝑚×𝑛, 𝑚 ≥ 𝑛, without using the normal equations.

Theorem: Let the matrix 𝐴 ∈ ℝ𝑚×𝑛 have full column rank and thin QR decomposition 𝐴 = 𝑄1𝑅. Then the least squares problem min

𝑥 𝑏 − 𝐴𝑥 2 has the unique solution

𝑥 = 𝑅−1𝑄1𝑇𝑏.

We have

min𝑥 𝐴𝑥 − 𝑏 2 = min

𝑥

𝑅𝑥 − 𝑄1𝑇𝑏

𝑄2𝑇𝑏 2 = 𝑄2𝑇𝑏 2

(59)

Flop Counts

• Applying a Householder transformation to an mxn matrix to zero the

elements of the first column below the diagonal requires approximately 4mn flops.

• In the following transformation, only rows 2 to m and columns 2 to n are changed, and the dimension of the submatrix that is affected by the

transformation is reduced by one in each step.

• Therefore the number of flops for computing R is approximately 4

𝑘=0 𝑛−1

(𝑚 − 𝑘)(𝑛 − 𝑘) ≈ 2𝑚𝑛2 − 2𝑛3 3

• Then Q is available in factored form as a product of Householder transformations.

• If we explicitly compute the full matrix Q, then in step k we need 4(m-k)m flops, which gives a total of

4

𝑘=0 𝑛−1

𝑚 − 𝑘 𝑚 ≈ 4𝑚𝑛 𝑚 − 𝑛 2 .

(60)

Error in the Solution of the LS Problem

Theorem: Assume that 𝐴 ∈ ℝ𝑚×𝑛, 𝑚 ≥ 𝑛, has full column rank and that the least squares problem min

𝑥 𝐴𝑥 − 𝑏 2 is solved using QR factorization via

Householder transformations. Then the computed solution 𝑥 is the exact least squares solution of the perturbed problem

min𝑥 𝐴 + Δ𝐴 𝑥 − 𝑏 + 𝛿𝑏 2, where

Δ𝐴 𝐹 ≤ 𝑐1𝑚𝑛𝑢 𝐴 𝐹 + 𝑂 𝑢2 , 𝛿𝑏 2 ≤ 𝑐2𝑚𝑛𝑢 𝑏 2 + 𝑂(𝑢2) and 𝑐1 and 𝑐2 are small integers.

(61)

Error in the Solution of the LS Problem

• The theorem says that the method is backward stable.

• In contrast, the method of solving the LS problem via the normal equations is not backward stable, unless A is well conditioned.

• To see this, in Matlab try both methods of solving the LS problem with 𝐴 = 1 1

𝜖 0 0 𝜖

, 𝑏 = 2 𝜖 𝜖 with 𝜖 = 10−7.

Note that the true LS solution should be 𝑥 = 1 1 .

(62)

SVD

(63)

The SVD

Theorem: Any mxn matrix A, with 𝑚 ≥ 𝑛 can be factored 𝐴 = 𝑈 Σ

0 𝑉𝑇

where 𝑈 ∈ ℝ𝑚×𝑚 and 𝑉 ∈ ℝ𝑛×𝑛 are orthogonal and Σ ∈ ℝ𝑛×𝑛 is diagonal, Σ = 𝑑𝑖𝑎𝑔(𝜎1, … , 𝜎𝑛)

𝜎1 ≥ ⋯ ≥ 𝜎𝑛.

Columns of 𝑈 and 𝑉 are called singular vectors and the 𝜎𝑖's are called singular vectors.

Other names: principal components (statistics and data analysis); Karhunen- Loewe expansion (image processing)

(64)

The SVD

• "Full SVD" vs. "Thin SVD"

• Matrix equations:

𝐴𝑉 = 𝑈1Σ, 𝐴𝑇𝑈1 = 𝑉Σ

• Outer product form:

𝐴 =

𝑖=1 𝑛

𝜎𝑖𝑢𝑖𝑣𝑖𝑇

• In MATLAB:

[U,S,V]=svd(A)

• Relation to norms:

(65)

Fundamental Subspaces

• The SVD gives orthogonal bases of the four fundamental subspaces of a matrix: column-space (or range), null-space, row-space, left null-space

• Column space (range): ℛ 𝐴 = 𝑦 𝑦 = 𝐴𝑥, for arbitrary 𝑥}

• Nullspace: 𝒩 𝐴 = 𝑥 𝐴𝑥 = 0}

• Row space: column space of 𝐴𝑇, ℛ(𝐴𝑇)

• Left nullspace: nullspace of 𝐴𝑇, 𝒩(𝐴𝑇)

(66)

Fundamental Subspaces

Theorem:

• The singular vectors 𝑢1, 𝑢2, … , 𝑢𝑟 are an orthonormal basis for ℛ(A) and 𝑟𝑎𝑛𝑘 𝐴 = dim ℛ 𝐴 = 𝑟

• The singular vectors 𝑣𝑟+1, 𝑣𝑟+2, … , 𝑣𝑛 are an orthonormal basis for 𝒩(𝐴) and

dim 𝒩 𝐴 = 𝑛 − 𝑟

• The singular vectors 𝑣1, 𝑣2, … , 𝑣𝑟 are an orthonormal basis for ℛ(𝐴𝑇)

• The singular vectors 𝑢𝑟+1, 𝑢𝑟+2, … , 𝑢𝑚 are an orthonormal basis for 𝒩(𝐴𝑇)

(67)

Matrix Approximation

• Assume that A is a low-rank matrix plus noise:

𝐴 = 𝐴0 + 𝑁, where the noise N is small compared with 𝐴0.

• Typically the singular values of A look like the figure

• if the noise is sufficiently small in magnitude, the number of large singular values is often referred to as the numerical rank of the matrix

• then we can “remove the noise” by

approximating A by a matrix of the correct rank

• If numerical rank is k, we can approximate 𝐴 =

𝑖=1 𝑛

𝜎𝑖𝑢𝑖𝑣𝑖𝑇

𝑖=1 𝑘

𝜎𝑖𝑢𝑖𝑣𝑖𝑇 ≡ 𝐴𝑘

This is called the truncated SVD.

(68)

Truncated SVD

• Truncated SVD is important for removing noise and for compressing data Theorem: Assume that the matrix 𝐴 ∈ ℝ𝑚×𝑛 has rank r>k. The matrix approximation problem

𝑟𝑎𝑛𝑘 𝑍 =𝑘min 𝐴 − 𝑍 2 has the solution

𝑍 = 𝐴𝑘 ≡ 𝑈𝑘Σ𝑘𝑉𝑘𝑇

where 𝑈𝑘 = 𝑢1, … , 𝑢𝑘 , 𝑉𝑘 = 𝑣1, … , 𝑣𝑘 , Σ𝑘 = 𝑑𝑖𝑎𝑔(𝜎1, … , 𝜎𝑘).

The minimum is

𝐴 − 𝐴𝑘 2 = 𝜎𝑘+1.

Also holds for the Frobenius norm; in this case the minimum is 𝐴 − 𝐴 =

min 𝑚,𝑛

𝜎2

1/2

(69)

Principal Component Analysis

• Assume that 𝑋 ∈ ℝ𝑚×𝑛 is a data matrix, where each column is an observation of a real-valued random vector with mean zero

• The matrix is assumed to be centered, i.e., the mean of each column is 0

• Let the SVD of X be 𝑋 = 𝑈Σ𝑉𝑇.

• The right singular vectors 𝑣𝑖 are called principal components directions of X.

• The vector 𝑧1 = 𝑋𝑣1 = 𝜎1𝑢1 has the largest sample variance among all normalized linear combinations of the columns of X:

𝑉𝑎𝑟 𝑧1 = 𝑉𝑎𝑟 𝑍𝑣1 = 𝜎12 𝑚

• Finding the vector of maximal variance is equivalent to maximizing the Rayleigh quotient:

𝜎12 = max

𝑣≠0

𝑣𝑇𝑋𝑇𝑋𝑣

𝑣𝑇𝑣 , 𝑣1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑣≠0 𝑣𝑇𝑋𝑇𝑋𝑣 𝑣𝑇𝑣 .

• The normalized vector 𝑢1 = 1

𝜎1 𝑋𝑣1 is called the normalized first principal component of X.

(70)

Principal Component Analysis

• Having determined the vector of largest sample variance, we usually want to go on and find the vector of second largest sample variance that is

orthogonal to the first

• This is done by computing the vector of largest sample variance of the deflated data matrix 𝑋 − 𝜎1𝑢1𝑣1𝑇.

• Continuing this process we can determine all the principal components in order

• i.e., we compute the singular vectors.

• In the general step of the procedure, the subsequent principal component is defined as the vector of maximal variance subject to the constraint that it is orthogonal to the previous ones

(71)

Principal Component Analysis

(72)

Solving Least Squares Problems

Theorem: Let the matrix 𝐴 ∈ ℝ𝑚×𝑛 have full column rank and thin SVD 𝐴 = 𝑈1Σ𝑉𝑇. Then the least squares problem min

𝑥 𝐴𝑥 − 𝑏 2 has the unique solution

𝑥 = 𝑉Σ−1𝑈1𝑇𝑏 =

𝑖=1

𝑛 𝑢𝑖𝑇𝑏 𝜎𝑖 𝑣𝑖

(73)

Condition Number and Perturbation Theory

The condition number of a rectangular matrix is defined in terms of the SVD Let A have rank r, i.e., its singular values satisfy

𝜎1 ≥ ⋯ ≥ 𝜎𝑟 > 𝜎𝑟+1 = ⋯ = 𝜎𝑝 = 0 where 𝑝 = min(𝑚, 𝑛).

Then the condition number is defined

𝜅 𝐴 = 𝜎1 𝜎𝑟 .

(74)

Perturbation Theorem

Assume that the matrix 𝐴 ∈ ℝ𝑚×𝑛, m ≥ 𝑛 has full column rank, and let x be the solution of the least squares problem min

𝑥 𝐴𝑥 − 𝑏 2. Let 𝛿𝐴 and 𝛿𝑏 be perturbations such that

𝜂 = 𝛿𝐴 2

𝜎𝑛 = 𝜅𝜖𝐴 < 1, 𝜖𝐴 = 𝛿𝐴 2 𝐴 2 .

Then the perturbed matrix 𝐴 + 𝛿𝐴 has full rank, and the perturbation of the solution 𝛿𝑥 satisfies

𝛿𝑥 2 ≤ 𝜅

1 − 𝜂 𝜖𝐴 𝑥 2 + 𝛿𝑏 2

𝐴 2 + 𝜖𝐴𝜅 𝑟 2 𝐴 2 , where 𝑟 is the residual 𝑟 = 𝑏 − 𝐴𝑥.

(75)

Important Observations

• The number 𝜅 determines the condition of the least squares problem, and if 𝑚 = 𝑛, then the residual 𝑟 is equal to zero and the inequality becomes a perturbation result for linear systems.

• In the overdetermined case, the residual is usually not equal to zero. Then the conditioning depends on 𝜅2.

• This dependence may be significant if the norm of the residual is large.

(76)

Rank-Deficient Systems

Assume that A is rank-deficient with the SVD

𝐴 = 𝑈1 𝑈2 Σ1 0 0 0

𝑉1𝑇 𝑉2𝑇 , where

𝑈1 ∈ ℝ𝑚×𝑟, Σ1 ∈ ℝ𝑟×𝑟, 𝑉1 ∈ ℝ𝑛×𝑟 and the diagonal elements of Σ1 are all nonzero.

Then the least squares problem min

𝑥 𝐴𝑥 − 𝑏 2 does not have a unique solution.

However, the problem

min𝑥∈ℒ 𝑥 2 , ℒ = 𝑥 𝐴𝑥 − 𝑏 2= 𝑚𝑖𝑛}

has the unique solution

𝑥 = 𝑉 Σ1−1 0

0 0 𝑈𝑇𝑏 = 𝑉1Σ1−1𝑈1𝑇𝑏.

−1 0

(77)

Underdetermined Systems

The SVD can also be used to solve underdetermined linear systems, i.e., systems with more unknowns than equations.

Let 𝐴 ∈ ℝ𝑚×𝑛, 𝑚 < 𝑛, have full row rank, with SVD 𝐴 = 𝑈 Σ 0 𝑉1𝑇

𝑉2𝑇 , 𝑉1 ∈ ℝ𝑚×𝑚.

Then the linear system 𝐴𝑥 = 𝑏 always has a solution, which, however, in nonunique.

The problem

min𝑥∈𝒦 𝑥 2, 𝒦 = 𝑥 𝐴𝑥 = 𝑏}

has the unique solution

𝑥1 = 𝑉1Σ−1𝑈𝑇𝑏.

(78)

Computing the SVD

• The svd() function in Matlab is an implementation of algorithms from LAPACK

• The double precision function is implemented in DGESVD

• In this algorithm, the matrix is first reduced to bidiagonal form by a series of Householder transformations from the left and the right.

• Then the bidiagonal matrix is iteratively reduced to diagonal form using a variant of what is called the QR algorithm.

• Flops: For dense matrix, SVD can be computed in 𝑂(𝑚𝑛2) flops

• For sparse matrices, use the svds() function in Matlab. This function is based on Lanczos methods from ARPACK.

(79)

Complete Orthogonal Decomposition

• When the matrix is rank deficient, computing the SVD is the most reliable method for determining the rank

• However, this is relatively expensive to compute and expensive to update (when new rows/columns are added).

• This motivated methods that approximate the SVD, called complete orthogonal decompositions, which (in the noise-free case and in exact arithmetic) can be written

𝐴 = 𝑄 𝑇 0 0 0 𝑍𝑇

where Q and Z are orthogonal and triangular 𝑇 ∈ ℝ𝑟×𝑟, where r is the rank of A.

• The SVD is a special case of a complete orthogonal decomposition.

• Can be computed by an algorithm based on QR with column pivoting

Références

Documents relatifs

Proof: If HP is a Hamiltonian i—j path obtained with the proposed algorithm, then adding edge ij we find a HC for which the cost is equal to:. It follows that c (SHC) ^ 2

[230] Mann E, Baun M, Meininger J, Wade C.Comparison of mortality associated with sepsis in the burn, trauma, and general intensive care unit patient: A systematic review of

We prove that in the expan- sion of the composition of two rooted trees in the operad pre-Lie there is a unique rooted tree of maximal degree and a unique tree of minimal

According to the literature, the actualization of an affordance may depend on the presence of appropriate enabling, stimulating, and releasing conditions (Volkoff and Strong

Constraint propagation is an efficient method which is able to contract the domains for the variables in polynomial time without losing any solution [2, 23, 24]... C j is able

[110] Claire MARTIN, Rôle de la signalisation Sphingosine kinase-1/ Sphingosine 1- Phosphate dans les interactions stroma-cellules épithéliales au sein des

b) We first introduce (if necessary on an enlarged probability space) a symmetric Bernoulli random variable ~ which is assumed to be independent. of

— We find a characterization of the spherical Fourier transform of K- invariant probability measures on Riemannian symmetric spaces of noncompact type in the complex case.. To do