• Aucun résultat trouvé

Random Feature Moments for Compressive Statistical Learning

N/A
N/A
Protected

Academic year: 2022

Partager "Random Feature Moments for Compressive Statistical Learning"

Copied!
65
0
0

Texte intégral

(1)Random Feature Moments for Compressive Statistical Learning Rémi Gribonval - Inria Rennes - Bretagne Atlantique remi.gribonval@inria.fr Joint work with: G. Blanchard (U. Potsdam) N. Keriven, Y Traonmilin (Inria Rennes). 1.

(2) Main Contributors & Collaborators. Anthony Bourrier. Nicolas Keriven. Yann Traonmilin. Nicolas Tremblay. Gilles Blanchard. Mike Davies. R. GRIBONVAL SMAI-SIGMA, Nov 2017. Gilles Puy. Patrick Perez. 2.

(3) Some unforeseen connections ? Signal processing & machine learning inverse problems & generalized method of moments embeddings with random projections & random features /kernels image super-resolution, source localization & k-means. Continuous vs discrete ? wavelets (1990s): from continuous to discrete compressive sensing (2000s): in the discrete world compressive learning: continuous strikes back to circumvent curse of dimensionality ! links with off-the-grid compressive sensing, FRI, high-resolution methods. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 3.

(4) Algorithm. Experimental results. ˆ s}ks=1 and support ˆ = {µ̂s}ks=1. weights {↵. Data setup: = 1, (↵1, . . . , ↵k ) drawn uniformly on the simplex. Entries of µ1, . . . , µk ⇠ N (0, 1).. Statistical Learning i.i.d.. rt functions: ponents to add” to the support ! hAfµ, r̂i, added to the support ˆ .. Algorithm heuristics:. • Frequencies drawn i.i.d. from N (0, Id).. Goal. • New support  function search (step 1) initialized as ru, where r uniformly drawn in 0, max ||x||2 and u uniformly drawn on B2(0, 1). x2X. with positivity constraints on coefficients:. infer parameters from training data to achieve a certain task ✓between: (3) Comparison • Our method: Sketchto is computed on-the-fly and dataof is discarded. ensure generalization unseen data similar type (avoid overfitting) • EM: Data is stored to allow the standard optimization steps to be perand corresponding support are often kept focus onformed. “model free” assumption on data distribution argmin ||ẑ. U ||22,. 2RK +. ˆ 1, . . . , ↵ ˆ k. coefficients ↵. Training collection = large point cloud X. orithm on the objective function, with initialization at nd coefficients. Second step. Quality measures: KL Divergence and Hellinger distance. Compressed KL div. Hell. 3 10 0.68 ± 0.28 0.06 ± 0.01 104 0.24 ± 0.31 0.02 ± 0.02 105 0.13 ± 0.15 0.01 ± 0.02. signals, images, … Third step feature vectors, labels, … N. EM Mem. KL div. Hell. Mem. 0.6 0.68 ± 0.44 0.07 ± 0.03 0.24 0.6 0.19 ± 0.21 0.01 ± 0.02 2.4 0.6 0.13 ± 0.21 0.01 ± 0.02 24. Table 1: Comparison between our method and an EM algorithm. n = 20, k = 10, m = 1000.. Digit recognition (MNIST). Image classification. Examples of tasks & parameters 60. 3. electro. 4. n=10, Hell. for 80%. 1. 6. Sound classification. 0.9. 0.16. 50. 0.8. 2. cars. 0.14. 0.7. 1. 40. jazz  =) blues. 0. 0. 0.12 0.6 0.1. k*n/m. 2. 30. 0.08 0.4 0.06. 20. -1. 0.5. 0.3. −2. 0.04. 0.2. -2 10. −4. ustration in dimension n = 1 for k = 3 Gaus1. Bottom: Iteration 2. Blue curve=true mixture, ed mixture, Green curve=gradient function. Green ids, Red Dots=Reconstructed Centroids. -3. -4 -6. -4. -2. 0. PCA. ✓. 2. 4. 6. rock. 0. −4. 0.02. 0.1. −2. 0. 2. 4. 6. 8. 0 −0.5. 0. 0.5. 1. 200. 400. 600. 800. 1000 1200 sketch size m. 1400. 1600. 1800. 2000. Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruction quality for n = 10.. = principal subspace. R. GRIBONVAL SMAI-SIGMA, Nov 2017. Clustering. ✓ = centroids. Dictionary learning. ✓. = dictionary atoms. planes. Classification. ✓ = classifier parameters (e.g. support vectors). 4.

(5) Algorithm. Experimental results. ˆ s}ks=1 and support ˆ = {µ̂s}ks=1. weights {↵. Data setup: = 1, (↵1, . . . , ↵k ) drawn uniformly on the simplex. Entries of µ1, . . . , µk ⇠ N (0, 1).. Statistical Learning i.i.d.. rt functions: ponents to add” to the support ! hAfµ, r̂i, added to the support ˆ .. Algorithm heuristics:. • Frequencies drawn i.i.d. from N (0, Id).. Goal. • New support  function search (step 1) initialized as ru, where r uniformly drawn in 0, max ||x||2 and u uniformly drawn on B2(0, 1). x2X. with positivity constraints on coefficients:. infer parameters from training data to achieve a certain task ✓between: (3) Comparison • Our method: Sketchto is computed on-the-fly and dataof is discarded. ensure generalization unseen data similar type (avoid overfitting) • EM: Data is stored to allow the standard optimization steps to be perand corresponding support are often kept focus onformed. “model free” assumption on data distribution argmin ||ẑ. U ||22,. 2RK +. ˆ 1, . . . , ↵ ˆ k. coefficients ↵. Training collection = large point cloud X. orithm on the objective function, with initialization at nd coefficients. Second step. Quality measures: KL Divergence and Hellinger distance. Compressed KL div. Hell. 3 10 0.68 ± 0.28 0.06 ± 0.01 104 0.24 ± 0.31 0.02 ± 0.02 105 0.13 ± 0.15 0.01 ± 0.02. signals, images, … Third step feature vectors, labels, … N. EM Mem. KL div. Hell. Mem. 0.6 0.68 ± 0.44 0.07 ± 0.03 0.24 0.6 0.19 ± 0.21 0.01 ± 0.02 2.4 0.6 0.13 ± 0.21 0.01 ± 0.02 24. Table 1: Comparison between our method and an EM algorithm. n = 20, k = 10, m = 1000.. Examples of tasks & parameters 60. 3. electro. 4. n=10, Hell. for 80%. 1. 6. 0.9. 0.16. 50. 0.8. 2. cars. 0.14. 0.7. 1. 40. jazz  =) blues. 0. 0. 0.12 0.6 0.1. k*n/m. 2. 30. 0.08 0.4 0.06. 20. -1. 0.5. 0.3. −2. 0.04. 0.2. -2 10. −4. ustration in dimension n = 1 for k = 3 Gaus1. Bottom: Iteration 2. Blue curve=true mixture, ed mixture, Green curve=gradient function. Green ids, Red Dots=Reconstructed Centroids. -3. -4 -6. -4. -2. 0. PCA. ✓. 2. 4. 6. rock. 0. −4. 0.02. 0.1. −2. 0. 2. 4. 6. 8. 0 −0.5. 0. 0.5. 1. 200. 400. 600. 800. 1000 1200 sketch size m. 1400. 1600. 1800. 2000. Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruction quality for n = 10.. = principal subspace. R. GRIBONVAL SMAI-SIGMA, Nov 2017. Clustering. ✓ = centroids. Dictionary learning. ✓. = dictionary atoms. planes. Classification. ✓ = classifier parameters (e.g. support vectors). 4.

(6) Large-scale learning. X. x1 x2. R. GRIBONVAL SMAI-SIGMA, Nov 2017. xn. 5.

(7) Large-scale learning. X. x1 x2. xn. High feature dimension d Large collection size n = “volume”. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 5.

(8) Large-scale learning. X. x1 x2. xn. High feature dimension d Large collection size n = “volume” Challenge: compress X before learning ?. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 5.

(9) Compressive learning: three routes. X. xn. x1 x2. dimension reduction. subsampling. sketching. Y = MX random projections - Johnson Lindenstrauss lemma see e.g. [Calderbank & al 2009, Reboredo & al 2013] R. GRIBONVAL SMAI-SIGMA, Nov 2017. 6.

(10) Compressive learning: three routes. X. xn. x1 x2. dimension reduction x1 x2. subsampling. sketching. xn. Nyström method & coresets see e.g. [Williams&Seeger 2000, Agarwal & al 2003, Felman 2010] R. GRIBONVAL SMAI-SIGMA, Nov 2017. 7.

(11) Compressive learning: three routes. X. xn. x1 x2. dimension reduction. subsampling E. 1 (X). … E. random moments. z 2 Rm. m (X). Inspiration:. compressive sensing [Foucart & Rauhut 2013] sketching/hashing [Thaper & al 2002, Cormode & al 2005] Connections with: generalized method of moments [Hall 2005] kernel mean embeddings[Smola & al 2007, Sriperimbudur & al 2010] R. GRIBONVAL SMAI-SIGMA, Nov 2017. 8.

(12) Example: Compressive K-means on MNIST X n = 70000; d = 784; k = 10. Training set. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 9.

(13) Example: Compressive K-means on MNIST X n = 70000; d = k = 10. Training set 1. Dim. 6. 0.5. 0. -0.5. -1 -1. -0.5. 0. 0.5. 1. Dim. 5. Spectral features. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 9.

(14) Example: Compressive K-means on MNIST Sketch vector. X n = 70000; d = k = 10. Training set. memory size independent of n. m & kd = 100. 1. Sketch(X ) 0.5. Dim. 6. streaming / distributed computation. n X 1 m z2R = (xi ) n i=1. 0. -0.5. -1 -1. -0.5. 0. 0.5. 1. Dim. 5. Spectral features. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 9.

(15) Example: Compressive K-means on MNIST Sketch vector memory size independent of n. m & kd = 100. n = 70000; d = k = 10. Privacy-aware. R. GRIBONVAL SMAI-SIGMA, Nov 2017. streaming / distributed computation. n X 1 m z2R = (xi ) n i=1. 9.

(16) Example: Compressive K-means on MNIST Sketch vector memory size independent of n. m & kd = 100. n = 70000; d = k = 10 1. Dim. 6. 0.5 streaming / distributed computation. Privacy-aware. n X 1 m z2R = (xi ) n i=1. 0. -0.5. -1 -1. -0.5. 0. 0.5. 1. Learn centroids from sketch = moment fitting. Dim. 5. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 9.

(17) Example: Compressive K-means on MNIST Sketch vector memory size independent of n. m & kd = 100. n = 70000; d = k = 10 1. Dim. 6. 0.5 streaming / distributed computation. Privacy-aware. n X 1 m z2R = (xi ) n i=1. 0. -0.5. -1 -1. -0.5. 0. 0.5. 1. Learn centroids from sketch = moment fitting. Dim. 5. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 9.

(18) Agenda. From Statistical Learning to Compressive Learning Principle Guarantees. Worked Examples: Compressive PCA Compressive K-means Compressive Gaussian Mixture Modeling. Conclusion and perspectives. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 10.

(19) Moments & kernel mean embeddings Data distribution. X ⇠ p(x). Sketch = vector of generalized moments n X 1 z= (xi ) ⇡ E n i=1. (X) =. Z. (x)p(x)dx. nonlinear in the feature vectors linear in the distribution p(x). finite-dimensional Mean Map Embedding, [cf Smola & al 2007, Sriperumbudur & al 2010]. M(p) := EX⇠p (X) R. GRIBONVAL SMAI-SIGMA, Nov 2017. 11.

(20) Question 1: Information preservation ? Machine Learning. Signal Processing inverse problems compressive sensing. method of moments compressive learning [Hall 2005]. Signal (Empirical) probability. p. x M. z Sketch R. GRIBONVAL SMAI-SIGMA, Nov 2017. Linear “projection”. M y. Observation. 12.

(21) Question 1: Information preservation ? Signal Processing. Machine Learning. inverse problems compressive sensing. method of moments compressive learning [Hall 2005]. Signal (Empirical) probability. p. x M. z Sketch R. GRIBONVAL SMAI-SIGMA, Nov 2017. Is recovery possible in principle ?. M y. Observation. 12.

(22) Question 2: Dimension reduction ? Signal Processing. Machine Learning method of moments compressive learning. inverse problems compressive sensing. Signal (Empirical) probability. p. x To design. M(e.g., at random) M z Sketch R. GRIBONVAL SMAI-SIGMA, Nov 2017. m. y. How small can be the Observation dimension of the sketch ?. 13.

(23) Compressive Statistical Learning Existence of good decoder ?. (Empirical) probability distribution. Sketching operator Empirical sketch vector. p?. M ẑ R. GRIBONVAL SMAI-SIGMA, Nov 2017. 14.

(24) Compressive Statistical Learning Existence of good decoder ?. (Empirical) probability distribution. p? (ẑ). Sketching operator Empirical sketch vector. M ẑ R. GRIBONVAL SMAI-SIGMA, Nov 2017. 14.

(25) Compressive Statistical Learning Existence of good decoder ? some quality metric. kp?. (ẑ)k` . ???. (Empirical) probability distribution. p? (ẑ). Sketching operator Empirical sketch vector. M ẑ R. GRIBONVAL SMAI-SIGMA, Nov 2017. 14.

(26) Compressive Statistical Learning Existence of good decoder ?. ?. some quality metric. kp?. (ẑ)k` . kM(p̂n. (Empirical) probability distribution. 8p 2 ⌃ p? )k2. p? (ẑ). Sketching operator Empirical sketch vector. On a model set ⌃ = distributions “of interest” ex: Gaussian Mixture Model. M ẑ R. GRIBONVAL SMAI-SIGMA, Nov 2017. 14.

(27) Compressive Statistical Learning Lower Restricted Isometry Property (LRIP). Existence of good decoder ?. ?. some quality metric. kp?. (ẑ)k` . kM(p̂n. (Empirical) probability distribution. 8p 2 ⌃ p? )k2. 8q, q 0 2 ⌃. kq. q k` . kM(q 0. 0. q )k2. On a model set ⌃. p?. = distributions “of interest”. q. ex: Gaussian Mixture Model. (ẑ). q. 0. cf [Cohen & al 2009] for sparse recovery Sketching operator Empirical sketch vector. M. M ẑ. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 14.

(28) Compressive Statistical Learning Lower Restricted Isometry Property (LRIP). Existence of good decoder ? some quality metric. kp?. (ẑ)k` . kM(p̂n. (Empirical) probability distribution. 8q, q 0 2 ⌃. ?. 8p 2 ⌃ p? )k2. kq. q k` . kM(q 0. 0. q )k2. On a model set ⌃. p?. = distributions “of interest”. q. ex: Gaussian Mixture Model. (ẑ). q. 0. cf [Cohen & al 2009] for sparse recovery Sketching operator Empirical sketch vector. M. M ẑ. (z) := arg min kz p2⌃. Mpk2. = Generalized method of moments. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 14.

(29) Compressive Statistical Learning Lower Restricted Isometry Property (LRIP). Existence of good decoder ?. 8p?. some quality metric. kp?. (ẑ)k` . kM(p̂n. (Empirical) probability distribution. p? )k2+d(p? , ⌃) Bonus: stability to modeling error. 8q, q 0 2 ⌃. kq. q k` . kM(q 0. 0. q )k2. On a model set ⌃. p?. = distributions “of interest”. q. ex: Gaussian Mixture Model. (ẑ). q. 0. cf [Cohen & al 2009] for sparse recovery Sketching operator Empirical sketch vector. M. M ẑ. (z) := arg min kz p2⌃. Mpk2. = Generalized method of moments. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 14.

(30) Which metric ? Statistical learning. Empirical Risk Minimization(ERM). `(x, ✓). task given by loss function learning = minimizing risk. given training collection xi ⇠ p? , i.i.d. minimize empirical risk. R(p, ✓) = Ex⇠p `(x, ✓) given “arbitrary”. p. ?. R(p̂n , ✓) =. (e.g.: bounded). empirical estimate. `(xi , ✓). i=1. ✓ˆn 2 arg min R(p̂n , ✓). ✓? 2 arg min R(p? , ✓). ✓. ✓. Examples k-means k-medians PCA Max Likelihood …. 1 n. n X. Guarantees for ERM `(x, ✓) = min kx i `(x, ✓) = min kx i. `(x, ✓) = kx. `(x, ✓) =. ✓i k2 ✓i k. P✓ xk2. log p✓ (x). Goal: control excess risk. R(p? , ✓ˆn )  R(p? , ✓? ) + ⌘n. Sufficient: show that w.h.p.. sup |R(p̂n , ✓) ✓. R. GRIBONVAL SMAI-SIGMA, Nov 2017. R(p? , ✓)|  ⌘n /2 15.

(31) Which metric ? Statistical learning. Empirical Risk Minimization(ERM). `(x, ✓). task given by loss function learning = minimizing risk. given training collection xi ⇠ p? , i.i.d. minimize empirical risk. R(p, ✓) = Ex⇠p `(x, ✓) given “arbitrary”. p. ?. R(p̂n , ✓) =. (e.g.: bounded). empirical estimate. `(xi , ✓). i=1. ✓ˆn 2 arg min R(p̂n , ✓). ✓? 2 arg min R(p? , ✓). ✓. ✓. Examples k-means k-medians PCA Max Likelihood …. 1 n. n X. Guarantees for ERM `(x, ✓) = min kx i `(x, ✓) = min kx i. `(x, ✓) = kx. `(x, ✓) =. ✓i k2 ✓i k. P✓ xk2. log p✓ (x). Goal: control excess risk. R(p? , ✓ˆn )  R(p? , ✓? ) + ⌘n. Sufficient: show that w.h.p.. sup |R(p̂n , ✓) ✓. R. GRIBONVAL SMAI-SIGMA, Nov 2017. R(p? , ✓)|  ⌘n /2 15.

(32) Which metric ? Statistical learning. Empirical Risk Minimization(ERM). `(x, ✓). task given by loss function learning = minimizing risk. given training collection xi ⇠ p? , i.i.d. minimize empirical risk. R(p, ✓) = Ex⇠p `(x, ✓) given “arbitrary”. p. ?. R(p̂n , ✓) =. (e.g.: bounded). empirical estimate. `(xi , ✓). i=1. ✓ˆn 2 arg min R(p̂n , ✓). ✓? 2 arg min R(p? , ✓). ✓. ✓. Examples k-means k-medians PCA Max Likelihood …. 1 n. n X. Guarantees for ERM `(x, ✓) = min kx i `(x, ✓) = min kx i. `(x, ✓) = kx. `(x, ✓) =. ✓i k2 ✓i k. P✓ xk2. log p✓ (x). Goal: control excess risk. R(p? , ✓ˆn )  R(p? , ✓? ) + ⌘n. Sufficient: show that w.h.p.. sup |R(p̂n , ✓) ✓. R. GRIBONVAL SMAI-SIGMA, Nov 2017. R(p? , ✓)|  ⌘n /2 15.

(33) Which metric ? Statistical learning. Empirical Risk Minimization(ERM). task given by loss function learning = minimizing risk. `(x, ✓). given training collection xi ⇠ p? , i.i.d. minimize empirical risk. R(p, ✓) = Ex⇠p `(x, ✓) given “arbitrary”. p. ?. R(p̂n , ✓) =. (e.g.: bounded). 1 n. empirical estimate. n X. `(xi , ✓). i=1. ✓ˆn 2 arg min R(p̂n , ✓). ✓? 2 arg min R(p? , ✓). ✓. ✓. Examples. Guarantees for ERM `(x, ✓) = min kx i `(x, ✓) = min kx. ✓i k2 ✓i k. k-means k-medians i Definition: task-based metric `(x, ✓) = kx P✓ xk2 PCA `(x,0 ,✓)✓)= R(p, log p✓ ✓)| (x) kp0 Max pk`Likelihood := sup |R(p … ✓. Goal: control excess risk. R(p? , ✓ˆn )  R(p? , ✓? ) + ⌘n. Sufficient: show that w.h.p.. sup |R(p̂n , ✓) ✓. R. GRIBONVAL SMAI-SIGMA, Nov 2017. R(p? , ✓)|  ⌘n /2 15.

(34) Question 1: Information preservation ? Signal Processing. Machine Learning. inverse problems compressive sensing. method of moments compressive learning [Hall 2005]. Signal (Empirical) probability. p. x M. z Sketch R. GRIBONVAL SMAI-SIGMA, Nov 2017. Is recovery possible in principle ?. M y. Observation. 16.

(35) Question 1: Information preservation ? Signal Processing. Machine Learning. inverse problems compressive sensing. method of moments compressive learning [Hall 2005]. Signal (Empirical) probability. p. x M. z Sketch R. GRIBONVAL SMAI-SIGMA, Nov 2017. Is recovery possible in principle ?. ✓Lower RIP with. M. respect to taskbased metric. y Observation. 16.

(36) Compressive Statistical Learning Given a learning task `(x, ✓) Design a sketching function Choose a model set ⌃. (x) 2 R. m. Given a training collection Compute a sketch vector. and operator M(p) := EX⇠p (X). z=. 1 n. n X i=1. (xi ) ⇡ E. (X). Given a sketch vector Step 1:. p̃ =. Step 2:. ✓ˆ 2 arg min R(p̃, ✓). (ẑ) = arg min kẑ p2⌃. ✓. Guarantees? Assume: LRIP. kp. Then: excess risk control. R. GRIBONVAL SMAI-SIGMA, Nov 2017. Mpk2. p0 k` . kM(p). ˆ R(p? , ✓). M(p0 )k2 , 8p, p0 2 ⌃. R(p? , ✓? ) . kM(p̂n. p? )k2 + d(p? , ⌃) 17.

(37) Compressive Statistical Learning Given a learning task `(x, ✓) Design a sketching function Choose a model set ⌃. (x) 2 R. m. Given a training collection Compute a sketch vector. and operator M(p) := EX⇠p (X). z=. 1 n. n X i=1. (xi ) ⇡ E. (X). Given a sketch vector Step 1:. p̃ =. Step 2:. ✓ˆ 2 arg min R(p̃, ✓). (ẑ) = arg min kẑ p2⌃. ✓. Guarantees? Assume: LRIP. kp. Then: excess risk control. R. GRIBONVAL SMAI-SIGMA, Nov 2017. Mpk2. p0 k` . kM(p). ˆ R(p? , ✓). M(p0 )k2 , 8p, p0 2 ⌃. R(p? , ✓? ) . kM(p̂n. p? )k2 + d(p? , ⌃) 17.

(38) Compressive Statistical Learning Given a learning task `(x, ✓) Design a sketching function Choose a model set ⌃. (x) 2 R. m. Given a training collection Compute a sketch vector. and operator M(p) := EX⇠p (X). z=. 1 n. n X i=1. (xi ) ⇡ E. (X). Given a sketch vector Step 1:. p̃ =. Step 2:. ✓ˆ 2 arg min R(p̃, ✓). (ẑ) = arg min kẑ p2⌃. ✓. Guarantees? Assume: LRIP. kp. Then: excess risk control. R. GRIBONVAL SMAI-SIGMA, Nov 2017. Mpk2. p0 k` . kM(p). ˆ R(p? , ✓). M(p0 )k2 , 8p, p0 2 ⌃. R(p? , ✓? ) . kM(p̂n. p? )k2 + d(p? , ⌃) 17.

(39) Compressive Statistical Learning Given a learning task `(x, ✓) Design a sketching function Choose a model set ⌃. (x) 2 R. m. Given a training collection Compute a sketch vector. and operator M(p) := EX⇠p (X). z=. 1 n. n X i=1. (xi ) ⇡ E. (X). Given a sketch vector Step 1:. p̃ =. Step 2:. ✓ˆ 2 arg min R(p̃, ✓). (ẑ) = arg min kẑ p2⌃. ✓. Guarantees? Assume: LRIP. kp. Then: excess risk control. R. GRIBONVAL SMAI-SIGMA, Nov 2017. Mpk2. p0 k` . kM(p). ˆ R(p? , ✓). M(p0 )k2 , 8p, p0 2 ⌃. R(p? , ✓? ) . kM(p̂n. p? )k2 + d(p? , ⌃) 17.

(40) Compressive Statistical Learning Given a learning task `(x, ✓) Design a sketching function Choose a model set ⌃. (x) 2 R. m. Given a training collection Compute a sketch vector. and operator M(p) := EX⇠p (X). z=. 1 n. n X i=1. (xi ) ⇡ E. (X). Given a sketch vector Step 1:. p̃ =. Step 2:. ✓ˆ 2 arg min R(p̃, ✓). (ẑ) = arg min kẑ p2⌃. ✓. Guarantees? Assume: LRIP. kp. Then: excess risk control. R. GRIBONVAL SMAI-SIGMA, Nov 2017. Mpk2. p0 k` . kM(p). ˆ R(p? , ✓). M(p0 )k2 , 8p, p0 2 ⌃. R(p? , ✓? ) . kM(p̂n. p? )k2 + d(p? , ⌃) 17.

(41) Agenda. From Statistical Learning to Compressive Learning Worked Examples: Compressive PCA Compressive K-means Compressive Gaussian Mixture Modeling. Conclusion and perspectives. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 18.

(42) Compressive PCA. 3. 2. 1. 0. Loss function: `(x, ✓) = kx. P✓ xk2. x2R. d. -1. -2. -3. -4 -6. -4. -2. 0. 2. 4. An hypothesis = a k-dimensional subspace ✓ Model set ⌃ :. distributions supported on a k-dimensional subspace. Sketch:. naive = full covariance matrix, EX⇠p XX using compressive matrix sensing. Learn from sketch:. T. 2. m = O(d ) m = O(kd) LRIP proved. low-rank matrix recovery R. GRIBONVAL SMAI-SIGMA, Nov 2017. 19.

(43) Compressive K-Means (clustering) 1. Dim. 6. 0.5. 0. Loss function: i d ✓ = {✓ , . . . , ✓ }, ✓ 2 R An hypothesis = k centroids 1 k i Standard approach: “K-means algorithm” `(x, ✓) = min kx. ✓i k2. -0.5. -1 -1. -0.5. 0. 0.5. 1. Dim. 5. aka Lloyd-Max algorithm. [Steinhaus 1956, Lloyd 1957 (publ. 1982)]. several passes on the training set. Model set ⌃. mixtures of k Diracs. Naive sketching =histograms. N = O((R/✏)d ) bins of size ✏ within domain of size R sketch size. m=N. compressive sensing suggests. m = O(k log N ) = O(kd log(R/✏)). R. GRIBONVAL SMAI-SIGMA, Nov 2017. 20.

(44) Compressive K-Means (clustering) 1. Dim. 6. 0.5. 0. Loss function: i d ✓ = {✓ , . . . , ✓ }, ✓ 2 R An hypothesis = k centroids 1 k i Standard approach: “K-means algorithm” `(x, ✓) = min kx. ✓i k2. -0.5. -1 -1. -0.5. 0. 0.5. 1. Dim. 5. aka Lloyd-Max algorithm. [Steinhaus 1956, Lloyd 1957 (publ. 1982)]. several passes on the training set. Model set ⌃. mixtures of k Diracs. Naive sketching =histograms. N = O((R/✏)d ) bins of size ✏ within domain of size R sketch size. m=N. compressive sensing suggests. m = O(k log N ) = O(kd log(R/✏)). R. GRIBONVAL SMAI-SIGMA, Nov 2017. 20.

(45) Compressive K-Means (clustering) 1. Dim. 6. 0.5. 0. Loss function: i d ✓ = {✓ , . . . , ✓ }, ✓ 2 R An hypothesis = k centroids 1 k i Standard approach: “K-means algorithm” `(x, ✓) = min kx. ✓i k2. -0.5. -1 -1. -0.5. 0. 0.5. 1. Dim. 5. aka Lloyd-Max algorithm. [Steinhaus 1956, Lloyd 1957 (publ. 1982)]. several passes on the training set. Model set ⌃. mixtures of k Diracs. Naive sketching =histograms. N = O((R/✏)d ) bins of size ✏ within domain of size R sketch size. m=N. compressive sensing suggests. m = O(k log N ) = O(kd log(R/✏)). R. GRIBONVAL SMAI-SIGMA, Nov 2017. 20.

(46) Compressive K-means: How to Sketch ? 1. Which moments to choose ?. Dim. 6. 0.5. 0. -0.5. -1 -1. Observation: distribution p(x) is spatially localized Intuition (from compressive sensing) need “incoherent” sampling choose Fourier measurements = characteristic function of p̂n -0.5. 0. 0.5. 1. Dim. 5. z` ⇡ EX⇠p e. jw`> X. ! ` 2 Rd. 1`m. Sketching function Sketch vector. (x) =. p1 m. ⇣. e. j!`> x. ⌘m. Random Fourier Features [Rahimi & Recht 2007]. `=1. z = Random Fourier Moments. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 21.

(47) Learning centroids from a sketch ? From theory:. Instantiation. Learning from sketch vector. Optimization problem Step 1:. Step 1:. p̃ =. (ẑ) = arg min kẑ p2⌃. (↵ ˜ j , ✓˜j ) = arg min kẑ. Mpk2. ↵j ,✓j. Step 2:. Step 2:. ✓ˆ 2 arg min R(p̃, ✓). Model set. ✓. mixtures of k Diracs. p=. k X. ↵j (✓j )k2. j=1. (✓ˆj ) = (✓˜j ) k X. ↵j. ✓j. j=1. M(p) =. k X. ↵j (✓j ). j=1. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 22.

(48) Learning centroids from a sketch ? From theory:. Instantiation. Learning from sketch vector. Optimization problem Step 1:. Step 1:. p̃ =. (ẑ) = arg min kẑ p2⌃. (↵ ˜ j , ✓˜j ) = arg min kẑ. Mpk2. ↵j ,✓j. Step 2:. Step 2:. ✓ˆ 2 arg min R(p̃, ✓). Model set. ✓. mixtures of k Diracs. p=. M(p) =. j=1. R. GRIBONVAL SMAI-SIGMA, Nov 2017. ↵j (✓j )k2. j=1. (✓ˆj ) = (✓˜j ) k X. ↵j. j=1. k X. k X. ↵j (✓j ). ✓j. Ideal decoding scheme • “Highly” non-convex • Two approaches:. • Discretize + convex relaxation [Bunea & al 2010] • Greedy & gridless. 22.

(49) Learning centroids from a sketch ? From theory:. Instantiation. Learning from sketch vector. Optimization problem Step 1:. Step 1:. p̃ =. (ẑ) = arg min kẑ. (↵ ˜ j , ✓˜j ) = arg min kẑ. Mpk2. p2⌃. ↵j ,✓j. Step 2:. Step 2:. ✓ˆ 2 arg min R(p̃, ✓). Model set. p=. k X. ↵j. j=1. M(p) =. k X j=1. ↵j (✓j )k2. j=1. (✓ˆj ) = (✓˜j ). ✓. mixtures of k Diracs. k X. ↵j (✓j ). ✓j. Ideal decoding scheme • “Highly” non-convex • Two approaches:. • Discretize + convex relaxation [Bunea & al 2010] • Greedy & gridless. MP (Mallat & Zhang 93) > OMP (Pati & al 93) > OMPR (Jain 2011) > CL-OMPR. Analogous to Frank-Wolfe, see e.g. [Bredies & al 2013] R. GRIBONVAL SMAI-SIGMA, Nov 2017. 22.

(50) The SketchMLbox SketchMLbox (sketchml.gforge.inria.fr) •. Mixture of Diracs (« K-means »). •. GMMs with known covariance. •. GMMs with unknown diagonal covariance. •. Soon:. •. •. Mixtures of alpha-stable. •. Gaussian Locally Linear Mapping [Deleforge 2014] Handles generic mixtures p =. k X. ↵ j p ✓j. j=1. with user-defined Mp✓ , r✓ Mp✓ R. GRIBONVAL SMAI-SIGMA, Nov 2017. Nicolas Keriven. 26/28. 23.

(51) Back to Question 2: Dimension reduction ? Signal Processing. Machine Learning method of moments compressive learning. inverse problems compressive sensing. Signal (Empirical) probability. p. x To design. M(e.g., at random) M z Sketch R. GRIBONVAL SMAI-SIGMA, Nov 2017. m. y. How small can be the Observation dimension of the sketch ?. 24.

(52) Back to Question 2: Dimension reduction ? Signal Processing. Machine Learning method of moments compressive learning. inverse problems compressive sensing. Signal (Empirical) probability. p. x To design. M(e.g., at random) M z Sketch R. GRIBONVAL SMAI-SIGMA, Nov 2017. m. y. How small can be the Observation dimension of the sketch ?. 24.

(53) Theorem: K-medians Goal minimize ⇥2⇥. ?. . k. R(p , ⇥) = EX⇠p? min kX. R. GRIBONVAL SMAI-SIGMA, Nov 2017. j=1. ✓j k2. (expected risk). 25.

(54) Theorem: K-medians Goal minimize ⇥2⇥. Hyp. ⇥✏,M. ?. . k. R(p , ⇥) = EX⇠p? min kX j=1. ✓j k2. (expected risk). - bounded domain. •. in Rd • - separation. (= assumptions on the k centroids, not samples). R. GRIBONVAL SMAI-SIGMA, Nov 2017. 25.

(55) Theorem: K-medians Goal minimize ⇥2⇥. Hyp. ⇥✏,M. ?. . k. R(p , ⇥) = EX⇠p? min kX j=1. - bounded domain. •. in Rd • - separation. (= assumptions on the k centroids, not samples). R. GRIBONVAL SMAI-SIGMA, Nov 2017. ✓j k2. Sketching. (expected risk). M. Reweighted random Fourier moments (needed for theory, no effect in practice). 25.

(56) Theorem: K-medians Goal minimize ⇥2⇥. Hyp. ⇥✏,M. . ?. k. R(p , ⇥) = EX⇠p? min kX j=1. - bounded domain. •. in Rd • - separation. (= assumptions on the k centroids, not samples). 2 2. ✓j k2. Sketching. (expected risk). M. Reweighted random Fourier moments (needed for theory, no effect in practice). 2. k d log k(log(kd) + log(M/✏))) If m O(k ? then w.h.p., for any p ˆ R(⇥) Minimize kz (with hyp.). Mp⇥,↵ k2. R. GRIBONVAL SMAI-SIGMA, Nov 2017. p R(⇥ ) . d(p , ⌃) + O( 1/n) ?. ?. Minimize expected risk (with hyp.). 25.

(57) Theorem: K-medians Goal minimize ⇥2⇥. Hyp. ⇥✏,M. . ?. k. R(p , ⇥) = EX⇠p? min kX j=1. - bounded domain. •. in Rd • - separation. (= assumptions on the k centroids, not samples). 2. ✓j k2. Sketching. (expected risk). M. Reweighted random Fourier moments (needed for theory, no effect in practice). 2. If m O(k d log k(log(kd) + log(M/✏))) ? then w.h.p., for any p ˆ R(⇥) Minimize kz (with hyp.). Mp⇥,↵ k2. R. GRIBONVAL SMAI-SIGMA, Nov 2017. p R(⇥ ) . d(p , ⌃) + O( 1/n) ?. ?. Minimize expected risk (with hyp.). 25.

(58) Theorem: K-medians Goal minimize ⇥2⇥. Hyp. ⇥✏,M. . ?. k. R(p , ⇥) = EX⇠p? min kX j=1. - bounded domain. •. in Rd • - separation. (= assumptions on the k centroids, not samples). 2. ✓j k2. Sketching. (expected risk). M. Reweighted random Fourier moments (needed for theory, no effect in practice). 2. If m O(k d log k(log(kd) + log(M/✏))) ? then w.h.p., for any p by proving the LRIP. Minimize kz (with hyp.). ˆ R(⇥) Mp⇥,↵ k2. R. GRIBONVAL SMAI-SIGMA, Nov 2017. p R(⇥ ) . d(p , ⌃) + O( 1/n) ?. ?. Minimize expected risk (with hyp.). 25.

(59) Theorem: K-medians Goal minimize ⇥2⇥. Hyp. ⇥✏,M. . ?. k. R(p , ⇥) = EX⇠p? min kX j=1. - bounded domain. •. in Rd • - separation. (= assumptions on the k centroids, not samples). 2. ✓j k2. Sketching. Minimize kz (with hyp.). ˆ R(⇥) Mp⇥,↵ k2. R. GRIBONVAL SMAI-SIGMA, Nov 2017. M. Reweighted random Fourier moments (needed for theory, no effect in practice). 2. If m O(k d log k(log(kd) + log(M/✏))) ? ? . R(⇥ ) then w.h.p., for any p. by proving the LRIP. (expected risk). p R(⇥ ) . d(p , ⌃) + O( 1/n) ?. Using optimal transport. ?. Minimize expected risk (with hyp.). 25.

(60) #measurements: phase transitions ? In theory, sufficient to have. m & O(k 2 d) Empirically ?. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 26.

(61) #measurements: phase transitions ? In theory, sufficient to have. m & O(k 2 d). Relative loss E`(X, ⇥Sketch ) E`(X, ⇥Lloyd ). K-means. Empirically ?. GMMs, known cov.. R. GRIBONVAL SMAI-SIGMA, Nov 2017. Relative loglike. GMMs, diagonal cov.. Relative loglike. 26.

(62) Conclusion. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 27.

(63) Compressive learning with random moments Promises controlled resources (memory, flops) distributed, streamed computations information preservation … privacy preservation ? … provably good & efficient learning algorithms ?. Applications compressive PCA compressive clustering (k-means, k-medians). surprising links with line spectral estimation / “super-resolution” / off-the-grid methods. compressive mixture modeling (Gaussians, alpha-stable, …) … dictionary learning, supervised classification ? … universal sketching functions ?. Details. toolbox sketchml.gforge.inria.fr preprint arxiv.org/abs/1706.07180. R. GRIBONVAL SMAI-SIGMA, Nov 2017. 28.

(64) R. GRIBONVAL SMAI-SIGMA, Nov 2017. 29.

(65) TH###NKS# 30.

(66)

Références

Documents relatifs

We show practical mo- tivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both

Nous ´ etendons les outils utilis´ es dans la th´ eorie des probabilit´ es libres qui ont d´ emontr´ e ˆ etre utiles pour pr´ edire les distributions de valeurs propres des

For higher order moments this is not the case, although we state an interesting inequality involving the Vandermonde limit moments and the moments of the classical Poisson

The calculation remains possible for the trispectrum, but it appears extremely tedious for polyspectra of orders higher than 3, and this difficulty does not appear for the

Traditionally, Hidden Markov Models (HMMs) would be used to address this problem, involving a first step of hand- crafting to build a dialogue model (e.g. defining potential hid-

The presented approach is also applied to financial time series from the credit default swaps mar- ket whose prices dynamics are usually modelled by random walks according to

We describe a general framework –compressive statistical learning– for resource-efficient large- scale learning: the training collection is compressed in one pass into a

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des