An Adaptive Test of Independence with Analytic Kernel Embeddings

(1)

HAL Id: hal-01385111

https://hal.archives-ouvertes.fr/hal-01385111v2

Submitted on 28 Jul 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Embeddings

Wittawat Jitkrittum, Zoltán Szabó, Arthur Gretton

To cite this version:

Wittawat Jitkrittum, Zoltán Szabó, Arthur Gretton. An Adaptive Test of Independence with Ana-

lytic Kernel Embeddings. International Conference on Machine Learning (ICML), Aug 2017, Sydney,

Australia. pp.1742-1751. �hal-01385111v2�

(2)

An Adaptive Test of Independence with Analytic Kernel Embeddings

Wittawat Jitkrittum¹ Zoltán Szabó² Arthur Gretton¹

Abstract

A new computationally efficient dependence measure, and an adaptive statistical test of independence, are proposed. The dependence measure is the difference between analytic embeddings of the joint distribution and the product of the marginals, evaluated at a finite set of locations (features). These features are chosen so as to maximize a lower bound on the test power, resulting in a test that is data-efficient, and that runs in linear time (with respect to the sample sizen). The optimized features can be interpreted as evidence to reject the null hypothesis, indicating regions in the joint domain where the joint distribution and the product of the marginals differ most.

Consistency of the independence test is established, for an appropriate choice of features. In real-world benchmarks, independence tests using the optimized features perform comparably to the state-of-the-art quadratic-time HSIC test, and outperform competing O(n)andO(nlogn)tests.

1. Introduction

We consider the design of adaptive, nonparametric statistical tests of dependence: that is, tests of whether a joint distri- butionPxy factorizes into the product of marginalsPxPy

with the null hypothesis thatH0 :X andY are independent. While classical tests of dependence, such as Pearson’s correlation and Kendall’sτ, are able to detect monotonic relations between univariate variables, more modern tests can address complex interactions, for instance changes in variance ofX with the value ofY. Key to many recent tests is to examine covariance or correlation between data features. These interactions become significantly harder to detect, and the features are more difficult to design, when the data reside in high dimensions.

Zoltán Szabó’s ORCID ID: 0000-0001-6183-7603. Arthur Gret- ton’s ORCID ID: 0000-0003-3169-7624. ¹Gatsby Unit, Univer- sity College London, UK.²CMAP, École Polytechnique, France.

Correspondence to: Wittawat Jitkrittum <[email protected]>.

A basic nonlinear dependence measure is the Hilbert- Schmidt Independence Criterion (HSIC), which is the Hilbert-Schmidt norm of the covariance operator between feature mappings of the random variables (Gretton et al., 2005;2008). Each random variableXandY is mapped to a respective reproducing kernel Hilbert spaceHk and H^l. For sufficiently rich mappings, the covariance operator norm is zero if and only if the variables are independent. A second basic nonlinear dependence measure is the smoothed difference between the characteristic function of the joint distribution, and that of the product of marginals. When a particular smoothing function is used, the statistic corre- sponds to the covariance between distances ofXandY variable pairs (Feuerverger,1993;Székely et al.,2007;Székely

& Rizzo,2009), yielding a simple test statistic based on pairwise distances. It has been shown bySejdinovic et al.

(2013) that the distance covariance (and its generalization to semi-metrics) is an instance of HSIC for an appropriate choice of kernels. A disadvantage of these feature covariance statistics, however, is that they require quadratic time to compute (besides in the special case of the distance covariance with univariate real-valued variables, whereHuo &

Székely(2016) achieve anO(nlogn)cost). Moreover, the feature covariance statistics have intractable null distributions, and either a permutation approach or the solution of an expensive eigenvalue problem (e.g.Zhang et al.,2011) is required for consistent estimation of the quantiles. Several approaches were proposed byZhang et al.(2017) to obtain faster tests along the lines of HSIC. These include computing HSIC on finite-dimensional feature mappings chosen as random Fourier features (RFFs) (Rahimi & Recht,2008), a block-averaged statistic, and a Nyström approximation to the statistic. Key to each of these approaches is a more efficient computation of the statistic and its threshold under the null distribution: for RFFs, the null distribution is a finite weighted sum ofχ²variables; for the block-averaged statistic, the null distribution is asymptotically normal; for Nyström, either a permutation approach is employed, or the spectrum of the Nyström approximation to the kernel matrix is used in approximating the null distribution. Each of these methods costs significantly less than theO(n²)cost of the full HSIC (the cost is linear inn, but also depends quadratically on the number of features retained). A potential disadvantage of the Nyström and Fourier approaches is that the features are not optimized to maximize test power,

(3)

but are chosen randomly. The block statistic performs worse than both, due to the large variance of the statistic under the null (which can be mitigated by observing more data).

In addition to feature covariances, correlation measures have also been developed in infinite dimensional feature spaces:

in particular,Bach & Jordan(2002);Fukumizu et al.(2008) proposed statistics on the correlation operator in a reproducing kernel Hilbert space. While convergence has been established for certain of these statistics, their computational cost is high atO(n³), and test thresholds have relied on permutation. A number of much faster approaches to testing based on feature correlations have been proposed, however. For instance,Dauxois & Nkiet(1998) compute statistics of the correlation between finite sets of basis functions, chosen for instance to be step functions or low order B-splines. The cost of this approach isO(n). This idea was extended byLopez-Paz et al.(2013), who computed the canonical correlation between finite sets of basis functions chosen as random Fourier features; in addition, they performed a copula transform on the inputs, with a total cost ofO(nlogn). Finally, space partitioning approaches have also been proposed, based on statistics such as the KL divergence, however these apply only to univariate variables (Heller et al.,2016), or to multivariate variables of low dimension (Gretton & Györfi,2010) (that said, these tests have other advantages of theoretical interest, notably distribution-independent test thresholds).

The approach we take is most closely related to HSIC on a finite set of features. Our simplest test statistic, the Finite Set Independence Criterion (FSIC), is an average of covariances of analytic functions (i.e., features) defined on each ofXandY. A normalized version of the statistic (NFSIC) yields a distribution-independent asymptotic test threshold.

We show that our test is consistent, despite a finite number of analytic features being used, via a generalization of ar- guments inChwialkowski et al.(2015). As in recent work on two-sample testing byJitkrittum et al.(2016), our test isadaptivein the sense that we choose our features on a held-out validation set to optimize a lower bound on the test power. The design of features for independence testing turns out to be quite different to the case of two-sample testing, however: the task is to find correlated featurepairs on the respective marginal domains, rather than attempting to find a single, high-dimensional feature representation on thetensor productof the marginals, as we would need to do if we were comparing distributionsPxy andQxy. While the use of coupled feature pairs on the marginals entails a smaller feature space dimension, it introduces significant complications in the proof of the lower bound, compared with the two-sample case. We demonstrate the performance of our tests on several challenging artificial and real-world datasets, including detection of dependence between music and its year of appearance, and between videos and captions.

In these experiments, we outperform competing linear and O(nlogn)time tests.

2. Independence Criteria and Statistical Tests

We introduce two test statistics: first, the Finite Set Inde- pendence Criterion (FSIC), which builds on the principle that dependence can be measured in terms of the covariance between data features. Next, we propose a normalized version of this statistic (NFSIC), with a simpler asymptotic distribution whenPxy = PxPy. We show how to select features for the latter statistic to maximize a lower bound on the power of its corresponding statistical test.

2.1. The Finite Set Independence Criterion

We begin by recalling the Hilbert-Schmidt Independence Criterion (HSIC) as proposed inGretton et al.(2005), since our unnormalized statistic is built along similar lines. Con- sider two random variablesX ∈ X ⊆R^d^xandY ∈ Y ⊆ R^d^y. Denote byPxythe joint distribution betweenXandY; PxandPyare the marginal distributions ofXandY. Let⊗ denote the tensor product, such that(a⊗b)c=ahb, ci. As- sume thatk:X × X →Randl:Y × Y →Rare positive definite kernels associated with reproducing kernel Hilbert spaces (RKHS)HkandHl, respectively. Letk · kHS be the norm on the space ofH^l→ H^kHilbert-Schmidt operators.

Then, HSIC betweenXandY is defined as HSIC(X, Y) =µxy−µx⊗µy

²

HS

=E(x,y),(x⁰,y⁰)[k(x,x⁰)l(y,y⁰)]

+ExEx⁰[k(x,x⁰)]EyEy⁰[l(y,y⁰)]

−2E(x,y)[Ex⁰[k(x,x⁰)]Ey⁰[l(y,y⁰)]], (1) whereEx := Ex∼Px,Ey := Ey∼Py,Exy := E(x,y)∼Pxy, andx⁰is an independent copy ofx. The mean embedding of Pxybelongs to the space of Hilbert-Schmidt operators from Hl toHk, µxy := R

X ×Yk(x,·)⊗l(y,·) dPxy(x,y) ∈ HS(Hl,Hk), and the marginal mean embeddings areµx:=

R

Xk(x,·) dPx(x) ∈ Hk andµy := R

Yl(y,·) dPy(y) ∈ Hl(Smola et al.,2007). Gretton et al.(2005, Theorem 4) show that if the kernelskand lare universal (Steinwart

& Christmann,2008) on compact domainsX andY, then HSIC(X, Y) = 0if and only ifXandY are independent.

Given a joint sampleZn = {(xi,yi)}ⁿi=1 ∼ Pxy, an empirical estimator of HSIC can be computed inO(n²)time by replacing the population expectations in (1) with their corresponding empirical expectations based onZn. We now propose our new linear-time dependence measure, the Finite Set Independence Criterion (FSIC). Let X ⊆ R^d^x and Y ⊆ R^d^y be open sets. Let µxµy(x,y) :=µx(x)µy(y)The idea is to seeµxy(v,w) = Exy[k(x,v)l(y,w)], µx(v) =Ex[k(x,v)]andµy(w) = Ey[l(y,w)]as smooth functions, and consider a new dis-

(4)

tance betweenµxyandµxµyinstead of a Hilbert-Schmidt distance as in HSIC (Gretton et al.,2005). The new measure is given by the average of squared differences be- tweenµxy andµxµy, evaluated atJrandom test locations VJ :={(vi,wi)}^Ji=1⊂ X × Y.

FSIC²(X, Y) := 1 J

XJ i=1

[µxy(vi,wi)−µx(vi)µy(wi)]²

= 1 J

XJ i=1

u²(vi,wi) = 1 Jkuk²2, where

u(v,w) :=µxy(v,w)−µx(v)µy(w)

=Exy[k(x,v)l(y,w)]−Ex[k(x,v)]Ey[l(y,w)], (2)

= covxy[k(x,v), l(y,w)],

u := (u(v1,w1), . . . , u(vJ,wJ))^>, and {(vi,wi)}^Ji=1

are realizations from an absolutely continuous distribution (wrt the Lebesgue measure).

Our first result in Proposition 2 states that FSIC(X, Y) almost surely defines a dependence measure for the random variables X and Y, provided that the product kernel on the joint spaceX × Y is characteristic and analytic (see Definition1).

Definition 1(Analytic kernels (Chwialkowski et al.,2015)).

LetX be an open set inR^d. A positive definite kernel k:X × X →Ris said to be analytic on its domainX × X if for allv∈ X,f(x) :=k(x,v)is an analytic function on X.

Assumption A. The kernels k : X × X → R and l : Y × Y → Rare bounded byBk andBlrespectively [sup_x,x0∈Xk(x,x⁰)≤Bk,sup_y,y0∈Yl(y,y⁰)≤Bl] , and the product kernelg((x,y),(x⁰,y⁰)) :=k(x,x⁰)l(y,y⁰)is characteristic (Sriperumbudur et al.,2010, Definition 6), and analytic (Definition1) on(X × Y)×(X × Y).

Proposition 2(FSIC is a dependence measure). Assume that assumption A holds, and that the test locations VJ = {(vi,wi)}^Ji=1 are drawn from an absolutely continuous distributionη. Then,η-almost surely, it holds that FSIC(X, Y) = ^√¹_Jkuk2 = 0if and only ifX andY are independent.

Proof. Sincegis characteristic, the mean embedding map Πg : P 7→ E(x,y)∼P[g((x,y),·)]is injective (Sriperum- budur et al.,2010, Section 3), whereP is a probability distribution onX × Y. Sincegis analytic, by Lemma10 (Appendix), µxy andµxµy are analytic functions. Thus, Lemma11 (Appendix, settingΛ = Πg) guarantees that FSIC(X, Y) = 0 ⇐⇒ Pxy =PxPy ⇐⇒ XandY are independent almost surely.

FSIC usesµxyas a proxy forPxy, andµxµyas a proxy for PxPy. Proposition2states that, to detect the dependence betweenXandY, it is sufficient to evaluate the difference of the population joint embeddingµxyand the embedding of the product of the marginal distributionsµxµy at a finite number of locations (defined byVJ). The intuitive ex- planation of this property is as follows. IfPxy = PxPy, thenu(v,w) = 0 everywhere, and FSIC(X, Y) = 0 for anyVJ. IfPxy 6= PxPy, thenu will not be a zero function, since the mean embedding map is injective (requires the product kernel to be characteristic). Using the same argument as inChwialkowski et al.(2015), sincek andlare analytic,uis also analytic, and the set of roots Ru:={(v,w)|u(v,w) = 0}has Lebesgue measure zero.

Thus, it is sufficient to draw(v,w)from an absolutely continuous distribution to have(v,w)∈/ Ruη-almost surely, and henceFSIC(X, Y)>0. We note that a characteristic kernel which is not analytic may produceusuch thatRuhas a positive Lebesgue measure. In this case, there is a positive probability that(v,w)∈Ru, resulting in a potential failure to detect the dependence.

The next proposition shows that Gaussian kernelskandl yield a product kernel which is characteristic and analytic; in other words, this is an example when AssumptionAholds.

Proposition 3 (A product of Gaussian kernels is characteristic and analytic). Let k(x,x⁰) = exp −(x−x⁰)^>A(x−x⁰)

and l(y,y⁰) = exp −(y−y⁰)^>B(y−y⁰)

be Gaussian kernels on R^d^x × R^d^x and R^d^y × R^d^y respectively, for positive definite matrices A and B. Then, g((x,y),(x⁰,y⁰)) = k(x,x⁰)l(y,y⁰) is characteristic and analytic on(R^d^x×R^d^y)×(R^d^x×R^d^y).

Proof (sketch). The main idea is to use the fact that a Gaus- sian kernel is analytic, and a product of Gaussian kernels is a Gaussian kernel on the pair of variables. See the full proof in AppendixD.

Plug-in Estimator Assume that we observe a joint sample Zn := {(xi,yi)}ⁿi=1

i.i.d.

∼ Pxy. Un- biased estimators of µxy(v,w) and µxµy(v,w) are µˆxy(v,w) := ¹_nPn

i=1k(xi,v)l(yi,w) and [

µxµy(v,w) := _n(n¹₋₁₎Pn i=1

P

j6=ik(xi,v)l(yj,w), respectively. A straightforward empirical estimator of FSIC²is then given by

FSIC\²(Zn) = 1 J

XJ i=1

ˆ

u(vi,wi)², ˆ

u(v,w) := ˆµxy(v,w)−µ[xµy(v,w) (3)

= 2

n(n−1) X

i<j

h(v,w)((xi,yi),(xj,yj)), (4) where h(v,w)((x,y),(x⁰,y⁰)) := ¹₂(k(x,v) − k(x⁰,v))(l(y,w) − l(y⁰,w)). For conciseness, we

(5)

defineuˆ := (û1, . . . ,uˆJ)^> ∈R^J whereuî := û(vi,wi) so thatFSIC\²(Zn) = _J¹uˆ^>u.ˆ

FSIC\²can be efficiently computed inO((dx+dy)Jn)time which is linear inn[see (3) which does not have nested double sums], assuming that the runtime complexity of evaluatingk(x,v)isO(dx)and that ofl(y,w)isO(dy).

SinceFSICsatisfiesFSIC(X, Y) = 0 ⇐⇒ X ⊥Y, in principle its empirical estimator can be used as a test statistic for an independence test proposing a null hypothesisH0:

“XandY are independent” against an alternativeH1:“X andY are dependent.” The null distribution (i.e., distribution of the test statistic assuming thatH0is true) is challenging to obtain, however, and depends on the unknownPxy. This prompts us to consider a normalized version ofFSIC whose asymptotic null distribution takes a more convenient form. We first derive the asymptotic distribution ofuˆ in Proposition4, which we use to derive the normalized test statistic in Theorem5. As a shorthand, we writez:= (x,y), t:= (v,w),covzis covariance,Vzstands for variance.

Proposition 4 (Asymptotic distribution of u).ˆ Define u := (u(t1), . . . , u(tJ))^>, ˜k(x,v) := k(x,v) − Ex⁰k(x⁰,v), and ˜l(y,w) := l(y,w) − Ey⁰l(y⁰,w).

Let Σ = [Σij] ∈ R^J^×^J be the positive semi-definite matrix with entries Σij = covz(ˆu(ti),u(tˆ j)) = Exy[˜k(x,vi)˜l(y,wi)˜k(x,vj)˜l(y,wj)]−u(ti)u(tj). Then, under both H0 and H1, for any fixed test locations {t1, . . . ,tJ} for which Σ is full rank, and 0 <

V^z[htj(z)]< ∞forj = 1, . . . , J, it holds that√ n(ˆu− u)→ N^d (0,Σ).

Proof. For a fixed{t1, . . . ,tJ},uˆis a one-sample second- order multivariate U-statistic with a U-statistic kernelht. Thus, by Lehmann (1999, Theorem 6.1.6) and Kowal- ski & Tu(2008, Section 5.1, Theorem 1), it follows directly that√

n(ˆu−u) → N^d (0,Σ)where we note that Exy[˜k(x,v)˜l(y,w)] =u(v,w).

Recall from Proposition2thatu=0holds almost surely un- derH0. The asymptotic normality described in Proposition 4implies thatnFSIC\²= ⁿ_Juˆ^>uˆconverges in distribution to a sum ofJ dependent weighted χ²random variables.

The dependence comes from the fact that the coordinates ˆ

u1. . . ,uˆJ ofuˆall depend on the sampleZn. This null distribution is not analytically tractable, and requires a large number of simulations to compute the rejection threshold Tαfor a given significance valueα.

2.2. Normalized FSIC and Adaptive Test

For the purpose of an independence test, we will consider a normalized variant ofFSIC\², which we call NFSIC\², whose tractable asymptotic null distribution isχ²(J), the

chi-squared distribution withJ degrees of freedom. We then show that the independence test defined byNFSIC\²is consistent. These results are given in Theorem5.

Theorem 5(Independence test based onNFSIC\²is consistent). LetΣˆ be a consistent estimate ofΣbased on the joint sampleZn, whereΣis defined in Proposition4. Assume thatVJ ={(vi,wi)}^Ji=1∼ηwhereηis absolutely continuous wrt the Lebesgue measure. TheNFSIC\²statistic is defined asλˆn:=nˆu^>

Σˆ +γnI−1

uˆwhereγn≥0is a regularization parameter. Assume that

1. AssumptionAholds.

2. Σis invertibleη-almost surely.

3. limn→∞γn= 0.

Then, for anyk, landVJ satisfying the assumptions, 1. UnderH0,ˆλn d

→χ²(J)asn→ ∞. 2. UnderH1, for anyr ∈R,lim_n→∞P

λˆn≥r

= 1 η-almost surely. That is, the independence test based on NFSIC\²is consistent.

Proof (sketch) . UnderH0,nû^>( ˆΣ+γnI)⁻¹uâsymptot- ically followsχ²(J)because√nûis asymptotically nor- mally distributed (see Proposition4). Claim 2 builds on the result in Proposition2stating thatu6=0underH1; it follows using the convergence ofuˆtou. The full proof can be found in AppendixE.

Theorem5states that ifH1holds, the statistic can be arbi- trarily large asnincreases, allowingH0to be rejected for any fixed threshold. Asymptotically the test thresholdTαis given by the(1−α)-quantile ofχ²(J)and is independent ofn. The assumption on the consistency ofΣˆ is required to obtain the asymptotic chi-squared distribution. The regularization parameterγnis to ensure that( ˆΣ+γnI)⁻¹can be stably computed. In practice,γnrequires no tuning, and can be set to be a very small constant. We emphasize thatJ need not increase withnfor test consistency.

The next proposition states that the computational complexity of theNFSIC\²estimator is linear in both the input dimension and sample size, and that it can be expressed in terms of theK=[Kij] = [k(vi,xj)] ∈ R^J^×ⁿ,L = [Lij] = [l(wi,yj)]∈R^J^×ⁿmatrices. In contrast to typical kernel methods, a large Gram matrix of sizen×nis not needed to computeNFSIC\².

Proposition 6(An empirical estimator of NFSIC\²). Let 1n := (1, . . . ,1)^> ∈Rⁿ. Denote by◦the element-wise matrix product. Then,

(6)

1. uˆ= ^(K_n^◦₋^L)1₁ⁿ − ^(K1n(nⁿ⁾^◦−^(L11)ⁿ⁾.

2. A consistent estimator forΣisΣˆ = ^ΓΓ_n^>where Γ:= (K−n⁻¹K1n1^>_n)◦(L−n⁻¹L1n1^>_n)−uˆ^b1^>_n, ˆ

u^b=n⁻¹(K◦L)1n−n⁻²(K1n)◦(L1n).

Assume that the complexity of the kernel evaluation is linear in the input dimension. Then the test statisticλˆn = nˆu^>

Σˆ +γnI−1

uˆ can be computed inO(J³+J²n+ (dx+dy)Jn)time.

Proof (sketch). Claim 1 foruˆ is straightforward. The expression forΣˆ in claim 2 follows directly from the asymptotic covariance expression in Proposition4. The consistency ofΣˆ can be obtained by noting that the finite sample bound forP(kΣˆ−ΣkF > t)decreases asnincreases. This is implicitly shown in AppendixF.2.2 and its following sections.

Although the dependency of the estimator onJis cubic, we empirically observe that only a small value ofJis required (see Section3). The number of test locationsJ relates to the number of regions inX × Yofpxyandpxpythat differ (see Figure1).

Theorem5asserts the consistency of the test for any test locationsVJ drawn from an absolutely continuous distribution. In practice,VJ can be further optimized to increase the test power for a fixed sample size. Our final theoretical result gives a lower bound on the test power ofNFSIC\²i.e., the probability of correctly rejectingH0. We will use this lower bound as the objective function to determineVJ and the kernel parameters. Letk · kF be the Frobenius norm.

Theorem 7 (A lower bound on the test power). Let NFSIC²(X, Y) :=λn :=nu^>Σ⁻¹u. LetKbe a kernel class fork,Lbe a kernel class forl, andVbe a collection with each element being a set ofJlocations. Assume that

1. There exist finite Bk and Bl such that sup_k∈Ksup_x,x0∈X|k(x,x⁰)| ≤ Bk and sup_l_∈Lsup_y,y0∈Y|l(y,y⁰)| ≤Bl.

2. ˜c:= sup_k_∈Ksup_l_∈Lsup_V_J_∈VkΣ⁻¹kF <∞.

Then, for anyk∈ K, l∈ L, VJ ∈ V, andλn≥r, the test power satisfiesP

λˆn≥r

≥L(λn)where

L(λn) = 1−62e^−ξ¹^γⁿ²^(λⁿ^−r)²^/n−2e^−b^0.5n^c^(λⁿ⁻^r)²^/[^ξ²ⁿ²]

−2e⁻[^(λⁿ−r)γn(n−1)/3−ξ3n−c3γ²_nn(n−1)]²^/[^ξ⁴ⁿ²(n−1)], b·cis the floor function,ξ1:= ₃2c²₁¹J²B^∗,B^∗is a constant depending on onlyBkandBl,ξ2:= 72c²₂JB²,B:=BkBl,

ξ3 := 8c1B²J,c3 := 4B²J˜c²,ξ4 := 2⁸B⁴J²c²₁,c1 :=

4B²J√

J˜c, andc2:= 4B√

Jc. Moreover, for sufficiently˜ large fixedn,L(λn)is increasing inλn.

We provide the proof in Appendix F. To put Theoremn 7 into perspective, assume that K =

(x,v)7→exp

−^k^x2σ⁻^v_x²^k²

|σ²_x∈[σ_x,l² , σ_x,u² ]o

=: Kg

for some 0 < σ_x,l² < σ²_x,u < ∞ and L = n(y,w)7→exp

−^k^y⁻2σ^w_y²^k²

|σ²_y∈[σ²_y,l, σ²_y,u]o

=: Lg

for some0 < σ²_y,l < σ_y,u² < ∞ are Gaussian kernel classes. Then, in Theorem 7, B = Bk = Bl = 1, and B^∗ = 2. The assumption ˜c < ∞ is a techni- cal condition to guarantee that the test power lower bound is finite for all θ defined by the feasible sets K,L, and V. Let V,r :=

VJ | kvik²,kwik² ≤ randkvi−vjk²2+kwi−wjk²2≥, for alli6=j . If we setK=Kg,L=Lg,andV=V,rfor some, r >0, then

˜

c <∞asK^g,L^g,andV^,rare compact. In practice, these conditions do not necessarily create restrictions as they almost always hold implicitly. We show in AppendixCthat the objective function used to chooseVJ will discourage any two locations to be in the same neighborhood.

Parameter TuningLet θbe the collection of all tuning parameters of the test. Ifk∈ Kg andl ∈ Lg(i.e., Gaus- sian kernels), then θ = {σ_x², σ_y², VJ}. The test power lower boundL(λn)in Theorem7is a function ofλn = nu^>Σ⁻¹uwhich is the population counterpart of the test statisticˆλn. As in FSIC, it can be shown thatλn = 0if and only ifXareY are independent (from Proposition2).

According to Theorem7, for a sufficiently largen, the test power lower bound is increasing inλn. One can therefore think ofλn(a function ofθ) as representing how easily the test rejectsH0given a problemPxy. The higher theλn, the greater the lower bound on the test power, and thus the more likely it is that the test will rejectH0when it is false.

In light of this reasoning, we propose to setθ by maxi- mizing the lower bound on the test power i.e., set θ to θ^∗= arg maxθL(λn). Assume thatnis sufficiently large so thatλn 7→ L(λn) is an increasing function. Then, arg maxθL(λn) = arg maxθλn. That this procedure is also valid underH0can be seen as follows. Under H0, θ^∗= arg maxθ0will be arbitrary. Since Theorem7guaran- tees thatλˆn d

→χ²(J)asn→ ∞for anyθ, the asymptotic null distribution does not change by usingθ^∗. In practice, λnis a population quantity which is unknown. We propose dividing the sampleZninto two disjoint sets: training and test sets. The training set is used to computeˆλn(an estimate ofλn) to optimize forθ^∗, and the test set is used for the actual independence test with the optimizedθ^∗. The splitting is to guarantee the independence ofθ^∗and the test sample to avoid overfitting.

(7)

(a)µˆxy(v,w) (b)µ[xµy(v,w)

(c)Σ(v,b w) (d) Statisticˆλn(v,w) Figure 1: Illustration ofNFSIC\².

To better understand the behaviour ofNFSIC\², we visual- izeµˆxy(v,w),µ[xµy(v,w)andΣ(v,ˆ w)as a function of one test location(v,w)on a simple toy problem. In this problem,Y =−X+ZwhereZ∼ N(0,0.3²)is an independent noise variable. As we consider only one location (J = 1),Σ(v,ˆ w)is a scalar. The statistic can be written

asλˆn =n(^µ^ˆ^xy^(v,w)−\^µ^x^µ^y^(v,w))²

Σ(v,w)ˆ . These components are shown in Figure1, where we use Gaussian kernels for both XandY, and the horizontal and vertical axes correspond tov∈Randw∈R, respectively.

Intuitively,u(v,ˆ w) = ˆµxy(v,w)−µ[xµy(v,w)captures the difference of the joint distribution and the product of the marginals as a function of(v,w). Squaring ˆu(v,w) and dividing it by the variance shown in Figure1cgives the statistic (also the parameter tuning objective) shown in Fig- ure1d. The latter figure illustrates that the parameter tuning objective function can be non-convex: non-convexity arises since there are multiple ways to detect the difference between the joint distribution and the product of the marginals.

In this case, the lower left and upper right regions equally indicate the largest difference. A convex objective would not be able to capture this phenomenon.

3. Experiments

In this section, we empirically study the performance of the proposed method on both toy (Section3.1) and real problems (Section 3.2). We are interested in challenging problems requiring a large number of samples, where a quadratic-time test might be computationally infeasible. Our goal is not to outperform a quadratic-time test with a linear-time test uniformly overalltesting problems.

We will find, however, that our test does outperform the quadratic-time test in some cases. Code is available at https://github.com/wittawatj/fsic-test.

We compare the proposed NFSIC with optimization (NFSIC- opt) to five multivariate nonparametric tests. TheNFSIC\² test without optimization (NFSIC-med) acts as a baseline, allowing the effect of parameter optimization to be clearly

seen. For pedagogical reason, we consider the original HSIC test ofGretton et al.(2005) denoted by QHSIC, which is a quadratic-time test. Nyström HSIC (NyHSIC) uses a Nys- tröm approximation to the kernel matrices ofXandY when computing the HSIC statistic. FHSIC is another variant of HSIC in which a random Fourier feature approximation (Rahimi & Recht,2008) to the kernel is used. NyHSIC and FHSIC are studied inZhang et al.(2017) and can be computed inO(n), with quadratic dependency on the number of inducing points in NyHSIC, and quadratic dependency on the number of random features in FHSIC. Finally, the Randomized Dependence Coefficient (RDC) proposed in Lopez-Paz et al.(2013) is also considered. The RDC can be seen as the primal form (with random Fourier features) of the kernel canonical correlation analysis ofBach & Jordan (2002) on copula-transformed data. We consider RDC as a linear-time test even though preprocessing by an empirical copula transform costsO((dx+dy)nlogn).

We use Gaussian kernel classes Kg and Lg for both X and Y in all the methods. Except NFSIC-opt, all other tests use full sample to conduct the independence test, where the Gaussian widths σx and σy are set according to the widely used median heuristic i.e., σx = median ({kxi−xjk²|1≤i < j≤n}), andσy is set in the same way using{yi}ⁿi=1. TheJ locations for NFSIC- med are randomly drawn from the standard multivariate normal distribution in each trial. For a sample of sizen, NFSIC-opt uses half the sample for parameter tuning, and the other disjoint half for the test. We permute the sample 300 times in RDC¹ and HSIC to simulate from the null distribution and compute the test threshold. The null distributions for FHSIC and NyHSIC are given by a finite sum of weightedχ²(1)random variables given in Eq. 8 ofZhang et al.(2017). Unless stated otherwise, we set the test threshold of the two NFSIC tests to be the(1−α)-quantile of χ²(J). To provide a fair comparison, we setJ = 10, use 10 inducing points in NyHSIC, and 10 random Fourier features in FHSIC and RDC.

Optimization of NFSIC-optThe parameters of NFSIC-opt areσx, σy,andJlocations of size(dx+dy)J. We treat all the parameters as a long vector inR^2+(d^x^+d^y^)J and use gradient ascent to optimizeλˆn/2. We observe that initializing VJ by randomly pickingJpoints from the training sample yields good performance. The regularization parameterγn

in NFSIC is fixed to a small value, and is not optimized. It is worth emphasizing that the complexity of the optimization procedure is still linear-time.²

1We use a permutation test for RDC, following the au- thors’ implementation (https://github.com/lopezpaz/

randomized_dependence_coefficient, referred com- mit: b0ac6c0).

2Our claim on linear runtime (with respect ton) is for the gradient ascent procedure to find a local optimum forθ. We do not

(8)

100 200 dxanddy

10⁰ 10¹ 10²

Time(s)

(a) SG(α= 0.05)

50 100 150 200 250 dxanddy

0.04 0.06

Type-Ierror

(b) SG(α= 0.05)

1 2 3 4 5 6

ωin 1 + sin(ωx) sin(ωy) 0.0

0.5 1.0

Testpower

(c) Sin

1 2 3 4 5 6

dx

0.0 0.5 1.0

Testpower

1 2 3 4 5 6

dx

0.0 0.5 1.0

Testpower NFSIC-opt

NFSIC-med QHSIC NyHSIC FHSIC RDC (d) GSign

Figure 2: (a): Runtime. (b): Probability of rejectingH0as problem parameters vary. Fixn= 4000.

Since FSIC, NyHFSIC and RDC rely on a finite- dimensional kernel approximation, these tests are consistent only if both the number of features increases withn. By constrast, the proposed NFSIC requires onlynto go to in- finity to achieve consistency i.e.,Jcan be fixed. We refer the reader to AppendixCfor a brief investigation of the test power vs. increasingJ. The test power does not necessarily monotonically increase withJ.

3.1. Toy Problems

We consider three toy problems.

1. Same Gaussian (SG).The two variables are indepen- dently drawn from the standard multivariate normal distribution i.e.,X ∼ N(0,Idx)andY ∼ N(0,Idy)whereId

is thed×didentity matrix. This problem represents a case in whichH0holds.

2. Sinusoid (Sin).Letpxybe the probability density ofPxy. In the Sinusoid problem, the dependency ofXandY is char- acterized by(X, Y)∼ pxy(x, y)∝ 1 + sin(ωx) sin(ωy), where the domains of X,Y = (−π, π)andω is the frequency of the sinusoid. As the frequencyωincreases, the drawn sample becomes more similar to a sample drawn fromUniform((−π, π)²). That is, the higherω, the harder to detect the dependency betweenXandY. This problem was studied inSejdinovic et al.(2013). Plots of the density for a few values ofωare shown in Figures6and7in the appendix. The main characteristic of interest in this problem is the local change in the density function.

3. Gaussian Sign (GSign). In this problem, Y =

|Z|Qdx

i=1sgn(Xi), whereX ∼ N(0,Idx), sgn(·)is the sign function, andZ∼ N(0,1)serves as a source of noise.

The full interaction ofX = (X1, . . . , Xdx)is what makes the problem challenging. That is,Y is dependent onX, yet it is independent of any proper subset of{X1, . . . , Xd}.

Thus, simultaneous consideration of all the coordinates of Xis required to successfully detect the dependency.

We fixn= 4000and vary the problem parameters. Each problem is repeated for 300 trials, and the sample is redrawn each time. The significance levelαis set to 0.05. The re- claim a linear runtime to find a global optimum.

sults are shown in Figure2. It can be seen that in the SG problem (Figure2b) whereH0holds, all the tests achieve roughly correct type-I errors atα = 0.05. In particular, we point out that NFSIC-opt’s rejection rate is well con- trolled as the sample used for testing and the sample used for parameter tuning are independent. The rejection rate would have been much higher had we done the optimization and testing on the same sample (i.e., overfitting). In the Sin problem, NFSIC-opt achieves high test power for all consideredω= 1, . . . ,6, highlighting its strength in detecting local changes in the joint density. The performance of NFSIC-med is significantly lower than that of NFSIC-opt.

This phenomenon clearly emphasizes the importance of the optimization to place the locations at the relevant regions in X ×Y. RDC has a remarkably high performance in both Sin and GSign (Figure2c,2d) despite no parameter tuning. The ability to simultaneously consider interacting features of NFSIC-opt is indicated by its superior test power in GSign, especially at the challenging settings ofdx= 5,6.

NFSIC vs. QHSIC.We observe that NFSIC-opt outper- forms the quadratic-time QHSIC in these two problems.

QHSIC is defined as the RKHS norm of the witness func- tionu(see (2)). Intuitively, one can think of the RKHS norm as taking into account all the locations(v,w). By contrast, the proposed NFSIC evaluates the witness function atJlocations. If the differences inpxyandpxpyare local (e.g., Sin problem), or there are interacting features (e.g., GSign problem), then only small regions in the space of (X, Y)are relevant in detecting the difference ofpxy and pxpy. In these cases, pinpointing exact test locations by the optimization of NFSIC performs well. On the other hand, taking into account all possible test locations as done implicitly in QHSIC also integrates over regions where the difference betweenpxy andpxpy is small, resulting in a weaker indication of dependence. Whether QHSIC is better than NFSIC depends heavily on the problem, and there is no one best answer. If the difference betweenpxyandpxpy

is large only in localized regions, then the proposed linear time statistic has an advantage. If the difference is spatially diffuse, then QHSIC has an advantage. No existing work has proposed a procedure to optimally tune kernel parameters for QHSIC; by contrast, NFSIC has a clearly defined objective for parameter tuning.

(9)

10³ 10⁴ 10⁵ Sample sizen 10⁰

10²

Time(s)

(a) SG.dx=dy= 250.

10⁴ 10⁵

Sample sizen 0.04

0.06

Type-Ierror

(b) SG.dx=dy= 250.

10³ 10⁴ 10⁵

Sample sizen 0.0

0.5 1.0

Testpower

(c) Sin.ω= 4.

10³ 10⁴ 10⁵

Sample sizen 0.0

0.5 1.0

Testpower

1 2 3 4 5 6

dx

0.0 0.5 1.0

Testpower NFSIC-opt

NFSIC-med QHSIC NyHSIC FHSIC RDC (d) GSign.dx= 4.

Figure 3: (a) Runtime. (b): Probability of rejectingH0asnincreases in the toy problems.

To investigate the sample efficiency of all the tests, we fix dx=dy= 250in SG,ω= 4in Sin,dx= 4in GSign, and increasen. Figure3shows the results. The quadratic dependency onnin QHSIC makes it infeasible both in terms of memory and runtime to considernlarger than 6000 (Fig- ure3a). By constrast, although not the most time-efficient, NFSIC-opt has the highest sample-efficiency for GSign, and for Sin in the low-sample regime, significantly outperform- ing QHSIC. Despite the small additional overhead from the optimization, we are yet able to conduct an accurate test withn = 10⁵, dx = dy = 250in less than100seconds.

We observe in Figure3bthat the two NFSIC variants have correct type-I errors across all sample sizes. We recall from Theorem5that the NFSIC test with random test locations will asymptotically rejectH0if it is false. A demonstration of this property is given in Figure3c, where the test power of NFSIC-med eventually reaches 1 withnhigher than10⁵. 3.2. Real Problems

We now examine the performance of our proposed test on real problems.

Million Song Data (MSD) We consider a subset of the Million Song Data³(Bertin-Mahieux et al.,2011), in which each song(X)out of 515,345 is represented by 90 features, of which 12 features are timbre average (over all segments) of the song, and 78 features are timbre covariance. Most of the songs are western commercial tracks from 1922 to 2011.

The goal is to detect the dependency between each song and its year of release(Y). We setα = 0.01, and repeat for 300 trials where the full sample is randomly subsampled tonpoints in each trial. Other settings are the same as in the toy problems. To make sure that the type-I error is correct, we use the permutation approach in the NFSIC tests to compute the threshold. Figure4bshows the test powers asnincreases from 500 to 2000. To simulate the case whereH0holds in the problem, we permute the sample to break the dependency ofXandY. The results are shown in Figure5in the appendix.

Evidently, NFSIC-opt has the highest test power among all

3Million Song Data subset: https://archive.ics.

uci.edu/ml/datasets/YearPredictionMSD.

500 1000 1500 2000 Sample sizen 0.00

0.01 0.02

Type-Ierror

NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC

500 1000 1500 2000 Sample sizen 0.5

1.0

Testpower

(a) MSD problem.

2000 4000 6000 8000 Sample sizen 0.5

1.0

Testpower

(b) Videos & Captions problem.

Figure 4: Probability of rejectingH0asnincreases in the two real problems.α= 0.01.

the linear-time tests for all the sample sizes. Its test power is second to only QHSIC. We recall that NFSIC-opt uses half of the sample for parameter tuning. Thus, atn= 500, the actual sample for testing is 250, which is relatively small. The fact that there is a vast power gain from 0.4 (NFSIC-med) to 0.8 (NFSIC-opt) atn= 500suggests that the optimization procedure can perform well even at a lower sample sizes.

Videos and Captions Our last problem is based on the VideoStory46K⁴ dataset (Habibian et al., 2014). The dataset contains 45,826 Youtube videos(X) of an average length of roughly one minute, and their corresponding text captions(Y)uploaded by the users. Each video is represented as adx= 2000dimensional Fisher vector en- coding of motion boundary histograms (MBH) descriptors ofWang & Schmid(2013). Each caption is represented as a bag of words with each feature being the frequency of one word. After filtering only words which occur in at least six video captions, we obtaindy= 1878words. We examine the test powers asnincreases from2000to8000.

The results are given in Figure 4. The problem is sufficiently challenging that all linear-time tests achieve a low power atn= 2000. QHSIC performs exceptionally well on this problem, achieving a maximum power throughout.

NFSIC-opt has the highest sample efficiency among the linear-time tests, showing that the optimization procedure is also practical in a high dimensional setting.

4VideoStory46K dataset:https://ivi.fnwi.uva.nl/

isis/mediamill/datasets/videostory.php.

(10)

Acknowledgement

We thank the Gatsby Charitable Foundation for the financial support. The major part of this work was carried out while Zoltán Szabó was a research associate at the Gatsby Com- putational Neuroscience Unit, University College London.

References

Anderson, Theodore W. An Introduction to Multivariate Statistical Analysis. Wiley, 2003.

Bach, Francis R. and Jordan, Michael I. Kernel independent component analysis.Journal of Machine Learning Research, 3:1–48, 2002.

Bertin-Mahieux, Thierry, Ellis, Daniel P.W., Whitman, Brian, and Lamere, Paul. The million song dataset. In International Conference on Music Information Retrieval (ISMIR), 2011.

Chwialkowski, Kacper P., Ramdas, Aaditya, Sejdinovic, Dino, and Gretton, Arthur. Fast Two-Sample Testing with Analytic Representations of Probability Measures.

InAdvances in Neural Information Processing Systems (NIPS), pp. 1981–1989. 2015.

Dauxois, Jacques and Nkiet, Guy Martial. Nonlinear canonical analysis and independence tests.The Annals of Statis- tics, 26(4):1254–1278, 1998.

Feuerverger, Andrey. A consistent test for bivariate dependence. International Statistical Review, 61(3):419–433, 1993.

Fukumizu, Kenji, Gretton, Arthur, Sun, Xiaohai, and Schölkopf, Bernhard. Kernel measures of conditional dependence. InAdvances in Neural Information Process- ing Systems (NIPS), pp. 489–496, 2008.

Gretton, Arthur and Györfi, László. Consistent nonparametric tests of independence. Journal of Machine Learning Research, 11:1391–1423, 2010.

Gretton, Arthur, Bousquet, Olivier, Smola, Alex, and Schölkopf, Bernhard. Measuring Statistical Dependence with Hilbert-Schmidt Norms. InAlgorithmic Learning Theory (ALT), pp. 63–77. 2005.

Gretton, Arthur, Fukumizu, Kenji, Teo, Choon H., Song, Le, Schölkopf, Bernhard, and Smola, Alex J. A Kernel Statistical Test of Independence. InAdvances in Neural Information Processing Systems (NIPS), pp. 585–592.

2008.

Habibian, Amirhossein, Mensink, Thomas, and Snoek, Cees GM. Videostory: A new multimedia embedding for few-example recognition and translation of events. In

ACM International Conference on Multimedia, pp. 17–26, 2014.

Heller, Ruth, Heller, Yair, Kaufman, Shachar, Brill, Barak, and Gorfine, Malka. Consistent distribution-free k- sample and independence tests for univariate random variables. Journal of Machine Learning Research, 17 (29):1–54, 2016.

Huo, Xiaoming and Székely, Gábor J. Fast computing for distance covariance.Technometrics, 58(4):435–447, 2016.

Jitkrittum, Wittawat, Szabó, Zoltán, Chwialkowski, Kacper, and Gretton, Arthur. Interpretable Distribution Features with Maximum Testing Power. InAdvances in Neural Information Processing Systems (NIPS), pp. 181–189.

2016.

Kowalski, Jeanne and Tu, Xin M. Modern Applied U- Statistics. John Wiley & Sons, 2008.

Lehmann, Eric L. Elements of Large-Sample Theory.

Springer Science & Business Media, 1999.

Lopez-Paz, David, Hennig, Philipp, and Schölkopf, Bern- hard. The Randomized Dependence Coefficient. InAd- vances in Neural Information Processing Systems (NIPS), pp. 1–9. 2013.

Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernel machines. InAdvances in Neural In- formation Processing Systems (NIPS), pp. 1177–1184.

2008.

Sejdinovic, Dino, Sriperumbudur, Bharath, Gretton, Arthur, and Fukumizu, Kenji. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):2263–2291, 2013.

Serfling, Robert J.Approximation Theorems of Mathemati- cal Statistics. John Wiley & Sons, 2009.

Smola, Alex, Gretton, Arthur, Song, Le, and Schölkopf, Bernhard. A Hilbert space embedding for distributions.

In International Conference on Algorithmic Learning Theory (ALT), pp. 13–31, 2007.

Sriperumbudur, Bharath K., Gretton, Arthur, Fukumizu, Kenji, Schölkopf, Bernhard, and Lanckriet, Gert R. G.

Hilbert Space Embeddings and Metrics on Probability Measures. Journal of Machine Learning Research, 11:

1517–1561, 2010.

Steinwart, Ingo and Christmann, Andreas.Support vector machines. Springer Science & Business Media, 2008.

(11)

Székely, Gábor J. and Rizzo, Maria L. Brownian distance covariance. The Annals of Applied Statistics, 3(4):1236–

1265, 2009.

Székely, Gábor J., Rizzo, Maria L., and Bakirov, Nail K.

Measuring and testing dependence by correlation of distances.The Annals of Statistics, 35(6):2769–2794, 2007.

van der Vaart, Aad.Asymptotic Statistics. Cambridge Uni- versity Press, 2000.

Wang, Heng and Schmid, Cordelia. Action recognition with improved trajectories. InIEEE International Conference on Computer Vision (ICCV), pp. 3551–3558, 2013.

Zhang, Kun, Peters, Jonas, Janzing, Dominik, and Schölkopf, Bernhard. Kernel-based conditional independence test and application in causal discovery. InConfer- ence on Uncertainty in Artificial Intelligence (UAI), pp.

804–813, 2011.

Zhang, Qinyi, Filippi, Sarah, Gretton, Arthur, and Sejdi- novic, Dino. Large-Scale Kernel Methods for Indepen- dence Testing.Statistics and Computing, pp. 1–18, 2017.

(12)

Supplementary Material

A. Type-I Errors

In this section, we show that all the tests have correct type-I errors (i.e., the probability of rejectH0when it is true) in real problems. We permute the joint sample so that the dependency is broken to simulate cases in whichH0holds. The results are shown in Figure5.

500 1000 1500 2000 Sample sizen 0.00

0.01 0.02

Type-Ierror

NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC

500 1000 1500 2000 Sample sizen 0.00

0.01 0.02

Type-Ierror

(a) MSD problem (permuted sample).

2000 4000 6000 8000 Sample sizen 0.01

0.02

Type-Ierror

(b) Videos & Captions problem (permuted sample).

Figure 5: Probability of rejectingH0asnincreases.α= 0.01.

B. Redundant Test Locations

Here, we provide a simple illustration to show that two locationst1= (v1,w1)andt2= (v2,w2)which are too close to each other will reduce the optimization objective. We consider the Sinusoid problem described in Section3.1with ω= 1, and useJ= 2test locations. In Figure6,t1is fixed at the red star, whilet2is varied along the horizontal line. The objective valueˆλnas a function oft2is shown in the bottom figure. It can be seen thatλˆndecreases sharply whent2is in the neighborhood oft1. This property implies that two locations which are too close will not maximize the objective function (i.e., the second feature contains no additional information when it matches the first). ForJ >2, the objective sharply decreases if any two locations are in the same neighborhood.

− 2.5 0.0 2.5 x

− 2.5 0.0 2.5

y

ω = 1.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

x

−2 0 2

y

−2 0 2 t2

175 200 225 250 ˆλn(t1,t2)

Figure 6: Plot of optimization objective values as locationt2moves along the green line. The objective sharply drops when the two locations are in the same neighborhood.

C. Test Power vs. J

It might seem intuitive that as the number of locationsJincreases, the test power should also increase. Here, we empirically show that this statement isnotalways true. Consider the Sinusoid toy example described in Section3.1withω= 2(also see the left figure of Figure7). By construction,XandY are dependent in this problem. We run NFSIC test with a sample size ofn= 800, varyingJfrom1to600. For each value ofJ, the test is repeated for 500 times. In each trial, the sample is redrawn and theJtest locations are drawn fromUniform((−π, π)²). There is no optimization of the test locations. We use Gaussian kernels for bothXandY, and use the median heuristic to set the Gaussian widths to 1.8. Figure7shows the test power asJincreases.

(13)

− 2.5 0.0 2.5 x

− 2.5 0.0 2.5

y

ω = 2.00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

10 200 400 600

J 0.5

1.0 T est p ow er

Figure 7: The Sinusoid problem and the plot of test power vs. the number of test locations.

We observe that the test power does not monotonically increase asJincreases. WhenJ= 1, the difference ofpxyandpxpy

cannot be adequately captured, resulting in a low power. The power increases rapidly to roughly 0.6 atJ = 10, and stays at 1 until aboutJ= 100. Then, the power starts to drop sharply whenJis higher than400in this problem.

Unlike random Fourier features, the number of test locations in NFSIC is not the number of Monte Carlo particles used to approximate an expectation. There is a tradeoff: if the test locations are in key regions (i.e., regions in which there is a big difference betweenpxyandpxpy), then they increase power; yet the statistic gains in variance (thus reducing test power) as Jincreases. As can be seen in Figure7, there are eight key regions (in blue) that can reveal the difference ofpxy andpxpy. Using an unnecessarily highJnot only makes the covariance matrixΣˆ harder to estimate accurately, it also increases the computation as the complexity onJisO(J³).

We note that NFSIC is not intended to be used with a largeJ. In practice, it should be set to be large enough so as to capture the key regions as stated. As a practical guide, with optimization of the test locations, a good starting point isJ= 5or10.

D. Proof of Proposition 3

Recall Proposition3,

Proposition(A product of Gaussian kernels is characteristic and analytic). Letk(x,x⁰) = exp −(x−x⁰)^>A(x−x⁰)and l(y,y⁰) = exp −(y−y⁰)^>B(y−y⁰)

be Gaussian kernels onR^d^x×R^d^xandR^d^y×R^d^yrespectively, for positive definite matricesAandB. Then,g((x,y),(x⁰,y⁰)) =k(x,x⁰)l(y,y⁰)is characteristic and analytic on(R^d^x×R^d^y)×(R^d^x×R^d^y).

Proof. Letz:= (x^>,y^>)^>andz⁰:= (x^0>,y^0>)^>be vectors inR^d^x^+d^y. We prove by reducing the product kernel to one Gaussian kernel withg(z,z⁰) = exp −(z−z⁰)^>C(z−z⁰)

whereC:=

A 0 0 B

. Writeg(z,z⁰) = Ψ(z−z⁰)where Ψ(t) := exp −t^>Ct

. SinceCis positive definite, we see that the finite measureζ corresponding toΨas defined in Lemma12has support everywhere inR^d^x^+d^y. Thus,Sriperumbudur et al.(2010, Theorem 9) implies thatgis characteristic.

To see thatgis analytic, we observe that for eachz⁰ ∈R^d^x^+d^y,z7→ −(z−z⁰)^>C(z−z⁰)is a multivariate polynomial inz, which is known to be analytic. Using the fact thatt7→exp(t)is analytic onR, and that a composition of analytic functions is analytic, we see thatz7→exp −(z−z⁰)^>C(z−z⁰)

is analytic onR^d^x^+d^yfor eachz⁰.

E. Proof of Theorem 5

Recall Theorem5,

Theorem 5(Independence test based onNFSIC\²is consistent). LetΣˆ be a consistent estimate ofΣbased on the joint sampleZn, whereΣis defined in Proposition4. Assume thatVJ ={(vi,wi)}^Ji=1∼ηwhereηis absolutely continuous wrt the Lebesgue measure. TheNFSIC\²statistic is defined asˆλn:=nˆu^>

Σˆ +γnI₋1

ˆ

uwhereγn≥0is a regularization parameter. Assume that

1. AssumptionAholds.

2. Σis invertibleη-almost surely.