Distribution testing : classical and new paradigms

(1)

Distribution Testing: Classical and New Paradigms

by

Maryam Aliakbarpour

B.S., Sharif University of Technology (2013)

S.M., Massachusetts Institute of Technology (2015)

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulﬁllment of the requirements for the degree of

Doctor of Philosophy in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2020

c

Massachusetts Institute of Technology 2020. All rights reserved.

Author . . . .

Department of Electrical Engineering and Computer Science

August 24, 2020

Certiﬁed by . . . .

Ronitt Rubinfeld

Professor of Electrical Engineering and Computer Science

Thesis Supervisor

Accepted by . . . .

Leslie A. Kolodziejski

Professor of Electrical Engineering and Computer Science

Chair, Department Committee on Graduate Students

(2)

(3)

Distribution Testing: Classical and New Paradigms

by

Maryam Aliakbarpour

Submitted to the Department of Electrical Engineering and Computer Science on August 24, 2020, in partial fulﬁllment of the

requirements for the degree of

Doctor of Philosophy in Electrical Engineering and Computer Science

Abstract

Hypothesis testing is a fundamental topic in statistics. To put it simply, hypothesis testing is a framework to examine whether a hypothesized model is in line with the observed data. Hypothesis testing has been widely used in experimental research in a variety of ﬁelds, such as biology, medical science, and social sciences. Despite a century of constant use, there is still a lot left to be done for the evolving needs in the practical world. Some of the high-priority challenges we face are preserving privacy, working with high-dimensional distributions, handling noisy data, and dealing with data that is gathered from multiple sources. In this thesis, we focus on basic statistical problems in a more recently considered setting of hypothesis testing, referred to as property testing, in which we aim to address the challenges mentioned above. In particular, we have the following contributions:

1. We investigate the problem of testing whether a distribution has the shape property that it is monotone according to some (partial) order of the domain elements, or it is far from being such a distribution. Among other results, our main contribution is that testing monotonicity over a high dimensional domain, a boolean hypercube needs almost linearly many samples in terms of the domain size.

2. We consider well-studied identity and closeness testing problems in a new mixture based noise model. We provide testers with optimal sample complexity for these problems under various scenarios that diﬀer in terms of how the tester can access the distribution, or what knowledge about the noise is available to the tester.

3. We developed diﬀerentially private testers for several fundamental problems in test-ing, such as testing uniformity, identity, closeness, and independence. The conceptual message of our work is there exist private hypothesis testers that are nearly as sample-eﬃcient as their non-private counterparts.

4. We consider a new model in distribution testing for multiple data sources when only a few samples are available from each source. This assumption is in contrast to the common distribution testing model, which views the data as i.i.d. samples from a single distribution. We generalized uniformity, identity, and closeness testing problems to this setting, and developed sample-optimal tester for these problems.

Thesis Supervisor: Ronitt Rubinfeld

(4)

(5)

Acknowledgments

My Ph.D. journey at MIT was a fantastic experience. I was fortunate to meet a lot of supportive people who have helped me throughout my journey.

Firs and foremost, I would like to my advisor, Prof. Ronitt Rubinfeld. I have learned from her the basics of research: approaching problems, picking interesting problems, writing a paper, etc. However, her role in my career is deﬁnitely beyond that. She has been very supportive of me and has encouraged me through every step of my Ph.D. She has motivated me to move forward and not to give up on myself. No matter how I felt when I entered her oﬃce (or the zoom link), our meeting would be the highlight of my day. I will be very thrilled if I do half of what she did for my future students. Ronitt, I was very honored to be your student. Thank you!

I want to thank the members of my thesis committee: Prof. Costis Daskalakis, Prof. Ilias Diakonikolas, and Prof. Piotr Indyk. I have learned from every one of them so much. All of them have given me valuable career advice, which helped me pursue my next steps in academia. I want to thank them individually, as well. I am very thrilled to have the opportunity to work with Costis. He has a unique and persistent approach to tackle problems (which I hope to learn). He is always a friendly, encouraging, energetic collaborator, and an excellent lecturer. Thank you, Costis! Ilias mentored me when I was a junior Ph.D. student. I have learned a lot from him, and also from his seminal work in distribution testing. His honest feedback has helped me to grow into the research who I am right now. At least three sections of this thesis would not be there if it were not for Ilias or his work. Thank you, Ilias! Piotr oﬀers help even before one asks him. He has wholeheartedly supported the students. He always gives thoughtful advice. He is a fantastic lecturer, and I had learned a lot from him when I was taking his class and TAing for him. Thanks, Piotr!

I would like to thank all of my past and current collaborators: Amartya Shankha Biswas, Eric Blais, Clement Canonne, Ilias Diakonikolas, Themistoklis Gouleakis, Stefanie Jegelka, Daniel Kane, Ravi Kumar, Stephen Macke, Aditya Parameswaran, John Peebles, Sajjadur Rahman, Kavya Ravichandran, Ronitt Rubinfeld, Sandeep Silwal, Anak Yodpinyanee, Mano-lis Zampetakis.

I want to thank Dr. Ravi Kumar especially. Working with Ravi was a turning point in my Ph.D., and I am so glad that I have had the chance. He was a true mentor for me, and I learned from him a lot. Thanks, Ravi!

I want to thank my friends in the theory group and at MIT: Madalina Persu, Adam Seal-fon, Themistoklis Gouleakis, Anak Yodpinyanee, Amartya Shankha Biswas, Talya Eden, Slobodan Mitrovic, Arsen Vasilyan, Saeed Mehraban, Quanquan Liu, William Leiserson, Govind Ramnarayan, Siddhartha Jayanti, Prashant Vasudevan, Pritish Kamath, Daniel Grier, Luke Schaeﬀer, Aikaterini Sotiraki, Nicole Wein, Alireza Fallah, Mehrdad Khani, Amir Tohidi, Sajjad Mohammadi, Farnaz Jahanbakhsh, Sepideh Mahabadi, Ali Vakilian, Mohammad Bavarian, Tiziana Smith, Jay Sircar, Luisa Reis de Castro, Anjuli Jain Figueroa, and Alessio Spantini.

Last but not least, I would like to thank my parents, Minoo and Hassan, who have raised me with love, support, and encouragement. I cannot thank them enough for what they have done for me. Also, I would like to thank my little sister, Mina, for being always there for me. Thank you!

(6)

Grants and fundings: The research in this thesis has been support by the following

sources: NSF Award Numbers: CCF-1065125, CCF-1217423, CCF-1420692, CCF-1733808 and IIS-1741137, MIT-IBM Watson AI Lab and Research Collaboration Agreement No. W1771646, Akamai appointment, and FinTech@CSAIL.

(7)

List of Figures

4-1 Standard reduction procedure to testing closeness of two distributions. . . . 105

(12)

(13)

Chapter 1 Introduction

Hypothesis testing is one of the foundational topics in statistics. To put it simply, hypothesis testing is a framework to examine whether a hypothesized model is in line with the observed data. To mathematically model this problem, we view the data as random samples from an unknown distribution. The goal is to see what properties does the distribution have. Some fundamental questions one could ask about the distribution from which the data is drawn include: (1) Is the distribution uniform? Or, is it far from being uniform? (2) Is a pair of random variables independent or correlated? (3) Is the distribution monotone? Or, does it have other shape-related properties?

About a century back, two frameworks have been proposed for hypothesis testing: (1) Neyman and Pearson introduced a framework in which one is given two simple hypotheses (known as the null hypothesis and the alternative hypothesis), and the goal is to decide which hypothesis is more likely to match the data [NP33]. (2) Almost concurrently, Fisher argued for the model of significance tests where there is only one hypothesis (known as the null

hypothesis), and the goal is to evaluate whether the hypothesis seems valid according to

the data or not [Fis25, Fis35]. These two frameworks have constituted the foundation of modern statistical hypothesis testing and have been widely used in experimental research in a variety of ﬁelds, such as biology, medical science, and social sciences. Over time, the use of statistical tests have gained popularity in such a way that they have become the very deﬁnition of the scientific method to draw conclusions from data.

Two decades ago, a new setting, called property testing of distributions or simply

dis-tribution testing, has been proposed to study hypothesis testing from a computational

per-spective [GGR98, GR11a, Bat01]. Distribution testing addresses some of the controversies around the old approaches. For example, the analysis in this framework involves determin-ing how much data is needed to achieve speciﬁc error guarantees, in contrast to classical results that consider the performance of the test in the limit when the number of samples goes to inﬁnity. Second, in their most general form, other than the assumptions about the domain, i.e., that it is discrete and its size is known, the tests in this framework do not make any assumption about the underlying distribution. In this thesis, we mainly focus on the distribution testing framework. We discuss distribution testing in more details in Section1.2. Despite a century of remarkable studies in hypothesis testing, there is still a lot left to be done for the practical world’s fast-paced and evolving needs. Due to technological advancement, new formats of data are emerging. The larger-capacity data centers allow us

(14)

to store more massive datasets. However, in spite of conventional wisdom, larger datasets do not necessarily make the hypothesis testing problem easier. In fact, recent datasets usually consist of multiple attributes that bring the challenges of working with high-dimensional domains into view. Furthermore, the way that the data is collected may cause diﬃculties in testing as well. For example, data may contain some noise, or it may have been collected from nonidentical sources. In the latter case, viewing data as random samples of a single underlying distribution may not reﬂect the truth. In recent years, new concerns have been raised over the social impact of data manipulation, such as privacy and fairness.

In this thesis, we strive to adapt the hypothesis testing framework to address such chal-lenges. We direct our attention to ﬁve fundamental problems, as described below. The objective in these problems is to design an algorithm that uses the optimal number of sam-ples (up to a constant factor) in terms of the domain size and the error guarantees.

• Uniformity testing: Testing whether a distribution is uniform or far from uniform. • Identity testing (goodness of fit): Testing whether a distribution is equal to a

known distribution.

• Closeness testing (equivalence testing): Testing whether two distributions, which we have sample access to, are equal or far from each other.

• Independence testing: Testing whether two random variables are independent from each other or far from being independent.

• Monotonicity testing: Testing whether a distribution is monotone according to some (partial) order of the domain elements, or it is far from being such a distribution. The above problems are primary problems in statistics which naturally arise in many practical settings. Also, the tests for these problems have been used as important building blocks for other testing problems in the standard setting, and thus we expect that our techniques and results will be useful to resolve these challenges for other testing problems as well. In short, we describe our contribution in the next section (Section 1.1).

1.1 Contributions

Testing monotonicity of distributions over posets Monotonicity: is a key

shape-related property of distributions. Many distributions which appear in real-world phenomena are monotone or piece-wise monotone. One can generalize the notion of monotonicity to the cases where the domain is a partially ordered set (poset). That is, we say p is a monotone distribution if for any pair of domain elements x and y such that x  y, then p(x) is at most p(y). In Chapter 2, we study the problem of testing monotonicity of a distribution over a poset. Among other results our main contribution is that testing monotonicity over a high dimensional domain, a boolean hypercube (i.e., _{{0, 1}}d_{), needs almost linearly many}

(15)

between the known lower and upper bounds for this problem [BFRV10,ADK15]. The lower bound is established by proving a lower bound of Ω(n/ log n) for testing the monotonicity of distributions over a matching poset with n edges, and embedding the hard instances of the matching problem into the hypercube. Moreover, we show that monotonicity testing over any poset can be reduced to monotonicity testing over bipartite posets with asymptotically the same number of vertices.

Testing distributions in the presence of noise: It is well-known that distribution

testing in the presence of arbitrary noise usually requires far more samples in comparison to the setting in which there is no noise [VV17b]. In Chapter 3, we present a noise model that, on the one hand, is more tractable for the testing problem and, on the other hand, represents a rich class of noise families. In our model, the noisy distribution is a mixture of

the original distribution and noise, where the latter is known to the tester either explicitly or

via sample access; the form of the noise is also known a priori. Focusing on the well studied identity and closeness testing problems, we consider various scenarios that diﬀer in terms of how the tester can access the distributions, and we demonstrate that indeed testing these properties under our proposed noise model is more tractable compared to the general noise model. Our results show that the asymptotic sample complexity of our testers is exactly the

same as the classical non-mixture case, which means that the sample complexity is optimal

and sublinear in the domain size. We also consider the case where the noise is from the class of k-ﬂat distributions.

Privacy and distribution testing: Preserving digital privacy is of crucial importance

in the era of the abundance of personal data. In order to achieve privacy protection, a common requirement on algorithms is that they be differentially private, that is, that the algorithm’s output has limited influence from any single individual. The main challenge is to enable algorithms to make global inferences about a collective dataset while protecting the privacy of individuals. In Chapter 4, we develop differentially private algorithms for uniformity, identity, closeness, and independence testing problems with (nearly-)optimal sample complexities. The conceptual message of our work is there exist private hypothesis testers that are nearly as sample-efficient as their non-private counterparts.

Our main technical contribution is a methodology to privatize the closeness tester of [DK16] that relies on the idea of ﬂattening the underlying distributions. The ﬂatten-ing technique maps the input distributions to two other distributions with lower l2-norms,

which can reduce the required number of samples for the closeness testing problem. Several other distribution properties can be tested via a reduction or the direct use of this flattening-based closeness tester. For many of such properties, e.g., independence, this approach is the only known methodology to obtain minimax sample-optimal testers in the non-private set-ting, e.g., testing independence. The efficiency of the flattening-based closeness tester is due to the fact that it allows us to exploit existing structure in the underlying problem distri-butions, and obtain a closeness tester which is more efficient than even the lower bounds of the general closeness testing problem. It is worth noting that prior to our work, in spite of the importance of the flattening technique in the non-private setting, there were no differ-entially private testers that could make the flattening step private. The privatization of the

(16)

ﬂattening based tester gives a uniﬁed approach via the reductions mentioned above.

Distribution testing with multiple sources: As explained earlier, a widespread rou-tine for modeling data is to view it as samples drawn from a single distribution. However, in many applications the data is usually gathered from multiple sources. Furthermore, it can be the case that the dataset contains only a few data points from each source. For example, an online shop may have the purchase history of thousands of customers while each cus-tomer may shop at the store only a limited number of times. On the other hand, data that comes from multiple sources may result in a dataset consisting of a collection of unconnected and unrelated data points. For example, it might not be possible to derive any meaningful conclusions from a dataset that contains the blood pressure of patients with heart disease, Alzheimer patients, and healthy individuals. However, if there is some consensus among the

sources, we may be able to make reasonable inferences based on the data. In [AS20], we propose a novel framework for testing the properties of distributions. While we allow the input data to be drawn from multiple distributions (sources), we receive “few” samples from each distribution (one sample in expectation per distribution). We suggest a structural

con-dition in order to model the agreement among the sources to enable us to draw meaningful

conclusions. Under this structural condition, we have developed sample optimal testers for the problems of uniformity, identity, and closeness testing.

1.2 The context

Distribution testing is a branch of a larger area in theoretical computer science, called

prop-erty testing [RS96,GGR98]. For a given property of distributions, we use_{P to denote a set of} distributions that satisfy the property. We say we can test_{P if there exists an algorithm that} receives samples from an unknown distribution and distinguishes the following cases with high probability: (1) The underlying distribution is a member of _{P. (2) Or, it is ǫ-far from} any distribution in _{P for some proximity parameter ǫ and some notion of distance between} distributions. (For the exact definition, see Section 1.3.) If the algorithm chooses the first case, we say the algorithm accepts the distribution. And, if it chooses the second case, we say the algorithm rejects the distribution. The focus of this framework is mostly on distri-butions on discrete domains. The ultimate goal is to find the optimal sample complexity, i.e., the number of samples that the algorithm needs in terms of the domain size, proximity parameter, and other parameters of the problem. Besides the sample complexity, we favor time-efficient algorithms, ideally with a linear time complexity in terms of the sample size.

Since the introduction of distribution testing, many properties have been extensively studied. The examples of such properties include: uniformity [GR11a, Pan08, DGPP19], identity to a known [BFF+₀₁_, _DKN15_, _ADK15_, _DGPP18_, _VV17a_{] or an unknown}

distri-bution [CDVV14, DK16], independence [BFF+₀₁_, _CDKS18_{], monotonicity [}_BKR04_, _RS09_,

CDGR18], k-modality [DDS+₁₃_{], being a k-histogram [}_ILR12_, _Can16_, _DK16_{], entropy}

esti-mation [BDKR02,WY16], and support size estimation [RRSS09a,VV17b,WY19]. For more results, see the surveys on this topic [Can15b, Rub12]. We defer the detailed discussion of the related work corresponding to each results to their chapters.

(17)

with the classical approaches in the hypothesis testing. These features render distribution testing more general for practical circumstances.

Bounding type I and type II errors: As we have mentioned, there are two primary

schools for hypothesis testing. The hypothesis testing model of Neyman and Pearson and

the significance test model of Fisher. The conceptual notions of error in these models are

signiﬁcantly diﬀerent:

1. In Neyman and Pearson’s model, there are two (simple) hypotheses about the under-lying distribution known as the null hypothesis and the alternative hypothesis and the goal is to determine which one is more aligned with the observed data. In this model, there are two types of error: (1) Type I error: the probability of rejecting the null hypothesis while it is true. (2) Type II error: the probability of accepting the null hypothesis when the alternative hypothesis is true. If we are assured that one of the two hypotheses are true, and both of the errors are small, the testers designed in this model should be able to recover the true hypothesis with high probability.

2. Fisher promoted the model of significance tests, which evaluates only the null hypoth-esis. The analysis of a significance test mainly involves measuring the probability of “rejecting” a null hypothesis when it is true. If the anticipated probability of failing is smaller than a known threshold value p, referred to as the p-value, the test concludes to reject the null hypothesis. In other words, when the p-value is small, it means that it is unlikely that the tester finds evidence to reject the null hypothesis when it is true. However, the reverse side of the argument is not a point of concern in the significance test1_{. In fact, “not rejecting” does not imply that the null hypothesis is “accepted”.}

The conceptual diﬀerences in these models aroused much controversy among mathemati-cians [Ney55,Fis58,HB03] and confused applied researchers [Gig87,TAW15, DSO94]. Both schools have their pros and cons: One one hand, Fisher’s model cannot confirm a hypothesis and only allows disproving one. On the other hand, Neyman and Pearson’s model, while it is more rigorous, only works if one of the two hypotheses is true, which may not be achievable in practical settings.

Distribution testing follows the Neyman and Pearson model. More precisely, there are two hypotheses: (1) The underlying distribution has the property _{P. (2) The underlying} distribution is ǫ-far from any distribution which has the propertyP. Note that for a carefully chosen small parameter ǫ, it is reasonable to assume one of the two hypotheses is true. And,

in the case where neither is true, i.e., when the distribution is ǫ-close to having the property, accepting or rejecting the null hypothesis both can be considered as an appropriate answer.

Finite sample regime: In classical hypothesis testing literature, the goal is to analyze

the error rate of the tester. That is, how fast the error drops as the number of samples goes to inﬁnity. While these results have improved our understanding of problems, they are less

1_{In some cases, the analysis of type II error is implicit in some of Fisher’s analysis. However, it is not}

(18)

useful from an algorithmic point of view. Moreover, one cannot directly adapt these results when the problem’s parameters, such as the domain dimension, change.

In distribution testing, as opposed to the classical results, the algorithm’s performance is analyzed when we have a ﬁnite number of samples. We compute the number of samples up to a constant factor with respect to the domain size, potentially the dimension of the domain, and the proximity parameter ǫ. This type of analysis of the sample size is referred to as non-asymptotic in statistics community.2

Assumptions: discrete domain, with known size: In its most general form, the

frame-work of distribution testing does not make any assumption about the underlying distribution except that the domain is discrete, and its size is known to the algorithm. It is important to note that the discrete domain assumption is crucial for the other features we have dis-cussed. In particular, having a finite number of samples to distinguish whether a distribution is uniform or ǫ-far from uniform in total variation distance (see Section 1.3 for the defini-tions), cannot be done with a finite number of samples. In fact, one can always construct a fluctuating distribution over tiny intervals, which could mimic the behavior of the uniform distribution when one zooms out.

Besides the two mentioned assumptions, we do not make any other assumption about the underlying distributions, whereas, in other frameworks, the underlying distributions are assumed to be within a speciﬁc class of distributions, e.g., the Gaussian distributions. One could refer to this feature as being non-parametric, while it can be argued that any discrete distributions on a domain of size n can be described by n parameters.

1.3 Notation and definitions

We use [n] to indicate the set _{{1, 2, . . . , n}. We say p : Ω → [0, 1] is a distribution over Ω} if P_i∈Ωp(i) = 1. In this thesis, Ω is a ﬁnite discrete set. We denote the probability of the

domain element x by p(x). For a subset S ⊆ Ω, let p(S) = Pi∈Sp(i). For the case where

p is a distribution over [n], we may also view p as a vector where the its i-th coordinate, pi, is equal to p(i). Let Un denote the uniform distribution on [n]; we drop the subscript

when the domain is clear from the context. Also, we refer to a Poisson random variable with parameter λ as Poi(λ).

Distances: Throughout this thesis, we mainly use the following distances between dis-tributions. For a real number r ≥ 1, we use k.kr to indicate the ℓr-norm of a vector. More

precisely, for a distribution p, _kpkr indicates the ℓr-norm of p:

kpkr := r sX

i

p(i)r_.

2_{The terminology can be confusing for the theoretical computer science community since they call this}

(19)

For two distributions q and p, the ℓr-distance between them is deﬁned as:

kp − qkr := r sX

i

|p(i) − q(i)|r_.

We deﬁne the total variation distance between two distribution as follows: kp − qkT V := max

A⊆[n]|p(A) − q(A)| .

It is not hard to see that _{kp − qk}T V is equal to kp − qk1/2. We typically use the ℓ1-distance

and say p and q are ǫ-close if kp − qk1 < ǫ and ǫ-far otherwise.

Distribution testing: We mathematically deﬁne a property_{P to be a set of distributions.}

A distribution p has the property _{P, if and only if p is in P. We say two distributions p and}

q are ǫ-far from (ǫ-close to) each other in ℓz-distance, if and only if the ℓz-distance between

them is at least (at most) ǫ. Also, p is ǫ-far from _{P, if and only if it is ǫ-far from any} distribution in _P.

Definition 1.3.1. For two given parameters ǫ and δ, we say an algorithm is an tester for

property_{P if upon receiving samples from a distribution p the following is true with probability} at least 1− δ:

• Completeness case: If p has the property P, then the algorithm outputs accept. • Soundness case: If p is ǫ-far from P, then the algorithm outputs reject.

The deﬁnition of tester can be extended to the case of properties of more than one distribution. We refer to ǫ and δ as the proximity parameter and confidence parameter respectively.

Remark 1.3.2. Note that if we have an (ǫ, δ)-tester for a property with a conﬁdence

param-eter δ < 0.5, then we can achieve an (ǫ, δ′_{)-tester for an arbitrary small δ}′ _{where the sample}

complexity of the new tester has an extra Θ(log(δ/δ′_{)) factor. This amplification technique}

runs the initial tester Θ(log(δ/δ′_{)) times and takes the majority output as the answer, and}

then is analyzed by applying the Chernoﬀ bound to obtain the new conﬁdence parameter. Thus, in this thesis, we mainly focus on the case where δ is a constant.

The term identity testing is used to refer to the setting in which we test if a distribution, which we have sample access to, is equal to a known one. Note that this is equivalent to testing property P = {q} where q is the known distribution. The term uniformity testing is used to refer to a special case of identity testing, where q is equal to _Un. The term

closeness testing refers to the setting in which we test if two distributions, both available

via samples, are equal or not; in this case, P is the set of pairs of equal distributions. The term independence testing is used to refer to the setting in which we test whether a two dimensional distribution is equal to its marginals. In this case, _{P is deﬁned to be the set of} such distributions: P = {p|p = p1× p2}.

(20)

1.4 Organization

In Chapter 2, we focus on the problem of testing monotonicity over general posets. This chapter is adapted from the paper titled “Towards Testing Monotonicity of Distributions Over General Posets”. It is joint work with Themistoklis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee. It appeared in 32nd Annual Conference on Learning Theory (COLT 2019) [AGP+₁₉_].

In Chapter 3, we focus on distribution testing in the presence of noise. This chapter is adapted from the paper titled “Testing Mixtures of Discrete Distributions”. It is joint work with Ravi Kumar and Ronitt Rubinfeld. It appeared in 32nd Annual Conference on Learning Theory (COLT 2019) [AKR19].

In Chapter 4, we focus on distribution testing with respect to diﬀerential privacy. This chapter is adapted from the paper titled “Private Testing of Distributions via Sample Permutations”. It is joint work with Ilias Diakonikolas, Daniel Kane, and Ronitt Rubin-feld. It appeared in 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) [ADKR19]. This paper is an improvement of an earlier paper titled “Diﬀerentially Private Identity and Equivalence Testing of Discrete Distributions”. It is joint work with Ilias Diakonikolas and Ronitt Rubinfeld, and appeared in 35th International Conference on Machine Learning (ICML 2018) [ADR18].

In Chapter5, we focus on a novel model for the distribution testing when the input data is gathered from multiple sources. This chapter is adapted from the paper titled “Testing Properties of Multiple Distributions with Few Samples”. It is joint work with Sandeep Silwal. It appeared in 11th Innovations in Theoretical Computer Science (ITCS 2020) [AS20].

(21)

Chapter 2 Testing Monotonicity of Distributions

Over General Posets

2.1 Introduction

Monotonicity is an essential property of distributions that captures many observed phenom-ena of real-world probability distributions. For instance, monotone distributions over totally

ordered sets might be used to describe distributions on diseases for which the probability

of being aﬀected by the disease increases with age. More generally, an important class of distributions are characterized by being monotone over a partially ordered set (poset). For these distributions, if a domain element u lower bounds v in the partial ordering (denoted

u v), then p(u) ≤ p(v) (whereas if u and v are unrelated in the poset, then p needs not sat-isfy any particular requirement on the relative probabilities of u and v). Such distributions might include distributions on diseases for which the probability of being aﬀected increases by some combination of several risk factors. Many commonly studied distributions, e.g. exponential distributions or multivariate exponential distributions, are or can be approxi-mated by piecewise monotone functions. As monotone distributions are a fundamental class of distributions, the problem of testing whether a distribution is monotone is a key building block for distribution testing algorithms.

Given an unknown distribution, over a poset domain, the goal is to distinguish whether the distribution is monotone or far from any monotone distribution, using as few samples as possible. This problem has been considered in the literature: the problem of testing whether a distribution is monotone was ﬁrst considered in the work of [BKR04], where testing the monotonicity of distributions over totally ordered domains and partially ordered domains that corresponded to two-dimensional grids were considered. The work of [BFRV10] intro-duced the study of testing the monotonicity of distributions over general partially ordered domains, and in particular, considered the Boolean hypercube (_{{0, 1}}d_{). Several other works}

considered these questions [DDS+₁₃_,_ADK15_,_CDGR18_{] under various diﬀerent domains and}

achieved improved sample complexity bounds.

The sample complexity of the testing problem varies greatly with the structure of the poset: On the one hand, for domains of size n that are total orders, Θ(√n) samples

(22)

distance from any monotone distribution [BKR04, ADK15, CDGR18]. On the other hand, testing distributions deﬁned over the matching poset requires nearly linear in n, speciﬁcally Ω(n1−o(1)_{), samples [}_BFRV10_{]. Furthermore, for a large class of familiar posets, such as the}

Boolean hypercubes, little is understood about the sample complexity of the testing problem.

2.1.1 Our results and approaches

We ﬁrst deﬁne a new property called the bigness property, which we use as our main building block for establishing sample complexity lower bounds for monotonicity testing. A distribu-tion is T -big if every domain element is assigned probability mass at least T .

Though the bigness property is a symmetric property (i.e., permuting the labels of the elements does not change whether the distribution has the property or not), we use lower bounds for testing the bigness property in order to prove lower bounds on testing mono-tonicity, which is not a symmetric property. In addition, the bigness property is a natural property, and thus of interest in its own right.

We show that the sample complexity of the bigness testing problem is Θ(n/ log n) when

T = Θ(1/n). The upper bound follows from applying the algorithm of [VV17b] that learns the underlying distribution up to a permutation of the domain elements. Our lower bound approach is inspired by the framework of [WY19], used to lower bound the number of sam-ples needed to estimate support sizes. Our lower bound is established by showing that the distribution of samples, one generated from T -big distributions (p’s) and the other generated from distributions that are ǫ-far from T -big (p′_{’s), are statistically close. In contrast with}

the standard lower bound framework, p and p′ _{are not picked from two sets of distributions.}

Instead, the distribution p (resp. p′_{) is constructed by having each domain element i choose}

its probability p(i), in an i.i.d. fashion, from the distribution PV (resp. PV′) over possible probabilities in [0, 1]. To design PV and PV′, we introduce a new optimization problem that maximizes ǫ while keeping the distribution of samples statistically close. This constraint is established via the moments matching technique, which allows us to show that the distri-butions are indistinguishable with o(n/ log n) samples, but also plays a crucial role in many other settings [RRSS09b, Val11,BFRV10, VV16,VV17b, WY19, WY16].

By reducing from the bigness testing problem, we next give a lower bound of Ω(n/ log n) on the sample complexity of the monotonicity testing problem over the matching poset, improving on the Ωn/2Θ(√log n) _{lower bound in [}_BFRV10_{]. In addition to improving the}

sample complexity lower bound, one particularly useful byproduct of our approach is that the maximum probability of an element in the constructed lower bound distribution families can be made small, which assists us in proving lower bounds for other posets in the following. Finally, we leverage the lower bound for the monotonicity testing problem over the match-ing poset to prove a lower bound of N1−δ _{for δ = Θ(}√_{ǫ) + o(1) for monotonicity testing over}

the Boolean hypercube of size N = 2d_{, greatly improving upon the standard “Birthday}

Paradox” lower bound of Ω(√N ). Our reduction follows from ﬁnding a large embedding

of the matching poset in the hypercube, and its eﬃciency follows from the previously men-tioned upper bound on the maximum element probability from the bigness lower bound construction above.

We then give a number of new tools for analyzing upper bounds on the sample complexity of the monotonicity testing problem:

(23)

1. We prove that the distance of a distribution to monotonicity can be characterized approximately as the weight of a maximum weighted matching in the transitive closure of the poset, where the weight of the edge (u, v) is the amount of violation from being monotone: max(0, p(u)− p(v)). This characterization gives a structural result about distributions that are ǫ-far from monotone. Moreover, this results extends the work of [FLN+₀₂_{] to non-boolean valued functions. The work of [}_FLN+₀₂_{] shows that the}

distance of a boolean function f to monotonicity is related to the number of “violating edges” in the transitive closure of the underlying poset.

2. Via the characterization above, we show that the monotonicity testing problem over

bipartite posets (where all edges are directed in the same direction) captures the

tonicity testing problem in its full generality. That is, we give a reduction from mono-tonicity testing over any poset to monomono-tonicity testing over a bipartite poset. Our reduction preserves the number of vertices and the distance parameter up to a con-stant multiplicative factor. As in the previous, this result extends the work of [FLN+₀₂_]

to non-boolean valued functions.

3. Leveraging the learning algorithms for symmetric distributions in [VV17b], we propose algorithms with sample complexity O(n/(ǫ2_{log n)) for testing bigness of a}

distribu-tion, and for testing monotonicity on matching posets. The proof of our latter result requires certain subtle details: (1) an additional reduction that allows us to scale our distribution for “each side” of the matching, in order to generate suﬃcient samples from each side, as required by the algorithm of [VV17b], and (2) technical lemmas establishing bounds between the total variation distance and the distance notion in

[VV17b], under the scaling mentioned earlier.

4. We give a reduction from monotonicity testing on a bipartite poset, to monotonicity testing on the matching (for which the testing algorithm is constructed above). This reduction gives an algorithm for monotonicity testing on any bipartite poset (which is the most general problem, as argued earlier), in which the overhead in the sample complexity depends only on the maximum degree of the bipartite graph.

5. We give another upper bound for testing monotonicity on bipartite posets:

O((log M)/ǫ2_{) where M is the number of “endpoint sets” of all possible matchings}

contained in the given bipartite graph (or equivalently, the number of induced sub-graphs that admit a perfect matching over their respective vertex sets). Note that for the matching poset, M = 2n _{yields an O(n/ǫ}2_{) upper bound, and therefore for}

matching posets our previous algorithm is preferable. However, this bound yields an upper bound of O(n/ǫ2_{) for all posets, and could potentially be even smaller for certain}

classes of graphs, such as collections of large stars.

6. Finally, we give an upper bound of O(n2/3_ǫ + _ǫ12) samples for monotonicity testing on bipartite posets, under the guarantee that the distribution being tested is a uniform distribution on some subset of known size of the domain. This special case is of interest in that it relates to the well studied problem of testing monotonicity of Boolean

(24)

function, we are given uniform “positive” samples of domain elements x for which

f (x) = 1.

2.1.2 Related work

Batu, Kumar, and Rubinfeld [BKR04] initiated the study of testing monotonicity of distri-butions. For the case where the domain is totally ordered, the sample complexity is known to be Θ(√n) [BKR04, ADK15, CDGR18]. Several works have considered distributions over higher dimensional domains. In [BKR04,BFRV10], it is shown that testing monotonicity of a distribution on the two dimensional grid [m]_{× [m] (here N = m}2_{) can be performed using}

e

O(N3/4_{) samples. For higher dimensional grids [m]}d _{(where N = m}d_{), Bhattacharyya et al.}

provided an algorithm that uses O(me d−1/2_{) =} _O(N/e 2d√

N ) samples [BFRV10]. Acharya et al. gave an upper bound of O(√_ǫN2 + (d log m_ǫ2 )d·_ǫ12) and a lower bound of Ω(

√

N/ǫ2_{) [}_ADK15_].

While their result gives a tight bound of Θ(√N /ǫ2_{) when d is relatively small compared to} m, it does not yield a tester for Boolean hypercubes using a sublinear number of samples.

Bhattacharyya et al. considered the problem of monotonicity testing over general posets

[BFRV10]. In particular, they proposed an algorithm for testing the monotonicity of

distri-butions over hypercubes (where N = 2d_{) using ˜}_{O(N/(log N/ log log N)}1/4_{) samples. They}

provide a lower bound of Ω(n1−o(1)_{) for testing monotonicity of distributions over a matching}

of size n, and a lower bound of Ω(√n) when the poset contains a linear-sized matching in

the transitive closure of its Hasse digraph.

In addition to the above, testing monotonicity of distributions has been considered in var-ious settings [ACS10,DDS12,Can15a]. There are several works on testing various properties, e.g. uniformity, closeness, and independence when the underlying distribution is monotone

[BDKR02, BKR04, RS09, DDS+₁₃_,_AJOS13_].

Testing monotonicity of boolean functions is also well studied (e.g., [GGLR98, DGL+₉₉_,

LR01, FLN+₀₂_, _CS13_, _CS14_, _BB16_, _BCS18_{]). In the general regime, the algorithm can}

query the value of the function at any element in the poset. This ability is in sharp contrast with our model, in which the algorithm only receive samples according to the distribution, which do not directly reveal the probability of the elements. It is known that one can test monotonicity of functions over hypergrids, and hypercubes using as few as polylogarithmic queries in the size of the domain. This query complexity is exponentially smaller than the sample complexity of testing monotonicity of distributions, demonstrating that there are inherent diﬀerences between the two problems.

2.1.3 Preliminaries

Given a multiset of samples from a distribution on [n], the histogram of the samples is an

n-dimensional vector, h = (h1, h2, . . . , hn), where hi is the frequency of the i-th element in

the sample set. A poset G = ([n], E) is called a line if and only if E contains all the edges (i, i + 1) for 1 ≤ i ≤ n. We say a poset is a matching if all of the edges in the poset are vertex-disjoint. We say a poset is bipartite if the set of vertices can be decomposed in two sets, the top set and the bottom set, where no two vertices in the same set are connected. Moreover, the direction of all the edges is from the top set to the bottom set. We use similar terminology for the matching poset as well. In addition, we say a poset G = (V, E) is an

(25)

n-dimensional hypercube when V is{0, 1}n _{and E contains all edges (u, v) where there exists}

a coordinate i such that ui= 0 and vi = 1 and uj = vj for all i6= j.

Monotonicity. A partially-ordered set (poset) is described as a directed graph G = (V, E),

where each edge (u, v) indicates the relationship u _{v on the poset. A matching poset is a} poset where the underlying graph G is a matching. A distribution p over a poset domain

V =_{v1, . . . , vn} is a distribution over the vertex set V . A distribution p is monotone (with

respect to a poset G) if for every edge (u, v)_{∈ E (i.e., every ordered pair u v), p(u) ≤ p(v).} Let Mon(G) be the set of all monotone distributions over the poset G. We say that p is ǫ-far

from monotone if its distance to monotonicity, dT V(p, Mon(G)) := minq∈Mon(G)dT V(p, q), is

at least ǫ.

Definition 2.1.1. Let p be a distribution on poset G and ǫ be the proximity parameter.

Suppose an algorithm, A, has sample access to p and the full description of poset G. A is

called a monotonicty tester for distributions if the following is true with probability at least

2/3 when the tester has sample access to the distribution. • If p is monotone, then A outputs accept.

• If p is ǫ-far from monotone, then A outputs reject.

Bigness. A probability distribution p over a domain [n] =_{{1, . . . , n} is T -big if, for every}

domain element i ∈ [n], p(i) ≥ T . Related notions for distance to T -bigness are deﬁned analogously. The parameter T is called the bigness threshold, and may be omitted if it is clear from the context. Let Big(n, T ) indicate the set of all distributions over [n] that are

T -big. We deﬁne the distance to T -bigness as dT V (p, Big(n, T )) = minq∈Big(n,T )dT V(p, q). If

this distance is at least ǫ, we say the distribution is ǫ-far from being T -big.

Definition 2.1.2. Let p be a distribution on [n]. Suppose Algorithm _{A receives threshold T}

and bigness parameter ǫ, and has sample access to p. _{A is a T -bigness tester if the following} is true with probability at least 2/3.

• If p is T -big, then A outputs accept.

• If p is ǫ-far from T -big, then A outputs reject.

Also, T -bigness testing problem refers to the task of distinguishing the above cases with high probability.

2.2 Overview of Our Techniques

(26)

2.2.1 A lower bound for the bigness testing problem

In Section 2.3, we provide two random processes for generating histograms of samples from two families of distributions, such that one family consists of “big” distributions, and the other family largely of “ǫ-far from big” distributions. Then, we show that unless a large number of samples have been drawn, the distributions over the histograms generated via these two random processes are statistically very close to each other, and hence appear indistinguishable to any algorithm, as speciﬁed precisely in Theorem2.3.1. The construction yields a lower bound for the general problem of testing the bigness property in Corollary2.3.4. Furthermore, the construction provides a useful building block for establishing further lower bounds for monotonicity testing in various scenarios in Section 2.4.

To generate histograms from the two families of distributions, imagine the following process: We have two prior distributions PV and PV′, and we generate probability vectors (measures), p and p′_{, according to the priors: Each domain element i randomly picks its}

probability in an i.i.d fashion from the prior distribution. More precisely, let V1, V2, . . . , Vn

be n i.i.d. random variables from prior PV, then p is deﬁned to be the following:

p = 1

n(V1, V2, . . . , Vn) .

We generate p′ _{similarly according to prior P}

V′. While the total probability is unlikely to sum to 1, we will design the priors, PV and PV′, so that we can later modify p or p′ into a probability distribution with only small changes. We then generate histograms of samples from (the normalization of) p by drawing n independent random variables hi ∼ Poi(s · p(i))

(namely hi ∼ Poi(sVi/n)) for i = 1, . . . , n, and output h = (h1, . . . , hn) as the histogram

of the samples. Note that by Poissonization method, one may view the histogram as being generated from a set of Poi(s· PiVi/n) samples from the normalization of p. Hence, if

P

iVi/n is close to one, the histogram serves as a set of roughly s samples. We set s more

speciﬁcally in terms of the rest of the parameters later.

The goal in Section 2.3 is to ﬁnd two prior distributions PV and PV′, then generate two probability vectors p and p′_{, and two histograms h and h}′ _{according to them respectively,}

such that the following events hold with high probability.

1. The probability vectors p and p′ _{are approximate probability distributions; that is,}

their total probability masses are each close to 1.

2. After scaling the probability vectors p and p′ _{above into respective probability}

distri-butions, the normalization of p is T -big, and the normalization of p′ _{is ǫ-far from any} T -big distribution.

3. The total numbers of (Poissonized) samples in h and h′ _{drawn from the normalization}

of p and p′ _{are each Ω(s), where s is the sample complexity lower bound we are aiming}

to prove.

4. Given h or h′_{, distinguishing whether it is generated from P}

V or PV′ with success probability 2/3 requires h or h′ _{to contain at least s samples.}

(27)

5. Additionally, we will bound the largest probability mass pmax that the normalized

distributions place on any domain element – this part is not necessary for this section, but will be useful for the reduction between monotonicity testing and bigness testing later on.

Now, if we choose PV and PV′ carefully such that h and h′ are generated according to the above process based on PV and PV′ are hard to distinguish, then we can establish a lower bound for the bigness testing problem. We state this result more formally as the following theorem in Section 2.3.

Theorem 2.3.1. For integer L = O(log n) and sufficiently small ǫ = Ω(1/n), there exist a

parameter β = β(L, ǫ) and two distributions H+ _and _H− _{over the set of possible histograms} of size at least s = Ωn1−1/L_log2

(1/ǫ)/L with the following properties:

• The histogram generated from H+ _{is drawn from a 1/(βn)-big distribution.}

• The histogram generated from H− _{is drawn from a distribution which is ǫ-far from any}

1/(βn)-big distribution. • dT V (H+,H−)≤ 0.01.

• The largest probability mass among any elements in any probability distributions above

(from which the histograms are drawn) is pmax= O(L

2_{/(n log}2

(1/ǫ))).

An important case of this theorem is when L = Θ(log n), where we establish a nearly linear sample complexity lower bound of Ω(n/ log n) for the general problem of bigness testing as follows.

Corollary 2.3.4. For sufficiently small parameter ǫ = Ω(1/n), there exists a parameter β =

β(ǫ) such that any algorithm that can distinguish whether a distribution over [n] is 1/(βn)-big or ǫ-far from any 1/(βn)-big distribution with probability 2/3 requires Ω(n log2(1/ǫ)/ log n)

samples. In particular when ǫ is a constant, β is constant, then any such algorithm requires

Ω(n/ log n) samples.

We propose the following optimization problem, OP1, such that its optimal solution speciﬁes PV and PV′, satisfying the requirements of the theorem. Intuitively speaking, as

PV aims to generate T -big distributions, we must ensure that Vi’s are bounded away from

1/β, so that p(i) = Vi/n has expected value higher than T = 1/(βn). At the same time, we

hope to maximize the probability that V′

i = 0 so that p′ has lots of domain elements with

probability zero to make its normalization far from any T -big distribution. In addition, we ﬁnd PV and PV′ under the constraint that the ﬁrst L moments of them are exactly matched, as to ensure that the resulting distributions over the histograms, H and H′_{, are statistically}

close. The objective value of this optimization problem corresponds to the expected distance of p′ _{to the closest T -big distribution in the ℓ}

1-distance.

Deﬁnition of OP1 : sup _β1Pr[[[V′ _{= 0]]]} s.t. E[[[ V ]]] = E[[[V′_{]]] = 1}

E[[[ Vj_{]]] = E}hhh_V′jiii _{for j = 1, 2, . . . , L} V ∈h1+ν β , λ β i , V′ _{∈ {0} ∪}h1+ν β , λ β i and β > 0.

(28)

In the above optimization problem, the unknowns are PV, PV′ , and β. ν and λ are two

pa-rameters speciﬁed latter in the proof. That is we are looking for two distributions PV and PV′

such that two random variables V and V′ _{drawn from them respectively have expected value}

one, and their ﬁrst L moments are matched. Also, β controls the range of the probabilities,

p(i)’s and p′_{(i)’s, and the distance to the bigness property.}

We relate the optimal solution for OP1 to an LP deﬁned by [WY19], who in turn relate their LP to the error from the best polynomial approximation of the function 1/x over the interval [1+ν, λ]. By doing this, we show the existence of a solution (PV, PV′) where the value

Pr[[[ V′ _{= 0]]], which is proportional to the distance to 1/(βn)-bigness in the second family, is}

suﬃciently large.

Our proof relies on and extends the lower bound techniques for estimating support size provided in [WY19], incorporating speciﬁc conditions for the bigness problem. Firstly, unlike the support size estimation problem, we need our distributions to be fully-supported on the domain [n] for the big distributions, whereas in their case, both families of distributions are allowed to be partially supported. Secondly, our optimization problem treats the threshold 1/(βn) as a variable, whereas the support size problem simply imposes the strict threshold of 1/n. Thirdly, based on this construction, we must also give a direct upper bound for the maximum probability, which facilitates our later proofs for providing lower bounds for the matching and hypercube posets.

2.2.2 From bigness lower bounds to monotonicity lower bounds

In Section 2.4, we show how to turn our lower bound results for bigness testing problem in Section 2.3, into lower bounds for monotonicity testing in some fundamental posets, namely the matching poset and the Boolean hypercube poset.

Matching poset. To establish our lower bound for testing monotonicity of the matching

poset, we construct our distribution p by assigning probability masses to the endpoints of edges (ui, vi) in our matching as follows: the vertices ui’s are assigned probability masses

according to the T -bigness construction, whereas the vertices vi’s are uniformly assigned the

threshold T as their probability masses; the assigned probabilities are then normalized into a proper probability distribution. We show that before normalization, p(ui) ≤ T = p(vi)

if the original distribution is big; and otherwise, the distance to the monotonicity of the constructed distribution measures exactly the distance to the T -bigness property. We then show that the normalization step scales the entire distribution p down by only a constant factor, hence the lower bounds for the monotonicity testing over the matching poset with 2n vertices asymptotically preserves the parameters ǫ, s and pmax of the lower bound on bigness

construction for n domain elements.

Hypercube poset. To achieve our results for the Boolean hypercube, we embed our

dis-tributions over the matching poset into two consecutive levels ℓ and ℓ− 1 of the hypercube (where ℓ denotes the number of ones in the vertices’ binary representation). We pair up elements in these levels in such a way that distinct edges of the matching have incomparable endpoints: the algorithm must obtain samples of these matched vertices in order to decide

(29)

whether the given distribution is monotone or not. We also place probability mass pmax on

all other vertices on level ℓ and above, and probability mass 0 on all remaining vertices, in order to ensure that the distribution is monotone everywhere else. Lastly, we rescale the entire construction down into a proper probability distribution. Unlike the matching poset, sometimes this scaling factor is super-constant, shrinking the overall distance to monotonic-ity, ǫ, to sub-constant. Here, we make use of our upper bound on pmax of the bigness lower

bound construction to determine the scaling factor.

2.2.3 Reduction from general posets to bipartite graphs

In Section 2.5, we show that the problem of monotonicity testing of distributions over the

bipartite posets is essentially the “hardest” case of monotonicity testing in general poset

domains. That is, we show that for any distribution p over some poset domain of size n, represented as a directed graph G, there exists a distribution p′ _{over a bipartite poset G}′ _of

size 2n such that (1) p preserves the total variation distance of p to monotonicity up to a small multiplicative constant factor, and (2) each sample for p′ _{can be generated using one}

sample drawn from p. These properties together imply the following main theorem of the section.

Theorem 2.5.1. Suppose that there exists an algorithm that tests monotonicity of a

dis-tribution over a bipartite poset domain of n elements using s(n, ǫ) samples for any total variation distance parameter ǫ > 0. Then, there exists an algorithm that tests monotonicity of a distribution over any poset domain of n elements using O(s(2n, ǫ/4)) samples.

Our approach may be summarized as follows. We ﬁrst show, in Theorem 2.5.2, that we may characterize (up to a constant factor) the distance of p′ _{to monotonicity, as the size of}

the maximum matching on the transitive closure of G, denoted by T C(G), where the weight

w(u, v) := max_{{p(u)−p(v), 0} represents the amount that (u, v) is violating the monotonicity}

condition. In particular, we have the following theorem:

Theorem 2.5.2. Consider a poset G = (E, V ) and a distribution p over its vertices. Suppose

every edge (u, v) in the T C(G) has a weight of max(0, p(u)_{−p(v)). Then, the total variation} distance of p to any monotone distribution is within a factor of two of the weight of the maximum weighted matching in T C(G).

This crucial theorem provides a combinatorial way to approximate the distance to mono-tonicity for general posets, leading to our upcoming construction of p′ _{for Theorem} _2.5.1 _as

well as some algorithms in Section 2.6. Theorem 2.5.2 is shown via LP duality: the dual LP for the problem of optimally “ﬁxing” p to make it monotone, turns out to align with the maximum (fractional) matching problem on G’s transitive closure. In particular, the dual constraints are of the form _{{Ay ≤ b, y ≥ 0} where A is a totally unimodular matrix,} implying that an integral optimal solution exists, namely the maximum matching.

To prove Theorem 2.5.1, given the original poset G = (V, E), we create a bipartite poset with two copies u− _{and u}+ _{of each original vertex u} _{∈ V : the vertices u}−_{’s and u}+_{’s form}

the bipartition of the new bipartite poset G′ _{of size 2n. We add (u}−_{, v}+_{) to the bipartite}

(30)

to v in G. The new probability distribution p′ on G′, is created from p on G, by dividing the probability mass p(u) equally among p′_(u−_{) and p}′_(u+_{). Note that a sample from p}′ _is

obtained by drawing from p and adding the sign−/+ equiprobably. It follows via transitivity that p′ _{is monotone over G}′ _{when p is monotone over G, and via Theorem} _2.5.2 _{that if p}

is ǫ-far from monotone on G, then p′ _{is also at least ǫ/4-far from monotone over G}′_{. These}

conditions allow us to test monotonicity of p on any general poset G by instead testing monotonicity of p′ _{on a bipartite poset G}′ _{with parameter ǫ}′ _{= ǫ/4, as desired.}

2.2.4 Upper bounds results

In Section2.6, we provide sublinear algorithms for testing bigness, and testing monotonicity of distributions over diﬀerent poset domains.

Bigness testing. In Section 2.6.1, we provide an algorithm for bigness testing. Observe that the T -bigness property is a symmetric property: closed under permutation of the labels of the domain elements [n]. Hence, we leverage the result of [VV17b] that learns the counts of elements for each probability mass: hp(x) = |{a : p(a) = x}|. Observe that the distance

to T -bigness is proportional to the total “deﬁcits” of elements with probability mass below

T . Hence, this learned information suﬃces for constructing an algorithm for testing bigness,

using a sub-linear, O( n

ǫ2_{log n}), number of samples.

Monotonicity testing for matchings. Next, in Section 2.6.2, we provide an algorithm for testing monotonicity of matching posets. We again resort to the work of [VV17b] for learning the counts of elements for each pair of probability masses, with respect to a pair of distributions p1, p2 over the domain [n], namely hp1,p2(x, y) = |{a : p1(a) = x, p2(a) = y}|,

given O( n

ǫ2_{log n}) samples each from p1 and p2. We hope to consider our distribution p over a matching G = (S∪ T, E) with E = {(ui, vi)}i∈[n]⊂ S × T as a pair of distributions, namely

pS and pT, representing probability masses p places over ui ∈ S and vi ∈ T , respectively.

Learning hpS,pT would intuitively allows us to approximate p’s distance to monotonicity by

summing up the “violation” for pairs x < y. However, there are subtle challenges to this approach that do not present in the earlier case of bigness testing.

First, we must somehow rescale pS and pT up into distributions according to their total

masses wS, wT placed by p. However, it is possible that, say, pS = o(1), making samples from

S costly to generate by drawing i.i.d. samples from p. We resolve this issue via a reduction

to a diﬀerent distribution p′ _{that approximately preserves the distance to bigness, while}

placing comparable total probability masses to S and T . Second, the algorithm of [VV17b] learns hp1,p2(x, y) according to a certain distance function, that we must lower-bound by the total variation distance. In particular, this bound must be established under the presence of errors in the scaling factor, as wS and wT are not known to the algorithm. We overcome

these technical issues, which yields an algorithm for testing monotonicity over matchings. We maintain the same asymptotic complexity as that of [VV17b].

Monotonicity testing for bounded-degree bipartite graphs. Moving on, in

Distribution testing : classical and new paradigms

Distribution Testing: Classical and New Paradigms

by

Maryam Aliakbarpour

B.S., Sharif University of Technology (2013)

S.M., Massachusetts Institute of Technology (2015)

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulﬁllment of the requirements for the degree of

Doctor of Philosophy in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2020

c

Massachusetts Institute of Technology 2020. All rights reserved.

Author . . . .

Department of Electrical Engineering and Computer Science

August 24, 2020

Certiﬁed by . . . .

Ronitt Rubinfeld

Professor of Electrical Engineering and Computer Science

Thesis Supervisor

Accepted by . . . .

Leslie A. Kolodziejski

Professor of Electrical Engineering and Computer Science

Chair, Department Committee on Graduate Students

Distribution Testing: Classical and New Paradigms

by

Maryam Aliakbarpour

Abstract

Acknowledgments

Contents

List of Figures

Chapter 1

Introduction

1.1

Contributions

1.2

The context

1.3

Notation and definitions

1.4

Organization

Chapter 2

Testing Monotonicity of Distributions

Over General Posets

2.1

Introduction

2.1.1

Our results and approaches

2.1.2

Related work

2.1.3

Preliminaries

2.2

Overview of Our Techniques

2.2.1

A lower bound for the bigness testing problem

2.2.2

From bigness lower bounds to monotonicity lower bounds

2.2.3

Reduction from general posets to bipartite graphs

2.2.4

Upper bounds results