• Aucun résultat trouvé

Block Heavy Hitters

N/A
N/A
Protected

Academic year: 2021

Partager "Block Heavy Hitters"

Copied!
5
0
0

Texte intégral

(1)

Computer Science and Artificial Intelligence Laboratory

Technical Report

MIT-CSAIL-TR-2008-024

May 2, 2008

Block Heavy Hitters

(2)

Block Heavy Hitters

Alexandr Andoni MIT Khanh Do Ba MIT Piotr Indyk MIT April 11, 2008 Abstract

We study a natural generalization of the heavy hitters problem in the streaming context. We term this generalization block heavy hitters and define it as follows. We are to stream over a matrix A, and report all rows that are heavy, where a row is heavy if its `1-norm is at

least φ fraction of the `1 norm of the entire matrix A. In comparison, in the standard heavy

hitters problem, we are required to report the matrix entries that are heavy. As is common in streaming, we solve the problem approximately: we return all rows with weight at least φ, but also possibly some other rows that have weight no less than (1 − )φ. To solve the block heavy hitters problem, we show how to construct a linear sketch of A from which we can recover the heavy rows of A.

The block heavy hitters problem has already found applications for other streaming prob-lems. In particular, it is a crucial building block in a streaming algorithm of [AIK08] that constructs a small-size sketch for the Ulam metric, a metric on non-repetitive strings under the edit (Levenshtein) distance.

We prove the following theorem. Let Mn,m be the set of real matrices A of size n by m, with

entries from E = nm1 · {0, 1, . . . nm}. For a matrix A, let Ai denote its ith row.

Theorem 0.1. Fix some  > 0, and n, m ≥ 1, and φ ∈ [0, 1]. There exists a randomized linear map (sketch) µ : Mn,m → {0, 1}s, where s = O(51φ2 log n), such that the following holds. For a matrix

A ∈ Mn,m, it is possible, given µ(A), to find a set W ⊂ [n] of rows such that, with probability at

least 1 − 1/n, we have: • for any i ∈ W , kAik1

kAk1 ≥ (1 − )φ and

• if kAik1

kAk1 ≥ φ, then i ∈ W .

Moreover, µ can be of the form µ(A) = µ0(ρ(A1), ρ(A2), . . . ρ(An)), where ρ : Em → Rk and

µ0 : Rkn → {0, 1}s are randomized linear mappings. That is, the sketch µ is obtained by first

sketching the rows of A (using the same function ρ) and then sketching those sketches.

Our construction is inspired by the CountMin sketch of [CM05], and may be seen as a CountMin sketch on the projections of the rows of A.

Proof. Construction of the sketch. We define the function ρ as an `1 projection into a space

with k = O(12 log n) dimensions, achieved through a standard Cauchy distribution projection.

(3)

Namely, the function ρ is determined by k vectors ~c1, . . . ~ck∈ Rm, with coordinates chosen iid from

the Cauchy distribution with pdf f (x) = π11+x1 2. Then ρ(~x), for some ~x ∈ Em, is given by

ρ(~x) = (~c1~x, ~c2~x, . . . ~ck~x).

The function µ0 takes as input ρ(A1), . . . ρ(An), and produces k hash tables, each having l =

O(21φ) cells. The jth cell of the ith hash table H(i), for j ∈ [l], is given by

Hj(i)= X

q:hi(q)=j

[ρ(Aq)]i.

See Figure 1 for an illustration.

  ρ ρ ρ ρ ρ ρ n m k µ’ A1 An h1 hk ρ(A1) ρ(An) l

Figure 1: Illustration of µ as a double sketch.

Reconstruction. Given a sketch µ(A) = µ0(ρ(A1), . . . ρ(An)), we construct the desired set W

as follows. For each w ∈ [n], consider the vector ~rw =

 |Hh(i)

i(w)|



i∈[k]. Then w is included in W

iff median(~rw) > (1 − /2)φ. In words, for any block w we consider the cell of a hash table H(i)

into which w falls (one for each i). If the majority of these cells contain a value greater or equal to (1 − /2)φ (in magnitude), then w is included in W .

Sketch size. As described, the sketch µ(A) = µ0(ρ(A1), . . . ρ(An)) consists of k·l = O(41φlog n)

real numbers. We note that, by usual arguments, it is enough to store all the real numbers up to precision O(φ) and cut off when the absolute value is beyond a constant such as 2. The resulting size of the sketch (in bits) is s = O(51φ2 log n).

Analysis of correctness. We proceed to proving that the set W satisfies the desired properties.

(4)

First, consider any w such that kAwk1 ≥ φ. We would like to prove that w ∈ W w.h.p. For

this purpose, it is sufficient to prove that, for fixed i ∈ [k], we have that |Hh(i)

i(w)| > (1 − /2)φ

with probability ≥ 1/2 + Ω(). Then, a standard application of the Chernoff bound will imply that median(~rw) > (1 − /2)φ w.h.p.

So fix some i ∈ [k], and consider the cell hi(w) of the hash table H(i). Let χ[E] denote the

indicator variable of an event E. The mass that falls into the cell hi(w) is equal to the following

quantity: Hh(i) i(w) = [ρ(Aw)]i+ X j∈[n],j6=w [ρ(Aj)]i· χ[hi(j) = hi(w)] = ~ci· Aw+ ~ci·   X j∈[n],j6=w Aj· χ[hi(j) = hw(j)]   = ~ci·  Aw+   X j∈[n],j6=w Aj· χ[hi(j) = hw(j)]    .

Now, consider the vector ~z = 

P

j∈[n],j6=wAj· χ[hi(j) = hw(j)]



. The expected norm of ~z is at most Ehi[k~zk1] ≤ 1 l X j∈[n],j6=w kAjk1≤ 1/l = O(2φ).

By Markov’s inequality, with probability at least 1−O(), we have k~zk1≤ φ/4 and thus kAw+~zk1≥

(1 − /4)φ. It follows that the random variable |(Aw+ ~z) ·~ci| has a Cauchy distribution with median

kAw+ ~zk1 ≥ (1 − /4)φ. By standard properties of Cauchy distributions we have

H (i) hi(w) ≥ (1 − /4) · (1 − /4)φ > (1 − /2)φ with probability at least (1/2 + Ω())(1 − O()) = 1/2 + Ω().

Next we prove that if kAwk1 ≤ (1 − )φ, then w 6∈ W w.h.p. As above, we just need to

prove that |Hh(i)

i(w)| < (1 − /2)φ with probability ≥ 1/2 + Ω(). We again consider the vector

~

z =P

j∈[n],j6=wAj· χ[hi(j) = hj(w)]



, and similarly deduce that, with probability at least 1−O(),

we have k~zk1 ≤ φ/4 and thus kAw+ zk1 ≤ (1 − 34)φ. Again by standard properties of Cauchy

distributions, we conclude that H (i) hi(w) ≤ (1 + /4) · (1 − 3 4)φ < (1 − /2)φ

with probability at least (1/2 + Ω())(1 − O()) = 1/2 + Ω().

References

[AIK08] Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Overcoming the `1

non-embeddability barrier: Algorithms for product metrics. Manuscript, 2008.

[CM05] G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58–75, 2005.

(5)

Figure

Figure 1: Illustration of µ as a double sketch.

Références

Documents relatifs

Figure 4 Evolution of publications (number of articles using the term divided by the total number of all articles published that year) using the term “heavy metal” in the topic

Two loop renormalization group equations in general gauge field theories... Two loop renormalization group equations for soft supersymmetry

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

To enlarge the statistics of long fission events, one should have a very large fission barrier at the beginning of the chain to compensate the damping, and have a fission barrier

Summing up, it is not yet possible ta produce beams of proper intensity and in sufficiently high charge states of high-Z atoms in ion sources that are suitable for

This abrupt acceleration at low altitudes plays a decisive role in the overall transport of heavy plane- tary material, since, regardless of the initial parallel speed (be it well

Les paronymes sont des termes qui ont une forme semblable, mais qui n’ont pas le même sens.. Prenons, par exemple, les termes accident et

III, we will show how the physics of exotic atoms and of highly charged heavy ions have come together to provide at the same times accurate spectrometer calibrations that were