# Category Archives: Core Sets

## Big Data talk

Talk by Faculty candidate on Friday.

Big data computation model

• $n$ = number of vectors in $R^d$ seen so far
• $d$ = number of sensors (dimensionality)
• only have $\log n$ memory available on the system
• $M$ = number of cores

$k$ -median queries

• input: $P \in R^d$
• key idea: replace many points by a weighted average point $c \in C$
• $\sum_{p \in P} dist(p,Q) \approx \sum_{c \in C} w(c) * dist(c,Q)$ , where $C$ is the approximated core set.

Online algorithm to build core set: read in $1/\varepsilon$ points, compute $k$ -means and best-fit lines, then choose some of the furthest outliers, then compress that data. Read next $1/\varepsilon$ points, repeat, then compute the core set of those two sets to get a new core set. Then move on to next points until you read all $n$ and have a final core set. Then run off-line heuristic on this core set of size $k$ .

Does a $1+\varepsilon$ approximation.

References to consider:

1. Feldman, Landberg, STOC'11
• Coreset size: $kd/\varepsilon^2$