Talk by Faculty candidate on Friday.

Big data computation model

- \(n\) = number of vectors in \(R^d\) seen so far
- \(d\) = number of sensors (dimensionality)
- only have \(\log n\) memory available on the system
- \(M\) = number of cores

\(k\) -median queries

- input: \(P \in R^d\)
- key idea: replace many points by a weighted average point \(c \in C\)
- \(\sum_{p \in P} dist(p,Q) \approx \sum_{c \in C} w(c) * dist(c,Q)\) , where \(C\) is the approximated core set.

Online algorithm to build core set: read in \(1/\varepsilon\) points, compute \(k\) -means and best-fit lines, then choose some of the furthest outliers, then compress that data. Read next \(1/\varepsilon\) points, repeat, then compute the core set of those two sets to get a new core set. Then move on to next points until you read all \(n\) and have a final core set. Then run off-line heuristic on this core set of size \(k\) .

Does a \(1+\varepsilon\) approximation.

References to consider:

- Feldman, Landberg, STOC'11
- Feldman, Sohler, Monemizadeh, SoCG'07
- Coreset size: \(kd/\varepsilon^2\)

- Har-Peled, Mazumdar, 04
- Feldman, Wu, Julian, Sung and Rus. GPS Compression