Big Data talk

Talk by Faculty candidate on Friday.

Big data computation model

  • \(n\) = number of vectors in \(R^d\) seen so far
  • \(d\) = number of sensors (dimensionality)
  • only have \(\log n\) memory available on the system
  • \(M\) = number of cores

\(k\) -median queries

  • input: \(P \in R^d\)
  • key idea: replace many points by a weighted average point \(c \in C\)
  • \(\sum_{p \in P} dist(p,Q) \approx \sum_{c \in C} w(c) * dist(c,Q)\) , where \(C\) is the approximated core set.

Online algorithm to build core set: read in \(1/\varepsilon\) points, compute \(k\) -means and best-fit lines, then choose some of the furthest outliers, then compress that data. Read next \(1/\varepsilon\) points, repeat, then compute the core set of those two sets to get a new core set. Then move on to next points until you read all \(n\) and have a final core set. Then run off-line heuristic on this core set of size \(k\) .

Does a \(1+\varepsilon\) approximation.

References to consider:

  1. Feldman, Landberg, STOC'11
  2. Feldman, Sohler, Monemizadeh, SoCG'07
    • Coreset size: \(kd/\varepsilon^2\)
  3. Har-Peled, Mazumdar, 04
  4. Feldman, Wu, Julian, Sung and Rus. GPS Compression

Leave a Comment

Filed under Big Data, Core Sets, Talks

Leave a Reply

Your email address will not be published. Required fields are marked *