Category Archives: Talks

Boots: New Machine Learning Approaches to Modeling Dynamical Systems

Large streams of data, mostly unlabeled.

Machine learning approach to fit models to data. How does it work? Take the raw data, hypothesize a model, use a learning algorithm to get the model parameters to match the data.

What makes a good machine learning algorithm?

  • Performance guarantees: \(\theta \approx \theta^*\) (statistical consistency and finite sample bounds)
  • Real-world sensors, data, resources (high-dimensional, large-scale, ...)

For many types of dynamical systems, learning is provably intractable. You must choose the right class of model, or else all bets are off!

Look into:

  • Spectral Learning approaches to machine learning

Leave a Comment

Filed under Big Data, Machine Learning, Talks

Basener: Topological and Bayesian Methods in Data Science

  • Topology: Encompasses the global shape of the data, and the relations between data points or groups within the global structure
    • Google Pagerank Algorithm
    • Example: Cosmic Crystallography
      • Torus universe (zero curvature)
      • Spherical universe (positive curvature)
      • Other universe (negative curvature)
  • Data: Hyperspectral Imagery
  • Gradient Flow Algorithm
    • identify neighbor with highest density for each data point (arrow points from that point to that particular neighbor)
      • gives a data field
    • follow the arrows to identify clusters


Leave a Comment

Filed under Big Data, Talks

Christin: Seeking a fix: Measuring, analyzing and disrupting unlicensed online drug sales

Interesting points from the talk

  • Drugs in different countries have different names, so they had to do matching
  • Use the Jacard distance to find related pharmacies

Interesting points to look into for research:

  • spinglass clustering algorithm
  • visualizations for spinglass

Leave a Comment

Filed under Big Data, Talks

Natural Language Processing for Predictive Technology (Matthew Gerber)

SIE Colloquium by Matthew Gerber, Research Assistant Professor in the Systems and Information Engineering Department.

The PTL group has 2 faculty, 10 grad students, and collaborators at the health system.

Predicting crime using twitter:

  • Conventional warfare had easily identified forces and open conflict with direct attacks (friends/enemies). The US has no conventional military peers. The US us dealing with asymmetric warfare (asymmetry in size, power, funding, influence). Our enemies have tactical advantages.
  • Monitoring via hot-spot maps
    1. Problems: very specific to the are you're studying and it's retrospective. Can't take yesterday's model and predict on a different place today.
  • Overview of the approach
    1. Gather information on potential crime correlates (Incident Layer, Grid Layer, Demographic Layer, Spatial Layer). Ex: newar military outpost? religious site? Income levels and ethnic tension, and prior history (each on a different layer). Want to take these information and create a statistical model.
    2. Text provides a problem: unstructured text abounds. These short tweets should be helpful: "The second blast was caused by a motorcycle bomb targeting a minibus in the Domeez area in the south of the city. That needs to be read by a human or automated approach (this talk).
    3. Automatically integrate unstructured text: add some new layers from the previous model (Twitter Layer, Newswire Layer, ...).
  • He's looking at tweets from the Chicago area (collecting in the basement of olsson--time, text, etc). A few topics: 1) flight(0.54), plane(0.2), terminal(0.11),... ; 2) shopping (0.39), buy(--),...
    1. Mapping these \(n\) topics to heat map of Chicago. Can see where certain things are being talked about.
    2. Unsupervised topic modeling
      • Latent Dirichlet allocation (Blei et al 2003)
      • A generative story (2 topics). Outside of these documents live topics. We can generate these. Do a similar thing with the documents (grab a dirichlet distribution and produce another--a distribution of topics that the document consists of). Want to pick a topic from that distribution to generate a word. (generate by repeating this process).
      • Gather tweets from a neighborhood, tokenize and filter words, identify topic probabilities by LDA, compute probability of crime \(P(Crime) = F(0.15,0.74,...,f_n)\) . The question what is \(f\) ?
        1. \(\frac{1}{1+e^{-\left(\beta_0 + \prod_{b=1}^n \beta_bf_b(p)\right)}}\) .
        2. Find the beta coefficients that give the best function
      • Training
        • Establish training window (1/1/13-1/31/13)
        • Lay down non-crime points
        • lay down crime points from training window
        • Compute topic neighborhoods
        • compile training data (use Kernel Density Estimate (?) that adds historical data to the model)
      • Evaluation
        • Want to find the smallest place boundaries with the highest crime levels.
        • Do people actually talk about crime on twitter? (that's the big question-- but gangs do trash-talk about their crimes, etc)
        • Baseline for comparison was the kernel density estimation (based on past, where is crime likely to occur?)
        • They do well with twitter data model + KDE over just KDE for certain results: prostitution, battery.
        • They are worse with other topics/crime: homicide, liquor law violations.
        • AUC improvement for 22 of 25 crime types, with average peak improvement of 11 points
  • Clinical Practice Guidelines
    • Want to formalize using natural language processing
    • Sentences have a specific order: they're using NLP and parsing English sentences. (concern: context sensitivity of English)
    • Want to annotate the text with semantic labels (not XML, though).
    • Precisions: temporal identifiers 28% are identified; others average around 50%, with the top around 75-80%
    • Warning: need to make sure that fully automated isn't used alone, as there could be things that automated analysis would miss that could be life-threatening.
  • The big picture
    • Want to get structured information from unstructured text data through Natural Language Processing

Leave a Comment

Filed under Natural Language Processing, Talks

Google Tech Talk Fall 2013

Used to handle searching:

  • MapReduce
  • BigTable
  • Percolator: incremental update system

Leave a Comment

Filed under Big Data, Data Centers, Talks

Big Data talk

Talk by Faculty candidate on Friday.

Big data computation model

  • \(n\) = number of vectors in \(R^d\) seen so far
  • \(d\) = number of sensors (dimensionality)
  • only have \(\log n\) memory available on the system
  • \(M\) = number of cores

\(k\) -median queries

  • input: \(P \in R^d\)
  • key idea: replace many points by a weighted average point \(c \in C\)
  • \(\sum_{p \in P} dist(p,Q) \approx \sum_{c \in C} w(c) * dist(c,Q)\) , where \(C\) is the approximated core set.

Online algorithm to build core set: read in \(1/\varepsilon\) points, compute \(k\) -means and best-fit lines, then choose some of the furthest outliers, then compress that data. Read next \(1/\varepsilon\) points, repeat, then compute the core set of those two sets to get a new core set. Then move on to next points until you read all \(n\) and have a final core set. Then run off-line heuristic on this core set of size \(k\) .

Does a \(1+\varepsilon\) approximation.

References to consider:

  1. Feldman, Landberg, STOC'11
  2. Feldman, Sohler, Monemizadeh, SoCG'07
    • Coreset size: \(kd/\varepsilon^2\)
  3. Har-Peled, Mazumdar, 04
  4. Feldman, Wu, Julian, Sung and Rus. GPS Compression

Leave a Comment

Filed under Big Data, Core Sets, Talks