Large streams of data, mostly unlabeled.
Machine learning approach to fit models to data. How does it work? Take the raw data, hypothesize a model, use a learning algorithm to get the model parameters to match the data.
What makes a good machine learning algorithm?
- Performance guarantees: \(\) (statistical consistency and finite sample bounds)
- Real-world sensors, data, resources (high-dimensional, large-scale, ...)
For many types of dynamical systems, learning is provably intractable. You must choose the right class of model, or else all bets are off!
- Spectral Learning approaches to machine learning
- Topology: Encompasses the global shape of the data, and the relations between data points or groups within the global structure
- Google Pagerank Algorithm
- Example: Cosmic Crystallography
- Torus universe (zero curvature)
- Spherical universe (positive curvature)
- Other universe (negative curvature)
- Data: Hyperspectral Imagery
- Gradient Flow Algorithm
- identify neighbor with highest density for each data point (arrow points from that point to that particular neighbor)
- follow the arrows to identify clusters
Filed under Big Data, Talks
Interesting points from the talk
- Drugs in different countries have different names, so they had to do matching
- Use the Jacard distance to find related pharmacies
Interesting points to look into for research:
- spinglass clustering algorithm
- visualizations for spinglass
Filed under Big Data, Talks
SIE Colloquium by Matthew Gerber, Research Assistant Professor in the Systems and Information Engineering Department.
The PTL group has 2 faculty, 10 grad students, and collaborators at the health system.
Predicting crime using twitter:
- Conventional warfare had easily identified forces and open conflict with direct attacks (friends/enemies). The US has no conventional military peers. The US us dealing with asymmetric warfare (asymmetry in size, power, funding, influence). Our enemies have tactical advantages.
- Monitoring via hot-spot maps
- Problems: very specific to the are you're studying and it's retrospective. Can't take yesterday's model and predict on a different place today.
- Overview of the approach
- Gather information on potential crime correlates (Incident Layer, Grid Layer, Demographic Layer, Spatial Layer). Ex: newar military outpost? religious site? Income levels and ethnic tension, and prior history (each on a different layer). Want to take these information and create a statistical model.
- Text provides a problem: unstructured text abounds. These short tweets should be helpful: "The second blast was caused by a motorcycle bomb targeting a minibus in the Domeez area in the south of the city. That needs to be read by a human or automated approach (this talk).
- Automatically integrate unstructured text: add some new layers from the previous model (Twitter Layer, Newswire Layer, ...).
- He's looking at tweets from the Chicago area (collecting in the basement of olsson--time, text, etc). A few topics: 1) flight(0.54), plane(0.2), terminal(0.11),... ; 2) shopping (0.39), buy(--),...
- Mapping these \(\) topics to heat map of Chicago. Can see where certain things are being talked about.
- Unsupervised topic modeling
- Latent Dirichlet allocation (Blei et al 2003)
- A generative story (2 topics). Outside of these documents live topics. We can generate these. Do a similar thing with the documents (grab a dirichlet distribution and produce another--a distribution of topics that the document consists of). Want to pick a topic from that distribution to generate a word. (generate by repeating this process).
- Gather tweets from a neighborhood, tokenize and filter words, identify topic probabilities by LDA, compute probability of crime \(\) . The question what is \(\) ?
- \(\) .
- Find the beta coefficients that give the best function
- Establish training window (1/1/13-1/31/13)
- Lay down non-crime points
- lay down crime points from training window
- Compute topic neighborhoods
- compile training data (use Kernel Density Estimate (?) that adds historical data to the model)
- Want to find the smallest place boundaries with the highest crime levels.
- Do people actually talk about crime on twitter? (that's the big question-- but gangs do trash-talk about their crimes, etc)
- Baseline for comparison was the kernel density estimation (based on past, where is crime likely to occur?)
- They do well with twitter data model + KDE over just KDE for certain results: prostitution, battery.
- They are worse with other topics/crime: homicide, liquor law violations.
- AUC improvement for 22 of 25 crime types, with average peak improvement of 11 points
- Clinical Practice Guidelines
- Want to formalize using natural language processing
- Sentences have a specific order: they're using NLP and parsing English sentences. (concern: context sensitivity of English)
- Want to annotate the text with semantic labels (not XML, though).
- Precisions: temporal identifiers 28% are identified; others average around 50%, with the top around 75-80%
- Warning: need to make sure that fully automated isn't used alone, as there could be things that automated analysis would miss that could be life-threatening.
- The big picture
- Want to get structured information from unstructured text data through Natural Language Processing
Used to handle searching:
- Percolator: incremental update system
Talk by Faculty candidate on Friday.
Big data computation model
- \(\) = number of vectors in \(\) seen so far
- \(\) = number of sensors (dimensionality)
- only have \(\) memory available on the system
- \(\) = number of cores
\(\) -median queries
- input: \(\)
- key idea: replace many points by a weighted average point \(\)
- \(\) , where \(\) is the approximated core set.
Online algorithm to build core set: read in \(\) points, compute \(\) -means and best-fit lines, then choose some of the furthest outliers, then compress that data. Read next \(\) points, repeat, then compute the core set of those two sets to get a new core set. Then move on to next points until you read all \(\) and have a final core set. Then run off-line heuristic on this core set of size \(\) .
Does a \(\) approximation.
References to consider:
- Feldman, Landberg, STOC'11
- Feldman, Sohler, Monemizadeh, SoCG'07
- Har-Peled, Mazumdar, 04
- Feldman, Wu, Julian, Sung and Rus. GPS Compression