Large streams of data, mostly unlabeled.
Machine learning approach to fit models to data. How does it work? Take the raw data, hypothesize a model, use a learning algorithm to get the model parameters to match the data.
What makes a good machine learning algorithm?
- Performance guarantees: \(\) (statistical consistency and finite sample bounds)
- Real-world sensors, data, resources (high-dimensional, large-scale, ...)
For many types of dynamical systems, learning is provably intractable. You must choose the right class of model, or else all bets are off!
- Spectral Learning approaches to machine learning
- Topology: Encompasses the global shape of the data, and the relations between data points or groups within the global structure
- Google Pagerank Algorithm
- Example: Cosmic Crystallography
- Torus universe (zero curvature)
- Spherical universe (positive curvature)
- Other universe (negative curvature)
- Data: Hyperspectral Imagery
- Gradient Flow Algorithm
- identify neighbor with highest density for each data point (arrow points from that point to that particular neighbor)
- follow the arrows to identify clusters
Filed under Big Data, Talks
Interesting points from the talk
- Drugs in different countries have different names, so they had to do matching
- Use the Jacard distance to find related pharmacies
Interesting points to look into for research:
- spinglass clustering algorithm
- visualizations for spinglass
Filed under Big Data, Talks
SIE Colloquium by Matthew Gerber, Research Assistant Professor in the Systems and Information Engineering Department.
The PTL group has 2 faculty, 10 grad students, and collaborators at the health system.
Predicting crime using twitter:
- Conventional warfare had easily identified forces and open conflict with direct attacks (friends/enemies). The US has no conventional military peers. The US us dealing with asymmetric warfare (asymmetry in size, power, funding, influence). Our enemies have tactical advantages.
- Monitoring via hot-spot maps
- Problems: very specific to the are you're studying and it's retrospective. Can't take yesterday's model and predict on a different place today.
- Overview of the approach
- Gather information on potential crime correlates (Incident Layer, Grid Layer, Demographic Layer, Spatial Layer). Ex: newar military outpost? religious site? Income levels and ethnic tension, and prior history (each on a different layer). Want to take these information and create a statistical model.
- Text provides a problem: unstructured text abounds. These short tweets should be helpful: "The second blast was caused by a motorcycle bomb targeting a minibus in the Domeez area in the south of the city. That needs to be read by a human or automated approach (this talk).
- Automatically integrate unstructured text: add some new layers from the previous model (Twitter Layer, Newswire Layer, ...).
- He's looking at tweets from the Chicago area (collecting in the basement of olsson--time, text, etc). A few topics: 1) flight(0.54), plane(0.2), terminal(0.11),... ; 2) shopping (0.39), buy(--),...
- Mapping these \(\) topics to heat map of Chicago. Can see where certain things are being talked about.
- Unsupervised topic modeling
- Latent Dirichlet allocation (Blei et al 2003)
- A generative story (2 topics). Outside of these documents live topics. We can generate these. Do a similar thing with the documents (grab a dirichlet distribution and produce another--a distribution of topics that the document consists of). Want to pick a topic from that distribution to generate a word. (generate by repeating this process).
- Gather tweets from a neighborhood, tokenize and filter words, identify topic probabilities by LDA, compute probability of crime \(\) . The question what is \(\) ?
- \(\) .
- Find the beta coefficients that give the best function
- Establish training window (1/1/13-1/31/13)
- Lay down non-crime points
- lay down crime points from training window
- Compute topic neighborhoods
- compile training data (use Kernel Density Estimate (?) that adds historical data to the model)
- Want to find the smallest place boundaries with the highest crime levels.
- Do people actually talk about crime on twitter? (that's the big question-- but gangs do trash-talk about their crimes, etc)
- Baseline for comparison was the kernel density estimation (based on past, where is crime likely to occur?)
- They do well with twitter data model + KDE over just KDE for certain results: prostitution, battery.
- They are worse with other topics/crime: homicide, liquor law violations.
- AUC improvement for 22 of 25 crime types, with average peak improvement of 11 points
- Clinical Practice Guidelines
- Want to formalize using natural language processing
- Sentences have a specific order: they're using NLP and parsing English sentences. (concern: context sensitivity of English)
- Want to annotate the text with semantic labels (not XML, though).
- Precisions: temporal identifiers 28% are identified; others average around 50%, with the top around 75-80%
- Warning: need to make sure that fully automated isn't used alone, as there could be things that automated analysis would miss that could be life-threatening.
- The big picture
- Want to get structured information from unstructured text data through Natural Language Processing
Used to handle searching:
- Percolator: incremental update system
Facebook is generating over 500 TB per day. Google, 20PB.
She wants to tackle the issues of traffic dynamics in large scale networks by injecting flexibility into design.
- design flexible spectrum auction systems to match dynamic traffic, since US auctions off large chunks of the spectrum covering the entire country.
- add flexible wireless links to address traffic hotspots in data centers. (Directional 60GHz links between the different racks that is configurable--but any racks in the way must repeat the signal, since 60GHz is line of sight).
- they use reflections off the ceiling to avoid these problems (3D Beamforming)
Dynamic spectrum auction
- most allocated spectrum sits idle.
- using an auction
- truthful auction as e dominant method to remove gain from cheating. Ex: Vicky auction (second price auction)
- she integrates allocation with pricing: greedy allocation based on bids, winner X pays the bid of its critical neighbor.
How is this different than how radio stations split their frequencies based on location?
Note the use of a Kautz \(\) graph overlay of nodes in the data centers, which allows them to handle arbitrary network size and incremental growth. They show that 92% of paths can run concurrently.
Talk by Faculty candidate on Friday.
Big data computation model
- \(\) = number of vectors in \(\) seen so far
- \(\) = number of sensors (dimensionality)
- only have \(\) memory available on the system
- \(\) = number of cores
\(\) -median queries
- input: \(\)
- key idea: replace many points by a weighted average point \(\)
- \(\) , where \(\) is the approximated core set.
Online algorithm to build core set: read in \(\) points, compute \(\) -means and best-fit lines, then choose some of the furthest outliers, then compress that data. Read next \(\) points, repeat, then compute the core set of those two sets to get a new core set. Then move on to next points until you read all \(\) and have a final core set. Then run off-line heuristic on this core set of size \(\) .
Does a \(\) approximation.
References to consider:
- Feldman, Landberg, STOC'11
- Feldman, Sohler, Monemizadeh, SoCG'07
- Har-Peled, Mazumdar, 04
- Feldman, Wu, Julian, Sung and Rus. GPS Compression