## Boots: New Machine Learning Approaches to Modeling Dynamical Systems

Large streams of data, mostly unlabeled.

Machine learning approach to fit models to data. How does it work? Take the raw data, hypothesize a model, use a learning algorithm to get the model parameters to match the data.

What makes a good machine learning algorithm?

• Performance guarantees: $\theta \approx \theta^*$ (statistical consistency and finite sample bounds)
• Real-world sensors, data, resources (high-dimensional, large-scale, ...)

For many types of dynamical systems, learning is provably intractable. You must choose the right class of model, or else all bets are off!

Look into:

• Spectral Learning approaches to machine learning

Filed under Big Data, Machine Learning, Talks

## Basener: Topological and Bayesian Methods in Data Science

• Topology: Encompasses the global shape of the data, and the relations between data points or groups within the global structure
• Example: Cosmic Crystallography
• Torus universe (zero curvature)
• Spherical universe (positive curvature)
• Other universe (negative curvature)
• Data: Hyperspectral Imagery
• identify neighbor with highest density for each data point (arrow points from that point to that particular neighbor)
• gives a data field
• follow the arrows to identify clusters

people.rit.edu/wfbsma/data/NINJA_MAIN_self_test_refl_RX.img.html

Filed under Big Data, Talks

## Christin: Seeking a fix: Measuring, analyzing and disrupting unlicensed online drug sales

Interesting points from the talk

• Drugs in different countries have different names, so they had to do matching
• Use the Jacard distance to find related pharmacies

Interesting points to look into for research:

• spinglass clustering algorithm
• visualizations for spinglass

https://www.andrew.cmu.edu/user/nicolasc/

Filed under Big Data, Talks

## Natural Language Processing for Predictive Technology (Matthew Gerber)

SIE Colloquium by Matthew Gerber, Research Assistant Professor in the Systems and Information Engineering Department.

The PTL group has 2 faculty, 10 grad students, and collaborators at the health system.

• Conventional warfare had easily identified forces and open conflict with direct attacks (friends/enemies). The US has no conventional military peers. The US us dealing with asymmetric warfare (asymmetry in size, power, funding, influence). Our enemies have tactical advantages.
• Monitoring via hot-spot maps
1. Problems: very specific to the are you're studying and it's retrospective. Can't take yesterday's model and predict on a different place today.
• Overview of the approach
1. Gather information on potential crime correlates (Incident Layer, Grid Layer, Demographic Layer, Spatial Layer). Ex: newar military outpost? religious site? Income levels and ethnic tension, and prior history (each on a different layer). Want to take these information and create a statistical model.
2. Text provides a problem: unstructured text abounds. These short tweets should be helpful: "The second blast was caused by a motorcycle bomb targeting a minibus in the Domeez area in the south of the city. That needs to be read by a human or automated approach (this talk).
3. Automatically integrate unstructured text: add some new layers from the previous model (Twitter Layer, Newswire Layer, ...).
• He's looking at tweets from the Chicago area (collecting in the basement of olsson--time, text, etc). A few topics: 1) flight(0.54), plane(0.2), terminal(0.11),... ; 2) shopping (0.39), buy(--),...
1. Mapping these $n$ topics to heat map of Chicago. Can see where certain things are being talked about.
2. Unsupervised topic modeling
• Latent Dirichlet allocation (Blei et al 2003)
• A generative story (2 topics). Outside of these documents live topics. We can generate these. Do a similar thing with the documents (grab a dirichlet distribution and produce another--a distribution of topics that the document consists of). Want to pick a topic from that distribution to generate a word. (generate by repeating this process).
• Gather tweets from a neighborhood, tokenize and filter words, identify topic probabilities by LDA, compute probability of crime $P(Crime) = F(0.15,0.74,...,f_n)$ . The question what is $f$ ?
1. $\frac{1}{1+e^{-\left(\beta_0 + \prod_{b=1}^n \beta_bf_b(p)\right)}}$ .
2. Find the beta coefficients that give the best function
• Training
• Establish training window (1/1/13-1/31/13)
• Lay down non-crime points
• lay down crime points from training window
• Compute topic neighborhoods
• compile training data (use Kernel Density Estimate (?) that adds historical data to the model)
• Evaluation
• Want to find the smallest place boundaries with the highest crime levels.
• Do people actually talk about crime on twitter? (that's the big question-- but gangs do trash-talk about their crimes, etc)
• Baseline for comparison was the kernel density estimation (based on past, where is crime likely to occur?)
• They do well with twitter data model + KDE over just KDE for certain results: prostitution, battery.
• They are worse with other topics/crime: homicide, liquor law violations.
• AUC improvement for 22 of 25 crime types, with average peak improvement of 11 points
• Clinical Practice Guidelines
• Want to formalize using natural language processing
• Sentences have a specific order: they're using NLP and parsing English sentences. (concern: context sensitivity of English)
• Want to annotate the text with semantic labels (not XML, though).
• Precisions: temporal identifiers 28% are identified; others average around 50%, with the top around 75-80%
• Warning: need to make sure that fully automated isn't used alone, as there could be things that automated analysis would miss that could be life-threatening.
• The big picture
• Want to get structured information from unstructured text data through Natural Language Processing

Filed under Natural Language Processing, Talks

## Command Line Master

Wanted to post the craziest command line script I've used in a long time.  Used to convert names listed in XML tags in an EAC-CPF record to filenames to copy.

grep -h -o -P "<relationEntry>(.*?)</relationEntry>" *.xml
| sed -e 's/<[a-zA-Z0-9\/\+]*>//g'
| awk '{print tolower($0)}' | sed -e 's/[ ,.  :]\+/\-/g' | sed -e 's/$/cr.xml/g'
| while read x ; do cp /data/production/data/\$x eac_data/. ; done

Filed under Uncategorized

## Google Tech Talk Fall 2013

Used to handle searching:

• MapReduce
• BigTable
• Percolator: incremental update system

Filed under Big Data, Data Centers, Talks

## Dynamic Auctions and Wireless Links in Data Centers (Xia)

She wants to tackle the issues of traffic dynamics in large scale networks by injecting flexibility into design.

• design flexible spectrum auction systems to match dynamic traffic, since US auctions off large chunks of the spectrum covering the entire country.
• add flexible wireless links to address traffic hotspots in data centers. (Directional 60GHz links between the different racks that is configurable--but any racks in the way must repeat the signal, since 60GHz is line of sight).
• they use reflections off the ceiling to avoid these problems (3D Beamforming)

Dynamic spectrum auction

• most allocated spectrum sits idle.
• using an auction
• truthful auction as e dominant method to remove gain from cheating. Ex: Vicky auction (second price auction)
• she integrates allocation with pricing: greedy allocation based on bids, winner X pays the bid of its critical neighbor.

How is this different than how radio stations split their frequencies based on location?

Note the use of a Kautz $(d,k)$ graph overlay of nodes in the data centers, which allows them to handle arbitrary network size and incremental growth. They show that 92% of paths can run concurrently.

Filed under Big Data, Data Centers

## Minimum sum-of-squares clustering

See the PDF, page 4, for the problem description. Find $C$ cluster centers $z_j$ to minimize the sum of distances from each point to its cluster center.

MSSC is similar to the problem we are trying to solve in the $kn$ to $k$ groups of $n$ matching problem.  Given $kn$ points, we would like to find the $n$ cluster centers that minimize the distance of any point $p_i$ to its nearest center $z_j$ (the centroid of that cluster).   For fixed $n$ and dimensionality $d$ , it has been shown MSSC is solvable in $O((N)^{nd+1})$ , for $N$ input points.  For general $d$ and input parameter $n$ (as in our case), the problem is shown to be NP-hard.

MSSC, however, differs from our problem statement, since we require the size of any cluster to be size $k$ such that the cluster center $z_i$ provides a good representation for $k$ points.  The points in each cluster may be divided up to form $n$ groups, where each member comes from a unique cluster, thereby removing bias between the groups.  Also, our initial problem statement proposes a greedy method, such that the smallest clusters are found first, whereas MSSC minimizes the sizes of all clusters.

Filed under Algorithms

## Big Data talk

Talk by Faculty candidate on Friday.

Big data computation model

• $n$ = number of vectors in $R^d$ seen so far
• $d$ = number of sensors (dimensionality)
• only have $\log n$ memory available on the system
• $M$ = number of cores

$k$ -median queries

• input: $P \in R^d$
• key idea: replace many points by a weighted average point $c \in C$
• $\sum_{p \in P} dist(p,Q) \approx \sum_{c \in C} w(c) * dist(c,Q)$ , where $C$ is the approximated core set.

Online algorithm to build core set: read in $1/\varepsilon$ points, compute $k$ -means and best-fit lines, then choose some of the furthest outliers, then compress that data. Read next $1/\varepsilon$ points, repeat, then compute the core set of those two sets to get a new core set. Then move on to next points until you read all $n$ and have a final core set. Then run off-line heuristic on this core set of size $k$ .

Does a $1+\varepsilon$ approximation.

References to consider:

1. Feldman, Landberg, STOC'11
• Coreset size: $kd/\varepsilon^2$
3. Har-Peled, Mazumdar, 04
4. Feldman, Wu, Julian, Sung and Rus. GPS Compression

Filed under Big Data, Core Sets, Talks

## Dynamic Voronoi Diagrams

Computational Geometry - An Introduction Through Randomized Algorithms; Mulmuley; 1994; pp 135-

Proposition 4.2.2: The total cost of maintaining a Voronoi diagram in a given update sequence is proportional to the total structural change during that sequence. This ignores a logarithmic factor in the cost of a deletion. It also ignores the cost of locating a conflicting vertex during addition. The expected (total) structural change during a random $(N,\delta)$ -sequence is $O(n)$ .