Monthly Archives: September 2013

Natural Language Processing for Predictive Technology (Matthew Gerber)

SIE Colloquium by Matthew Gerber, Research Assistant Professor in the Systems and Information Engineering Department.

The PTL group has 2 faculty, 10 grad students, and collaborators at the health system.

Predicting crime using twitter:

  • Conventional warfare had easily identified forces and open conflict with direct attacks (friends/enemies). The US has no conventional military peers. The US us dealing with asymmetric warfare (asymmetry in size, power, funding, influence). Our enemies have tactical advantages.
  • Monitoring via hot-spot maps
    1. Problems: very specific to the are you're studying and it's retrospective. Can't take yesterday's model and predict on a different place today.
  • Overview of the approach
    1. Gather information on potential crime correlates (Incident Layer, Grid Layer, Demographic Layer, Spatial Layer). Ex: newar military outpost? religious site? Income levels and ethnic tension, and prior history (each on a different layer). Want to take these information and create a statistical model.
    2. Text provides a problem: unstructured text abounds. These short tweets should be helpful: "The second blast was caused by a motorcycle bomb targeting a minibus in the Domeez area in the south of the city. That needs to be read by a human or automated approach (this talk).
    3. Automatically integrate unstructured text: add some new layers from the previous model (Twitter Layer, Newswire Layer, ...).
  • He's looking at tweets from the Chicago area (collecting in the basement of olsson--time, text, etc). A few topics: 1) flight(0.54), plane(0.2), terminal(0.11),... ; 2) shopping (0.39), buy(--),...
    1. Mapping these \(n\) topics to heat map of Chicago. Can see where certain things are being talked about.
    2. Unsupervised topic modeling
      • Latent Dirichlet allocation (Blei et al 2003)
      • A generative story (2 topics). Outside of these documents live topics. We can generate these. Do a similar thing with the documents (grab a dirichlet distribution and produce another--a distribution of topics that the document consists of). Want to pick a topic from that distribution to generate a word. (generate by repeating this process).
      • Gather tweets from a neighborhood, tokenize and filter words, identify topic probabilities by LDA, compute probability of crime \(P(Crime) = F(0.15,0.74,...,f_n)\) . The question what is \(f\) ?
        1. \(\frac{1}{1+e^{-\left(\beta_0 + \prod_{b=1}^n \beta_bf_b(p)\right)}}\) .
        2. Find the beta coefficients that give the best function
      • Training
        • Establish training window (1/1/13-1/31/13)
        • Lay down non-crime points
        • lay down crime points from training window
        • Compute topic neighborhoods
        • compile training data (use Kernel Density Estimate (?) that adds historical data to the model)
      • Evaluation
        • Want to find the smallest place boundaries with the highest crime levels.
        • Do people actually talk about crime on twitter? (that's the big question-- but gangs do trash-talk about their crimes, etc)
        • Baseline for comparison was the kernel density estimation (based on past, where is crime likely to occur?)
        • They do well with twitter data model + KDE over just KDE for certain results: prostitution, battery.
        • They are worse with other topics/crime: homicide, liquor law violations.
        • AUC improvement for 22 of 25 crime types, with average peak improvement of 11 points
  • Clinical Practice Guidelines
    • Want to formalize using natural language processing
    • Sentences have a specific order: they're using NLP and parsing English sentences. (concern: context sensitivity of English)
    • Want to annotate the text with semantic labels (not XML, though).
    • Precisions: temporal identifiers 28% are identified; others average around 50%, with the top around 75-80%
    • Warning: need to make sure that fully automated isn't used alone, as there could be things that automated analysis would miss that could be life-threatening.
  • The big picture
    • Want to get structured information from unstructured text data through Natural Language Processing

Leave a Comment

Filed under Natural Language Processing, Talks

Command Line Master

Wanted to post the craziest command line script I've used in a long time.  Used to convert names listed in XML tags in an EAC-CPF record to filenames to copy.

grep -h -o -P "<relationEntry>(.*?)</relationEntry>" *.xml
 | sed -e 's/<[a-zA-Z0-9\/\+]*>//g'
 | awk '{print tolower($0)}'
 | sed -e 's/[ ,. \(\) :]\+/\-/g'
 | sed -e 's/$/cr.xml/g'
 | while read x ; do cp /data/production/data/$x eac_data/. ; done

 

Leave a Comment

Filed under Uncategorized

Google Tech Talk Fall 2013

Used to handle searching:

  • MapReduce
  • BigTable
  • Percolator: incremental update system

Leave a Comment

Filed under Big Data, Data Centers, Talks