Monday, March 20, 2017

Using IBM Watson Knowledge Studio to Train Machine Learning Models

Using the Free Trial version of IBM's Watson Knowledge Studio, I just annotated a text and created a machine learning model in about 3 hours without writing a single line of code. The mantra of WKS is that you don't program Watson, you teach Watson.

For demo purposes I chose to identify personal relationships in Shirley Jackson's 1948 short story The Lottery. This is a haunting story about a small village and its mindless adherence to an old, and tragic tradition. I chose it because 1) it's short and 2) it has clear person relationships like brothers, sisters, mothers, and fathers. I added a few other relations like AGENT_OF (which amounts to subjects of verbs) and X_INSIDE_Y for things like pieces of paper inside a box.
Caveat: This short story is really short: 3300 words. So I had no high hopes of getting a good model out of this. I just wanted to go through an entire machine learning work flow from gathering text data to outputting a complete model without writing a single line of code. And that's just what I did.


WORK FLOW
I spent about 30 minutes prepping the data. E.g., I broke it into 20 small snippets (to facilitate test/train split later), I also edited some quotation issues, spelling, etc).
It uploaded into WKS in seconds (by simply dragging and dropping the files into the browser tool). I then created a Type System to include entity types such as these:

And relation types such as these:
I then annotated the 20 short documents in less than two hours (as is so often the case, I re-designed my type system several times along the way; luckily WKS allows me to do this fluidly without having to re-annotate).

Here's a shot of my entity annotations:

Here's a shot of my relation annotations:

I then used these manually annotated documents as ground truth to teach a machine learning model to recognize the relationships automatically using a set of linguistic features (character and word ngrams, parts-of-speech, syntactic parses, etc). I accepted the WKS suggested split of documents as 70/23/2:
I clicked "Run" and waited:

The model was trained and evaluated in about ten minutes. Here's how it performed on entity types:

And here's how it performed on relation types:


This is actually not bad given how sparse the data is. I mean, an F1 of .33 on X_INSIDE_Y from only 29 training examples on a first pass. I'll take it, especially since that one is not necessarily obvious from the text. Here's one example of the X_INSIDE_Y relation:

So I was able to train a model with 11 entity types and 81 relation types on a small corpus in less than three hours start to finish without writing a single line of code. I did not program Watson. I taught it

Thursday, March 9, 2017

Annotate texts and create learning models

IBM Watson has released a free trial version of their online Watson Knowledge Studio tool. This is one of the tools I'm most excited about because it brings linguistic annotation, rule writing, dictionary creation, and machine learning together in a single user-friendly interface designed for non engineers to use.


This tool allows users to annotate documents with mentions, relations, and coreference, then learn a linguistic model of their environments. With zero computer science background. I've trained subject matter experts in several fields to use the WKS and I'm genuinely impressed. I'll put together a demo and post it sometime this weekend.

Friday, October 14, 2016

Sunday, September 25, 2016

the value of play for preschool children

A nice article came up in The Atlantic about the comeback of recess and play in elementary schools. My sister posted some thought about how she encourages play when she teaches her preschool kids: For our little Chico preschool, play and education are two sides of the same coin.

Friday, September 23, 2016

from preschool to scholar athlete

One of the most endearing things about being a teacher is seeing former students go on to achieve great things. My sister's preschool student Jack Emanuel has been named a Subway Scholar athlete. Very cool. Congratulations Jack.

Monday, September 5, 2016

some thoughts on data analytics for a micro business

My first thought as a new business owner trying to utilize data analytics is this: how empty this all seems. I don't need to understand my market right now, I need to break into it. How can I use data analytics to open doors? I don't have time for nuance right now, I need sign ups. I'm faced with one of the most basic problems in business: getting noticed. And I wanna do it on the cheap.

Let me stipulate that this post relates only to a small slice of DA proper: low-cost advertising DA available from Google & Facebook.

My sister just relocated her preschool, Kids First, to a new town, Chico CA, and I'm helping her buy a domain, set up a website, and do some promotion using Google AdWords Express, Facebook ads, and some local print advertising.

This is the first time I've been on this side of the data analytics equation: as a consumer of the analysis, not a producer. And my experience is telling.

Business model: We're not a small business. We're a micro business. Employees = 1 (Miss Lori). We don't need to sell a million widgets or gain a million likes. The business model of our preschool remains very old fashioned: get ten kids signed up, average that head count annually, and we're successful.

The market: Chico is a college town with 85,000 people in the city, about 200,000 in the larger area. There is a state university, a junior college, two hospitals, two high schools, two middle schools, and several elementary schools, as well as a robust farming economy (all employing professionals with kids, right?).

Advertising: Bought a small print ad in the local weekly for four weeks ($200/wk). One-time ad in the university student paper for the first week of the semester (hoping to grab the attention of new faculty who might finger through it out of curiosity). Small Facebook ad ($50), small Google AdWords campaign ($120/month). My sister "boosted" the preschool's Facebook page for a few days for $20.

Data Analytics: FB ad = zero calls. FB said 6,322 people saw her boosted virtual tour video and 411 people clicked on it in just a few days (I'm suspicious of these numbers because Lori said she limited the ad to Chico. I would be shocked if that many Chicoans clicked her video in a couple days). Zero calls. I started the AdWords campaign on September 3. I limited the area to the Chico region (those 200k people) because I don't want to pay for clicks from people who are looking for preschools in San Francisco. We've gotten 4 clicks in about 48 hours. Zero calls.



This is early, of course, but still, this all seems so empty because of the nature of our business. We don't need clicks or views or likes: we need parents to sign their kids up. There is a disconnect to me between the old fashioned, brick-and-mortar business reality of trying to profitably run a preschool and the virtual reality of these meager data analytics. The NLPer in me likes the fact that AdWords shows me which search words drive clicks, but the business owner in me says, where's the beef? I need a parent to pay me money. Don't care 'bout no clicks.

I'm not in the business of advertising DA, but I have some appreciation for the tools and techniques. And even I'm frustrated trying to connect the dots. Or more to the point, I don't know how to use data analytics to go from clicks to phone calls (to actual sign ups).

I get that using DA may prove useful in the long run, but first we need old fashioned visibility. Without that, nuance is worthless. I can only imagine how frustrated and perhaps infuriated many small business owners are at this same disconnect between their business models and the services offered. This experience has already deepened my appreciation for the customer experience in the DA ecosystem.

But ultimately what I think is this: Data analytics, schmata analytics. I don't need a data scientist. I need Don Draper.

Monday, August 15, 2016

Yet Another Bad review of Suicide Squad

There are lots of reviews of Suicide Squad detailing how bad it is. This is another one. This is less a movie than a series of loosely related scenes. It had roughly three sections:
A) The set-up: Character introductions. The movie begins with an amateurish method of introducing the characters that is literally one person listing their names and features followed by a flashback scene for each. Dumbest exposition structure ever.    
B) The Mission: Load everybody up in a helicopter, give them weapons, send them along their way. This plays out as an extension of A where each person gets a montage playing with their toys of choice.  
C) Switcheroo: Plot twist that changes the mission objective. Unfortunately, the movie fails to adequately set up this core plot point because the director spent so much time showing off weapons and Margot Robbie's arse, he forgot to have a character explain the mission in any memorable way. When the twist in the mission objective is "revealed" midway into the siege, it's more confusing than revelation. At this point, no sane person is looking for coherence in the film anyway, so it hardly matters.
The directorial style can best be summed up as 'just keep everyone shooting guns, no one will notice the incoherence'.

Monday, July 11, 2016

Fun with Stanford's online demo

I've long been a fan of Stanford's online parser demo, but now they've outdone themselves with a demo page for their CoreNLP tools. Not only does it take your text and show the parse and entities, it also lets you develop a regex to capture your input text, including semantic regexes!

 This is just plain fun: http://corenlp.run/ 



Sunday, June 19, 2016

IBM Watson at NAACL 2016

There were several Twitter NLP flare-ups recently triggered by the contrast between academic NLP and industry NLP. I'm not going to re-litigate those arguments, but I will note that one IBM Watson question answering team anticipated this very thing in their current NAACL paper for the NAACL HLT 2016 Workshop on Human-Computer Question Answering.

The paper is titled Watson Discovery Advisor: Question-answering in an industrial setting.

The Abstract
This work discusses a mix of challenges arising from Watson Discovery Advisor (WDA), an industrial strength descendant of the Watson Jeopardy! Question Answering system currently used in production in industry settings. Typical challenges include generation of appropriate training questions, adaptation to new industry domains, and iterative improvement of the system through manual error analyses.
The paper's topic is not surprising given that four of the authors are PhDs (Charley, Graham, Allen, and Kristen). Hence, it was largely a group of fishes out of water: they had an academic bent, but are daily wrestling with the real-word challenges of paying-customers and very messy data.

Here are five take-aways:

  1. Real-world questions and answers are far more ambiguous and domain-specific than academic training sets.
  2. Domain tuning involves far more than just retraining ML models.
  3. Useful error analysis requires deep dives into specific QA failures (as opposed to broad statistical generalizations).
  4. Defining what counts as an error is itself embedded in the context of the customer's needs and the domain data. What counts as an error to one customer may be acceptable to another.
  5. Quiz-Bowl evaluations are highly constrained, special-cases of general QA, a point I made in 2014 here (pats self on back). Their lesson's learned are of little value to the industry QA world (for now, at least).

I do hope you will read the brief paper in full (as well as the other excellent papers in the workshop).