05 9 / 2011
Back and forth, back and forth…
One of the features I really have wanted to build into Inbox Assistant was the ability automatically modify an appointment as the details are ironed out. How many times have you proposed a meeting time to someone, and a few emails later an agreed upon time finally materializes? Sure, there are services and products that help groups make appointments - but let’s face it, most of the time people stick to email.
Here’s how it works:
Simply send an email and CC Inbox Assistant. If I were to send, “Mike, want to grab lunch tomorrow?”, Inbox Assistant would create an appointment for me at noon tomorrow. If Mike responds with another time, my appointment (and his, if he’s an Inbox Assistant user) will adjust to the new proposed time, and keep adjusting until we’ve agreed upon a time.
So there you have it. The #1 requested feature is now available. Now start planning some meetings!
31 8 / 2011
Meridian-less time parsing
Last night, I shipped a new feature that will find times without a meridian indicator (that is, AM or PM).
My previous post touches on how I use determinate words to separate these from other numbers, but I wanted to quickly describe how this will make scheduling from natural language even easier.
If I were to say to you, “I’ll meet you at your office at 10”, you probably know that I’m talking about 10am. So what I ended up doing was rather simple: If the determined hour is 12, 1, 2, 3, 4, 5, 6, 7, or 8, treat it as PM. Otherwise, it’s AM. This change works for extracting hours and minutes: “at 9” and “at 9:30”.
30 8 / 2011
Experimenting with Natural Language Processing
At the heart of Inbox Assistant is a lightweight internal web service written in Python that processes the text of incoming mail, and extracts date/timestamps and context (an occasion and possibly a location).
I’m a Ruby developer, and prior to breaking ground on this project I’d never written a line of Python in my life. I also knew nothing about natural language processing or machine learning. But I had to write Inbox Assistant, and for the first time in quite a few years I genuinely felt like I had found a challenge that completely pushed me out of my comfort zone.
I decided on Python because all of my Googling of NLP kept leading me to one library: nltk. There had been a few attempts at NLP in Ruby, and the first prototype of Inbox Assistant was entirely written in Ruby and backed by nickel. But what sold me on nltk was its maturity. Not only does it have really strong sentence chunkers and word taggers (using WordNet and other part-of-speech databases, or corpora), but it also has a ton of other features I haven’t even begun to look at.
Not only did I need to learn how to leverage NLP for my needs, I also had to figure out the syntax and grammar of Python, along with its API. I won’t bore you with the details of how I dabbled at first with the idea of doing the whole thing in Python, quickly became frustrated with Django, and ended up writing the frontend in Rails.
A beginners guide to Natural Language Processing (NLP) with nltk
People use NLP for a lot of different things. Obviously, this post is what I’m familiar with: extracting the metadata I need from raw text. If you’re really interested in NLP with nltk, I recommend this book.
Tokenizing
When given a large block of text, the first thing you’ll want to do is to tokenize sentences into words. Usually, this pretty simple to do - just split your text on periods. I didn’t really care about the order of sentences, though I do give preference to the first dates found in the text.
After tokenizing into sentences, what you’ll now have is an array of strings, with each string being a sentence. Next, I split these sentences into arrays of words. The simplest approach is to split on spaces, but nltk provides tokenizers based on corpora (which I’ll cover later), which goes as far as to split on contractions (“can’t”). If your application had to extract sentiment or mood from a block of text, being able to normalize contractions (into “can not”) might be valuable. I ended up just splitting on whitespace.
Tagging Words
Tokenizing (especially with the way I described above) didn’t even require nltk (though it ships with whitespace and period-based tokenizers) and could have been done easily in my native Ruby. The next concept, part-of-speech tagging, is inarguably where nltk proved to be essential.
Having an array of sentences, with each sentence being an array of words, doesn’t really get us anywhere. In order to derive context, our system needs to understand subjects, verbs, direct objects, and so on. From this, we can begin to understand the who’s, what’s, where’s and why’s of a block of text.
It’s impossible to take a given English word and assign a part of speech without knowing context. Think of the word ‘address’. Is it a verb? A noun? nltk provides a number of corpora, which are pre-classified blocks of text taken from various sources (Project Gutenberg, the NYT, etc.) With nltk, you’re able to instantiate a tagger that’s trained against one of these sources and apply the tagger to your text. If your project involved extracting context from academic publications, you’d most likely want to use a different corpus than another project that might be analyzing middle school instant messenger conversations.
So after running an array of strings through a tagger, you’re returned an array of tuples: the first value being the word itself (‘chicken’) and the second being the part of speech (‘NN’, for singular noun). You’ll start having flashbacks of grade school English class.
Chunking
Now that we have our tagged words, we can start ‘chunking’ these into logical groupings. For instance, if we wanted words like “the chicken” or “an egg” to be chunked into their own groups, we’d develop a chunker that would look for a determinate or indeterminate word followed by a noun. Or in my case, I’d be looking for a determiner followed by a cardinal noun (a number).
I ended up writing my own IOB chunker (IOB stands for inside, outside, beginning), which allowed me to lookahead before making my chunking decision. As I cycle through the list of words, if I prefix a word with B-, it signifies that I’m creating a chunk. If I’m within a chunk, and I prefix a word with I-, it means that I want to keep that chunk open. And finally, when I return a O-, I’m closing the chunk with whatever the last I or B tagged word was.
Here’s a better example: Part of my chunker is looking for times. 9am, 8:15pm, noon, etc. Usually, these times are prefixed with a determiner, like ‘at’. I wrote a regular expression that covered every way of representing time I could think of. If my system were traversing over, “let’s meet at 9am”, when cycling over the 3rd word, ‘at’, I’d return a B-DATE marker, indicating that I’m creating a DATE chunk. I would do this by looking ahead to the next word, ‘9am’, and because it would match my regular expression, I’d assign I-DATE to 9am. If another word were to follow ‘9am’ that was unrelated, I’d return an O- and my chunk would be ‘at 9am’. Additionally, because Inbox Assistant supports meridian-less times, ‘at 9’ would be chunked as a DATE, whereas just ‘9’ with no determiner before it would be ignored.
Phew. That’s a LOT to digest. At least it was for me. But ultimately, I was able to create chunks of dates, occasions, locations, and so on. And after having established this context, the application can pretty accurately create calendar events from plain text - even if the date and time chunks are scattered throughout the text (“Bob, I’m thinking we could grab lunch tomorrow. I have a morning meeting, but let’s plan on meeting at 1” would create an event for tomorrow at 1pm). But how I did that is a topic for another post :-)
NLP and nltk are both immense, and I’ve only scratched the surface. If you’d like to see how well I’ve done creating a natural language parser, why don’t you give Inbox Assistant a spin?
30 8 / 2011
Social Media's Secret Weapon - Email
Email is here to stay, and we’re betting that email as an interface is going to become really popular in the next few years.