Rule-based matching

spaCy features a rule-matching engine that operates over tokens, similar to regular expressions. The rules can refer to token annotations (e.g. the token text or tag_, and flags (e.g. IS_PUNCT). The rule matcher also lets you pass in a custom callback to act on matches – for example, to merge entities and apply custom labels. You can also associate patterns with entity IDs, to allow some basic entity linking or disambiguation.

Adding patterns

Let's say we want to enable spaCy to find a combination of three tokens:

  1. A token whose lowercase form matches "hello", e.g. "Hello" or "HELLO".
  2. A token whose is_punct flag is set to True, i.e. any punctuation.
  3. A token whose lowercase form matches "world", e.g. "World" or "WORLD".
[{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]

First, we initialise the Matcher with a vocab. The matcher must always share the same vocab with the documents it will operate on. We can now call matcher.add() with an ID and our custom pattern. The second argument lets you pass in an optional callback function to invoke on a successful match. For now, we set it to None.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
# add match ID "HelloWorld" with no callback and one pattern
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
matcher.add('HelloWorld', None, pattern)

doc = nlp(u'Hello, world! Hello world!')
matches = matcher(doc)

The matcher returns a list of (match_id, start, end) tuples – in this case, [('HelloWorld', 0, 2)], which maps to the span doc[0:2] of our original document. Optionally, we could also choose to add more than one pattern, for example to also match sequences without punctuation between "hello" and "world":

matcher.add('HelloWorld', None,
            [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}],
            [{'LOWER': 'hello'}, {'LOWER': 'world'}])

By default, the matcher will only return the matches and not do anything else, like merge entities or assign labels. This is all up to you and can be defined individually for each pattern, by passing in a callback function as the on_match argument on add(). This is useful, because it lets you write entirely custom and pattern-specific logic. For example, you might want to merge some patterns into one token, while adding entity labels for other pattern types. You shouldn't have to create different matchers for each of those processes.

Adding on_match rules

To move on to a more realistic example, let's say you're working with a large corpus of blog articles, and you want to match all mentions of "Google I/O" (which spaCy tokenizes as ['Google', 'I', '/', 'O']). To be safe, you only match on the uppercase versions, in case someone has written it as "Google i/o". You also add a second pattern with an added {IS_DIGIT: True} token – this will make sure you also match on "Google I/O 2017". If your pattern matches, spaCy should execute your custom callback function add_event_ent.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)

# Get the ID of the 'EVENT' entity type. This is required to set an entity.
EVENT = nlp.vocab.strings['EVENT']

def add_event_ent(matcher, doc, i, matches):
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    match_id, start, end = matches[i]
    doc.ents += ((EVENT, start, end),)

matcher.add('GoogleIO', add_event_ent,
            [{'ORTH': 'Google'}, {'UPPER': 'I'}, {'ORTH': '/'}, {'UPPER': 'O'}],
            [{'ORTH': 'Google'}, {'UPPER': 'I'}, {'ORTH': '/'}, {'UPPER': 'O'}, {'IS_DIGIT': True}])

In addition to mentions of "Google I/O", your data also contains some annoying pre-processing artefacts, like leftover HTML line breaks (e.g. <br> or <BR/>). While you're at it, you want to merge those into one token and flag them, to make sure you can easily ignore them later. So you add a second pattern and pass in a function merge_and_flag:

# Add a new custom flag to the vocab, which is always False by default.
# BAD_HTML_FLAG will be the flag ID, which we can use to set it to True on the span.
BAD_HTML_FLAG = nlp.vocab.add_flag(lambda text: False)

def merge_and_flag(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    span.merge(is_stop=True) # merge (and mark it as a stop word, just in case)
    span.set_flag(BAD_HTML_FLAG, True) # set BAD_HTML_FLAG

matcher.add('BAD_HTML', merge_and_flag,
            [{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}],
            [{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}])

We can now call the matcher on our documents. The patterns will be matched in the order they occur in the text. The matcher will then iterate over the matches, look up the callback for the match ID that was matched, and invoke it.

doc = nlp(LOTS_OF_TEXT)

When the callback is invoked, it is passed four arguments: the matcher itself, the document, the position of the current match, and the total list of matches. This allows you to write callbacks that consider the entire set of matched phrases, so that you can resolve overlaps and other conflicts in whatever way you prefer.

matcherMatcherThe matcher instance.
docDocThe document the matcher was used on.
iintIndex of the current match (matches[i]).
matcheslist A list of (match_id, start, end) tuples, describing the matches. A match tuple describes a span doc[start:end].

Using operators and quantifiers

The matcher also lets you use quantifiers, specified as the 'OP' key. Quantifiers let you define sequences of tokens to be mached, e.g. one or more punctuation marks, or specify optional tokens. Note that there are no nested or scoped quantifiers – instead, you can build those behaviours with on_match callbacks.

!match exactly 0 timesnegation
*match 0 or more timesoptional, variable number
+match 1 or more timesmandatory, variable number
?match 0 or 1 timesoptional, max one

Example: Using linguistic annotations

Let's say you're analysing user comments and you want to find out what people are saying about Facebook. You want to start off by finding adjectives following "Facebook is" or "Facebook was". This is obviously a very rudimentary solution, but it'll be fast, and a great way get an idea for what's in your data. Your pattern could look like this:

[{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}]

This translates to a token whose lowercase form matches "facebook" (like Facebook, facebook or FACEBOOK), followed by a token with the lemma "be" (for example, is, was, or 's), followed by an optional adverb, followed by an adjective. Using the linguistic annotations here is especially useful, because you can tell spaCy to match "Facebook's annoying", but not "Facebook's annoying ads". The optional adverb makes sure you won't miss adjectives with intensifiers, like "pretty awful" or "very nice".

To get a quick overview of the results, you could collect all sentences containing a match and render them with the displaCy visualizer. In the callback function, you'll have access to the start and end of each match, as well as the parent Doc. This lets you determine the sentence containing the match, doc[start : end].sent, and calculate the start and end of the matched span within the sentence. Using displaCy in "manual" mode lets you pass in a list of dictionaries containing the text and entities to render.

from spacy import displacy
from spacy.matcher import Matcher

nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matched_sents = [] # collect data of matched sentences to be visualized

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end] # matched span
    sent = span.sent # sentence containing matched span
    # append mock entity for match in displaCy style to matched_sents
    # get the match span by ofsetting the start and end of the span with the
    # start and end of the sentence in the doc
    match_ents = [{'start': span.start-sent.start, 'end': span.end-sent.start,
                   'label': 'MATCH'}]
    matched_sents.append({'text': sent.text, 'ents': match_ents })

pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ'}]
matcher.add('FacebookIs', collect_sents, pattern) # add pattern
matches = matcher(nlp(LOTS_OF_TEXT)) # match on your text

# serve visualization of sentences containing match with displaCy
# set manual=True to make displaCy render straight from a dictionary
displacy.serve(matched_sents, style='ent', manual=True)

Example: Phone numbers

Phone numbers can have many different formats and matching them is often tricky. During tokenization, spaCy will leave sequences of numbers intact and only split on whitespace and punctuation. This means that your match pattern will have to look out for number sequences of a certain length, surrounded by specific punctuation – depending on the national conventions.

The IS_DIGIT flag is not very helpful here, because it doesn't tell us anything about the length. However, you can use the SHAPE flag, with each d representing a digit:

[{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'dddd'},
 {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'dddd'}]

This will match phone numbers of the format (123) 4567 8901 or (123) 4567-8901. To also match formats like (123) 456 789, you can add a second pattern using 'ddd' in place of 'dddd'. By hard-coding some values, you can match only certain, country-specific numbers. For example, here's a pattern to match the most common formats of international German numbers:

[{'ORTH': '+'}, {'ORTH': '49'}, {'ORTH': '(', 'OP': '?'}, {'SHAPE': 'dddd'},
 {'ORTH': ')', 'OP': '?'}, {'SHAPE': 'dddddd'}]

Depending on the formats your application needs to match, creating an extensive set of rules like this is often better than training a model. It'll produce more predictable results, is much easier to modify and extend, and doesn't require any training data – only a set of test cases.

Example: Hashtags and emoji on social media

Social media posts, especially tweets, can be difficult to work with. They're very short and often contain various emoji and hashtags. By only looking at the plain text, you'll lose a lot of valuable semantic information.

Let's say you've extracted a large sample of social media posts on a specific topic, for example posts mentioning a brand name or product. As the first step of your data exploration, you want to filter out posts containing certain emoji and use them to assign a general sentiment score, based on whether the expressed emotion is positive or negative, e.g. 😀 or 😞. You also want to find, merge and label hashtags like #MondayMotivation, to be able to ignore or analyse them later.

By default, spaCy's tokenizer will split emoji into separate tokens. This means that you can create a pattern for one or more emoji tokens. Valid hashtags usually consist of a #, plus a sequence of ASCII characters with no whitespace, making them easy to match as well.

from spacy.lang.en import English
from spacy.matcher import Matcher

nlp = English() # we only want the tokenizer, so no need to load a model
matcher = Matcher(nlp.vocab)

pos_emoji = [u'😀', u'😃', u'😂', u'🤣', u'😊', u'😍'] # positive emoji
neg_emoji = [u'😞', u'😠', u'😩', u'😢', u'😭', u'😒'] # negative emoji

# add patterns to match one or more emoji tokens
pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
neg_patterns = [[{'ORTH': emoji}] for emoji in neg_emoji]

matcher.add('HAPPY', label_sentiment, *pos_patterns) # add positive pattern
matcher.add('SAD', label_sentiment, *neg_patterns) # add negative pattern

# add pattern to merge valid hashtag, i.e. '#' plus any ASCII token
matcher.add('HASHTAG', merge_hashtag, [{'ORTH': '#'}, {'IS_ASCII': True}])

Because the on_match callback receives the ID of each match, you can use the same function to handle the sentiment assignment for both the positive and negative pattern. To keep it simple, we'll either add or subtract 0.1 points – this way, the score will also reflect combinations of emoji, even positive and negative ones.

With a library like Emojipedia, we can also retrieve a short description for each emoji – for example, 😍's official title is "Smiling Face With Heart-Eyes". Assigning it to the merged token's norm will make it available as token.norm_.

from emojipedia import Emojipedia # installation: pip install emojipedia

def label_sentiment(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    if doc.vocab.strings[match_id] == 'HAPPY': # don't forget to get string!
        doc.sentiment += 0.1 # add 0.1 for positive sentiment
    elif doc.vocab.strings[match_id] == 'SAD':
        doc.sentiment -= 0.1 # subtract 0.1 for negative sentiment
    span = doc[start : end]
    emoji =[0].text) # get data for emoji
    span.merge(norm=emoji.title) # merge span and set NORM to emoji title

To label the hashtags, we first need to add a new custom flag. IS_HASHTAG will be the flag's ID, which you can use to assign it to the hashtag's span, and check its value via a token's check_flag() method. On each match, we merge the hashtag and assign the flag.

# Add a new custom flag to the vocab, which is always False by default
IS_HASHTAG = nlp.vocab.add_flag(lambda text: False)

def merge_hashtag(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    span.merge() # merge hashtag
    span.set_flag(IS_HASHTAG, True) # set IS_HASHTAG to True

To process a stream of social media posts, we can use Language.pipe() , which will return a stream of Doc objects that we can pass to Matcher.pipe() .

docs = nlp.pipe(LOTS_OF_TWEETS)
matches = matcher.pipe(docs)
Read next: Adding languages