Customising the tokenizer

Tokenization is the task of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc object. To construct a Doc object, you need a Vocab instance, a sequence of word strings, and optionally a sequence of spaces booleans, which allow you to maintain alignment of the tokens into the original string.

Tokenizer 101

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token. Each Doc consists of individual tokens, and we can simply iterate over them:

for token in doc:
    print(token.text)
012345678910
AppleislookingatbuyingU.K.startupfor$1billion

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

  1. Does the substring match a tokenizer exception rule? For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.
  2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.

If there's a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

“Let’s go to N.Y.!” Let’s go to N.Y.!” Let go to N.Y.!” ’s Let go to N.Y.! ’s Let go to N.Y. ’s ! Let go to N.Y. ’s ! EXCEPTION PREFIX SUFFIX SUFFIX EXCEPTION DONE

While punctuation rules are usually pretty general, tokenizer exceptions strongly depend on the specifics of the individual language. This is why each available language has its own subclass like English or German, that loads in lists of hard-coded data and exception rules.

Tokenizer data

Global and language-specific tokenizer data is supplied via the language data in spacy/lang . The tokenizer exceptions define special cases like "don't" in English, which needs to be split into two tokens: {ORTH: "do"} and {ORTH: "n't", LEMMA: "not"}. The prefixes, suffixes and infixes mosty define punctuation rules – for example, when to split off periods (at the end of a sentence), and when to leave token containing periods intact (abbreviations like "U.S.").

Tokenizer Base data Language data stop words lexical attributes tokenizer exceptions prefixes, suffixes, infixes lemma data Lemmatizer char classes Token morph rules tag map Morphology

Adding special case tokenization rules

Most domains have at least some idiosyncracies that require custom tokenization rules. This could be very certain expressions, or abbreviations only used in this specific field.

Here's how to add a special case rule to an existing Tokenizer instance:

import spacy
from spacy.symbols import ORTH, LEMMA, POS

nlp = spacy.load('en')
doc = nlp(u'gimme that') # phrase to tokenize
assert [w.text for w in doc] == [u'gimme', u'that'] # current tokenization

# add special case rule
special_case = [{ORTH: u'gim', LEMMA: u'give', POS: u'VERB'}, {ORTH: u'me'}]
nlp.tokenizer.add_special_case(u'gimme', special_case)
assert [w.text for w in nlp(u'gimme that')] == [u'gim', u'me', u'that']
assert [w.lemma_ for w in nlp(u'gimme that')] == [u'give', u'me', u'that']

The special case doesn't have to match an entire whitespace-delimited substring. The tokenizer will incrementally split off punctuation, and keep looking up the remaining substring:

assert 'gimme' not in [w.text for w in nlp(u'gimme!')]
assert 'gimme' not in [w.text for w in nlp(u'("...gimme...?")')]

The special case rules have precedence over the punctuation splitting:

special_case = [{ORTH: u'...gimme...?', LEMMA: u'give', TAG: u'VB'}]
nlp.tokenizer.add_special_case(u'...gimme...?', special_case)
assert len(nlp(u'...gimme...?')) == 1

Because the special-case rules allow you to set arbitrary token attributes, such as the part-of-speech, lemma, etc, they make a good mechanism for arbitrary fix-up rules. Having this logic live in the tokenizer isn't very satisfying from a design perspective, however, so the API may eventually be exposed on the Language class itself.

How spaCy's tokenizer works

spaCy introduces a novel tokenization algorithm, that gives a better balance between performance, ease of definition, and ease of alignment into the original string.

After consuming a prefix or infix, we consult the special cases again. We want the special cases to handle things like "don't" in English, and we want the same rule to work for "(don't)!". We do this by splitting off the open bracket, then the exclamation, then the close bracket, and finally matching the special-case. Here's an implementation of the algorithm in Python, optimized for readability rather than performance:

def tokenizer_pseudo_code(text, find_prefix, find_suffix,
                          find_infixes, special_cases):
    tokens = []
    for substring in text.split(' '):
        suffixes = []
        while substring:
            if substring in special_cases:
                tokens.extend(special_cases[substring])
                substring = ''
            elif find_prefix(substring) is not None:
                split = find_prefix(substring)
                tokens.append(substring[:split])
                substring = substring[split:]
            elif find_suffix(substring) is not None:
                split = find_suffix(substring)
                suffixes.append(substring[split:])
                substring = substring[:split]
            elif find_infixes(substring):
                infixes = find_infixes(substring)
                offset = 0
                for match in infixes:
                    tokens.append(substring[i : match.start()])
                    tokens.append(substring[match.start() : match.end()])
                    offset = match.end()
                substring = substring[offset:]
            else:
                tokens.append(substring)
                substring = ''
        tokens.extend(reversed(suffixes))
        return tokens

The algorithm can be summarized as follows:

  1. Iterate over space-separated substrings
  2. Check whether we have an explicitly defined rule for this substring. If we do, use it.
  3. Otherwise, try to consume a prefix.
  4. If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.
  5. If we didn't consume a prefix, try to consume a suffix.
  6. If we can't consume a prefix or suffix, look for "infixes" — stuff like hyphens etc.
  7. Once we can't consume any more of the string, handle it as a single token.

Customizing spaCy's Tokenizer class

Let's imagine you wanted to create a tokenizer for a new language or specific domain. There are four things you would need to define:

  1. A dictionary of special cases. This handles things like contractions, units of measurement, emoticons, certain abbreviations, etc.
  2. A function prefix_search, to handle preceding punctuation, such as open quotes, open brackets, etc
  3. A function suffix_search, to handle succeeding punctuation, such as commas, periods, close quotes, etc.
  4. A function infixes_finditer, to handle non-whitespace separators, such as hyphens etc.

You shouldn't usually need to create a Tokenizer subclass. Standard usage is to use re.compile() to build a regular expression object, and pass its .search() and .finditer() methods:

import re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

If you need to subclass the tokenizer instead, the relevant methods to specialize are find_prefix, find_suffix and find_infix.

Hooking an arbitrary tokenizer into the pipeline

The tokenizer is the first component of the processing pipeline and the only one that can't be replaced by writing to nlp.pipeline. This is because it has a different signature from all the other components: it takes a text and returns a Doc, whereas all other components expect to already receive a tokenized Doc.

Doc Text nlp tokenizer tensorizer tagger parser ner

To overwrite the existing tokenizer, you need to replace nlp.tokenizer with a custom function that takes a text, and returns a Doc.

nlp = spacy.load('en')
nlp.tokenizer = my_tokenizer
ArgumentTypeDescription
textunicodeThe raw text to tokenize.
returnsDocThe tokenized document.

Example: A custom whitespace tokenizer

To construct the tokenizer, we usually want attributes of the nlp pipeline. Specifically, we want the tokenizer to hold a reference to the vocabulary object. Let's say we have the following class as our tokenizer:

from spacy.tokens import Doc

class WhitespaceTokenizer(object):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(' ')
        # All tokens 'own' a subsequent space character in this tokenizer
        spaces = [True] * len(word)
        return Doc(self.vocab, words=words, spaces=spaces)

As you can see, we need a Vocab instance to construct this — but we won't have it until we get back the loaded nlp object. The simplest solution is to build the tokenizer in two steps. This also means that you can reuse the "tokenizer factory" and initialise it with different instances of Vocab.

nlp = spacy.load('en')
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)