Training the Named Entity Recognizer

All spaCy models support online learning, so you can update a pre-trained model with new examples. You can even add new classes to an existing model, to recognise a new entity type, part-of-speech, or syntactic relation. Updating an existing model is particularly useful as a "quick and dirty solution", if you have only a few corrections or annotations.

Improving accuracy on existing entity types

To update the model, you first need to create an instance of GoldParse , with the entity labels you want to learn. You'll usually need to provide many examples to meaningfully improve the system — a few hundred is a good start, although more is better.

Training data label text Doc GoldParse update nlp optimizer

You should avoid iterating over the same few examples multiple times, or the model is likely to "forget" how to annotate other examples. If you iterate over the same few examples, you're effectively changing the loss function. The optimizer will find a way to minimize the loss on your examples, without regard for the consequences on the examples it's no longer paying attention to.

One way to avoid this "catastrophic forgetting" problem is to "remind" the model of other examples by augmenting your annotations with sentences annotated with entities automatically recognised by the original model. Ultimately, this is an empirical process: you'll need to experiment on your own data to find a solution that works best for you.

Example

import random
from spacy.lang.en import English
from spacy.gold import GoldParse, biluo_tags_from_offsets

def main(model_dir=None):
    train_data = [
        ('Who is Shaka Khan?',
            [(len('Who is '), len('Who is Shaka Khan'), 'PERSON')]),
        ('I like London and Berlin.',
            [(len('I like '), len('I like London'), 'LOC'),
            (len('I like London and '), len('I like London and Berlin'), 'LOC')])
    ]
    nlp = English(pipeline=['tensorizer', 'ner'])
    get_data = lambda: reformat_train_data(nlp.tokenizer, train_data)
    optimizer = nlp.begin_training(get_data)
    for itn in range(100):
        random.shuffle(train_data)
        losses = {}
        for raw_text, entity_offsets in train_data:
            doc = nlp.make_doc(raw_text)
            gold = GoldParse(doc, entities=entity_offsets)
            nlp.update([doc], [gold], drop=0.5, sgd=optimizer, losses=losses)
    nlp.to_disk(model_dir)
def reformat_train_data(tokenizer, examples):
    """Reformat data to match JSON format"""
    output = []
    for i, (text, entity_offsets) in enumerate(examples):
        doc = tokenizer(text)
        ner_tags = biluo_tags_from_offsets(tokenizer(text), entity_offsets)
        words = [w.text for w in doc]
        tags = ['-'] * len(doc)
        heads = [0] * len(doc)
        deps = [''] * len(doc)
        sentence = (range(len(doc)), words, tags, heads, deps, ner_tags)
        output.append((text, [(sentence, [])]))
    return output

View full example

Saving and loading

After training our model, you'll usually want to save its state, and load it back later. You can do this with the Language.to_disk() method:

nlp.to_disk('/home/me/data/en_technology')

To make the model more convenient to deploy, we recommend wrapping it as a Python package, so that you can install it via pip and load it as a module. spaCy comes with a handy package CLI command to create all required files and directories.

spacy package /home/me/data/en_technology /home/me/my_models

To build the package and create a .tar.gz archive, run python setup.py sdist from within its directory.