scikit

Training spaCy’s Statistical Models

This guide describes how to train new statistical models for spaCy's part-of-speech tagger, named entity recognizer and dependency parser. Once the model is trained, you can then save and load it.

Training basics

spaCy's models are statistical and every "decision" they make – for example, which part-of-speech tag to assign, or whether a word is a named entity – is a prediction. This prediction is based on the examples the model has seen during training. To train a model, you first need training data – examples of text, and the labels you want the model to predict. This could be a part-of-speech tag, a named entity or any other information.

The model is then shown the unlabelled text and will make a prediction. Because we know the correct answer, we can give the model feedback on its prediction in the form of an error gradient of the loss function that calculates the difference between the training example and the expected output. The greater the difference, the more significant the gradient and the updates to our model.

PREDICT SAVE Model Training data label label Updated Model text GRADIENT

When training a model, we don't just want it to memorise our examples – we want it to come up with theory that can be generalised across other examples. After all, we don't just want the model to learn that this one instance of "Amazon" right here is a company – we want it to learn that "Amazon", in contexts like this, is most likely a company. That's why the training data should always be representative of the data we want to process. A model trained on Wikipedia, where sentences in the first person are extremely rare, will likely perform badly on Twitter. Similarly, a model trained on romantic novels will likely perform badly on legal text.

This also means that in order to know how the model is performing, and whether it's learning the right things, you don't only need training data – you'll also need evaluation data. If you only test the model with the data it was trained on, you'll have no idea how well it's generalising. If you want to train a model from scratch, you usually need at least a few hundred examples for both training and evaluation. To update an existing model, you can already achieve decent results with very few examples – as long as they're representative.

How do I get training data?

Collecting training data may sound incredibly painful – and it can be, if you're planning a large-scale annotation project. However, if your main goal is to update an existing model's predictions – for example, spaCy's named entity recognition – the hard is part usually not creating the actual annotations. It's finding representative examples and extracting potential candidates. The good news is, if you've been noticing bad performance on your data, you likely already have some relevant text, and you can use spaCy to bootstrap a first set of training examples. For example, after processing a few sentences, you may end up with the following entities, some correct, some incorrect.

TextEntityStartEndLabel
Uber blew through $1 million a weekUber04ORG
Android Pay expands to CanadaAndroid07PERSON
Android Pay expands to CanadaCanada2330GPE
Spotify steps up Asia expansionSpotify08ORG
Spotify steps up Asia expansionAsia1721NORP

Alternatively, the rule-based matcher can be a useful tool to extract tokens or combinations of tokens, as well as their start and end index in a document. In this case, we'll extract mentions of Google and assume they're an ORG.

TextEntityStartEndLabel
let me google this for yougoogle713ORG
Google Maps launches location sharingGoogle06ORG
Google rebrands its business appsGoogle06ORG
look what i found on google! 😂google2127ORG

Based on the few examples above, you can already create six training sentences with eight entities in total. Of course, what you consider a "correct annotation" will always depend on what you want the model to learn. While there are some entity annotations that are more or less universally correct – like Canada being a geopolitical entity – your application may have its very own definition of the NER annotation scheme.

train_data = [
    ("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
    ("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]),
    ("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
    ("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
    ("Google rebrands its business apps", [(0, 6, "ORG")]),
    ("look what i found on google! 😂", [(21, 27, "PRODUCT")])]

Training with annotations

The GoldParse object collects the annotated training examples, also called the gold standard. It's initialised with the Doc object it refers to, and keyword arguments specifying the annotations, like tags or entities. Its job is to encode the annotations, keep them aligned and create the C-level data structures required for efficient access. Here's an example of a simple GoldParse for part-of-speech tags:

vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
doc = Doc(vocab, words=['I', 'like', 'stuff'])
gold = GoldParse(doc, tags=['N', 'V', 'N'])

Using the Doc and its gold-standard annotations, the model can be updated to learn a sentence of three words with their assigned part-of-speech tags. The tag map is part of the vocabulary and defines the annotation scheme. If you're training a new language model, this will let you map the tags present in the treebank you train on to spaCy's tag scheme.

doc = Doc(Vocab(), words=['Facebook', 'released', 'React', 'in', '2014'])
gold = GoldParse(doc, entities=['U-ORG', 'O', 'U-TECHNOLOGY', 'O', 'U-DATE'])

The same goes for named entities. The letters added before the labels refer to the tags of the BILUO schemeO is a token outside an entity, U an single entity unit, B the beginning of an entity, I a token inside an entity and L the last token of an entity.

Training data label text Doc GoldParse update nlp optimizer

Of course, it's not enough to only show a model a single example once. Especially if you only have few examples, you'll want to train for a number of iterations. At each iteration, the training data is shuffled to ensure the model doesn't make any generalisations based on the order of examples. Another technique to improve the learning results is to set a dropout rate, a rate at which to randomly "drop" individual features and representations. This makes it harder for the model to memorise the training data. For example, a 0.25 dropout means that each feature or internal representation has a 1/4 likelihood of being dropped.

Example training loop

optimizer = nlp.begin_training(get_data) for itn in range(100): random.shuffle(train_data) for raw_text, entity_offsets in train_data: doc = nlp.make_doc(raw_text) gold = GoldParse(doc, entities=entity_offsets) nlp.update([doc], [gold], drop=0.5, sgd=optimizer) nlp.to_disk('/model')
NameDescription
train_dataThe training data.
get_dataA function converting the training data to spaCy's JSON format.
docDoc objects.
goldGoldParse objects.
dropDropout rate. Makes it harder for the model to just memorise the data.
optimizerCallable to update the model's weights.

Training the named entity recognizer

All spaCy models support online learning, so you can update a pre-trained model with new examples. To update the model, you first need to create an instance of GoldParse , with the entity labels you want to learn. You'll usually need to provide many examples to meaningfully improve the system — a few hundred is a good start, although more is better.

You should avoid iterating over the same few examples multiple times, or the model is likely to "forget" how to annotate other examples. If you iterate over the same few examples, you're effectively changing the loss function. The optimizer will find a way to minimize the loss on your examples, without regard for the consequences on the examples it's no longer paying attention to. One way to avoid this "catastrophic forgetting" problem is to "remind" the model of other examples by augmenting your annotations with sentences annotated with entities automatically recognised by the original model. Ultimately, this is an empirical process: you'll need to experiment on your own data to find a solution that works best for you.

Example: Training an additional entity type

This script shows how to add a new entity type to an existing pre-trained NER model. To keep the example short and simple, only a few sentences are provided as examples. In practice, you'll need many more — a few hundred would be a good start. You will also likely need to mix in examples of other entity types, which might be obtained by running the entity recognizer over unlabelled sentences, and adding their annotations to the training set.

The actual training is performed by looping over the examples, and calling nlp.update() . The update method steps through the words of the input. At each word, it makes a prediction. It then consults the annotations provided on the GoldParse instance, to see whether it was right. If it was wrong, it adjusts its weights so that the correct action will score higher next time.

Example: Training an NER system from scratch

This example is written to be self-contained and reasonably transparent. To achieve that, it duplicates some of spaCy's internal functionality. Specifically, in this example, we don't use spaCy's built-in Language class to wire together the Vocab , Tokenizer and EntityRecognizer . Instead, we write our own simle Pipeline class, so that it's easier to see how the pieces interact.

Training the tagger and parser

Training a similarity model

Training a text classification model

Example: Training spaCy's text classifier

This example shows how to use and train spaCy's new TextCategorizer pipeline component on IMDB movie reviews.

Saving and loading models

After training your model, you'll usually want to save its state, and load it back later. You can do this with the Language.to_disk() method:

nlp.to_disk('/home/me/data/en_example_model')

The directory will be created if it doesn't exist, and the whole pipeline will be written out. To make the model more convenient to deploy, we recommend wrapping it as a Python package.

Generating a model package

spaCy comes with a handy CLI command that will create all required files, and walk you through generating the meta data. You can also create the meta.json manually and place it in the model data directory, or supply a path to it using the --meta flag. For more info on this, see the package docs.

spacy package /home/me/data/en_example_model /home/me/my_models

This command will create a model package directory that should look like this:

Directory structure

└── / ├── MANIFEST.in # to include meta.json ├── meta.json # model meta data ├── setup.py # setup file for pip installation └── en_example_model # model directory ├── __init__.py # init for pip installation └── en_example_model-1.0.0 # model data

You can also find templates for all files in our spaCy dev resources . If you're creating the package manually, keep in mind that the directories need to be named according to the naming conventions of lang_name and lang_name-version.

Customising the model setup

The meta.json includes the model details, like name, requirements and license, and lets you customise how the model should be initialised and loaded. You can define the language data to be loaded and the processing pipeline to execute.

SettingTypeDescription
langunicodeID of the language class to initialise.
pipelinelist A list of strings mapping to the IDs of pipeline factories to apply in that order. If not set, spaCy's default pipeline will be used.

The load() method that comes with our model package templates will take care of putting all this together and returning a Language object with the loaded pipeline and data. If your model requires custom pipeline components, you should ship then with your model and register their factories via set_factory() .

spacy.set_factory('custom_component', custom_component_factory)

Building the model package

To build the package, run the following command from within the directory. For more information on building Python packages, see the docs on Python's Setuptools.

python setup.py sdist

This will create a .tar.gz archive in a directory /dist. The model can be installed by pointing pip to the path of the archive:

pip install /path/to/en_example_model-1.0.0.tar.gz

You can then load the model via its name, en_example_model, or import it directly as a module and then call its load() method.

Loading a custom model package

To load a model from a data directory, you can use spacy.load() with the local path. This will look for a meta.json in the directory and use the lang and pipeline settings to initialise a Language class with a processing pipeline and load in the model data.

nlp = spacy.load('/path/to/model')

If you want to load only the binary data, you'll have to create a Language class and call from_disk instead.

from spacy.lang.en import English
nlp = English().from_disk('/path/to/data')

Example: How we're training and packaging models for spaCy

Publishing a new version of spaCy often means re-training all available models – currently, that's 6 models for 5 languages. To make this run smoothly, we're using an automated build process and a spacy train template that looks like this:

spacy train {lang} {models_dir}/{name} {train_data} {dev_data} -m meta/{name}.json -V {version} -g {gpu_id} -n {n_epoch} -ns {n_sents}

In a directory meta, we keep meta.json templates for the individual models, containing all relevant information that doesn't change across versions, like the name, description, author info and training data sources. When we train the model, we pass in the file to the meta template as the --meta argument, and specify the current model version as the --version argument.

On each epoch, the model is saved out with a meta.json using our template and added properties, like the pipeline, accuracy scores and the spacy_version used to train the model. After training completion, the best model is selected automatically and packaged using the package command. Since a full meta file is already present on the trained model, no further setup is required to build a valid model package.

spacy package -f {best_model} dist/
cd dist/{model_name}
python setup.py sdist

This process allows us to quickly trigger the model training and build process for all available models and languages, and generate the correct meta data automatically.