Saving, loading and data serialization

Serialization 101

If you've been modifying the pipeline, vocabulary, vectors and entities, or made updates to the model, you'll eventually want to save your progress – for example, everything that's in your nlp object. This means you'll have to translate its contents and structure into a format that can be saved, like a file or a byte string. This process is called serialization. spaCy comes with built-in serialization methods and supports the Pickle protocol.

All container classes, i.e. Language , Doc , Vocab and StringStore have the following methods available:

MethodReturnsExample
to_bytesbytesnlp.to_bytes()
from_bytesobjectnlp.from_bytes(bytes)
to_disk-nlp.to_disk('/path')
from_diskobjectnlp.from_disk('/path')

For example, if you've processed a very large document, you can use Doc.to_disk to save it to a file on your local machine. This will save the document and its tokens, as well as the vocabulary associated with the Doc.

moby_dick = open('moby_dick.txt', 'r') # open a large document
doc = nlp(moby_dick) # process it
doc.to_disk('/moby_dick.bin') # save the processed Doc

If you need it again later, you can load it back into an empty Doc with an empty Vocab by calling from_disk() :

from spacy.tokens import Doc # to create empty Doc
from spacy.vocab import Vocab # to create empty Vocab

doc = Doc(Vocab()).from_disk('/moby_dick.bin') # load processed Doc

Example: Saving and loading a document

For simplicity, let's assume you've added custom entities to a Doc, either manually, or by using a match pattern. You can save it locally by calling Doc.to_disk() , and load it again via Doc.from_disk() . This will overwrite the existing object and return it.

import spacy
from spacy.tokens import Span

text = u'Netflix is hiring a new VP of global policy'

nlp = spacy.load('en')
doc = nlp(text)
assert len(doc.ents) == 0 # Doc has no entities
doc.ents += ((Span(doc, 0, 1, label=doc.vocab.strings[u'ORG'])) # add entity
doc.to_disk('/path/to/doc') # save Doc to disk

new_doc = nlp(text)
assert len(new_doc.ents) == 0 # new Doc has no entities
new_doc = new_doc.from_disk('path/to/doc') # load from disk and overwrite
assert len(new_doc.ents) == 1 # entity is now recognised!
assert [(ent.text, ent.label_) for ent in new_doc.ents] == [(u'Netflix', u'ORG')]

Saving models

After training your model, you'll usually want to save its state, and load it back later. You can do this with the Language.to_disk() method:

nlp.to_disk('/home/me/data/en_example_model')

The directory will be created if it doesn't exist, and the whole pipeline will be written out. To make the model more convenient to deploy, we recommend wrapping it as a Python package.

Generating a model package

spaCy comes with a handy CLI command that will create all required files, and walk you through generating the meta data. You can also create the meta.json manually and place it in the model data directory, or supply a path to it using the --meta flag. For more info on this, see the package docs.

spacy package /home/me/data/en_example_model /home/me/my_models

This command will create a model package directory that should look like this:

Directory structure

└── / ├── MANIFEST.in # to include meta.json ├── meta.json # model meta data ├── setup.py # setup file for pip installation └── en_example_model # model directory ├── __init__.py # init for pip installation └── en_example_model-1.0.0 # model data

You can also find templates for all files in our spaCy dev resources . If you're creating the package manually, keep in mind that the directories need to be named according to the naming conventions of lang_name and lang_name-version.

Customising the model setup

The meta.json includes the model details, like name, requirements and license, and lets you customise how the model should be initialised and loaded. You can define the language data to be loaded and the processing pipeline to execute.

SettingTypeDescription
langunicodeID of the language class to initialise.
pipelinelist A list of strings mapping to the IDs of pipeline factories to apply in that order. If not set, spaCy's default pipeline will be used.

The load() method that comes with our model package templates will take care of putting all this together and returning a Language object with the loaded pipeline and data. If your model requires custom pipeline components, you should ship then with your model and register their factories via set_factory() .

spacy.set_factory('custom_component', custom_component_factory)

Building the model package

To build the package, run the following command from within the directory. For more information on building Python packages, see the docs on Python's Setuptools.

python setup.py sdist

This will create a .tar.gz archive in a directory /dist. The model can be installed by pointing pip to the path of the archive:

pip install /path/to/en_example_model-1.0.0.tar.gz

You can then load the model via its name, en_example_model, or import it directly as a module and then call its load() method.

Loading a custom model package

To load a model from a data directory, you can use spacy.load() with the local path. This will look for a meta.json in the directory and use the lang and pipeline settings to initialise a Language class with a processing pipeline and load in the model data.

nlp = spacy.load('/path/to/model')

If you want to load only the binary data, you'll have to create a Language class and call from_disk instead.

from spacy.lang.en import English
nlp = English().from_disk('/path/to/data')