Saving, loading and data serialization
If you've been modifying the pipeline, vocabulary, vectors and entities,
or made updates to the model, you'll eventually want to save your progress – for example, everything that's in your
nlp object. This means you'll have to translate its contents and structure into a format that can be saved, like a file or a byte string.
This process is called serialization. spaCy comes with built-in serialization methods and supports the Pickle protocol.
All container classes, i.e.
StringStore have the following methods available:
For example, if you've processed a very large document, you can use
Doc.to_disk to save it to a file on your local machine. This will save the document and its tokens, as well as the vocabulary associated with the
moby_dick = open('moby_dick.txt', 'r') # open a large document doc = nlp(moby_dick) # process it doc.to_disk('/moby_dick.bin') # save the processed Doc
If you need it again later, you can load it back into an empty
Doc with an empty
Vocab by calling
from spacy.tokens import Doc # to create empty Doc from spacy.vocab import Vocab # to create empty Vocab doc = Doc(Vocab()).from_disk('/moby_dick.bin') # load processed Doc
Example: Saving and loading a document
For simplicity, let's assume you've added custom entities to a
Doc, either manually, or by using a match pattern. You can save it locally by calling
Doc.to_disk() , and load it again via
Doc.from_disk() . This will overwrite the existing object and return it.
import spacy from spacy.tokens import Span text = u'Netflix is hiring a new VP of global policy' nlp = spacy.load('en') doc = nlp(text) assert len(doc.ents) == 0 # Doc has no entities doc.ents += ((Span(doc, 0, 1, label=doc.vocab.strings[u'ORG'])) # add entity doc.to_disk('/path/to/doc') # save Doc to disk new_doc = nlp(text) assert len(new_doc.ents) == 0 # new Doc has no entities new_doc = new_doc.from_disk('path/to/doc') # load from disk and overwrite assert len(new_doc.ents) == 1 # entity is now recognised! assert [(ent.text, ent.label_) for ent in new_doc.ents] == [(u'Netflix', u'ORG')]
After training your model, you'll usually want to save its state, and load
it back later. You can do this with the
The directory will be created if it doesn't exist, and the whole pipeline will be written out. To make the model more convenient to deploy, we recommend wrapping it as a Python package.
Generating a model package
spaCy comes with a handy CLI command that will create all required files,
and walk you through generating the meta data. You can also create the
meta.json manually and place it in the model data directory, or supply a path to it using the
--meta flag. For more info on this, see the
spacy package /home/me/data/en_example_model /home/me/my_models
This command will create a model package directory that should look like this:
└── / ├── MANIFEST.in # to include meta.json ├── meta.json # model meta data ├── setup.py # setup file for pip installation └── en_example_model # model directory ├── __init__.py # init for pip installation └── en_example_model-1.0.0 # model data
You can also find templates for all files in our spaCy dev resources . If you're creating the package manually, keep in mind that the directories
need to be named according to the naming conventions of
Customising the model setup
The meta.json includes the model details, like name, requirements and license, and lets you customise how the model should be initialised and loaded. You can define the language data to be loaded and the processing pipeline to execute.
|unicode||ID of the language class to initialise.|
|list||A list of strings mapping to the IDs of pipeline factories to apply in that order. If not set, spaCy's default pipeline will be used.|
load() method that comes with our model package templates will take care of putting all this together and returning a
Language object with the loaded pipeline and data. If your model requires custom pipeline components, you should ship then with your model and register their factories via
Building the model package
To build the package, run the following command from within the directory. For more information on building Python packages, see the docs on Python's Setuptools.
python setup.py sdist
This will create a
.tar.gz archive in a directory
/dist. The model can be installed by pointing pip to the path of the archive:
pip install /path/to/en_example_model-1.0.0.tar.gz
You can then load the model via its name,
en_example_model, or import it directly as a module and then call its
Loading a custom model package
To load a model from a data directory, you can use
spacy.load() with the local path. This will look for a meta.json in the directory and use the
pipeline settings to initialise a
Language class with a processing pipeline and load in the model data.
nlp = spacy.load('/path/to/model')
If you want to load only the binary data, you'll have to create a
Language class and call
from spacy.lang.en import English nlp = English().from_disk('/path/to/data')