If you have a sequence of documents to process, you should use the
Language.pipe() method. The method takes an iterator of texts, and accumulates an internal buffer,
which it works on in parallel. It then yields the documents in order,
one-by-one. After a long and bitter struggle, the global interpreter
lock was freed around spaCy's main parsing loop in v0.100.3. This means that
.pipe() will be significantly faster in most practical situations, because it allows shared memory parallelism.
for doc in nlp.pipe(texts, batch_size=10000, n_threads=3): pass
To make full use of the
.pipe() function, you might want to brush up on Python generators. Here are a few quick hints:
- Generator comprehensions can be written as
(item for item in sequence).
itertoolsbuilt-in library and the
cytoolzpackage provide a lot of handy generator tools.
- Often you'll have an input stream that pairs text with some
important meta data, e.g. a JSON document. To pair up the meta data with the processed
Docobject, you should use the
itertools.teefunction to split the generator in two, and then
izipthe extra stream to the document stream.
Bringing your own annotations
spaCy generally assumes by default that your data is raw text. However,
sometimes your data is partially annotated, e.g. with pre-existing
tokenization, part-of-speech tags, etc. The most common situation is
that you have pre-defined tokenization. If you have a list of strings, you can create a
Doc object directly. Optionally, you can also specify a list of boolean values, indicating whether each word has a
doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])
If provided, the spaces list must be the same length as the words list. The spaces list affects the
span.end_char attributes. If you don't provide a
spaces sequence, spaCy will assume that all words are whitespace delimited.
good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False]) bad_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!']) assert bad_spaces.text == u'Hello , world !' assert good_spaces.text == u'Hello, world!'
Once you have a
Doc object, you can write to its attributes to set the part-of-speech tags, syntactic dependencies, named
entities and other attributes. For details, see the respective usage
Working with models
If your application depends on one or more models, you'll usually want to integrate them into your continuous integration workflow and build process. While spaCy provides a range of useful helpers for downloading, linking and loading models, the underlying functionality is entirely based on native Python packages. This allows your application to handle a model like any other package dependency.
Downloading and requiring model dependencies
download command is mostly intended as a convenient, interactive wrapper. It performs
compatibility checks and prints detailed error messages and warnings.
However, if you're downloading models as part of an automated build
process, this only adds an unecessary layer of complexity. If you know
which models your application needs, you should be specifying them directly.
Because all models are valid Python packages, you can add them to your application's
requirements.txt. If you're running your own internal PyPi installation, you can simply upload the models there. pip's requirements file format supports both package names to download via a PyPi server, as well as direct
spacy>=2.0.0,<3.0.0 -e https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
All models are versioned and specify their spaCy dependency. This ensures
cross-compatibility and lets you specify exact version requirements for
each model. If you've trained your own model, you can use the
package command to generate the required meta data and turn it into a loadable package.
Loading and testing models
Downloading models directly via pip won't call spaCy's link
link command, which creates symlinks for model shortcuts. This means that you'll have to run this command separately, or use the native
import syntax to load the models:
import en_core_web_sm nlp = en_core_web_sm.load()
In general, this approach is recommended for larger code bases, as it's
more "native", and doesn't depend on symlinks or rely on spaCy's loader
to resolve string names to model packages. If a model can't be imported, Python will raise an
ImportError immediately. And if a model is imported but not used, any linter will catch that.
Similarly, it'll give you more flexibility when writing tests that
require loading models. For example, instead of writing your own
except logic around spaCy's loader, you can use pytest's
importorskip() method to only run a test if a specific model or model version is installed. Each model package exposes a
__version__ attribute which you can also use to perform your own version compatibility
checks before loading a model.