Language processing pipelines

Pipelines 101

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tensorizer, a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

Doc Text nlp tokenizer tensorizer tagger parser ner
tokenizerTokenizer DocSegment text into tokens.
tensorizerTokenVectorEncoderDoc.tensorCreate feature representation tensor for Doc.
taggerTagger Doc[i].tagAssign part-of-speech tags.
parserDependencyParser Doc[i].head, Doc[i].dep, Doc.sents, Doc.noun_chunksAssign dependency labels.
nerEntityRecognizer Doc.ents, Doc[i].ent_iob, Doc[i].ent_typeDetect and label named entities.

The processing pipeline always depends on the statistical model and its capabilities. For example, a pipeline can only include an entity recognizer component if the model includes data to make predictions of entity labels. This is why each model will specify the pipeline to use in its meta data, as a simple list containing the component names:

"pipeline": ["tensorizer", "tagger", "parser", "ner"]

Although you can mix and match pipeline components, their order and combination is usually important. Some components may require certain modifications on the Doc to process it. For example, the default pipeline first applies the tensorizer, which pre-processes the doc and encodes its internal meaning representations as an array of floats, also called a tensor. This includes the tokens and their context, which is required for the next component, the tagger, to make predictions of the part-of-speech tags. Because spaCy's models are neural network models, they only "speak" tensors and expect the input Doc to have a tensor.

How pipelines work

spaCy makes it very easy to create your own pipelines consisting of reusable components – this includes spaCy's default tensorizer, tagger, parser and entity regcognizer, but also your own custom processing functions. A pipeline component can be added to an already existing nlp object, specified when initialising a Language class, or defined within a model package.

When you load a model, spaCy first consults the model's meta.json. The meta typically includes the model details, the ID of a language class, and an optional list of pipeline components. spaCy then does the following:

  1. Look up pipeline IDs in the available pipeline factories.
  2. Initialise the pipeline components by calling their factories with the Vocab as an argument. This gives each factory and component access to the pipeline's shared data, like strings, morphology and annotation scheme.
  3. Load the language class and data for the given ID via get_lang_class .
  4. Pass the path to the model data to the Language class and return it.

So when you call this...

nlp = spacy.load('en')

... the model tells spaCy to use the pipeline ["tensorizer", "tagger", "parser", "ner"]. spaCy will then look up each string in its internal factories registry and initialise the individual components. It'll then load spacy.lang.en.English, pass it the path to the model's data directory, and return it for you to use as the nlp object.

When you call nlp on a text, spaCy will tokenize it and then call each component on the Doc, in order. Components all return the modified document, which is then processed by the component next in the pipeline.

The pipeline under the hood

doc = nlp.make_doc(u'This is a sentence') for proc in nlp.pipeline: doc = proc(doc)

Creating pipeline components and factories

spaCy lets you customise the pipeline with your own components. Components are functions that receive a Doc object, modify and return it. If your component is stateful, you'll want to create a new one for each pipeline. You can do that by defining and registering a factory which receives the shared Vocab object and returns a component.

Creating a component

A component receives a Doc object and performs the actual processing – for example, using the current weights to make a prediction and set some annotation on the document. By adding a component to the pipeline, you'll get access to the Doc at any point during processing – instead of only being able to modify it afterwards.

docDocThe Doc object processed by the previous component.
returnsDocThe Doc object processed by this pipeline component.

When creating a new Language class, you can pass it a list of pipeline component functions to execute in that order. You can also add it to an existing pipeline by modifying nlp.pipeline – just be careful not to overwrite a pipeline or its components by accident!

# Create a new Language object with a pipeline
from spacy.language import Language
nlp = Language(pipeline=[my_component])

# Modify an existing pipeline
nlp = spacy.load('en')

Creating a factory

A factory is a function that returns a pipeline component. It's called with the Vocab object, to give it access to the shared data between components – for example, the strings, morphology, vectors or annotation scheme. Factories are useful for creating stateful components, especially ones which depend on shared data.

vocabVocab Shared data between components, including strings, morphology, vectors etc.
returnscallableThe pipeline component.

By creating a factory, you're essentially telling spaCy how to get the pipeline component once the vocab is available. Factories need to be registered via set_factory() and by assigning them a unique ID. This ID can be added to the pipeline as a string. When creating a pipeline, you're free to mix strings and callable components:

spacy.set_factory('my_factory', my_factory)
nlp = Language(pipeline=['my_factory', my_other_component])

If spaCy comes across a string in the pipeline, it will try to resolve it by looking it up in the available factories. The factory will then be initialised with the Vocab. Providing factory names instead of callables also makes it easy to specify them in the model's meta.json. If you're training your own model and want to use one of spaCy's default components, you won't have to worry about finding and implementing it either – to use the default tagger, simply add "tagger" to the pipeline, and spaCy will know what to do.

Example: Custom sentence segmentation logic

Let's say you want to implement custom logic to improve spaCy's sentence boundary detection. Currently, sentence segmentation is based on the dependency parse, which doesn't always produce ideal results. The custom logic should therefore be applied after tokenization, but before the dependency parsing – this way, the parser can also take advantage of the sentence boundaries.

def sbd_component(doc):
    for i, token in enumerate(doc[:-2]):
        # define sentence start if period + titlecase token
        if token.text == '.' and doc[i+1].is_title:
            doc[i+1].sent_start = True
    return doc

In this case, we simply want to add the component to the existing pipeline of the English model. We can do this by inserting it at index 0 of nlp.pipeline:

nlp = spacy.load('en')
nlp.pipeline.insert(0, sbd_component)

When you call nlp on some text, spaCy will tokenize it to create a Doc object, and first call sbd_component on it, followed by the model's default pipeline.

Example: Sentiment model

Let's say you have trained your own document sentiment model on English text. After tokenization, you want spaCy to first execute the default tensorizer, followed by a custom sentiment component that adds a .sentiment property to the Doc, containing your model's sentiment precition.

Your component class will have a from_disk() method that spaCy calls to load the model data. When called, the component will compute the sentiment score, add it to the Doc and return the modified document. Optionally, the component can include an update() method to allow training the model.

import pickle
from pathlib import Path

class SentimentComponent(object):
    def __init__(self, vocab):
        self.weights = None

    def __call__(self, doc):
        doc.sentiment = sum(self.weights*doc.vector) # set sentiment property
        return doc

    def from_disk(self, path): # path = model path + factory ID ('sentiment')
        self.weights = pickle.load(Path(path) / 'weights.bin') # load weights
        return self

    def update(self, doc, gold): # update weights – allows training!
        prediction = sum(self.weights*doc.vector)
        self.weights -= 0.001*doc.vector*(prediction-gold.sentiment)

The factory will initialise the component with the Vocab object. To be able to add it to your model's pipeline as 'sentiment', it also needs to be registered via set_factory() .

def sentiment_factory(vocab):
    component = SentimentComponent(vocab) # initialise component
    return component

spacy.set_factory('sentiment', sentiment_factory)

The above code should be shipped with your model. You can use the package command to create all required files and directories. The model package will include an with a load() method, that will initialise the language class with the model's pipeline and call the from_disk() method to load the model data.

In the model package's meta.json, specify the language class and pipeline IDs:

meta.json (excerpt)

{ "name": "sentiment_model", "lang": "en", "version": "1.0.0", "spacy_version": ">=2.0.0,<3.0.0", "pipeline": ["tensorizer", "sentiment"] }

When you load your new model, spaCy will call the model's load() method. This will return a Language object with a pipeline containing the default tensorizer, and the sentiment component returned by your custom "sentiment" factory.

nlp = spacy.load('en_sentiment_model')
doc = nlp(u'I love pizza')
assert doc.sentiment

Disabling pipeline components

If you don't need a particular component of the pipeline – for example, the tagger or the parser, you can disable loading it. This can sometimes make a big difference and improve loading speed. Disabled component names can be provided to spacy.load() , Language.from_disk() or the nlp object itself as a list:

nlp = spacy.load('en', disable['parser', 'tagger'])
nlp = English().from_disk('/model', disable=['tensorizer', 'ner'])
doc = nlp(u"I don't want parsed", disable=['parser'])

Note that you can't write directly to nlp.pipeline, as this list holds the actual components, not the IDs. However, if you know the order of the components, you can still slice the list:

nlp = spacy.load('en')
nlp.pipeline = nlp.pipeline[:2] # only use the first two components