scikit

Language Processing Pipelines

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tensorizer, a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

Doc Text nlp tokenizer tensorizer tagger parser ner
NameComponentCreatesDescription
tokenizerTokenizer DocSegment text into tokens.
tensorizerTensorizer Doc.tensorCreate feature representation tensor for Doc.
taggerTagger Doc[i].tagAssign part-of-speech tags.
parserDependencyParser Doc[i].head, Doc[i].dep, Doc.sents, Doc.noun_chunksAssign dependency labels.
nerEntityRecognizer Doc.ents, Doc[i].ent_iob, Doc[i].ent_typeDetect and label named entities.
textcatTextCategorizer Doc.catsAssign document labels.

The processing pipeline always depends on the statistical model and its capabilities. For example, a pipeline can only include an entity recognizer component if the model includes data to make predictions of entity labels. This is why each model will specify the pipeline to use in its meta data, as a simple list containing the component names:

"pipeline": ["tensorizer", "tagger", "parser", "ner"]

Although you can mix and match pipeline components, their order and combination is usually important. Some components may require certain modifications on the Doc to process it. For example, the default pipeline first applies the tensorizer, which pre-processes the doc and encodes its internal meaning representations as an array of floats, also called a tensor. This includes the tokens and their context, which is required for the next component, the tagger, to make predictions of the part-of-speech tags. Because spaCy's models are neural network models, they only "speak" tensors and expect the input Doc to have a tensor.

How pipelines work

spaCy makes it very easy to create your own pipelines consisting of reusable components – this includes spaCy's default tensorizer, tagger, parser and entity regcognizer, but also your own custom processing functions. A pipeline component can be added to an already existing nlp object, specified when initialising a Language class, or defined within a model package.

When you load a model, spaCy first consults the model's meta.json. The meta typically includes the model details, the ID of a language class, and an optional list of pipeline components. spaCy then does the following:

  1. Load the language class and data for the given ID via get_lang_class and initialise it. The Language class contains the shared vocabulary, tokenization rules and the language-specific annotation scheme.
  2. Iterate over the pipeline names and create each component using create_pipe , which looks them up in Language.factories.
  3. Add each pipeline component to the pipeline in order, using add_pipe .
  4. Make the model data available to the Language class by calling from_disk with the path to the model data ditectory.

So when you call this...

nlp = spacy.load('en')

... the model tells spaCy to use the language "en" and the pipeline ["tensorizer", "tagger", "parser", "ner"]. spaCy will then initialise spacy.lang.en.English, and create each pipeline component and add it to the processing pipeline. It'll then load in the model's data from its data ditectory and return the modified Language class for you to use as the nlp object.

Fundamentally, a spaCy model consists of three components: the weights, i.e. binary data loaded in from a directory, a pipeline of functions called in order, and language data like the tokenization rules and annotation scheme. All of this is specific to each model, and defined in the model's meta.json – for example, a Spanish NER model requires different weights, language data and pipeline components than an English parsing and tagging model. This is also why the pipeline state is always held by the Language class. spacy.load puts this all together and returns an instance of Language with a pipeline set and access to the binary data:

spacy.load under the hood

lang = 'en' pipeline = ['tensorizer', 'tagger', 'parser', 'ner'] data_path = 'path/to/en_core_web_sm/en_core_web_sm-2.0.0' cls = spacy.util.get_lang_class(lang) # 1. get Language instance, e.g. English() nlp = cls() # 2. initialise it for name in pipeline: component = nlp.create_pipe(name) # 3. create the pipeline components nlp.add_pipe(component) # 4. add the component to the pipeline nlp.from_disk(model_data_path) # 5. load in the binary data

When you call nlp on a text, spaCy will tokenize it and then call each component on the Doc, in order. Since the model data is loaded, the components can access it to assign annotations to the Doc object, and subsequently to the Token and Span which are only views of the Doc, and don't own any data themselves. All components return the modified document, which is then processed by the component next in the pipeline.

The pipeline under the hood

doc = nlp.make_doc(u'This is a sentence') # create a Doc from raw text for name, proc in nlp.pipeline: # iterate over components in order doc = proc(doc) # apply each component

The current processing pipeline is available as nlp.pipeline, which returns a list of (name, component) tuples, or nlp.pipe_names, which only returns a list of human-readable component names.

nlp.pipeline
# [('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>)]
nlp.pipe_names
# ['tagger', 'parser', 'ner']

Disabling and modifying pipeline components

If you don't need a particular component of the pipeline – for example, the tagger or the parser, you can disable loading it. This can sometimes make a big difference and improve loading speed. Disabled component names can be provided to spacy.load() , Language.from_disk() or the nlp object itself as a list:

nlp = spacy.load('en', disable['parser', 'tagger'])
nlp = English().from_disk('/model', disable=['tensorizer', 'ner'])

You can also use the remove_pipe method to remove pipeline components from an existing pipeline, the rename_pipe method to rename them, or the replace_pipe method to replace them with a custom component entirely (more details on this in the section on custom components.

nlp.remove_pipe('parser')
nlp.rename_pipe('ner', 'entityrecognizer')
nlp.replace_pipe('tagger', my_custom_tagger)

Creating custom pipeline components

A component receives a Doc object and can modify it – for example, by using the current weights to make a prediction and set some annotation on the document. By adding a component to the pipeline, you'll get access to the Doc at any point during processing – instead of only being able to modify it afterwards.

ArgumentTypeDescription
docDocThe Doc object processed by the previous component.
returnsDocThe Doc object processed by this pipeline component.

Custom components can be added to the pipeline using the add_pipe method. Optionally, you can either specify a component to add it before or after, tell spaCy to add it first or last in the pipeline, or define a custom name. If no name is set and no name attribute is present on your component, the function name is used.

Adding pipeline components

def my_component(doc): print("After tokenization, this doc has %s tokens." % len(doc)) if len(doc) < 10: print("This is a pretty short document.") return doc nlp = spacy.load('en') nlp.pipeline.add_pipe(my_component, name='print_info', first=True) print(nlp.pipe_names) # ['print_info', 'tagger', 'parser', 'ner'] doc = nlp(u"This is a sentence.")

Of course, you can also wrap your component as a class to allow initialising it with custom settings and hold state within the component. This is useful for stateful components, especially ones which depend on shared data.

class MyComponent(object):
    name = 'print_info'

    def __init__(vocab, short_limit=10):
        self.vocab = nlp.vocab
        self.short_limit = short_limit

    def __call__(doc):
        if len(doc) < self.short_limit:
            print("This is a pretty short document.")
        return doc

my_component = MyComponent(nlp.vocab, short_limit=25)
nlp.add_pipe(my_component, first=True)

Extension attributes on Doc, Span and Token

As of v2.0, spaCy allows you to set any custom attributes and methods on the Doc, Span and Token, which become available as Doc._, Span._ and Token._ – for example, Token._.my_attr. This lets you store additional information relevant to your application, add new features and functionality to spaCy, and implement your own models trained with other machine learning libraries. It also lets you take advantage of spaCy's data structures and the Doc object as the "single source of truth".

There are three main types of extensions, which can be defined using the Doc.set_extension , Span.set_extension and Token.set_extension methods.

  1. Attribute extensions. Set a default value for an attribute, which can be overwritten manually at any time. Attribute extensions work like "normal" variables and are the quickest way to store arbitrary information on a Doc, Span or Token.
    Doc.set_extension('hello', default=True)
    assert doc._.hello
    doc._.hello = False
    
  2. Property extensions. Define a getter and an optional setter function. If no setter is provided, the extension is immutable. Since the getter and setter functions are only called when you retrieve the attribute, you can also access values of previously added attribute extensions. For example, a Doc getter can average over Token attributes. For Span extensions, you'll almost always want to use a property – otherwise, you'd have to write to every possible Span in the Doc to set up the values correctly.
    Doc.set_extension('hello', getter=get_hello_value, setter=set_hello_value)
    assert doc._.hello
    doc._.hello = 'Hi!'
    
  3. Method extensions. Assign a function that becomes available as an object method. Method extensions are always immutable. For more details and implementation ideas, see these examples.
    Doc.set_extension('hello', method=lambda doc, name: 'Hi {}!'.format(name))
    assert doc._.hello('Bob') == 'Hi Bob!'
    

Before you can access a custom extension, you need to register it using the set_extension method on the object you want to add it to, e.g. the Doc. Keep in mind that extensions are always added globally and not just on a particular instance. If an attribute of the same name already exists, or if you're trying to access an attribute that hasn't been registered, spaCy will raise an AttributeError.

Example

from spacy.tokens import Doc, Span, Token fruits = ['apple', 'pear', 'banana', 'orange', 'strawberry'] is_fruit_getter = lambda token: token.text in fruits has_fruit_getter = lambda obj: any([t.text in fruits for t in obj]) Token.set_extension('is_fruit', getter=is_fruit_getter) Doc.set_extension('has_fruit', getter=has_fruit_getter) Span.set_extension('has_fruit', getter=has_fruit_getter)

Once you've registered your custom attribute, you can also use the built-in set, get and has methods to modify and retrieve the attributes. This is especially useful it you want to pass in a string instead of calling doc._.my_attr.

MethodDescriptionValid forExample
._.set()Set a value for an attribute.Attributes, mutable properties.token._.set('my_attr', True)
._.get()Get the value of an attribute.Attributes, mutable properties, immutable properties, methods.my_attr = span._.get('my_attr')
._.has()Check if an attribute exists.Attributes, mutable properties, immutable properties, methods.doc._.has('my_attr')

Example: Custom sentence segmentation logic

Let's say you want to implement custom logic to improve spaCy's sentence boundary detection. Currently, sentence segmentation is based on the dependency parse, which doesn't always produce ideal results. The custom logic should therefore be applied after tokenization, but before the dependency parsing – this way, the parser can also take advantage of the sentence boundaries.

def sbd_component(doc):
    for i, token in enumerate(doc[:-2]):
        # define sentence start if period + titlecase token
        if token.text == '.' and doc[i+1].is_title:
            doc[i+1].sent_start = True
    return doc

nlp = spacy.load('en')
nlp.add_pipe(sbd_component, before='parser')  # insert before the parser

Example: Pipeline component for entity matching and tagging with custom attributes

This example shows how to create a spaCy extension that takes a terminology list (in this case, single- and multi-word company names), matches the occurences in a document, labels them as ORG entities, merges the tokens and sets custom is_tech_org and has_tech_org attributes. For efficient matching, the example uses the PhraseMatcher which accepts Doc objects as match patterns and works well for large terminology lists. It also ensures your patterns will always match, even when you customise spaCy's tokenization rules. When you call nlp on a text, the custom pipeline component is applied to the Doc

Wrapping this functionality in a pipeline component allows you to reuse the module with different settings, and have all pre-processing taken care of when you call nlp on your text and receive a Doc object.

Example: Pipeline component for GPE entities and country meta data via a REST API

This example shows the implementation of a pipeline component that fetches country meta data via the REST Countries API sets entity annotations for countries, merges entities into one token and sets custom attributes on the Doc, Span and Token – for example, the capital, latitude/longitude coordinates and even the country flag.

spacy/examples/pipeline/custom_component_countries_api.py

In this case, all data can be fetched on initialisation in one request. However, if you're working with text that contains incomplete country names, spelling mistakes or foreign-language versions, you could also implement a like_country-style getter function that makes a request to the search API endpoint and returns the best-matching result.

Other usage ideas

  • Adding new features and hooking in models. For example, a sentiment analysis model, or your preferred solution for lemmatization or sentiment analysis. spaCy's built-in tagger, parser and entity recognizer respect annotations that were already set on the Doc in a previous step of the pipeline.
  • Integrating other libraries and APIs. For example, your pipeline component can write additional information and data directly to the Doc or Token as custom attributes, while making sure no information is lost in the process. This can be output generated by other libraries and models, or an external service with a REST API.
  • Debugging and logging. For example, a component which stores and/or exports relevant information about the current state of the processed document, and insert it at any point of your pipeline.

User hooks

While it's generally recommended to use the Doc._, Span._ and Token._ proxies to add your own custom attributes, spaCy offers a few exceptions to allow customising the built-in methods like Doc.similarity or Doc.vector . with your own hooks, which can rely on statistical models you train yourself. For instance, you can provide your own on-the-fly sentence segmentation algorithm or document similarity method.

Hooks let you customize some of the behaviours of the Doc, Span or Token objects by adding a component to the pipeline. For instance, to customize the Doc.similarity method, you can add a component that sets a custom function to doc.user_hooks['similarity']. The built-in Doc.similarity method will check the user_hooks dict, and delegate to your function if you've set one. Similar results can be achieved by setting functions to Doc.user_span_hooks and Doc.user_token_hooks.

NameCustomises
user_hooksDoc.vector Doc.has_vector Doc.vector_norm Doc.sents
user_token_hooksToken.similarity Token.vector Token.has_vector Token.vector_norm Token.conjuncts
user_span_hooksSpan.similarity Span.vector Span.has_vector Span.vector_norm Span.root

Add custom similarity hooks

class SimilarityModel(object): def __init__(self, model): self._model = model def __call__(self, doc): doc.user_hooks['similarity'] = self.similarity doc.user_span_hooks['similarity'] = self.similarity doc.user_token_hooks['similarity'] = self.similarity def similarity(self, obj1, obj2): y = self._model([obj1.vector, obj2.vector]) return float(y[0])

Developing spaCy extensions

We're very excited about all the new possibilities for community extensions and plugins in spaCy v2.0, and we can't wait to see what you build with it! To get you started, here are a few tips, tricks and best practices. For examples of other spaCy extensions, see the resources.

  • Make sure to choose a descriptive and specific name for your pipeline component class, and set it as its name attribute. Avoid names that are too common or likely to clash with built-in or a user's other custom components. While it's fine to call your package "spacy_my_extension", avoid component names including "spacy", since this can easily lead to confusion.
    name = 'myapp_lemmatizer'
    name = 'lemmatizer'
  • When writing to Doc, Token or Span objects, use getter functions wherever possible, and avoid setting values explicitly. Tokens and spans don't own any data themselves, so you should provide a function that allows them to compute the values instead of writing static properties to individual objects.
    is_fruit = lambda token: token.text in ('apple', 'orange') Token.set_extension('is_fruit', getter=is_fruit)
    token._.set_extension('is_fruit', default=False) if token.text in ('apple', 'orange'): token._.set('is_fruit', True)
  • Always add your custom attributes to the global Doc Token or Span objects, not a particular instance of them. Add the attributes as early as possible, e.g. in your extension's __init__ method or in the global scope of your module. This means that in the case of namespace collisions, the user will see an error immediately, not just when they run their pipeline.
    from spacy.tokens import Doc def __init__(attr='my_attr'): Doc.set_extension(attr, getter=self.get_doc_attr)
    def __call__(doc): doc.set_extension('my_attr', getter=self.get_doc_attr)
  • If your extension is setting properties on the Doc, Token or Span, include an option to let the user to change those attribute names. This makes it easier to avoid namespace collisions and accommodate users with different naming preferences. We recommend adding an attrs argument to the __init__ method of your class so you can write the names to class attributes and reuse them across your component.
    Doc.set_extension(self.doc_attr, default='some value')
    Doc.set_extension('my_doc_attr', default='some value')
  • Ideally, extensions should be standalone packages with spaCy and optionally, other packages specified as a dependency. They can freely assign to their own ._ namespace, but should stick to that. If your extension's only job is to provide a better .similarity implementation, and your docs state this explicitly, there's no problem with writing to the user_hooks, and overwriting spaCy's built-in method. However, a third-party extension should never silently overwrite built-ins, or attributes set by other extensions.
  • If you're looking to publish a model that depends on a custom pipeline component, you can either require it in the model package's dependencies, or – if the component is specific and lightweight – choose to ship it with your model package and add it to the Language instance returned by the model's load() method. For examples of this, check out the implementations of spaCy's load_model_from_init_py() and load_model_from_path() utility functions.
    nlp.add_pipe(my_custom_component) return nlp.from_disk(model_path)
  • Once you're ready to share your extension with others, make sure to add docs and installation instructions (you can always link to this page for more info). Make it easy for others to install and use your extension, for example by uploading it to PyPi. If you're sharing your code on GitHub, don't forget to tag it with spacy and spacy-extensions to help people find it. If you post it on Twitter, feel free to tag @spacy_io so we can check it out.

Multi-threading

If you have a sequence of documents to process, you should use the Language.pipe() method. The method takes an iterator of texts, and accumulates an internal buffer, which it works on in parallel. It then yields the documents in order, one-by-one. After a long and bitter struggle, the global interpreter lock was freed around spaCy's main parsing loop in v0.100.3. This means that .pipe() will be significantly faster in most practical situations, because it allows shared memory parallelism.

for doc in nlp.pipe(texts, batch_size=10000, n_threads=3):
   pass

To make full use of the .pipe() function, you might want to brush up on Python generators. Here are a few quick hints:

  • Generator comprehensions can be written as (item for item in sequence).
  • The itertools built-in library and the cytoolz package provide a lot of handy generator tools.
  • Often you'll have an input stream that pairs text with some important meta data, e.g. a JSON document. To pair up the meta data with the processed Doc object, you should use the itertools.tee function to split the generator in two, and then izip the extra stream to the document stream. Here's an example.

Serialization

If you've been modifying the pipeline, vocabulary, vectors and entities, or made updates to the model, you'll eventually want to save your progress – for example, everything that's in your nlp object. This means you'll have to translate its contents and structure into a format that can be saved, like a file or a byte string. This process is called serialization. spaCy comes with built-in serialization methods and supports the Pickle protocol.

All container classes, i.e. Language , Doc , Vocab and StringStore have the following methods available:

MethodReturnsExample
to_bytesbytesnlp.to_bytes()
from_bytesobjectnlp.from_bytes(bytes)
to_disk-nlp.to_disk('/path')
from_diskobjectnlp.from_disk('/path')

For example, if you've processed a very large document, you can use Doc.to_disk to save it to a file on your local machine. This will save the document and its tokens, as well as the vocabulary associated with the Doc.

moby_dick = open('moby_dick.txt', 'r') # open a large document
doc = nlp(moby_dick) # process it
doc.to_disk('/moby_dick.bin') # save the processed Doc

If you need it again later, you can load it back into an empty Doc with an empty Vocab by calling from_disk() :

from spacy.tokens import Doc # to create empty Doc
from spacy.vocab import Vocab # to create empty Vocab

doc = Doc(Vocab()).from_disk('/moby_dick.bin') # load processed Doc

Example: Saving and loading a document

For simplicity, let's assume you've added custom entities to a Doc, either manually, or by using a match pattern. You can save it locally by calling Doc.to_disk() , and load it again via Doc.from_disk() . This will overwrite the existing object and return it.

import spacy
from spacy.tokens import Span

text = u'Netflix is hiring a new VP of global policy'

nlp = spacy.load('en')
doc = nlp(text)
assert len(doc.ents) == 0 # Doc has no entities
doc.ents += ((Span(doc, 0, 1, label=doc.vocab.strings[u'ORG'])) # add entity
doc.to_disk('/path/to/doc') # save Doc to disk

new_doc = nlp(text)
assert len(new_doc.ents) == 0 # new Doc has no entities
new_doc = new_doc.from_disk('path/to/doc') # load from disk and overwrite
assert len(new_doc.ents) == 1 # entity is now recognised!
assert [(ent.text, ent.label_) for ent in new_doc.ents] == [(u'Netflix', u'ORG')]