spaCy 101 – Everything you need to know

Whether you're new to spaCy, or just want to brush up on some NLP basics and implementation details – this page should have you covered. Each section will explain one of spaCy's features in simple terms and with examples or illustrations. Some sections will also reappear across the usage guides as a quick introcution.

What's spaCy?

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?

spaCy is designed specifically for production use and helps you build applications that process and "understand" large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

What spaCy isn't

Features

In the documentation, you'll come across mentions of spaCy's features and capabilities. Some of them refer to linguistic concepts, while others are related to more general machine learning functionality.

NameDescriptionNeeds model
TokenizationSegmenting text into words, punctuations marks etc.
Part-of-speech (POS) TaggingAssigning word types to tokens, like verb or noun.
Dependency Parsing Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
Lemmatization Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".
Sentence Boundary Detection (SBD)Finding and segmenting individual sentences.
Named Entity Recongition (NER) Labelling named "real-world" objects, like persons, companies or locations.
Similarity Comparing words, text spans and documents and how similar they are to each other.
Text classificationAssigning categories or labels to a whole document, or parts of a document.
Rule-based Matching Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
TrainingUpdating and improving a statistical model's predictions.
SerializationSaving objects to files or byte strings.

Linguistic annotations

spaCy provides a variety of linguistic annotations to give you insights into a text's grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you're analysing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether "google" is used as a verb, or refers to the website or company in a specific context.

Once you've downloaded and installed a model, you can load it via spacy.load() . This will return a Language object contaning all components and data needed to process text. We usually call it nlp. Calling the nlp object on a string of text will return a processed Doc:

import spacy

nlp = spacy.load('en')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

Even though a Doc is processed – e.g. split into individual words and annotated – it still holds all information of the original text, like whitespace characters. You can always get the offset of a token into the original string, or reconstruct the original by joining the tokens and their trailing whitespace. This way, you'll never lose any information when processing text with spaCy.

Tokenization

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token. Each Doc consists of individual tokens, and we can simply iterate over them:

for token in doc:
    print(token.text)
012345678910
AppleislookingatbuyingU.K.startupfor$1billion

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

  1. Does the substring match a tokenizer exception rule? For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.
  2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.

If there's a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

“Let’s go to N.Y.!” Let’s go to N.Y.!” Let go to N.Y.!” ’s Let go to N.Y.! ’s Let go to N.Y. ’s ! Let go to N.Y. ’s ! EXCEPTION PREFIX SUFFIX SUFFIX EXCEPTION DONE

While punctuation rules are usually pretty general, tokenizer exceptions strongly depend on the specifics of the individual language. This is why each available language has its own subclass like English or German, that loads in lists of hard-coded data and exception rules.

Part-of-speech tags and dependencies

After tokenization, spaCy can also parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalise across the language – for example, a word following "the" in English is most likely a noun.

Linguistic annotations are available as Token attributes . Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name:

doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)
TextLemmaPOSTagDepShapealphastop
AppleapplePROPNNNPnsubjXxxxxTrueFalse
isbeVERBVBZauxxxTrueTrue
lookinglookVERBVBGROOTxxxxTrueFalse
atatADPINprepxxTrueTrue
buyingbuyVERBVBGpcompxxxxTrueFalse
U.K.u.k.PROPNNNPcompoundX.X.FalseFalse
startupstartupNOUNNNdobjxxxxTrueFalse
forforADPINprepxxxTrueTrue
$$SYM$quantmod$FalseFalse
11NUMCDcompounddFalseFalse
billionbillionNUMCDpobjxxxxTrueFalse

Using spaCy's built-in displaCy visualizer, here's what our example sentence and its dependencies look like:

Named Entities

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
TextStartEndLabelDescription
Apple05ORGCompanies, agencies, institutions.
U.K.2731GPEGeopolitical entity, i.e. countries, cities, states.
$1 billion4454MONEYMonetary values, including unit.

Using spaCy's built-in displaCy visualizer, here's what our example sentence and its named entities look like:

Word vectors and similarity

spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that's similar to what they're currently looking at, or label a support ticket as a duplicate if it's very similar to an already existing one.

Each Doc, Span and Token comes with a .similarity() method that lets you compare it with another object, and determine the similarity. Of course similarity is always subjective – whether "dog" and "cat" are similar really depends on how you're looking at it. spaCy's similarity model usually assumes a pretty general-purpose definition of similarity.

tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print(token1.similarity(token2))
dogcatbanana
dog1.00 0.80 0.24
cat0.80 1.00 0.28
banana0.24 0.28 1.00

In this case, the model's predictions are pretty on point. A dog is very similar to a cat, whereas a banana is not very similar to either of them. Identical tokens are obviously 100% similar to each other (just not always exactly 1.0, because of vector math and floating point imprecisions).

Similarity is determined by comparing word vectors or "word embeddings", multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec. Most of spaCy's default models come with 300-dimensional vectors that look like this:

banana.vector

array([2.02280000e-01, -7.66180009e-02, 3.70319992e-01, 3.28450017e-02, -4.19569999e-01, 7.20689967e-02, -3.74760002e-01, 5.74599989e-02, -1.24009997e-02, 5.29489994e-01, -5.23800015e-01, -1.97710007e-01, -3.41470003e-01, 5.33169985e-01, -2.53309999e-02, 1.73800007e-01, 1.67720005e-01, 8.39839995e-01, 5.51070012e-02, 1.05470002e-01, 3.78719985e-01, 2.42750004e-01, 1.47449998e-02, 5.59509993e-01, 1.25210002e-01, -6.75960004e-01, 3.58420014e-01, -4.00279984e-02, 9.59490016e-02, -5.06900012e-01, -8.53179991e-02, 1.79800004e-01, 3.38669986e-01, 1.32300004e-01, 3.10209990e-01, 2.18779996e-01, 1.68530002e-01, 1.98740005e-01, -5.73849976e-01, -1.06490001e-01, 2.66689986e-01, 1.28380001e-01, -1.28030002e-01, -1.32839993e-01, 1.26570001e-01, 8.67229998e-01, 9.67210010e-02, 4.83060002e-01, 2.12709993e-01, -5.49900010e-02, -8.24249983e-02, 2.24079996e-01, 2.39749998e-01, -6.22599982e-02, 6.21940017e-01, -5.98999977e-01, 4.32009995e-01, 2.81430006e-01, 3.38420011e-02, -4.88150001e-01, -2.13589996e-01, 2.74010003e-01, 2.40950003e-01, 4.59500015e-01, -1.86049998e-01, -1.04970002e+00, -9.73049998e-02, -1.89080000e-01, -7.09290028e-01, 4.01950002e-01, -1.87680006e-01, 5.16870022e-01, 1.25200003e-01, 8.41499984e-01, 1.20970003e-01, 8.82389992e-02, -2.91959997e-02, 1.21510006e-03, 5.68250008e-02, -2.74210006e-01, 2.55640000e-01, 6.97930008e-02, -2.22580001e-01, -3.60060006e-01, -2.24020004e-01, -5.36990017e-02, 1.20220006e+00, 5.45350015e-01, -5.79980016e-01, 1.09049998e-01, 4.21669990e-01, 2.06619993e-01, 1.29360005e-01, -4.14570011e-02, -6.67770028e-01, 4.04670000e-01, -1.52179999e-02, -2.76400000e-01, -1.56110004e-01, -7.91980028e-02, 4.00369987e-02, -1.29439995e-01, -2.40900001e-04, -2.67850012e-01, -3.81150007e-01, -9.72450018e-01, 3.17259997e-01, -4.39509988e-01, 4.19340014e-01, 1.83530003e-01, -1.52600005e-01, -1.08080000e-01, -1.03579998e+00, 7.62170032e-02, 1.65189996e-01, 2.65259994e-04, 1.66160002e-01, -1.52810007e-01, 1.81229994e-01, 7.02740014e-01, 5.79559989e-03, 5.16639985e-02, -5.97449988e-02, -2.75510013e-01, -3.90489995e-01, 6.11319989e-02, 5.54300010e-01, -8.79969969e-02, -4.16810006e-01, 3.28260005e-01, -5.25489986e-01, -4.42880005e-01, 8.21829960e-03, 2.44859993e-01, -2.29819998e-01, -3.49810004e-01, 2.68940002e-01, 3.91660005e-01, -4.19039994e-01, 1.61909997e-01, -2.62630010e+00, 6.41340017e-01, 3.97430003e-01, -1.28680006e-01, -3.19460005e-01, -2.56330013e-01, -1.22199997e-01, 3.22750002e-01, -7.99330026e-02, -1.53479993e-01, 3.15050006e-01, 3.05909991e-01, 2.60120004e-01, 1.85530007e-01, -2.40429997e-01, 4.28860001e-02, 4.06219989e-01, -2.42559999e-01, 6.38700008e-01, 6.99829996e-01, -1.40430003e-01, 2.52090007e-01, 4.89840001e-01, -6.10670000e-02, -3.67659986e-01, -5.50890028e-01, -3.82649988e-01, -2.08430007e-01, 2.28320003e-01, 5.12179971e-01, 2.78679997e-01, 4.76520002e-01, 4.79510017e-02, -3.40079993e-01, -3.28729987e-01, -4.19669986e-01, -7.54989982e-02, -3.89539987e-01, -2.96219997e-02, -3.40700001e-01, 2.21699998e-01, -6.28560036e-02, -5.19029975e-01, -3.77739996e-01, -4.34770016e-03, -5.83010018e-01, -8.75459984e-02, -2.39289999e-01, -2.47109994e-01, -2.58870006e-01, -2.98940003e-01, 1.37150005e-01, 2.98919994e-02, 3.65439989e-02, -4.96650010e-01, -1.81600004e-01, 5.29389977e-01, 2.19919994e-01, -4.45140004e-01, 3.77979994e-01, -5.70620000e-01, -4.69460003e-02, 8.18059966e-02, 1.92789994e-02, 3.32459986e-01, -1.46200001e-01, 1.71560004e-01, 3.99809986e-01, 3.62170011e-01, 1.28160000e-01, 3.16439986e-01, 3.75690013e-01, -7.46899992e-02, -4.84800003e-02, -3.14009994e-01, -1.92860007e-01, -3.12940001e-01, -1.75529998e-02, -1.75139993e-01, -2.75870003e-02, -1.00000000e+00, 1.83870003e-01, 8.14339995e-01, -1.89129993e-01, 5.09989977e-01, -9.19600017e-03, -1.92950002e-03, 2.81890005e-01, 2.72470005e-02, 4.34089988e-01, -5.49669981e-01, -9.74259973e-02, -2.45399997e-01, -1.72030002e-01, -8.86500031e-02, -3.02980006e-01, -1.35910004e-01, -2.77649999e-01, 3.12860007e-03, 2.05559999e-01, -1.57720000e-01, -5.23079991e-01, -6.47010028e-01, -3.70139986e-01, 6.93930015e-02, 1.14009999e-01, 2.75940001e-01, -1.38750002e-01, -2.72680014e-01, 6.68910027e-01, -5.64539991e-02, 2.40170002e-01, -2.67300010e-01, 2.98599988e-01, 1.00830004e-01, 5.55920005e-01, 3.28489989e-01, 7.68579990e-02, 1.55279994e-01, 2.56359994e-01, -1.07720003e-01, -1.23590000e-01, 1.18270002e-01, -9.90289971e-02, -3.43279988e-01, 1.15019999e-01, -3.78080010e-01, -3.90120000e-02, -3.45930010e-01, -1.94040000e-01, -3.35799992e-01, -6.23340011e-02, 2.89189994e-01, 2.80319989e-01, -5.37410021e-01, 6.27939999e-01, 5.69549985e-02, 6.21469975e-01, -2.52819985e-01, 4.16700006e-01, -1.01079997e-02, -2.54339993e-01, 4.00029987e-01, 4.24320012e-01, 2.26720005e-01, 1.75530002e-01, 2.30489999e-01, 2.83230007e-01, 1.38820007e-01, 3.12180002e-03, 1.70570001e-01, 3.66849989e-01, 2.52470002e-03, -6.40089989e-01, -2.97650009e-01, 7.89430022e-01, 3.31680000e-01, -1.19659996e+00, -4.71559986e-02, 5.31750023e-01], dtype=float32)

The .vector attribute will return an object's vector. Doc.vector and Span.vector will default to an average of their token vectors. You can also check if a token has a vector assigned, and get the L2 norm, which can be used to normalise vectors.

tokens = nlp(u'dog cat banana sasquatch')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)
TextHas vectorVector normOOV
dogTrue7.033672992262838False
catTrue6.68081871208896False
bananaTrue6.700014292148571False
sasquatchFalse0True

The words "dog", "cat" and "banana" are all pretty common in English, so they're part of the model's vocabulary, and come with a vector. The word "sasquatch" on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of 0, which means it's practically nonexistent.

If your application will benefit from a large vocabulary with more vectors, you should consider using one of the larger models instead of the default, smaller ones, which usually come with a clipped vocabulary.

Pipelines

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tensorizer, a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

Doc Text nlp tokenizer tensorizer tagger parser ner
NameComponentCreatesDescription
tokenizerTokenizer DocSegment text into tokens.
tensorizerTokenVectorEncoderDoc.tensorCreate feature representation tensor for Doc.
taggerTagger Doc[i].tagAssign part-of-speech tags.
parserDependencyParser Doc[i].head, Doc[i].dep, Doc.sents, Doc.noun_chunksAssign dependency labels.
nerEntityRecognizer Doc.ents, Doc[i].ent_iob, Doc[i].ent_typeDetect and label named entities.

The processing pipeline always depends on the statistical model and its capabilities. For example, a pipeline can only include an entity recognizer component if the model includes data to make predictions of entity labels. This is why each model will specify the pipeline to use in its meta data, as a simple list containing the component names:

"pipeline": ["tensorizer", "tagger", "parser", "ner"]

Although you can mix and match pipeline components, their order and combination is usually important. Some components may require certain modifications on the Doc to process it. For example, the default pipeline first applies the tensorizer, which pre-processes the doc and encodes its internal meaning representations as an array of floats, also called a tensor. This includes the tokens and their context, which is required for the next component, the tagger, to make predictions of the part-of-speech tags. Because spaCy's models are neural network models, they only "speak" tensors and expect the input Doc to have a tensor.

Vocab, hashes and lexemes

Whenever possible, spaCy tries to store data in a vocabulary, the Vocab , that will be shared by multiple documents. To save memory, spaCy also encodes all strings to hash values – in this case for example, "coffee" has the hash 3197928453018144401. Entity labels like "ORG" and part-of-speech tags like "VERB" are also encoded. Internally, spaCy only "speaks" in hash values.

31979... Lexeme 46904... Lexeme 37020... Lexeme "coffee" 31979… "I" 46904… "love" 37020… nsubj dobj String Store Vocab Doc love VERB Token I PRON Token coffee NOUN Token

If you process lots of documents containing the word "coffee" in all kinds of different contexts, storing the exact string "coffee" every time would take up way too much space. So instead, spaCy hashes the string and stores it in the StringStore . You can think of the StringStore as a lookup table that works in both directions – you can look up a string to get its hash, or a hash to get its string:

doc = nlp(u'I like coffee')
assert doc.vocab.strings[u'coffee'] == 3197928453018144401
assert doc.vocab.strings[3197928453018144401] == u'coffee'

Now that all strings are encoded, the entries in the vocabulary don't need to include the word text themselves. Instead, they can look it up in the StringStore via its hash value. Each entry in the vocabulary, also called Lexeme , contains the context-independent information about a word. For example, no matter if "love" is used as a verb or a noun in some context, its spelling and whether it consists of alphabetic characters won't ever change. Its hash value will also always be the same.

for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
          lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)
textorthshapeprefixsuffixis_alphais_digit
I4690420944186131903XIITrueFalse
love3702023516439754181xxxxloveTrueFalse
coffee3197928453018144401xxxxcffeTrueFalse

The mapping of words to hashes doesn't depend on any state. To make sure each value is unique, spaCy uses a hash function to calculate the hash based on the word string. This also means that the hash for "coffee" will always be the same, no matter which model you're using or how you've configured spaCy.

However, hashes cannot be reversed and there's no way to resolve 3197928453018144401 back to "coffee". All spaCy can do is look it up in the vocabulary. That's why you always need to make sure all objects you create have access to the same vocabulary. If they don't, spaCy might not be able to find the strings it needs.

from spacy.tokens import Doc
from spacy.vocab import Vocab

doc = nlp(u'I like coffee') # original Doc
assert doc.vocab.strings[u'coffee'] == 3197928453018144401 # get hash
assert doc.vocab.strings[3197928453018144401] == u'coffee' # 👍

empty_doc = Doc(Vocab()) # new Doc with empty Vocab
# doc.vocab.strings[3197928453018144401] will raise an error :(

empty_doc.vocab.strings.add(u'coffee') # add "coffee" and generate hash
assert doc.vocab.strings[3197928453018144401] == u'coffee' # 👍

new_doc = Doc(doc.vocab) # create new doc with first doc's vocab
assert doc.vocab.strings[3197928453018144401] == u'coffee' # 👍

If the vocabulary doesn't contain a string for 3197928453018144401, spaCy will raise an error. You can re-add "coffee" manually, but this only works if you actually know that the document contains that word. To prevent this problem, spaCy will also export the Vocab when you save a Doc or nlp object. This will give you the object and its encoded annotations, plus they "key" to decode it.

Serialization

If you've been modifying the pipeline, vocabulary, vectors and entities, or made updates to the model, you'll eventually want to save your progress – for example, everything that's in your nlp object. This means you'll have to translate its contents and structure into a format that can be saved, like a file or a byte string. This process is called serialization. spaCy comes with built-in serialization methods and supports the Pickle protocol.

All container classes, i.e. Language , Doc , Vocab and StringStore have the following methods available:

MethodReturnsExample
to_bytesbytesnlp.to_bytes()
from_bytesobjectnlp.from_bytes(bytes)
to_disk-nlp.to_disk('/path')
from_diskobjectnlp.from_disk('/path')

For example, if you've processed a very large document, you can use Doc.to_disk to save it to a file on your local machine. This will save the document and its tokens, as well as the vocabulary associated with the Doc.

moby_dick = open('moby_dick.txt', 'r') # open a large document
doc = nlp(moby_dick) # process it
doc.to_disk('/moby_dick.bin') # save the processed Doc

If you need it again later, you can load it back into an empty Doc with an empty Vocab by calling from_disk() :

from spacy.tokens import Doc # to create empty Doc
from spacy.vocab import Vocab # to create empty Vocab

doc = Doc(Vocab()).from_disk('/moby_dick.bin') # load processed Doc

Training

spaCy's models are statistical and every "decision" they make – for example, which part-of-speech tag to assign, or whether a word is a named entity – is a prediction. This prediction is based on the examples the model has seen during training. To train a model, you first need training data – examples of text, and the labels you want the model to predict. This could be a part-of-speech tag, a named entity or any other information.

The model is then shown the unlabelled text and will make a prediction. Because we know the correct answer, we can give the model feedback on its prediction in the form of an error gradient of the loss function that calculates the difference between the training example and the expected output. The greater the difference, the more significant the gradient and the updates to our model.

PREDICT SAVE Model Training data label label Updated Model text GRADIENT

When training a model, we don't just want it to memorise our examples – we want it to come up with theory that can be generalised across other examples. After all, we don't just want the model to learn that this one instance of "Amazon" right here is a company – we want it to learn that "Amazon", in contexts like this, is most likely a company. That's why the training data should always be representative of the data we want to process. A model trained on Wikipedia, where sentences in the first person are extremely rare, will likely perform badly on Twitter. Similarly, a model trained on romantic novels will likely perform badly on legal text.

This also means that in order to know how the model is performing, and whether it's learning the right things, you don't only need training data – you'll also need evaluation data. If you only test the model with the data it was trained on, you'll have no idea how well it's generalising. If you want to train a model from scratch, you usually need at least a few hundred examples for both training and evaluation. To update an existing model, you can already achieve decent results with very few examples – as long as they're representative.

Language data

Every language is different – and usually full of exceptions and special cases, especially amongst the most common words. Some of these exceptions are shared across languages, while others are entirely specific – usually so specific that they need to be hard-coded. The lang module contains all language-specific data, organised in simple Python files. This makes the data easy to update and extend.

The shared language data in the directory root includes rules that can be generalised across languages – for example, rules for basic punctuation, emoji, emoticons, single-letter abbreviations and norms for equivalent tokens with different spellings, like " and . This helps the models make more accurate predictions. The individual language data in a submodule contains rules that are only relevant to a particular language. It also takes care of putting together all components and creating the Language subclass – for example, English or German.

Tokenizer Base data Language data stop words lexical attributes tokenizer exceptions prefixes, suffixes, infixes lemma data Lemmatizer char classes Token morph rules tag map Morphology
NameDescription
Stop words
stop_words.py
List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return True for is_stop.
Tokenizer exceptions
tokenizer_exceptions.py
Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.".
Norm exceptions norm_exceptions.py Special-case rules for normalising tokens to improve the model's predictions, for example on American vs. British spelling.
Punctuation rules punctuation.py Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes.
Character classes char_classes.py Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons.
Lexical attributes lex_attrs.py Custom functions for setting lexical attributes on tokens, e.g. like_num, which includes language-specific words like "ten" or "hundred".
Syntax iterators syntax_iterators.py Functions that compute views of a Doc object based on its syntax. At the moment, only used for noun chunks.
Lemmatizer lemmatizer.py Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was".
Tag map
tag_map.py
Dictionary mapping strings in your tag set to Universal Dependencies tags.
Morph rules morph_rules.py Exception rules for morphological analysis of irregular words like personal pronouns.

Architecture

The central data structures in spaCy are the Doc and the Vocab. The Doc object owns the sequence of tokens and all their annotations. The Vocab object owns a set of look-up tables that make common information available across documents. By centralising strings, word vectors and lexical attributes, we avoid storing multiple copies of this data. This saves memory, and ensures there's a single source of truth.

Text annotations are also designed to allow a single source of truth: the Doc object owns the data, and Span and Token are views that point into it. The Doc object is constructed by the Tokenizer, and then modified in place by the components of the pipeline. The Language object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.

Language MAKES nlp.vocab.morphology Vocab nlp.vocab StringStore nlp.vocab.strings nlp.tokenizer.vocab Tokenizer nlp.make_doc() nlp.pipeline nlp.pipeline[i].vocab pt en de fr es it nl sv fi nb hu he bn ja zh doc.vocab MAKES Doc MAKES token.doc Token Span lexeme.vocab Lexeme MAKES span.doc Dependency Parser Entity Recognizer Tagger Matcher Lemmatizer Morphology
NameDescription
Language A text-processing pipeline. Usually you'll load this once per process as nlp and pass the instance around your application.
Doc A container for accessing linguistic annotations.
Span A slice from a Doc object.
Token An individual token — i.e. a word, punctuation symbol, whitespace, etc.
Lexeme An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc.
Vocab A lookup table for the vocabulary that allows you to access Lexeme objects.
Morphology Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag.
StringStore Map strings to and from hash values.
Tokenizer Segment text, and create Doc objects with the discovered segment boundaries.
Lemmatizer Determine the base forms of words.
Matcher Match sequences of tokens, based on pattern rules, similar to regular expressions.

Pipeline components

NameDescription
Tagger Annotate part-of-speech tags on Doc objects.
DependencyParser Annotate syntactic dependencies on Doc objects.
EntityRecognizer Annotate named entities, e.g. persons or products, on Doc objects.

Other classes

NameDescription
Vectors Container class for vector data keyed by string.
Binder Container class for serializing collections of Doc objects.
GoldParse Collection for training annotations.
GoldCorpus An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER.

Community & FAQ

We're very happy to see the spaCy community grow and include a mix of people from all kinds of different backgrounds – computational linguistics, data science, deep learning, research and more. If you'd like to get involved, below are some answers to the most important questions and resources for further reading.

Help, my code isn't working!

Bugs suck, and we're doing our best to continuously improve the tests and fix bugs as soon as possible. Before you submit an issue, do a quick search and check if the problem has already been reported. If you're having installation or loading problems, make sure to also check out the troubleshooting guide. Help with spaCy is available via the following platforms:

PlatformPurpose
StackOverflow Usage questions and everything related to problems with your specific code. The StackOverflow community is much larger than ours, so if your problem can be solved by others, you'll receive help much quicker.
Gitter chat General discussion about spaCy, meeting other community members and exchanging tips, tricks and best practices. If we're working on experimental models and features, we usually share them on Gitter first.
GitHub issue tracker Bug reports and improvement suggestions, i.e. everything that's likely spaCy's fault. This also includes problems with the models beyond statistical imprecisions, like patterns that point to a bug.

How can I contribute to spaCy?

You don't have to be an NLP expert or Python pro to contribute, and we're happy to help you get started. If you're new to spaCy, a good place to start is the help wanted (easy) label on GitHub, which we use to tag bugs and feature requests that are easy and self-contained. We also appreciate contributions to the docs – whether it's fixing a typo, improving an example or adding additional explanations. You'll find a "Suggest edits" link at the bottom of each page that points you to the source.

Another way of getting involved is to help us improve the language data – especially if you happen to speak one of the languages currently in alpha support. Even adding simple tokenizer exceptions, stop words or lemmatizer data can make a big difference. It will also make it easier for us to provide a statistical model for the language in the future. Submitting a test that documents a bug or performance issue, or covers functionality that's especially important for your application is also very helpful. This way, you'll also make sure we never accidentally introduce regressions to the parts of the library that you care about the most.

For more details on the types of contributions we're looking for, the code conventions and other useful tips, make sure to check out the contributing guidelines.

I've built something cool with spaCy – how can I get the word out?

First, congrats – we'd love to check it out! When you share your project on Twitter, don't forget to tag @spacy_io so we don't miss it. If you think your project would be a good fit for the showcase, feel free to submit it! Tutorials are also incredibly valuable to other users and a great way to get exposure. So we strongly encourage writing up your experiences, or sharing your code and some tips and tricks on your blog. Since our website is open-source, you can add your project or tutorial by making a pull request on GitHub.

If you would like to use the spaCy logo on your site, please get in touch and ask us first. However, if you want to show support and tell others that your project is using spaCy, you can grab one of our spaCy badges here:

https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg
<a href="https://alpha.spacy.io"><img src="https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg" height="20"></a>
[![spaCy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg)](https://alpha.spacy.io)
https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg
<a href="https://alpha.spacy.io"><img src="https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg" height="20"></a>
[![spaCy](https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg)](https://alpha.spacy.io)
https://img.shields.io/badge/spaCy-v2-09a3d5.svg
<a href="https://alpha.spacy.io"><img src="https://img.shields.io/badge/spaCy-v2-09a3d5.svg" height="20"></a>
[![spaCy](https://img.shields.io/badge/spaCy-v2-09a3d5.svg)](https://alpha.spacy.io)
Like this widget? Check out quickstart.js!
Read next: Lightning tour