scikit

Linguistic Features
Using spaCy to extract linguistic features like part-of-speech tags, dependency labels and named entities, customising the tokenizer and working with the rule-based matcher.

Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information. That's exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations.

Part-of-speech tagging

After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalise across the language – for example, a word following "the" in English is most likely a noun.

Linguistic annotations are available as Token attributes . Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name:

doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)
TextLemmaPOSTagDepShapealphastop
AppleapplePROPNNNPnsubjXxxxxTrueFalse
isbeVERBVBZauxxxTrueTrue
lookinglookVERBVBGROOTxxxxTrueFalse
atatADPINprepxxTrueTrue
buyingbuyVERBVBGpcompxxxxTrueFalse
U.K.u.k.PROPNNNPcompoundX.X.FalseFalse
startupstartupNOUNNNdobjxxxxTrueFalse
forforADPINprepxxxTrueTrue
$$SYM$quantmod$FalseFalse
11NUMCDcompounddFalseFalse
billionbillionNUMCDpobjxxxxTrueFalse

Using spaCy's built-in displaCy visualizer, here's what our example sentence and its dependencies look like:

Rule-based morphology

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not changes its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form. Here are some examples:

ContextSurfaceLemmaPOSMorphological Features
I was reading the paperreadingreadverbVerbForm=Ger
I don't watch the news, I read the paper.readreadverbVerbForm=Fin, Mood=Ind, Tense=Pres
I read the paper yesterdayreadreadverbVerbForm=Fin, Mood=Ind, Tense=Past

English has a relatively simple morphological system, which spaCy handles using rules that can be keyed by the token, the part-of-speech tag, or the combination of the two. The system works as follows:

  1. The tokenizer consults a mapping table TOKENIZER_EXCEPTIONS, which allows sequences of characters to be mapped to multiple tokens. Each token may be assigned a part of speech and one or more morphological features.
  2. The part-of-speech tagger then assigns each token an extended POS tag. In the API, these tags are known as Token.tag. They express the part-of-speech (e.g. VERB) and some amount of morphological information, e.g. that the verb is past tense.
  3. For words whose POS is not set by a prior process, a mapping table TAG_MAP maps the tags to a part-of-speech and a set of morphological features.
  4. Finally, a rule-based deterministic lemmatizer maps the surface form, to a lemma in light of the previously assigned extended part-of-speech and morphological information, without consulting the context of the token. The lemmatizer also accepts list-based exception files, acquired from WordNet.

English part-of-speech tag scheme

The English part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS tag set.

TagPOSMorphologyDescription
-LRB-PUNCT PunctType=brck PunctSide=inileft round bracket
-PRB-PUNCT PunctType=brck PunctSide=finright round bracket
,PUNCT PunctType=commpunctuation mark, comma
:PUNCTpunctuation mark, colon or ellipsis
.PUNCT PunctType=peripunctuation mark, sentence closer
''PUNCT PunctType=quot PunctSide=finclosing quotation mark
""PUNCT PunctType=quot PunctSide=finclosing quotation mark
#SYM SymType=numbersignsymbol, number sign
``PUNCT PunctType=quot PunctSide=iniopening quotation mark
$SYM SymType=currencysymbol, currency
ADDXemail
AFXADJ Hyph=yesaffix
BESVERBauxiliary "be"
CCCONJ ConjType=coorconjunction, coordinating
CDNUM NumType=cardcardinal number
DTDET determiner
EXADV AdvType=exexistential there
FWX Foreign=yesforeign word
GWXadditional word in multi-word expression
HVSVERBforms of "have"
HYPHPUNCT PunctType=dashpunctuation mark, hyphen
INADPconjunction, subordinating or preposition
JJADJ Degree=posadjective
JJRADJ Degree=compadjective, comparative
JJSADJ Degree=supadjective, superlative
LSPUNCT NumType=ordlist item marker
MDVERB VerbType=modverb, modal auxiliary
NFPPUNCTsuperfluous punctuation
NILmissing tag
NNNOUN Number=singnoun, singular or mass
NNPPROPN NounType=prop Number=signnoun, proper singular
NNPSPROPN NounType=prop Number=plurnoun, proper plural
NNSNOUN Number=plurnoun, plural
PDTADJ AdjType=pdt PronType=prnpredeterminer
POSPART Poss=yespossessive ending
PRPPRON PronType=prspronoun, personal
PRP$ADJ PronType=prs Poss=yespronoun, possessive
RBADV Degree=posadverb
RBRADV Degree=compadverb, comparative
RBSADV Degree=supadverb, superlative
RPPARTadverb, particle
SPSPACEspace
SYMSYMsymbol
TOPART PartType=inf VerbForm=infinfinitival to
UHINTJinterjection
VBVERB VerbForm=infverb, base form
VBDVERB VerbForm=fin Tense=pastverb, past tense
VBGVERB VerbForm=part Tense=pres Aspect=progverb, gerund or present participle
VBNVERB VerbForm=part Tense=past Aspect=perfverb, past participle
VBPVERB VerbForm=fin Tense=presverb, non-3rd person singular present
VBZVERB VerbForm=fin Tense=pres Number=sing Person=3verb, 3rd person singular present
WDTADJ PronType=int|relwh-determiner
WPNOUN PronType=int|relwh-pronoun, personal
WP$ADJ Poss=yes PronType=int|relwh-pronoun, possessive
WRBADV PronType=int|relwh-adverb
XXXunknown

German part-of-speech tag scheme

The German part-of-speech tagger uses the TIGER Treebank annotation scheme. We also map the tags to the simpler Google Universal POS tag set.

TagPOSMorphologyDescription
$(PUNCT PunctType=brckother sentence-internal punctuation mark
$,PUNCT PunctType=commcomma
$.PUNCT PunctType=perisentence-final punctuation mark
ADJAADJadjective, attributive
ADJDADJ Variant=shortadjective, adverbial or predicative
ADVADVadverb
APPOADP AdpType=postpostposition
APPRADP AdpType=preppreposition; circumposition left
APPRARTADP AdpType=prep PronType=artpreposition with article
APZRADP AdpType=circcircumposition right
ARTDET PronType=artdefinite or indefinite article
CARDNUM NumType=cardcardinal number
FMX Foreign=yesforeign language material
ITJINTJinterjection
KOKOMCONJ ConjType=compcomparative conjunction
KONCONJcoordinate conjunction
KOUISCONJsubordinate conjunction with "zu" and infinitive
KOUSSCONJsubordinate conjunction with sentence
NEPROPNproper noun
NNEPROPNproper noun
NNNOUNnoun, singular or mass
PAVADV PronType=dempronominal adverb
PROAVADV PronType=dempronominal adverb
PDATDET PronType=demattributive demonstrative pronoun
PDSPRON PronType=demsubstituting demonstrative pronoun
PIATDET PronType=ind|neg|totattributive indefinite pronoun without determiner
PIDATDET AdjType=pdt PronType=ind|neg|totattributive indefinite pronoun with determiner
PISPRON PronType=ind|neg|totsubstituting indefinite pronoun
PPERPRON PronType=prsnon-reflexive personal pronoun
PPOSATDET Poss=yes PronType=prsattributive possessive pronoun
PPOSSPRON PronType=relsubstituting possessive pronoun
PRELATDET PronType=relattributive relative pronoun
PRELSPRON PronType=relsubstituting relative pronoun
PRFPRON PronType=prs Reflex=yesreflexive personal pronoun
PTKAPARTparticle with adjective or adverb
PTKANTPART PartType=resanswer particle
PTKNEGPART Negative=yesnegative particle
PTKVZPART PartType=vbpseparable verbal particle
PTKZUPART PartType=inf"zu" before infinitive
PWATDET PronType=intattributive interrogative pronoun
PWAVADV PronType=intadverbial interrogative or relative pronoun
PWSPRON PronType=intsubstituting interrogative pronoun
TRUNCX Hyph=yesword remnant
VAFINAUX Mood=ind VerbForm=finfinite verb, auxiliary
VAIMPAUX Mood=imp VerbForm=finimperative, auxiliary
VAINFAUX VerbForm=infinfinitive, auxiliary
VAPPAUX Aspect=perf VerbForm=finperfect participle, auxiliary
VMFINVERB Mood=ind VerbForm=fin VerbType=modfinite verb, modal
VMINFVERB VerbForm=fin VerbType=modinfinitive, modal
VMPPVERB Aspect=perf VerbForm=part VerbType=modperfect participle, modal
VVFINVERB Mood=ind VerbForm=finfinite verb, full
VVIMPVERB Mood=imp VerbForm=finimperative, full
VVINFVERB VerbForm=infinfinitive, full
VVIZUVERB VerbForm=infinfinitive with "zu", full
VVPPVERB Aspect=perf VerbForm=partperfect participle, full
XYXnon-word containing non-letter
SPSPACEspace

Dependency parsing

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or "chunks". You can check whether a Doc object has been parsed with the doc.is_parsed attribute, which returns a boolean value. If this attribute is False, the default sentence iterator will raise an exception.

Noun chunks

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world’s largest tech fund". To get the noun chunks in a document, simply iterate over Doc.noun_chunks .

Example

nlp = spacy.load('en') doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers') for chunk in doc.noun_chunks: print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)
Textroot.textroot.dep_root.head.text
Autonomous carscarsnsubjshift
insurance liabilityliabilitydobjshift
manufacturersmanufacturerspobjtoward

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of .dep is a hash value. You can get the string value with .dep_.

Example

doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers') for token in doc: print(token.text, token.dep_, token.head.text, token.head.pos_, [child for child in token.children])
TextDepHead textHead POSChildren
AutonomousamodcarsNOUN
carsnsubjshiftVERBAutonomous
shiftROOTshiftVERBcars, liability
insurancecompoundliabilityNOUN
liabilitydobjshiftVERBinsurance, toward
towardprepliabilityNOUNmanufacturers
manufacturerspobjtowardADP

Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest — from below:

from spacy.symbols import nsubj, VERB

# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)

If you try to match from above, you'll have to iterate twice: once for the head, and then again through the children:

# Finding a verb with a subject from above — less good
verbs = []
for possible_verb in doc:
    if possible_verb.pos == VERB:
        for possible_subject in possible_verb.children:
            if possible_subject.dep == nsubj:
                verbs.append(possible_verb)
                break

To iterate through the children, use the token.children attribute, which provides a sequence of Token objects.

A few more convenience attributes are provided for iterating around the local tree from the token. The .lefts and .rights attributes provide sequences of syntactic children that occur before and after the token. Both sequences are in sentences order. There are also two integer-typed attributes, .n_rights and .n_lefts, that give the number of left and right children.

doc = nlp(u'bright red apples on the tree')
assert [token.text for token in doc[2].lefts]) == [u'bright', u'red']
assert [token.text for token in doc[2].rights]) == ['on']
assert doc[2].n_lefts == 2
assert doc[2].n_rights == 1

You can get a whole phrase by its syntactic head using the .subtree attribute. This returns an ordered sequence of tokens. You can walk up the tree with the .ancestors attribute, and check dominance with the .is_ancestor() method.

doc = nlp(u'Credit and mortgage account holders must submit their requests')
root = [token for token in doc if token.head is token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
    assert subject.is_ancestor(descendant)
    print(descendant.text, descendant.dep_, descendant.n_lefts, descendant.n_rights,
          [ancestor.text for ancestor in descendant.ancestors])
TextDepn_leftsn_rightsancestors
Creditnmod02holders, submit
andcc00Credit, holders, submit
mortgagecompound00account, Credit, holders, submit
accountconj10Credit, holders, submit
holdersnsubj10submit

Finally, the .left_edge and .right_edge attributes can be especially useful, because they give you the first and last token of the subtree. This is the easiest way to create a Span object for a syntactic phrase. Note that .right_edge gives a token within the subtree — so if you use it as the end-point of a range, don't forget to +1!

doc = nlp(u'Credit and mortgage account holders must submit their requests')
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
span.merge()
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)
TextPOSDepHead text
Credit and mortgage account holdersNOUNnsubjsubmit
mustVERBauxsubmit
submitVERBROOTsubmit
theirADJpossrequests
requestsNOUNdobjsubmit

Visualizing dependencies

The best way to understand spaCy's dependency parser is interactively. To make this easier, spaCy v2.0+ comes with a visualization module. Simply pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup. If you want to know how to write rules that hook into some type of syntactic construction, just plug the sentence into the visualizer and see how spaCy annotates it.

from spacy import displacy

doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers')
displacy.serve(doc, style='dep')

Disabling the parser

In the default models, the parser is loaded and enabled as part of the standard processing pipeline. If you don't need any of the syntactic information, you should disable the parser. Disabling the parser will make spaCy load and run much faster. If you want to load the parser, but need to disable it for specific documents, you can also control its use on the nlp object.

nlp = spacy.load('en', disable=['parser'])
nlp = English().from_disk('/model', disable=['parser'])
doc = nlp(u"I don't want parsed", disable=['parser'])

Named Entities

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

Named Entity Recognition 101

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
TextStartEndLabelDescription
Apple05ORGCompanies, agencies, institutions.
U.K.2731GPEGeopolitical entity, i.e. countries, cities, states.
$1 billion4454MONEYMonetary values, including unit.

Using spaCy's built-in displaCy visualizer, here's what our example sentence and its named entities look like:

Accessing entity annotations

The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value or as a string, using the attributes ent.label and ent.label_. The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

You can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.

Example

doc = nlp(u'San Francisco considers banning sidewalk delivery robots') # document level ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] assert ents == [(u'San Francisco', 0, 13, u'GPE')] # token level ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_] ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_] assert ent_san == [u'San', u'B', u'GPE'] assert ent_francisco == [u'Francisco', u'I', u'GPE']
Textent_iobent_iob_ent_type_Description
San3BGPEbeginning of an entity
Francisco1IGPEinside an entity
considers2O""outside an entity
banning2O""outside an entity
sidewalk2O""outside an entity
delivery2O""outside an entity
robots2O""outside an entity

Setting entity annotations

To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level. However, you can't write directly to the token.ent_iob or token.ent_type attributes, so the easiest way to set entities is to assign to the doc.ents attribute and create the new entity as a Span .

Example

from spacy.tokens import Span doc = nlp(u'Netflix is hiring a new VP of global policy') # the model didn't recognise any entities :( ORG = doc.vocab.strings[u'ORG'] # get hash value of entity label netflix_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity doc.ents = [netflix_ent] ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] assert ents = [(u'Netflix', 0, 7, u'ORG')]

Keep in mind that you need to create a Span with the start and end index of the token, not the start and end index of the entity in the document. In this case, "Netflix" is token (0, 1) – but at the document level, the entity will have the start and end indices (0, 7).

Setting entity annotations from array

You can also assign entity annotations using the doc.from_array() method. To do this, you should include both the ENT_TYPE and the ENT_IOB attributes in the array you're importing from.

import numpy
from spacy.attrs import ENT_IOB, ENT_TYPE

doc = nlp.make_doc(u'London is a big city in the United Kingdom.')
assert list(doc.ents) == []

header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)))
attr_array[0, 0] = 2 # B
attr_array[0, 1] = doc.vocab.strings[u'GPE']
doc.from_array(header, attr_array)
assert list(doc.ents)[0].text == u'London'

Setting entity annotations in Cython

Finally, you can always write to the underlying struct, if you compile a Cython function. This is easy to do, and allows you to write efficient native code.

# cython: infer_types=True
from spacy.tokens.doc cimport Doc

cpdef set_entity(Doc doc, int start, int end, int ent_type):
    for i in range(start, end):
        doc.c[i].ent_type = ent_type
    doc.c[start].ent_iob = 3
    for i in range(start+1, end):
        doc.c[i].ent_iob = 2

Obviously, if you write directly to the array of TokenC* structs, you'll have responsibility for ensuring that the data is left in a consistent state.

Built-in entity types

TypeDescription
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACILITYBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LANGUAGEAny named language.

The following values are also annotated in a style similar to names:

TypeDescription
DATEAbsolute or relative dates or periods.
TIMETimes smaller than a day.
PERCENTPercentage, including "%".
MONEYMonetary values, including unit.
QUANTITYMeasurements, as of weight or distance.
ORDINAL"first", "second", etc.
CARDINALNumerals that do not fall under another type.

Training and updating

To provide training examples to the entity recogniser, you'll first need to create an instance of the GoldParse class. You can specify your annotations in a stand-off format or as token tags. If a character offset in your entity annotations don't fall on a token boundary, the GoldParse class will treat that annotation as a missing value. This allows for more realistic training, because the entity recogniser is allowed to learn from examples that may feature tokenizer errors.

train_data = [('Who is Chaka Khan?', [(7, 17, 'PERSON')]),
              ('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')])]
doc = Doc(nlp.vocab, [u'rats', u'make', u'good', u'pets'])
gold = GoldParse(doc, entities=[u'U-ANIMAL', u'O', u'O', u'O'])

The BILUO Scheme

You can also provide token-level entity annotation, using the following tagging scheme to describe the entity boundaries:

TagDescription
B EGINThe first token of a multi-token entity.
I NAn inner token of a multi-token entity.
L ASTThe final token of a multi-token entity.
U NITA single-token entity.
O UTA non-entity token.

spaCy translates the character offsets into this scheme, in order to decide the cost of each action given the current state of the entity recogniser. The costs are then used to calculate the gradient of the loss, to train the model. The exact algorithm is a pastiche of well-known methods, and is not currently described in any single publication. The model is a greedy transition-based parser guided by a linear model whose weights are learned using the averaged perceptron loss, via the dynamic oracle imitation learning strategy. The transition system is equivalent to the BILOU tagging scheme.

Visualizing named entities

The displaCy ENT visualizer lets you explore an entity recognition model's behaviour interactively. If you're training a model, it's very useful to run the visualization yourself. To help you do that, spaCy v2.0+ comes with a visualization module. Simply pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup.

For more details and examples, see the usage guide on visualizing spaCy.

Named Entity example

import spacy from spacy import displacy text = """But Google is starting from behind. The company made a late push into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption.""" nlp = spacy.load('custom_ner_model') doc = nlp(text) displacy.serve(doc, style='ent')

Tokenization

Tokenization is the task of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc object. To construct a Doc object, you need a Vocab instance, a sequence of word strings, and optionally a sequence of spaces booleans, which allow you to maintain alignment of the tokens into the original string.

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token. Each Doc consists of individual tokens, and we can simply iterate over them:

for token in doc:
    print(token.text)
012345678910
AppleislookingatbuyingU.K.startupfor$1billion

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

  1. Does the substring match a tokenizer exception rule? For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.
  2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.

If there's a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

“Let’s go to N.Y.!” Let’s go to N.Y.!” Let go to N.Y.!” ’s Let go to N.Y.! ’s Let go to N.Y. ’s ! Let go to N.Y. ’s ! EXCEPTION PREFIX SUFFIX SUFFIX EXCEPTION DONE

While punctuation rules are usually pretty general, tokenizer exceptions strongly depend on the specifics of the individual language. This is why each available language has its own subclass like English or German, that loads in lists of hard-coded data and exception rules.

Tokenizer data

Global and language-specific tokenizer data is supplied via the language data in spacy/lang . The tokenizer exceptions define special cases like "don't" in English, which needs to be split into two tokens: {ORTH: "do"} and {ORTH: "n't", LEMMA: "not"}. The prefixes, suffixes and infixes mosty define punctuation rules – for example, when to split off periods (at the end of a sentence), and when to leave token containing periods intact (abbreviations like "U.S.").

Tokenizer Base data Language data stop words lexical attributes tokenizer exceptions prefixes, suffixes, infixes lemma data Lemmatizer char classes Token morph rules tag map Morphology

Adding special case tokenization rules

Most domains have at least some idiosyncrasies that require custom tokenization rules. This could be very certain expressions, or abbreviations only used in this specific field.

Here's how to add a special case rule to an existing Tokenizer instance:

import spacy
from spacy.symbols import ORTH, LEMMA, POS

nlp = spacy.load('en')
doc = nlp(u'gimme that') # phrase to tokenize
assert [w.text for w in doc] == [u'gimme', u'that'] # current tokenization

# add special case rule
special_case = [{ORTH: u'gim', LEMMA: u'give', POS: u'VERB'}, {ORTH: u'me'}]
nlp.tokenizer.add_special_case(u'gimme', special_case)
assert [w.text for w in nlp(u'gimme that')] == [u'gim', u'me', u'that']
# Pronoun lemma is returned as -PRON-!
assert [w.lemma_ for w in nlp(u'gimme that')] == [u'give', u'-PRON-', u'that']

For details on spaCy's custom pronoun lemma -PRON-, see here. The special case doesn't have to match an entire whitespace-delimited substring. The tokenizer will incrementally split off punctuation, and keep looking up the remaining substring:

assert 'gimme' not in [w.text for w in nlp(u'gimme!')]
assert 'gimme' not in [w.text for w in nlp(u'("...gimme...?")')]

The special case rules have precedence over the punctuation splitting:

special_case = [{ORTH: u'...gimme...?', LEMMA: u'give', TAG: u'VB'}]
nlp.tokenizer.add_special_case(u'...gimme...?', special_case)
assert len(nlp(u'...gimme...?')) == 1

Because the special-case rules allow you to set arbitrary token attributes, such as the part-of-speech, lemma, etc, they make a good mechanism for arbitrary fix-up rules. Having this logic live in the tokenizer isn't very satisfying from a design perspective, however, so the API may eventually be exposed on the Language class itself.

How spaCy's tokenizer works

spaCy introduces a novel tokenization algorithm, that gives a better balance between performance, ease of definition, and ease of alignment into the original string.

After consuming a prefix or infix, we consult the special cases again. We want the special cases to handle things like "don't" in English, and we want the same rule to work for "(don't)!". We do this by splitting off the open bracket, then the exclamation, then the close bracket, and finally matching the special-case. Here's an implementation of the algorithm in Python, optimized for readability rather than performance:

def tokenizer_pseudo_code(text, special_cases,
                          find_prefix, find_suffix, find_infixes):
    tokens = []
    for substring in text.split(' '):
        suffixes = []
        while substring:
            if substring in special_cases:
                tokens.extend(special_cases[substring])
                substring = ''
            elif find_prefix(substring) is not None:
                split = find_prefix(substring)
                tokens.append(substring[:split])
                substring = substring[split:]
            elif find_suffix(substring) is not None:
                split = find_suffix(substring)
                suffixes.append(substring[split:])
                substring = substring[:split]
            elif find_infixes(substring):
                infixes = find_infixes(substring)
                offset = 0
                for match in infixes:
                    tokens.append(substring[i : match.start()])
                    tokens.append(substring[match.start() : match.end()])
                    offset = match.end()
                substring = substring[offset:]
            else:
                tokens.append(substring)
                substring = ''
        tokens.extend(reversed(suffixes))
        return tokens

The algorithm can be summarized as follows:

  1. Iterate over space-separated substrings
  2. Check whether we have an explicitly defined rule for this substring. If we do, use it.
  3. Otherwise, try to consume a prefix.
  4. If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.
  5. If we didn't consume a prefix, try to consume a suffix.
  6. If we can't consume a prefix or suffix, look for "infixes" — stuff like hyphens etc.
  7. Once we can't consume any more of the string, handle it as a single token.

Customizing spaCy's Tokenizer class

Let's imagine you wanted to create a tokenizer for a new language or specific domain. There are five things you would need to define:

  1. A dictionary of special cases. This handles things like contractions, units of measurement, emoticons, certain abbreviations, etc.
  2. A function prefix_search, to handle preceding punctuation, such as open quotes, open brackets, etc
  3. A function suffix_search, to handle succeeding punctuation, such as commas, periods, close quotes, etc.
  4. A function infixes_finditer, to handle non-whitespace separators, such as hyphens etc.
  5. An optional boolean function token_match matching strings that should never be split, overriding the previous rules. Useful for things like URLs or numbers.

You shouldn't usually need to create a Tokenizer subclass. Standard usage is to use re.compile() to build a regular expression object, and pass its .search() and .finditer() methods:

import re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=simple_url_re.match)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

If you need to subclass the tokenizer instead, the relevant methods to specialize are find_prefix, find_suffix and find_infix.

Hooking an arbitrary tokenizer into the pipeline

The tokenizer is the first component of the processing pipeline and the only one that can't be replaced by writing to nlp.pipeline. This is because it has a different signature from all the other components: it takes a text and returns a Doc, whereas all other components expect to already receive a tokenized Doc.

Doc Text nlp tokenizer tensorizer tagger parser ner

To overwrite the existing tokenizer, you need to replace nlp.tokenizer with a custom function that takes a text, and returns a Doc.

nlp = spacy.load('en')
nlp.tokenizer = my_tokenizer
ArgumentTypeDescription
textunicodeThe raw text to tokenize.
returnsDocThe tokenized document.

Example: A custom whitespace tokenizer

To construct the tokenizer, we usually want attributes of the nlp pipeline. Specifically, we want the tokenizer to hold a reference to the vocabulary object. Let's say we have the following class as our tokenizer:

from spacy.tokens import Doc

class WhitespaceTokenizer(object):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(' ')
        # All tokens 'own' a subsequent space character in this tokenizer
        spaces = [True] * len(words)
        return Doc(self.vocab, words=words, spaces=spaces)

As you can see, we need a Vocab instance to construct this — but we won't have it until we get back the loaded nlp object. The simplest solution is to build the tokenizer in two steps. This also means that you can reuse the "tokenizer factory" and initialise it with different instances of Vocab.

nlp = spacy.load('en')
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)

Bringing your own annotations

spaCy generally assumes by default that your data is raw text. However, sometimes your data is partially annotated, e.g. with pre-existing tokenization, part-of-speech tags, etc. The most common situation is that you have pre-defined tokenization. If you have a list of strings, you can create a Doc object directly. Optionally, you can also specify a list of boolean values, indicating whether each word has a subsequent space.

doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])

If provided, the spaces list must be the same length as the words list. The spaces list affects the doc.text, span.text, token.idx, span.start_char and span.end_char attributes. If you don't provide a spaces sequence, spaCy will assume that all words are whitespace delimited.

good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])
bad_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'])
assert bad_spaces.text == u'Hello , world !'
assert good_spaces.text == u'Hello, world!'

Once you have a Doc object, you can write to its attributes to set the part-of-speech tags, syntactic dependencies, named entities and other attributes. For details, see the respective usage pages.

Rule-based matching

spaCy features a rule-matching engine, the Matcher , that operates over tokens, similar to regular expressions. The rules can refer to token annotations (e.g. the token text or tag_, and flags (e.g. IS_PUNCT). The rule matcher also lets you pass in a custom callback to act on matches – for example, to merge entities and apply custom labels. You can also associate patterns with entity IDs, to allow some basic entity linking or disambiguation. To match large terminology lists, you can use the PhraseMatcher , which accepts Doc objects as match patterns.

Adding patterns

Let's say we want to enable spaCy to find a combination of three tokens:

  1. A token whose lowercase form matches "hello", e.g. "Hello" or "HELLO".
  2. A token whose is_punct flag is set to True, i.e. any punctuation.
  3. A token whose lowercase form matches "world", e.g. "World" or "WORLD".
[{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]

First, we initialise the Matcher with a vocab. The matcher must always share the same vocab with the documents it will operate on. We can now call matcher.add() with an ID and our custom pattern. The second argument lets you pass in an optional callback function to invoke on a successful match. For now, we set it to None.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
# add match ID "HelloWorld" with no callback and one pattern
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
matcher.add('HelloWorld', None, pattern)

doc = nlp(u'Hello, world! Hello world!')
matches = matcher(doc)

The matcher returns a list of (match_id, start, end) tuples – in this case, [('HelloWorld', 0, 2)], which maps to the span doc[0:2] of our original document. Optionally, we could also choose to add more than one pattern, for example to also match sequences without punctuation between "hello" and "world":

matcher.add('HelloWorld', None,
            [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}],
            [{'LOWER': 'hello'}, {'LOWER': 'world'}])

By default, the matcher will only return the matches and not do anything else, like merge entities or assign labels. This is all up to you and can be defined individually for each pattern, by passing in a callback function as the on_match argument on add(). This is useful, because it lets you write entirely custom and pattern-specific logic. For example, you might want to merge some patterns into one token, while adding entity labels for other pattern types. You shouldn't have to create different matchers for each of those processes.

Available token attributes

The available token pattern keys are uppercase versions of the Token attributes . The most relevant ones for rule-based matching are:

AttributeDescription
ORTHThe exact verbatim text of a token.
LOWER, UPPERThe lowercase, uppercase form of the token text.
IS_ALPHA, IS_ASCII, IS_DIGIT Token text consists of alphanumeric characters, ASCII characters, digits.
IS_LOWER, IS_UPPER, IS_TITLEToken text is in lowercase, uppercase, titlecase.
IS_PUNCT, IS_SPACE, IS_STOPToken is punctuation, whitespace, stop word.
LIKE_NUM, LIKE_URL, LIKE_EMAILToken text resembles a number, URL, email.
POS, TAG, DEP, LEMMA, SHAPE The token's simple and extended part-of-speech tag, dependency label, lemma, shape.

Using wildcard token patterns

While the token attributes offer many options to write highly specific patterns, you can also use an empty dictionary, {} as a wildcard representing any token. This is useful if you know the context of what you're trying to match, but very little about the specific token and its characters. For example, let's say you're trying to extract people's user names from your data. All you know is that they are listed as "User name: {username}". The name itself may contain any character, but no whitespace – so you'll know it will be handled as one token.

[{'ORTH': 'User'}, {'ORTH': 'name'}, {'ORTH': ':'}, {}]

Using operators and quantifiers

The matcher also lets you use quantifiers, specified as the 'OP' key. Quantifiers let you define sequences of tokens to be mached, e.g. one or more punctuation marks, or specify optional tokens. Note that there are no nested or scoped quantifiers – instead, you can build those behaviours with on_match callbacks.

OPDescriptionExample
!match exactly 0 timesnegation
*match 0 or more timesoptional, variable number
+match 1 or more timesmandatory, variable number
?match 0 or 1 timesoptional, max one

Adding phrase patterns

If you need to match large terminology lists, you can also use the PhraseMatcher and create Doc objects instead of token patterns, which is much more efficient overall. The Doc patterns can contain single or multiple tokens.

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en')
matcher = PhraseMatcher(nlp.vocab)
terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.']
patterns = [nlp(text) for text in terminology_list]
matcher.add('TerminologyList', None, *patterns)

doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
          u"converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)

Since spaCy is used for processing both the patterns and the text to be matched, you won't have to worry about specific tokenization – for example, you can simply pass in nlp(u"Washington, D.C.") and won't have to write a complex token pattern covering the exact tokenization of the term.

Adding on_match rules

To move on to a more realistic example, let's say you're working with a large corpus of blog articles, and you want to match all mentions of "Google I/O" (which spaCy tokenizes as ['Google', 'I', '/', 'O']). To be safe, you only match on the uppercase versions, in case someone has written it as "Google i/o". You also add a second pattern with an added {IS_DIGIT: True} token – this will make sure you also match on "Google I/O 2017". If your pattern matches, spaCy should execute your custom callback function add_event_ent.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)

# Get the ID of the 'EVENT' entity type. This is required to set an entity.
EVENT = nlp.vocab.strings['EVENT']

def add_event_ent(matcher, doc, i, matches):
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    match_id, start, end = matches[i]
    doc.ents += ((EVENT, start, end),)

matcher.add('GoogleIO', add_event_ent,
            [{'ORTH': 'Google'}, {'UPPER': 'I'}, {'ORTH': '/'}, {'UPPER': 'O'}],
            [{'ORTH': 'Google'}, {'UPPER': 'I'}, {'ORTH': '/'}, {'UPPER': 'O'}, {'IS_DIGIT': True}])

In addition to mentions of "Google I/O", your data also contains some annoying pre-processing artefacts, like leftover HTML line breaks (e.g. <br> or <BR/>). While you're at it, you want to merge those into one token and flag them, to make sure you can easily ignore them later. So you add a second pattern and pass in a function merge_and_flag:

# Add a new custom flag to the vocab, which is always False by default.
# BAD_HTML_FLAG will be the flag ID, which we can use to set it to True on the span.
BAD_HTML_FLAG = nlp.vocab.add_flag(lambda text: False)

def merge_and_flag(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    span.merge(is_stop=True) # merge (and mark it as a stop word, just in case)
    span.set_flag(BAD_HTML_FLAG, True) # set BAD_HTML_FLAG

matcher.add('BAD_HTML', merge_and_flag,
            [{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}],
            [{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}])

We can now call the matcher on our documents. The patterns will be matched in the order they occur in the text. The matcher will then iterate over the matches, look up the callback for the match ID that was matched, and invoke it.

doc = nlp(LOTS_OF_TEXT)
matcher(doc)

When the callback is invoked, it is passed four arguments: the matcher itself, the document, the position of the current match, and the total list of matches. This allows you to write callbacks that consider the entire set of matched phrases, so that you can resolve overlaps and other conflicts in whatever way you prefer.

ArgumentTypeDescription
matcherMatcherThe matcher instance.
docDocThe document the matcher was used on.
iintIndex of the current match (matches[i]).
matcheslist A list of (match_id, start, end) tuples, describing the matches. A match tuple describes a span doc[start:end].

Example: Using linguistic annotations

Let's say you're analysing user comments and you want to find out what people are saying about Facebook. You want to start off by finding adjectives following "Facebook is" or "Facebook was". This is obviously a very rudimentary solution, but it'll be fast, and a great way get an idea for what's in your data. Your pattern could look like this:

[{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}]

This translates to a token whose lowercase form matches "facebook" (like Facebook, facebook or FACEBOOK), followed by a token with the lemma "be" (for example, is, was, or 's), followed by an optional adverb, followed by an adjective. Using the linguistic annotations here is especially useful, because you can tell spaCy to match "Facebook's annoying", but not "Facebook's annoying ads". The optional adverb makes sure you won't miss adjectives with intensifiers, like "pretty awful" or "very nice".

To get a quick overview of the results, you could collect all sentences containing a match and render them with the displaCy visualizer. In the callback function, you'll have access to the start and end of each match, as well as the parent Doc. This lets you determine the sentence containing the match, doc[start : end].sent, and calculate the start and end of the matched span within the sentence. Using displaCy in "manual" mode lets you pass in a list of dictionaries containing the text and entities to render.

from spacy import displacy
from spacy.matcher import Matcher

nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matched_sents = [] # collect data of matched sentences to be visualized

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end] # matched span
    sent = span.sent # sentence containing matched span
    # append mock entity for match in displaCy style to matched_sents
    # get the match span by ofsetting the start and end of the span with the
    # start and end of the sentence in the doc
    match_ents = [{'start': span.start-sent.start, 'end': span.end-sent.start,
                   'label': 'MATCH'}]
    matched_sents.append({'text': sent.text, 'ents': match_ents })

pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ'}]
matcher.add('FacebookIs', collect_sents, pattern) # add pattern
matches = matcher(nlp(LOTS_OF_TEXT)) # match on your text

# serve visualization of sentences containing match with displaCy
# set manual=True to make displaCy render straight from a dictionary
displacy.serve(matched_sents, style='ent', manual=True)

Example: Phone numbers

Phone numbers can have many different formats and matching them is often tricky. During tokenization, spaCy will leave sequences of numbers intact and only split on whitespace and punctuation. This means that your match pattern will have to look out for number sequences of a certain length, surrounded by specific punctuation – depending on the national conventions.

The IS_DIGIT flag is not very helpful here, because it doesn't tell us anything about the length. However, you can use the SHAPE flag, with each d representing a digit:

[{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'dddd'},
 {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'dddd'}]

This will match phone numbers of the format (123) 4567 8901 or (123) 4567-8901. To also match formats like (123) 456 789, you can add a second pattern using 'ddd' in place of 'dddd'. By hard-coding some values, you can match only certain, country-specific numbers. For example, here's a pattern to match the most common formats of international German numbers:

[{'ORTH': '+'}, {'ORTH': '49'}, {'ORTH': '(', 'OP': '?'}, {'SHAPE': 'dddd'},
 {'ORTH': ')', 'OP': '?'}, {'SHAPE': 'dddddd'}]

Depending on the formats your application needs to match, creating an extensive set of rules like this is often better than training a model. It'll produce more predictable results, is much easier to modify and extend, and doesn't require any training data – only a set of test cases.

Example: Hashtags and emoji on social media

Social media posts, especially tweets, can be difficult to work with. They're very short and often contain various emoji and hashtags. By only looking at the plain text, you'll lose a lot of valuable semantic information.

Let's say you've extracted a large sample of social media posts on a specific topic, for example posts mentioning a brand name or product. As the first step of your data exploration, you want to filter out posts containing certain emoji and use them to assign a general sentiment score, based on whether the expressed emotion is positive or negative, e.g. 😀 or 😞. You also want to find, merge and label hashtags like #MondayMotivation, to be able to ignore or analyse them later.

By default, spaCy's tokenizer will split emoji into separate tokens. This means that you can create a pattern for one or more emoji tokens. Valid hashtags usually consist of a #, plus a sequence of ASCII characters with no whitespace, making them easy to match as well.

from spacy.lang.en import English
from spacy.matcher import Matcher

nlp = English() # we only want the tokenizer, so no need to load a model
matcher = Matcher(nlp.vocab)

pos_emoji = [u'😀', u'😃', u'😂', u'🤣', u'😊', u'😍'] # positive emoji
neg_emoji = [u'😞', u'😠', u'😩', u'😢', u'😭', u'😒'] # negative emoji

# add patterns to match one or more emoji tokens
pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
neg_patterns = [[{'ORTH': emoji}] for emoji in neg_emoji]

matcher.add('HAPPY', label_sentiment, *pos_patterns) # add positive pattern
matcher.add('SAD', label_sentiment, *neg_patterns) # add negative pattern

# add pattern to merge valid hashtag, i.e. '#' plus any ASCII token
matcher.add('HASHTAG', merge_hashtag, [{'ORTH': '#'}, {'IS_ASCII': True}])

Because the on_match callback receives the ID of each match, you can use the same function to handle the sentiment assignment for both the positive and negative pattern. To keep it simple, we'll either add or subtract 0.1 points – this way, the score will also reflect combinations of emoji, even positive and negative ones.

With a library like Emojipedia, we can also retrieve a short description for each emoji – for example, 😍's official title is "Smiling Face With Heart-Eyes". Assigning it to the merged token's norm will make it available as token.norm_.

from emojipedia import Emojipedia # installation: pip install emojipedia

def label_sentiment(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    if doc.vocab.strings[match_id] == 'HAPPY': # don't forget to get string!
        doc.sentiment += 0.1 # add 0.1 for positive sentiment
    elif doc.vocab.strings[match_id] == 'SAD':
        doc.sentiment -= 0.1 # subtract 0.1 for negative sentiment
    span = doc[start : end]
    emoji = Emojipedia.search(span[0].text) # get data for emoji
    span.merge(norm=emoji.title) # merge span and set NORM to emoji title

To label the hashtags, we first need to add a new custom flag. IS_HASHTAG will be the flag's ID, which you can use to assign it to the hashtag's span, and check its value via a token's check_flag() method. On each match, we merge the hashtag and assign the flag.

# Add a new custom flag to the vocab, which is always False by default
IS_HASHTAG = nlp.vocab.add_flag(lambda text: False)

def merge_hashtag(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]
    span.merge() # merge hashtag
    span.set_flag(IS_HASHTAG, True) # set IS_HASHTAG to True

To process a stream of social media posts, we can use Language.pipe() , which will return a stream of Doc objects that we can pass to Matcher.pipe() .

docs = nlp.pipe(LOTS_OF_TWEETS)
matches = matcher.pipe(docs)