Named Entity Recognition

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

Named Entity Recognition 101

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
TextStartEndLabelDescription
Apple05ORGCompanies, agencies, institutions.
U.K.2731GPEGeopolitical entity, i.e. countries, cities, states.
$1 billion4454MONEYMonetary values, including unit.

Using spaCy's built-in displaCy visualizer, here's what our example sentence and its named entities look like:

Accessing entity annotations

The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value or as a string, using the attributes ent.label and ent.label_. The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

You can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.

Example

doc = nlp(u'San Francisco considers banning sidewalk delivery robots') # document level ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] assert ents == [(u'San Francisco', 0, 13, u'GPE')] # token level ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_] ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_] assert ent_san == [u'San', u'B', u'GPE'] assert ent_francisco == [u'Francisco', u'I', u'GPE']
Textent_iobent_iob_ent_type_Description
San3BGPEbeginning of an entity
Francisco1IGPEinside an entity
considers2O""outside an entity
banning2O""outside an entity
sidewalk2O""outside an entity
delivery2O""outside an entity
robots2O""outside an entity

Setting entity annotations

To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level. However, you can't write directly to the token.ent_iob or token.ent_type attributes, so the easiest way to set entities is to assign to the doc.ents attribute and create the new entity as a Span .

Example

from spacy.tokens import Span doc = nlp(u'Netflix is hiring a new VP of global policy') # the model didn't recognise any entities :( ORG = doc.vocab.strings[u'ORG'] # get hash value of entity label netflix_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity doc.ents = [netflix_ent] ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] assert ents = [(u'Netflix', 0, 7, u'ORG')]

Keep in mind that you need to create a Span with the start and end index of the token, not the start and end index of the entity in the document. In this case, "Netflix" is token (0, 1) – but at the document level, the entity will have the start and end indices (0, 7).

Setting entity annotations from array

You can also assign entity annotations using the doc.from_array() method. To do this, you should include both the ENT_TYPE and the ENT_IOB attributes in the array you're importing from.

import numpy
from spacy.attrs import ENT_IOB, ENT_TYPE

doc = nlp.make_doc(u'London is a big city in the United Kingdom.')
assert list(doc.ents) == []

header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)))
attr_array[0, 0] = 2 # B
attr_array[0, 1] = doc.vocab.strings[u'GPE']
doc.from_array(header, attr_array)
assert list(doc.ents)[0].text == u'London'

Setting entity annotations in Cython

Finally, you can always write to the underlying struct, if you compile a Cython function. This is easy to do, and allows you to write efficient native code.

# cython: infer_types=True
from spacy.tokens.doc cimport Doc

cpdef set_entity(Doc doc, int start, int end, int ent_type):
    for i in range(start, end):
        doc.c[i].ent_type = ent_type
    doc.c[start].ent_iob = 3
    for i in range(start+1, end):
        doc.c[i].ent_iob = 2

Obviously, if you write directly to the array of TokenC* structs, you'll have responsibility for ensuring that the data is left in a consistent state.

Built-in entity types

TypeDescription
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACILITYBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LANGUAGEAny named language.

The following values are also annotated in a style similar to names:

TypeDescription
DATEAbsolute or relative dates or periods.
TIMETimes smaller than a day.
PERCENTPercentage, including "%".
MONEYMonetary values, including unit.
QUANTITYMeasurements, as of weight or distance.
ORDINAL"first", "second", etc.
CARDINALNumerals that do not fall under another type.

Training and updating

To provide training examples to the entity recogniser, you'll first need to create an instance of the GoldParse class. You can specify your annotations in a stand-off format or as token tags. If a character offset in your entity annotations don't fall on a token boundary, the GoldParse class will treat that annotation as a missing value. This allows for more realistic training, because the entity recogniser is allowed to learn from examples that may feature tokenizer errors.

train_data = [('Who is Chaka Khan?', [(7, 17, 'PERSON')]),
              ('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')])]
doc = Doc(nlp.vocab, [u'rats', u'make', u'good', u'pets'])
gold = GoldParse(doc, [u'U-ANIMAL', u'O', u'O', u'O'])

The BILUO Scheme

You can also provide token-level entity annotation, using the following tagging scheme to describe the entity boundaries:

TagDescription
B EGINThe first token of a multi-token entity.
I NAn inner token of a multi-token entity.
L ASTThe final token of a multi-token entity.
U NITA single-token entity.
O UTA non-entity token.

spaCy translates the character offsets into this scheme, in order to decide the cost of each action given the current state of the entity recogniser. The costs are then used to calculate the gradient of the loss, to train the model. The exact algorithm is a pastiche of well-known methods, and is not currently described in any single publication. The model is a greedy transition-based parser guided by a linear model whose weights are learned using the averaged perceptron loss, via the dynamic oracle imitation learning strategy. The transition system is equivalent to the BILOU tagging scheme.

Visualizing named entities

The displaCy ENT visualizer lets you explore an entity recognition model's behaviour interactively. If you're training a model, it's very useful to run the visualization yourself. To help you do that, spaCy v2.0+ comes with a visualization module. Simply pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup.

For more details and examples, see the usage guide on visualizing spaCy.

Named Entity example

import spacy from spacy import displacy text = """But Google is starting from behind. The company made a late push into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption.""" nlp = spacy.load('custom_ner_model') doc = nlp(text) displacy.serve(doc, style='ent')