Using the dependency parse

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or "chunks". You can check whether a Doc object has been parsed with the doc.is_parsed attribute, which returns a boolean value. If this attribute is False, the default sentence iterator will raise an exception.

Noun chunks

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world’s largest tech fund". To get the noun chunks in a document, simply iterate over Doc.noun_chunks .

Example

nlp = spacy.load('en') doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers') for chunk in doc.noun_chunks: print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)
Textroot.textroot.dep_root.head.text
Autonomous carscarsnsubjshift
insurance liabilityliabilitydobjshift
manufacturersmanufacturerspobjtoward

Navigating the parse tree

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of .dep is a hash value. You can get the string value with .dep_.

Example

doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers') for token in doc: print(token.text, token.dep_, token.head.text, token.head.pos_, [child for child in token.children])
TextDepHead textHead POSChildren
AutonomousamodcarsNOUN
carsnsubjshiftVERBAutonomous
shiftROOTshiftVERBcars, liability
insurancecompoundliabilityNOUN
liabilitydobjshiftVERBinsurance, toward
towardprepliabilityNOUNmanufacturers
manufacturerspobjtowardADP

Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest — from below:

from spacy.symbols import nsubj, VERB

# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)

If you try to match from above, you'll have to iterate twice: once for the head, and then again through the children:

# Finding a verb with a subject from above — less good
verbs = []
for possible_verb in doc:
    if possible_verb.pos == VERB:
        for possible_subject in possible_verb.children:
            if possible_subject.dep == nsubj:
                verbs.append(possible_verb)
                break

To iterate through the children, use the token.children attribute, which provides a sequence of Token objects.

Iterating around the local tree

A few more convenience attributes are provided for iterating around the local tree from the token. The .lefts and .rights attributes provide sequences of syntactic children that occur before and after the token. Both sequences are in sentences order. There are also two integer-typed attributes, .n_rights and .n_lefts, that give the number of left and right children.

doc = nlp(u'bright red apples on the tree')
assert [token.text for token in doc[2].lefts]) == [u'bright', u'red']
assert [token.text for token in doc[2].rights]) == ['on']
assert doc[2].n_lefts == 2
assert doc[2].n_rights == 1

You can get a whole phrase by its syntactic head using the .subtree attribute. This returns an ordered sequence of tokens. You can walk up the tree with the .ancestors attribute, and check dominance with the .is_ancestor() method.

doc = nlp(u'Credit and mortgage account holders must submit their requests')
root = [token for token in doc if token.head is token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
    assert subject.is_ancestor(descendant)
    print(descendant.text, descendant.dep_, descendant.n_lefts, descendant.n_rights,
          [ancestor.text for ancestor in descendant.ancestors])
TextDepn_leftsn_rightsancestors
Creditnmod02holders, submit
andcc00Credit, holders, submit
mortgagecompound00account, Credit, holders, submit
accountconj10Credit, holders, submit
holdersnsubj10submit

Finally, the .left_edge and .right_edge attributes can be especially useful, because they give you the first and last token of the subtree. This is the easiest way to create a Span object for a syntactic phrase. Note that .right_edge gives a token within the subtree — so if you use it as the end-point of a range, don't forget to +1!

doc = nlp(u'Credit and mortgage account holders must submit their requests')
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
span.merge()
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)
TextPOSDepHead text
Credit and mortgage account holdersNOUNnsubjsubmit
mustVERBauxsubmit
submitVERBROOTsubmit
theirADJpossrequests
requestsNOUNdobjsubmit

Visualizing dependencies

The best way to understand spaCy's dependency parser is interactively. To make this easier, spaCy v2.0+ comes with a visualization module. Simply pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup. If you want to know how to write rules that hook into some type of syntactic construction, just plug the sentence into the visualizer and see how spaCy annotates it.

from spacy import displacy

doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers')
displacy.serve(doc, style='dep')

Disabling the parser

In the default models, the parser is loaded and enabled as part of the standard processing pipeline. If you don't need any of the syntactic information, you should disable the parser. Disabling the parser will make spaCy load and run much faster. If you want to load the parser, but need to disable it for specific documents, you can also control its use on the nlp object.

nlp = spacy.load('en', disable=['parser'])
nlp = English().from_disk('/model', disable=['parser'])
doc = nlp(u"I don't want parsed", disable=['parser'])