Part-of-speech tagging

Part-of-speech tags are labels like noun, verb, adjective etc that are assigned to each token in the document. They're useful in rule-based processes. They can also be useful features in some statistical models.

Part-of-speech tagging 101

After tokenization, spaCy can also parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalise across the language – for example, a word following "the" in English is most likely a noun.

Linguistic annotations are available as Token attributes . Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name:

doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

Using spaCy's built-in displaCy visualizer, here's what our example sentence and its dependencies look like:

Rule-based morphology

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not changes its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form. Here are some examples:

ContextSurfaceLemmaPOSMorphological Features
I was reading the paperreadingreadverbVerbForm=Ger
I don't watch the news, I read the paper.readreadverbVerbForm=Fin, Mood=Ind, Tense=Pres
I read the paper yestedayreadreadverbVerbForm=Fin, Mood=Ind, Tense=Past

English has a relatively simple morphological system, which spaCy handles using rules that can be keyed by the token, the part-of-speech tag, or the combination of the two. The system works as follows:

  1. The tokenizer consults a mapping table TOKENIZER_EXCEPTIONS, which allows sequences of characters to be mapped to multiple tokens. Each token may be assigned a part of speech and one or more morphological features.
  2. The part-of-speech tagger then assigns each token an extended POS tag. In the API, these tags are known as Token.tag. They express the part-of-speech (e.g. VERB) and some amount of morphological information, e.g. that the verb is past tense.
  3. For words whose POS is not set by a prior process, a mapping table TAG_MAP maps the tags to a part-of-speech and a set of morphological features.
  4. Finally, a rule-based deterministic lemmatizer maps the surface form, to a lemma in light of the previously assigned extended part-of-speech and morphological information, without consulting the context of the token. The lemmatizer also accepts list-based exception files, acquired from WordNet.

Part-of-speech tag schemes

English part-of-speech tag scheme

The English part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS tag set.

-LRB-PUNCT PunctType=brck PunctSide=inileft round bracket
-PRB-PUNCT PunctType=brck PunctSide=finright round bracket
,PUNCT PunctType=commpunctuation mark, comma
:PUNCTpunctuation mark, colon or ellipsis
.PUNCT PunctType=peripunctuation mark, sentence closer
''PUNCT PunctType=quot PunctSide=finclosing quotation mark
""PUNCT PunctType=quot PunctSide=finclosing quotation mark
#SYM SymType=numbersignsymbol, number sign
``PUNCT PunctType=quot PunctSide=iniopening quotation mark
$SYM SymType=currencysymbol, currency
AFXADJ Hyph=yesaffix
BESVERBauxillary "be"
CCCONJ ConjType=coorconjunction, coordinating
CDNUM NumType=cardcardinal number
DTDET determiner
EXADV AdvType=exexistential there
FWX Foreign=yesforeign word
GWXadditional word in multi-word expression
HVSVERBforms of "have"
HYPHPUNCT PunctType=dashpunctuation mark, hyphen
INADPconjunction, subordinating or preposition
JJADJ Degree=posadjective
JJRADJ Degree=compadjective, comparative
JJSADJ Degree=supadjective, superlative
LSPUNCT NumType=ordlist item marker
MDVERB VerbType=modverb, modal auxillary
NFPPUNCTsuperfluous punctuation
NILmissing tag
NNNOUN Number=singnoun, singular or mass
NNPPROPN NounType=prop Number=signnoun, proper singular
NNPSPROPN NounType=prop Number=plurnoun, proper plural
NNSNOUN Number=plurnoun, plural
PDTADJ AdjType=pdt PronType=prnpredeterminer
POSPART Poss=yespossessive ending
PRPPRON PronType=prspronoun, personal
PRP$ADJ PronType=prs Poss=yespronoun, possessive
RBADV Degree=posadverb
RBRADV Degree=compadverb, comparative
RBSADV Degree=supadverb, superlative
RPPARTadverb, particle
TOPART PartType=inf VerbForm=infinfinitival to
VBVERB VerbForm=infverb, base form
VBDVERB VerbForm=fin Tense=pastverb, past tense
VBGVERB VerbForm=part Tense=pres Aspect=progverb, gerund or present participle
VBNVERB VerbForm=part Tense=past Aspect=perfverb, past participle
VBPVERB VerbForm=fin Tense=presverb, non-3rd person singular present
VBZVERB VerbForm=fin Tense=pres Number=sing Person=3verb, 3rd person singular present
WDTADJ PronType=int|relwh-determiner
WPNOUN PronType=int|relwh-pronoun, personal
WP$ADJ Poss=yes PronType=int|relwh-pronoun, possessive
WRBADV PronType=int|relwh-adverb

German part-of-speech tag scheme

The German part-of-speech tagger uses the TIGER Treebank annotation scheme. We also map the tags to the simpler Google Universal POS tag set.

$(PUNCT PunctType=brckother sentence-internal punctuation mark
$,PUNCT PunctType=commcomma
$.PUNCT PunctType=perisentence-final punctuation mark
ADJAADJadjective, attributive
ADJDADJ Variant=shortadjective, adverbial or predicative
APPOADP AdpType=postpostposition
APPRADP AdpType=preppreposition; circumposition left
APPRARTADP AdpType=prep PronType=artpreposition with article
APZRADP AdpType=circcircumposition right
ARTDET PronType=artdefinite or indefinite article
CARDNUM NumType=cardcardinal number
FMX Foreign=yesforeign language material
KOKOMCONJ ConjType=compcomparative conjunction
KONCONJcoordinate conjunction
KOUISCONJsubordinate conjunction with "zu" and infinitive
KOUSSCONJsubordinate conjunction with sentence
NEPROPNproper noun
NNEPROPNproper noun
NNNOUNnoun, singular or mass
PAVADV PronType=dempronominal adverb
PROAVADV PronType=dempronominal adverb
PDATDET PronType=demattributive demonstrative pronoun
PDSPRON PronType=demsubstituting demonstrative pronoun
PIATDET PronType=ind|neg|totattributive indefinite pronoun without determiner
PIDATDET AdjType=pdt PronType=ind|neg|totattributive indefinite pronoun with determiner
PISPRON PronType=ind|neg|totsubstituting indefinite pronoun
PPERPRON PronType=prsnon-reflexive personal pronoun
PPOSATDET Poss=yes PronType=prsattributive possessive pronoun
PPOSSPRON PronType=relsubstituting possessive pronoun
PRELATDET PronType=relattributive relative pronoun
PRELSPRON PronType=relsubstituting relative pronoun
PRFPRON PronType=prs Reflex=yesreflexive personal pronoun
PTKAPARTparticle with adjective or adverb
PTKANTPART PartType=resanswer particle
PTKNEGPART Negative=yesnegative particle
PTKVZPART PartType=vbpseparable verbal particle
PTKZUPART PartType=inf"zu" before infinitive
PWATDET PronType=intattributive interrogative pronoun
PWAVADV PronType=intadverbial interrogative or relative pronoun
PWSPRON PronType=intsubstituting interrogative pronoun
TRUNCX Hyph=yesword remnant
VAFINAUX Mood=ind VerbForm=finfinite verb, auxiliary
VAIMPAUX Mood=imp VerbForm=finimperative, auxiliary
VAINFAUX VerbForm=infinfinitive, auxiliary
VAPPAUX Aspect=perf VerbForm=finperfect participle, auxiliary
VMFINVERB Mood=ind VerbForm=fin VerbType=modfinite verb, modal
VMINFVERB VerbForm=fin VerbType=modinfinitive, modal
VMPPVERB Aspect=perf VerbForm=part VerbType=modperfect participle, modal
VVFINVERB Mood=ind VerbForm=finfinite verb, full
VVIMPVERB Mood=imp VerbForm=finimperative, full
VVINFVERB VerbForm=infinfinitive, full
VVIZUVERB VerbForm=infinfinitive with "zu", full
VVPPVERB Aspect=perf VerbForm=partperfect participle, full
XYXnon-word containing non-letter