scikit

Adding Languages
Adding full support for a language touches many different parts of the spaCy library. This guide explains how to fit everything together, and points you to the specific workflows for each component.

Obviously, there are lots of ways you can organise your code when you implement your own language data. This guide will focus on how it's done within spaCy. For full language support, you'll need to create a Language subclass, define custom language data, like a stop list and tokenizer exceptions and test the new tokenizer. Once the language is set up, you can build the vocabulary, including word frequencies, Brown clusters and word vectors. Finally, you can train the tagger and parser, and save the model to a directory.

For some languages, you may also want to develop a solution for lemmatization and morphological analysis.

Language data

Every language is different – and usually full of exceptions and special cases, especially amongst the most common words. Some of these exceptions are shared across languages, while others are entirely specific – usually so specific that they need to be hard-coded. The lang module contains all language-specific data, organised in simple Python files. This makes the data easy to update and extend.

The shared language data in the directory root includes rules that can be generalised across languages – for example, rules for basic punctuation, emoji, emoticons, single-letter abbreviations and norms for equivalent tokens with different spellings, like " and . This helps the models make more accurate predictions. The individual language data in a submodule contains rules that are only relevant to a particular language. It also takes care of putting together all components and creating the Language subclass – for example, English or German.

Tokenizer Base data Language data stop words lexical attributes tokenizer exceptions prefixes, suffixes, infixes lemma data Lemmatizer char classes Token morph rules tag map Morphology
NameDescription
Stop words
stop_words.py
List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return True for is_stop.
Tokenizer exceptions
tokenizer_exceptions.py
Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.".
Norm exceptions norm_exceptions.py Special-case rules for normalising tokens to improve the model's predictions, for example on American vs. British spelling.
Punctuation rules punctuation.py Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes.
Character classes char_classes.py Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons.
Lexical attributes lex_attrs.py Custom functions for setting lexical attributes on tokens, e.g. like_num, which includes language-specific words like "ten" or "hundred".
Syntax iterators syntax_iterators.py Functions that compute views of a Doc object based on its syntax. At the moment, only used for noun chunks.
Lemmatizer lemmatizer.py Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was".
Tag map
tag_map.py
Dictionary mapping strings in your tag set to Universal Dependencies tags.
Morph rules morph_rules.py Exception rules for morphological analysis of irregular words like personal pronouns.

The individual components expose variables that can be imported within a language module, and added to the language's Defaults. Some components, like the punctuation rules, usually don't need much customisation and can simply be imported from the global rules. Others, like the tokenizer and norm exceptions, are very specific and will make a big difference to spaCy's performance on the particular language and training a language model.

VariableTypeDescription
STOP_WORDSsetIndividual words.
TOKENIZER_EXCEPTIONSdictKeyed by strings mapped to list of one dict per token with token attributes.
TOKEN_MATCHregexRegexes to match complex tokens, e.g. URLs.
NORM_EXCEPTIONSdictKeyed by strings, mapped to their norms.
TOKENIZER_PREFIXESlistStrings or regexes, usually not customised.
TOKENIZER_SUFFIXESlistStrings or regexes, usually not customised.
TOKENIZER_INFIXESlistStrings or regexes, usually not customised.
LEX_ATTRSdictAttribute ID mapped to function.
SYNTAX_ITERATORSdict Iterator ID mapped to function. Currently only supports 'noun_chunks'.
LOOKUPdictKeyed by strings mapping to their lemma.
LEMMA_RULES, LEMMA_INDEX, LEMMA_EXCdictLemmatization rules, keyed by part of speech.
TAG_MAPdict Keyed by strings mapped to Universal Dependencies tags.
MORPH_RULESdictKeyed by strings mapped to a dict of their morphological features.

Creating a Language subclass

Language-specific code and resources should be organised into a subpackage of spaCy, named according to the language's ISO code. For instance, code and resources specific to Spanish are placed into a directory spacy/lang/es, which can be imported as spacy.lang.es.

To get started, you can use our templates for the most important files. Here's what the class template looks like:

__init__.py (excerpt)

# import language-specific data from .stop_words import STOP_WORDS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .lex_attrs import LEX_ATTRS from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language from ...attrs import LANG from ...util import update_exc # create Defaults class in the module scope (necessary for pickling!) class XxxxxDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters[LANG] = lambda text: 'xx' # language ISO code # optional: replace flags with custom functions, e.g. like_num() lex_attr_getters.update(LEX_ATTRS) # merge base exceptions and custom tokenizer exceptions tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) stop_words = set(STOP_WORDS) # create actual Language class class Xxxxx(Language): lang = 'xx' # language ISO code Defaults = XxxxxDefaults # override defaults # set default export – this allows the language class to be lazy-loaded __all__ = ['Xxxxx']

Stop words

A "stop list" is a classic trick from the early days of information retrieval when search was largely about keyword presence and absence. It is still sometimes useful today to filter out common words from a bag-of-words model. To improve readability, STOP_WORDS are separated by spaces and newlines, and added as a multiline string.

Example

STOP_WORDS = set(""" a about above across after afterwards again against all almost alone along already also although always am among amongst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond both bottom but by """).split())

Tokenizer exceptions

spaCy's tokenization algorithm lets you deal with whitespace-delimited chunks separately. This makes it easy to define special-case rules, without worrying about how they interact with the rest of the tokenizer. Whenever the key string is matched, the special-case rule is applied, giving the defined sequence of tokens. You can also attach attributes to the subtokens, covered by your special case, such as the subtokens LEMMA or TAG.

Tokenizer exceptions can be added in the following format:

tokenizer_exceptions.py (excerpt)

TOKENIZER_EXCEPTIONS = { "don't": [ {ORTH: "do", LEMMA: "do"}, {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}] }

Unambiguous abbreviations, like month names or locations in English, should be added to exceptions with a lemma assigned, for example {ORTH: "Jan.", LEMMA: "January"}. Since the exceptions are added in Python, you can use custom logic to generate them more efficiently and make your data less verbose. How you do this ultimately depends on the language. Here's an example of how exceptions for time formats like "1a.m." and "1am" are generated in the English tokenizer_exceptions.py :

tokenizer_exceptions.py (excerpt)

# use short, internal variable for readability _exc = {} for h in range(1, 12 + 1): for period in ["a.m.", "am"]: # always keep an eye on string interpolation! _exc["%d%s" % (h, period)] = [ {ORTH: "%d" % h}, {ORTH: period, LEMMA: "a.m."}] for period in ["p.m.", "pm"]: _exc["%d%s" % (h, period)] = [ {ORTH: "%d" % h}, {ORTH: period, LEMMA: "p.m."}] # only declare this at the bottom TOKENIZER_EXCEPTIONS = dict(_exc)

When adding the tokenizer exceptions to the Defaults, you can use the update_exc() helper function to merge them with the global base exceptions (including one-letter abbreviations and emoticons). The function performs a basic check to make sure exceptions are provided in the correct format. It can take any number of exceptions dicts as its arguments, and will update and overwrite the exception in this order. For example, if your language's tokenizer exceptions include a custom tokenization pattern for "a.", it will overwrite the base exceptions with the language's custom one.

Example

from ...util import update_exc BASE_EXCEPTIONS = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]} TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]} tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) # {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]}

Norm exceptions

In addition to ORTH or LEMMA, tokenizer exceptions can also set a NORM attribute. This is useful to specify a normalised version of the token – for example, the norm of "n't" is "not". By default, a token's norm equals its lowercase text. If the lowercase spelling of a word exists, norms should always be in lowercase.

spaCy usually tries to normalise words with different spellings to a single, common spelling. This has no effect on any other token attributes, or tokenization in general, but it ensures that equivalent tokens receive similar representations. This can improve the model's predictions on words that weren't common in the training data, but are equivalent to other words – for example, "realize" and "realise", or "thx" and "thanks".

Similarly, spaCy also includes global base norms for normalising different styles of quotation marks and currency symbols. Even though $ and are very different, spaCy normalises them both to $. This way, they'll always be seen as similar, no matter how common they were in the training data.

Norm exceptions can be provided as a simple dictionary. For more examples, see the English norm_exceptions.py .

Example

NORM_EXCEPTIONS = { "cos": "because", "fav": "favorite", "accessorise": "accessorize", "accessorised": "accessorized" }

To add the custom norm exceptions lookup table, you can use the add_lookups() helper functions. It takes the default attribute getter function as its first argument, plus a variable list of dictionaries. If a string's norm is found in one of the dictionaries, that value is used – otherwise, the default function is called and the token is assigned its default norm.

lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
                                     NORM_EXCEPTIONS, BASE_NORMS)

The order of the dictionaries is also the lookup order – so if your language's norm exceptions overwrite any of the global exceptions, they should be added first. Also note that the tokenizer exceptions will always have priority over the atrribute getters.

Lexical attributes

spaCy provides a range of Token attributes that return useful information on that token – for example, whether it's uppercase or lowercase, a left or right punctuation mark, or whether it resembles a number or email address. Most of these functions, like is_lower or like_url should be language-independent. Others, like like_num (which includes both digits and number words), requires some customisation.

Here's an example from the English lex_attrs.py :

lex_attrs.py

_num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', 'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion', 'gajillion', 'bazillion'] def like_num(text): text = text.replace(',', '').replace('.', '') if text.isdigit(): return True if text.count('/') == 1: num, denom = text.split('/') if num.isdigit() and denom.isdigit(): return True if text in _num_words: return True return False LEX_ATTRS = { LIKE_NUM: like_num }

By updating the default lexical attributes with a custom LEX_ATTRS dictionary in the language's defaults via lex_attr_getters.update(LEX_ATTRS), only the new custom functions are overwritten.

Syntax iterators

Syntax iterators are functions that compute views of a Doc object based on its syntax. At the moment, this data is only used for extracting noun chunks, which are available as the Doc.noun_chunks property. Because base noun phrases work differently across languages, the rules to compute them are part of the individual language's data. If a language does not include a noun chunks iterator, the property won't be available. For examples, see the existing syntax iterators:

LanguageCodeSource
Englishenlang/en/syntax_iterators.py
Germandelang/de/syntax_iterators.py
Frenchfrlang/fr/syntax_iterators.py
Spanisheslang/es/syntax_iterators.py

Lemmatizer

As of v2.0, spaCy supports simple lookup-based lemmatization. This is usually the quickest and easiest way to get started. The data is stored in a dictionary mapping a string to its lemma. To determine a token's lemma, spaCy simply looks it up in the table. Here's an example from the Spanish language data:

lang/es/lemmatizer.py (excerpt)

LOOKUP = { "aba": "abar", "ababa": "abar", "ababais": "abar", "ababan": "abar", "ababanes": "ababán", "ababas": "abar", "ababoles": "ababol", "ababábites": "ababábite" }

To provide a lookup lemmatizer for your language, import the lookup table and add it to the Language class as lemma_lookup:

lemma_lookup = dict(LOOKUP)

Tag map

Most treebanks define a custom part-of-speech tag scheme, striking a balance between level of detail and ease of prediction. While it's useful to have custom tagging schemes, it's also useful to have a common scheme, to which the more specific tags can be related. The tagger can learn a tag scheme with any arbitrary symbols. However, you need to define how those symbols map down to the Universal Dependencies tag set. This is done by providing a tag map.

The keys of the tag map should be strings in your tag set. The values should be a dictionary. The dictionary must have an entry POS whose value is one of the Universal Dependencies tags. Optionally, you can also include morphological features or other token attributes in the tag map as well. This allows you to do simple rule-based morphological analysis.

Example

from ..symbols import POS, NOUN, VERB, DET TAG_MAP = { "NNS": {POS: NOUN, "Number": "plur"}, "VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"}, "DT": {POS: DET} }

Morph rules

The morphology rules let you set token attributes such as lemmas, keyed by the extended part-of-speech tag and token text. The morphological features and their possible values are language-specific and based on the Universal Dependencies scheme.

Example

from ..symbols import LEMMA MORPH_RULES = { "VBZ": { "am": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"}, "are": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"}, "is": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"}, "'re": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"}, "'s": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"} } }

In the example of "am", the attributes look like this:

AttributeDescription
LEMMA: "be"Base form, e.g. "to be".
"VerbForm": "Fin" Finite verb. Finite verbs have a subject and can be the root of an independent clause – "I am." is a valid, complete sentence.
"Person": "One"First person, i.e. "I am".
"Tense": "Pres" Present tense, i.e. actions that are happening right now or actions that usually happen.
"Mood": "Ind" Indicative, i.e. something happens, has happened or will happen (as opposed to imperative or conditional).

Testing the new language

Before using the new language or submitting a pull request to spaCy, you should make sure it works as expected. This is especially important if you've added custom regular expressions for token matching or punctuation – you don't want to be causing regressions.

The easiest way to test your new tokenizer is to run the language-independent "tokenizer sanity" tests located in tests/tokenizer . This will test for basic behaviours like punctuation splitting, URL matching and correct handling of whitespace. In the conftest.py , add the new language ID to the list of _languages:

_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'he', 'hu', 'it', 'nb',
              'nl', 'pl', 'pt', 'sv', 'xx'] # new language here

The language will now be included in the tokenizer test fixture, which is used by the basic tokenizer tests. If you want to add your own tests that should be run over all languages, you can use this fixture as an argument of your test function.

Writing language-specific tests

It's recommended to always add at least some tests with examples specific to the language. Language tests should be located in tests/lang in a directory named after the language ID. You'll also need to create a fixture for your tokenizer in the conftest.py . Always use the get_lang_class() helper function within the fixture, instead of importing the class at the top of the file. This will load the language data only when it's needed. (Otherwise, all data would be loaded every time you run a test.)

@pytest.fixture
def en_tokenizer():
    return util.get_lang_class('en').Defaults.create_tokenizer()

When adding test cases, always parametrize them – this will make it easier for others to add more test cases without having to modify the test itself. You can also add parameter tuples, for example, a test sentence and its expected length, or a list of expected tokens. Here's an example of an English tokenizer test for combinations of punctuation and abbreviations:

Example test

@pytest.mark.parametrize('text,length', [ ("The U.S. Army likes Shock and Awe.", 8), ("U.N. regulations are not a part of their concern.", 10), ("“Isn't it?”", 6)]) def test_en_tokenizer_handles_punct_abbrev(en_tokenizer, text, length): tokens = en_tokenizer(text) assert len(tokens) == length

Training a language model

spaCy expects that common words will be cached in a Vocab instance. The vocabulary caches lexical features, and makes it easy to use information from unlabelled text samples in your models. Specifically, you'll usually want to collect word frequencies, and train word vectors. To generate the word frequencies from a large, raw corpus, you can use the word_freqs.py script from the spaCy developer resources.

Note that your corpus should not be preprocessed (i.e. you need punctuation for example). The word frequencies should be generated as a tab-separated file with three columns:

  1. The number of times the word occurred in your language sample.
  2. The number of distinct documents the word occurred in.
  3. The word itself.

es_word_freqs.txt

6361109 111 Aunque 23598543 111 aunque 10097056 111 claro 193454 111 aro 7711123 111 viene 12812323 111 mal 23414636 111 momento 2014580 111 felicidad 233865 111 repleto 15527 111 eto 235565 111 deliciosos 17259079 111 buena 71155 111 Anímate 37705 111 anímate 33155 111 cuéntanos 2389171 111 cuál 961576 111 típico

You should make sure you use the spaCy tokenizer for your language to segment the text for your word frequencies. This will ensure that the frequencies refer to the same segmentation standards you'll be using at run-time. For instance, spaCy's English tokenizer segments "can't" into two tokens. If we segmented the text by whitespace to produce the frequency counts, we'll have incorrect frequency counts for the tokens "ca" and "n't".

Training the word vectors

Word2vec and related algorithms let you train useful word similarity models from unlabelled text. This is a key part of using deep learning for NLP with limited labelled data. The vectors are also useful by themselves – they power the .similarity() methods in spaCy. For best results, you should pre-process the text with spaCy before training the Word2vec model. This ensures your tokenization will match. You can use our word vectors training script , which pre-processes the text with your language-specific tokenizer and trains the model using Gensim. The vectors.bin file should consist of one word and vector per line.

Training the tagger and parser

You can now train the model using a corpus for your language annotated with Universal Dependencies. If your corpus uses the CoNLL-U format, i.e. files with the extension .conllu, you can use the convert command to convert it to spaCy's JSON format for training. Once you have your UD corpus transformed into JSON, you can train your model use the using spaCy's train command.