What's new in v2.0

We're very excited to finally introduce spaCy v2.0! On this page, you'll find a summary of the new features, information on the backwards incompatibilities, including a handy overview of what's been renamed or deprecated. To help you make the most of v2.0, we also re-wrote almost all of the usage guides and API docs, and added more real-world examples. If you're new to spaCy, or just want to brush up on some NLP basics and the details of the library, check out the spaCy 101 guide that explains the most important concepts with examples and illustrations.

Summary

This release features entirely new deep learning-powered models for spaCy's tagger, parser and entity recognizer. The new models are 20x smaller than the linear models that have powered spaCy until now: from 300 MB to only 15 MB.

We've also made several usability improvements that are particularly helpful for production deployments. spaCy v2 now fully supports the Pickle protocol, making it easy to use spaCy with Apache Spark. The string-to-integer mapping is no longer stateful, making it easy to reconcile annotations made in different processes. Models are smaller and use less memory, and the APIs for serialization are now much more consistent.

The main usability improvements you'll notice in spaCy v2.0 are around defining, training and loading your own models and components. The new neural network models make it much easier to train a model from scratch, or update an existing model with a few examples. In v1.x, the statistical models depended on the state of the Vocab. If you taught the model a new word, you would have to save and load a lot of data — otherwise the model wouldn't correctly recall the features of your new example. That's no longer the case.

Due to some clever use of hashing, the statistical models never change size, even as they learn new vocabulary items. The whole pipeline is also now fully differentiable. Even if you don't have explicitly annotated data, you can update spaCy using all the latest deep learning tricks like adversarial training, noise contrastive estimation or reinforcement learning.

New features

This section contains an overview of the most important new features and improvements. The API docs include additional deprecation notes. New methods and functions that were introduced in this version are marked with a tag.

Improved processing pipelines

It's now much easier to customise the pipeline with your own components, functions that receive a Doc object, modify and return it. If your component is stateful, you can define and register a factory which receives the shared Vocab object and returns a  component. spaCy's default components can be added to your pipeline by using their string IDs. This way, you won't have to worry about finding and implementing them – simply add "tagger" to the pipeline, and spaCy will know what to do.

Doc Text nlp tokenizer tensorizer tagger parser ner

Text classification

spaCy v2.0 lets you add text categorization models to spaCy pipelines. The model supports classification with multiple, non-mutually exclusive labels – so multiple labels can apply at once. You can change the model architecture rather easily, but by default, the TextCategorizer class uses a convolutional neural network to assign position-sensitive vectors to each word in the document.

Hash values instead of integer IDs

The StringStore now resolves all strings to hash values instead of integer IDs. This means that the string-to-int mapping no longer depends on the vocabulary state, making a lot of workflows much simpler, especially during training. Unlike integer IDs in spaCy v1.x, hash values will always match – even across models. Strings can now be added explicitly using the new Stringstore.add method. A token's hash is available via token.orth.

Saving, loading and serialization

spay's serialization API has been made consistent across classes and objects. All container classes, i.e. Language, Doc, Vocab and StringStore now have a to_bytes(), from_bytes(), to_disk() and from_disk() method that supports the Pickle protocol.

The improved spacy.load makes loading models easier and more transparent. You can load a model by supplying its shortcut link, the name of an installed model package or a path. The Language class to initialise will be determined based on the model's settings. For a blank language, you can import the class directly, e.g. from spacy.lang.en import English.

displaCy visualizer with Jupyter support

Our popular dependency and named entity visualizers are now an official part of the spaCy library! displaCy can run a simple web server, or generate raw HTML markup or SVG files to be exported. You can pass in one or more docs, and customise the style. displaCy also auto-detects whether you're running Jupyter and will render the visualizations in your notebook.

Improved language data and lazy loading

Language-specfic data now lives in its own submodule, spacy.lang. Languages are lazy-loaded, i.e. only loaded when you import a Language class, or load a model that initialises one. This allows languages to contain more custom data, e.g. lemmatizer lookup tables, or complex regular expressions. The language data has also been tidied up and simplified. spaCy now also supports simple lookup-based lemmatization.

Revised matcher API

Patterns can now be added to the matcher by calling matcher.add() with a match ID, an optional callback function to be invoked on each match, and one or more patterns. This allows you to write powerful, pattern-specific logic using only one matcher. For example, you might only want to merge some entity types, and set custom flags for other matched patterns.

Neural network models for English, German, French, Spanish and multi-language NER

spaCy v2.0 comes with new and improved neural network models for English, German, French and Spanish, as well as a multi-language named entity recognition model trained on Wikipedia. GPU usage is now supported via Chainer's CuPy module.

Backwards incompatibilities

OldNew
spacy.en spacy.xx spacy.lang.en spacy.lang.xx
orthlang.xx.lex_attrs
syntax.iteratorslang.xx.syntax_iterators
Language.save_to_directoryLanguage.to_disk
Language.create_make_docLanguage.tokenizer
Vocab.load Vocab.load_lexemes Vocab.from_disk Vocab.from_bytes
Vocab.dump Vocab.to_disk
Vocab.to_bytes
Vocab.load_vectors Vocab.load_vectors_from_bin_loc Vectors.from_disk Vectors.from_bytes
Vocab.dump_vectors Vectors.to_disk Vectors.to_bytes
StringStore.load StringStore.from_disk StringStore.from_bytes
StringStore.dump StringStore.to_disk StringStore.to_bytes
Tokenizer.load Tokenizer.from_disk Tokenizer.from_bytes
Tagger.load Tagger.from_disk Tagger.from_bytes
DependencyParser.load DependencyParser.from_disk DependencyParser.from_bytes
EntityRecognizer.load EntityRecognizer.from_disk EntityRecognizer.from_bytes
Matcher.load-
Matcher.add_pattern Matcher.add_entityMatcher.add
Matcher.get_entityMatcher.get
Matcher.has_entityMatcher.__contains__
Doc.read_bytesBinder
Token.is_ancestor_ofToken.is_ancestor
cli.model-

Migrating from spaCy 1.x

Because we'e made so many architectural changes to the library, we've tried to keep breaking changes to a minimum. A lot of projects follow the philosophy that if you're going to break anything, you may as well break everything. We think migration is easier if there's a logic to what has changed.

We've therefore followed a policy of avoiding breaking changes to the Doc, Span and Token objects. This way, you can focus on only migrating the code that does training, loading and serialization — in other words, code that works with the nlp object directly. Code that uses the annotations should continue to work.

Saving, loading and serialization

Double-check all calls to spacy.load() and make sure they don't use the path keyword argument. If you're only loading in binary data and not a model package that can construct its own Language class and pipeline, you should now use the Language.from_disk() method.

nlp = spacy.load('/model') nlp = English().from_disk('/model/data')
nlp = spacy.load('en', path='/model')

Review all other code that writes state to disk or bytes. All containers, now share the same, consistent API for saving and loading. Replace saving with to_disk() or to_bytes(), and loading with from_disk() and from_bytes().

nlp.to_disk('/model') nlp.vocab.to_disk('/vocab')
nlp.save_to_directory('/model') nlp.vocab.dump('/vocab')

If you've trained models with input from v1.x, you'll need to retrain them with spaCy v2.0. All previous models will not be compatible with the new version.

Strings and hash values

The change from integer IDs to hash values may not actually affect your code very much. However, if you're adding strings to the vocab manually, you now need to call StringStore.add() explicitly. You can also now be sure that the string-to-hash mapping will always match across vocabularies.

nlp.vocab.strings.add(u'coffee') nlp.vocab.strings[u'coffee'] # 3197928453018144401 other_nlp.vocab.strings[u'coffee'] # 3197928453018144401
nlp.vocab.strings[u'coffee'] # 3672 other_nlp.vocab.strings[u'coffee'] # 40259

Processing pipelines and language data

If you're importing language data or Language classes, make sure to change your import statements to import from spacy.lang. If you've added your own custom language, it needs to be moved to spacy/lang/xx and adjusted accordingly.

from spacy.lang.en import English
from spacy.en import English

If you've been using custom pipeline components, check out the new guide on processing pipelines. Appending functions to the pipeline still works – but you might be able to make this more convenient by registering "component factories". Components of the processing pipeline can now be disabled by passing a list of their names to the disable keyword argument on loading or processing.

nlp = spacy.load('en', disable=['tagger', 'ner']) doc = nlp(u"I don't want parsed", disable=['parser'])
nlp = spacy.load('en', tagger=False, entity=False) doc = nlp(u"I don't want parsed", parse=False)

Adding patterns and callbacks to the matcher

If you're using the matcher, you can now add patterns in one step. This should be easy to update – simply merge the ID, callback and patterns into one call to Matcher.add() .

matcher.add('GoogleNow', merge_phrases, [{ORTH: 'Google'}, {ORTH: 'Now'}])
matcher.add_entity('GoogleNow', on_match=merge_phrases) matcher.add_pattern('GoogleNow', [{ORTH: 'Google'}, {ORTH: 'Now'}])

If you've been using acceptor functions, you'll need to move this logic into the on_match callbacks. The callback function is invoked on every match and will give you access to the doc, the index of the current match and all total matches. This lets you both accept or reject the match, and define the actions to be triggered.

Benchmarks

The evaluation was conducted on raw text with no gold standard information.

ModelVersionTypeUASLASNER FPOSw/s
en_core_web_sm2.0.0neural91.289.282.696.610,300
en_core_web_sm1.2.0linear86.683.878.596.625,700
en_core_web_md1.2.1linear90.688.581.496.718,800