Word Vectors and Semantic Similarity

spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that's similar to what they're currently looking at, or label a support ticket as a duplicate if it's very similar to an already existing one.

Each Doc, Span and Token comes with a .similarity() method that lets you compare it with another object, and determine the similarity. Of course similarity is always subjective – whether "dog" and "cat" are similar really depends on how you're looking at it. spaCy's similarity model usually assumes a pretty general-purpose definition of similarity.

tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
dog1.00 0.80 0.24
cat0.80 1.00 0.28
banana0.24 0.28 1.00

In this case, the model's predictions are pretty on point. A dog is very similar to a cat, whereas a banana is not very similar to either of them. Identical tokens are obviously 100% similar to each other (just not always exactly 1.0, because of vector math and floating point imprecisions).

Similarity is determined by comparing word vectors or "word embeddings", multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec. Most of spaCy's default models come with 300-dimensional vectors that look like this:


array([2.02280000e-01, -7.66180009e-02, 3.70319992e-01, 3.28450017e-02, -4.19569999e-01, 7.20689967e-02, -3.74760002e-01, 5.74599989e-02, -1.24009997e-02, 5.29489994e-01, -5.23800015e-01, -1.97710007e-01, -3.41470003e-01, 5.33169985e-01, -2.53309999e-02, 1.73800007e-01, 1.67720005e-01, 8.39839995e-01, 5.51070012e-02, 1.05470002e-01, 3.78719985e-01, 2.42750004e-01, 1.47449998e-02, 5.59509993e-01, 1.25210002e-01, -6.75960004e-01, 3.58420014e-01, -4.00279984e-02, 9.59490016e-02, -5.06900012e-01, -8.53179991e-02, 1.79800004e-01, 3.38669986e-01, 1.32300004e-01, 3.10209990e-01, 2.18779996e-01, 1.68530002e-01, 1.98740005e-01, -5.73849976e-01, -1.06490001e-01, 2.66689986e-01, 1.28380001e-01, -1.28030002e-01, -1.32839993e-01, 1.26570001e-01, 8.67229998e-01, 9.67210010e-02, 4.83060002e-01, 2.12709993e-01, -5.49900010e-02, -8.24249983e-02, 2.24079996e-01, 2.39749998e-01, -6.22599982e-02, 6.21940017e-01, -5.98999977e-01, 4.32009995e-01, 2.81430006e-01, 3.38420011e-02, -4.88150001e-01, -2.13589996e-01, 2.74010003e-01, 2.40950003e-01, 4.59500015e-01, -1.86049998e-01, -1.04970002e+00, -9.73049998e-02, -1.89080000e-01, -7.09290028e-01, 4.01950002e-01, -1.87680006e-01, 5.16870022e-01, 1.25200003e-01, 8.41499984e-01, 1.20970003e-01, 8.82389992e-02, -2.91959997e-02, 1.21510006e-03, 5.68250008e-02, -2.74210006e-01, 2.55640000e-01, 6.97930008e-02, -2.22580001e-01, -3.60060006e-01, -2.24020004e-01, -5.36990017e-02, 1.20220006e+00, 5.45350015e-01, -5.79980016e-01, 1.09049998e-01, 4.21669990e-01, 2.06619993e-01, 1.29360005e-01, -4.14570011e-02, -6.67770028e-01, 4.04670000e-01, -1.52179999e-02, -2.76400000e-01, -1.56110004e-01, -7.91980028e-02, 4.00369987e-02, -1.29439995e-01, -2.40900001e-04, -2.67850012e-01, -3.81150007e-01, -9.72450018e-01, 3.17259997e-01, -4.39509988e-01, 4.19340014e-01, 1.83530003e-01, -1.52600005e-01, -1.08080000e-01, -1.03579998e+00, 7.62170032e-02, 1.65189996e-01, 2.65259994e-04, 1.66160002e-01, -1.52810007e-01, 1.81229994e-01, 7.02740014e-01, 5.79559989e-03, 5.16639985e-02, -5.97449988e-02, -2.75510013e-01, -3.90489995e-01, 6.11319989e-02, 5.54300010e-01, -8.79969969e-02, -4.16810006e-01, 3.28260005e-01, -5.25489986e-01, -4.42880005e-01, 8.21829960e-03, 2.44859993e-01, -2.29819998e-01, -3.49810004e-01, 2.68940002e-01, 3.91660005e-01, -4.19039994e-01, 1.61909997e-01, -2.62630010e+00, 6.41340017e-01, 3.97430003e-01, -1.28680006e-01, -3.19460005e-01, -2.56330013e-01, -1.22199997e-01, 3.22750002e-01, -7.99330026e-02, -1.53479993e-01, 3.15050006e-01, 3.05909991e-01, 2.60120004e-01, 1.85530007e-01, -2.40429997e-01, 4.28860001e-02, 4.06219989e-01, -2.42559999e-01, 6.38700008e-01, 6.99829996e-01, -1.40430003e-01, 2.52090007e-01, 4.89840001e-01, -6.10670000e-02, -3.67659986e-01, -5.50890028e-01, -3.82649988e-01, -2.08430007e-01, 2.28320003e-01, 5.12179971e-01, 2.78679997e-01, 4.76520002e-01, 4.79510017e-02, -3.40079993e-01, -3.28729987e-01, -4.19669986e-01, -7.54989982e-02, -3.89539987e-01, -2.96219997e-02, -3.40700001e-01, 2.21699998e-01, -6.28560036e-02, -5.19029975e-01, -3.77739996e-01, -4.34770016e-03, -5.83010018e-01, -8.75459984e-02, -2.39289999e-01, -2.47109994e-01, -2.58870006e-01, -2.98940003e-01, 1.37150005e-01, 2.98919994e-02, 3.65439989e-02, -4.96650010e-01, -1.81600004e-01, 5.29389977e-01, 2.19919994e-01, -4.45140004e-01, 3.77979994e-01, -5.70620000e-01, -4.69460003e-02, 8.18059966e-02, 1.92789994e-02, 3.32459986e-01, -1.46200001e-01, 1.71560004e-01, 3.99809986e-01, 3.62170011e-01, 1.28160000e-01, 3.16439986e-01, 3.75690013e-01, -7.46899992e-02, -4.84800003e-02, -3.14009994e-01, -1.92860007e-01, -3.12940001e-01, -1.75529998e-02, -1.75139993e-01, -2.75870003e-02, -1.00000000e+00, 1.83870003e-01, 8.14339995e-01, -1.89129993e-01, 5.09989977e-01, -9.19600017e-03, -1.92950002e-03, 2.81890005e-01, 2.72470005e-02, 4.34089988e-01, -5.49669981e-01, -9.74259973e-02, -2.45399997e-01, -1.72030002e-01, -8.86500031e-02, -3.02980006e-01, -1.35910004e-01, -2.77649999e-01, 3.12860007e-03, 2.05559999e-01, -1.57720000e-01, -5.23079991e-01, -6.47010028e-01, -3.70139986e-01, 6.93930015e-02, 1.14009999e-01, 2.75940001e-01, -1.38750002e-01, -2.72680014e-01, 6.68910027e-01, -5.64539991e-02, 2.40170002e-01, -2.67300010e-01, 2.98599988e-01, 1.00830004e-01, 5.55920005e-01, 3.28489989e-01, 7.68579990e-02, 1.55279994e-01, 2.56359994e-01, -1.07720003e-01, -1.23590000e-01, 1.18270002e-01, -9.90289971e-02, -3.43279988e-01, 1.15019999e-01, -3.78080010e-01, -3.90120000e-02, -3.45930010e-01, -1.94040000e-01, -3.35799992e-01, -6.23340011e-02, 2.89189994e-01, 2.80319989e-01, -5.37410021e-01, 6.27939999e-01, 5.69549985e-02, 6.21469975e-01, -2.52819985e-01, 4.16700006e-01, -1.01079997e-02, -2.54339993e-01, 4.00029987e-01, 4.24320012e-01, 2.26720005e-01, 1.75530002e-01, 2.30489999e-01, 2.83230007e-01, 1.38820007e-01, 3.12180002e-03, 1.70570001e-01, 3.66849989e-01, 2.52470002e-03, -6.40089989e-01, -2.97650009e-01, 7.89430022e-01, 3.31680000e-01, -1.19659996e+00, -4.71559986e-02, 5.31750023e-01], dtype=float32)

The .vector attribute will return an object's vector. Doc.vector and Span.vector will default to an average of their token vectors. You can also check if a token has a vector assigned, and get the L2 norm, which can be used to normalise vectors.

tokens = nlp(u'dog cat banana sasquatch')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)
TextHas vectorVector normOOV

The words "dog", "cat" and "banana" are all pretty common in English, so they're part of the model's vocabulary, and come with a vector. The word "sasquatch" on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of 0, which means it's practically nonexistent.

If your application will benefit from a large vocabulary with more vectors, you should consider using one of the larger models instead of the default, smaller ones, which usually come with a clipped vocabulary.

Similarities in context

Aside from spaCy's built-in word vectors, which were trained on a lot of text with a wide vocabulary, the parsing, tagging and NER models also rely on vector representations of the meanings of words in context. As the first component of the processing pipeline, the tensorizer encodes a document's internal meaning representations as an array of floats, also called a tensor. This allows spaCy to make a reasonable guess at a word's meaning, based on its surrounding words. Even if a word hasn't been seen before, spaCy will know something about it. Because spaCy uses a 4-layer convolutional network, the tensors are sensitive to up to four words on either side of a word.

For example, here are three sentences containing the out-of-vocabulary word "labrador" in different contexts.

doc1 = nlp(u"The labrador barked.")
doc2 = nlp(u"The labrador swam.")
doc3 = nlp(u"the labrador people live in canada.")

for doc in [doc1, doc2, doc3]:
    labrador = doc[1]
    dog = nlp(u"dog")

Even though the model has never seen the word "labrador", it can make a fairly accurate prediction of its similarity to "dog" in different contexts.

The labrador barked.0.56
The labrador swam.0.48
the labrador people live in canada.0.39

The same also works for whole documents. Here, the variance of the similarities is lower, as all words and their order are taken into account. However, the context-specific similarity is often still reflected pretty accurately.

doc1 = nlp(u"Paris is the largest city in France.")
doc2 = nlp(u"Vilnius is the capital of Lithuania.")
doc3 = nlp(u"An emu is a large bird.")

for doc in [doc1, doc2, doc3]:
    for other_doc in [doc1, doc2, doc3]:

Even though the sentences about Paris and Vilnius consist of different words and entities, they both describe the same concept and are seen as more similar than the sentence about emus. In this case, even a misspelled version of "Vilnius" would still produce very similar results.

Paris is the largest city in France.Vilnius is the capital of Lithuania.An emu is a large bird.
Paris is the largest city in France.1.00 0.85 0.65
Vilnius is the capital of Lithuania.0.85 1.00 0.55
An emu is a large bird.0.65 0.55 1.00

Sentences that consist of the same words in different order will likely be seen as very similar – but never identical.

docs = [nlp(u"dog bites man"), nlp(u"man bites dog"),
        nlp(u"man dog bites"), nlp(u"dog man bites")]

for doc in docs:
    for other_doc in docs:

Interestingly, "man bites dog" and "man dog bites" are seen as slightly more similar than "man bites dog" and "dog bites man". This may be a conincidence – or the result of "man" being interpreted as both sentence's subject.

dog bites manman bites dogman dog bitesdog man bites
dog bites man1.00 0.90 0.89 0.92
man bites dog0.90 1.00 0.93 0.90
man dog bites0.89 0.93 1.00 0.92
dog man bites0.92 0.90 0.92 1.00

Customising word vectors

By default, Token.vector returns the vector for its underlying Lexeme , while Doc.vector and Span.vector return an average of the vectors of their tokens. You can customize these behaviours by modifying the doc.user_hooks, doc.user_span_hooks and doc.user_token_hooks dictionaries.

Adding vectors

The new Vectors class makes it easy to add your own vectors to spaCy. Just like the Vocab , it is initialised with a StringStore or a list of strings.

Adding vectors one-by-one

from spacy.strings import StringStore from spacy.vectors import Vectors vector_data = {'dog': numpy.random.uniform(-1, 1, (300,)), 'cat': numpy.random.uniform(-1, 1, (300,)), 'orange': numpy.random.uniform(-1, 1, (300,))} vectors = Vectors(StringStore(), 300) for word, vector in vector_data.items(): vectors.add(word, vector)

You can also add the vector values directly on initialisation:

Adding vectors on initialisation

from spacy.vectors import Vectors vector_table = numpy.zeros((3, 300), dtype='f') vectors = Vectors([u'dog', u'cat', u'orange'], vector_table)

Loading GloVe vectors

spaCy comes with built-in support for loading GloVe vectors from a directory. The Vectors.from_glove method assumes a binary format, the vocab provided in a vocab.txt, and the naming scheme of vectors.{size}.[fd].bin. For example:

File nameDimensionsData type
vectors.300.d.bin300float64 (double)
from spacy.vectors import Vectors

vectors = Vectors([], 128)

Loading other vectors

You can also choose to load in vectors from other sources, like the fastText vectors for 294 languages, trained on Wikipedia. After reading in the file, the vectors are added to the Vocab using the set_vector method.

Storing vectors on a GPU

If you're using a GPU, it's much more efficient to keep the word vectors on the device. You can do that by setting the attribute to a cupy.ndarray object if you're using spaCy or Chainer, or a torch.Tensor object if you're using PyTorch. The data object just needs to support __iter__ and __getitem__, so if you're using another library such as TensorFlow, you could also create a wrapper for your vectors data.

spaCy, Thinc or Chainer

import cupy.cuda from spacy.vectors import Vectors vector_table = numpy.zeros((3, 300), dtype='f') vectors = Vectors([u'dog', u'cat', u'orange'], vector_table) with cupy.cuda.Device(0): = cupy.asarray(


import torch from spacy.vectors import Vectors vector_table = numpy.zeros((3, 300), dtype='f') vectors = Vectors([u'dog', u'cat', u'orange'], vector_table) = torch.Tensor(