Token

An individual token — i.e. a word, punctuation symbol, whitespace, etc.

Token.__init__

Construct a Token object.

NameTypeDescription
vocabVocabA storage container for lexical types.
docDocThe parent document.
offsetintThe index of the token within the document.
returnsTokenThe newly constructed object.

Token.__len__

The number of unicode characters in the token, i.e. token.text.

NameTypeDescription
returnsintThe number of unicode characters in the token.

Token.check_flag

Check the value of a boolean flag.

NameTypeDescription
flag_idintThe attribute ID of the flag to check.
returnsboolWhether the flag is set.

Token.similarity

Compute a semantic similarity estimate. Defaults to cosine over vectors.

NameTypeDescription
other- The object to compare with. By default, accepts Doc, Span, Token and Lexeme objects.
returnsfloatA scalar similarity score. Higher is more similar.

Token.nbor

Get a neighboring token.

NameTypeDescription
iintThe relative position of the token to get. Defaults to 1.
returnsTokenThe token at position self.doc[self.i+i].

Token.is_ancestor

Check whether this token is a parent, grandparent, etc. of another in the dependency tree.

NameTypeDescription
descendantTokenAnother token.
returnsboolWhether this token is the ancestor of the descendant.

Token.ancestors

The rightmost token of this token's syntactic descendants.

NameTypeDescription
yieldsToken A sequence of ancestor tokens such that ancestor.is_ancestor(self).

Token.conjuncts

A sequence of coordinated tokens, including the token itself.

NameTypeDescription
yieldsTokenA coordinated token.

Token.children

A sequence of the token's immediate syntactic children.

NameTypeDescription
yieldsTokenA child token such that child.head==self.

Token.subtree

A sequence of all the token's syntactic descendents.

NameTypeDescription
yieldsTokenA descendant token such that self.is_ancestor(descendant).

Token.has_vector

A boolean value indicating whether a word vector is associated with the token.

NameTypeDescription
returnsboolWhether the token has a vector data attached.

Token.vector

A real-valued meaning representation.

NameTypeDescription
returnsnumpy.ndarray[ndim=1, dtype='float32']A 1D numpy array representing the token's semantics.

Span.vector_norm

The L2 norm of the token's vector representation.

NameTypeDescription
returnsfloatThe L2 norm of the vector representation.

Attributes

NameTypeDescription
textunicodeVerbatim text content.
text_with_wsunicodeText content, with trailing space character if present.
whitespaceintTrailing space character if present.
whitespace_unicodeTrailing space character if present.
vocabVocabThe vocab object of the parent Doc.
docDocThe parent document.
headTokenThe syntactic parent, or "governor", of this token.
left_edgeTokenThe leftmost token of this token's syntactic descendants.
right_edgeTokenThe rightmost token of this token's syntactic descendents.
iintThe index of the token within the parent document.
ent_typeintNamed entity type.
ent_type_unicodeNamed entity type.
ent_iobint IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set.
ent_iob_unicode IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set.
ent_idint ID of the entity the token is an instance of, if any. Usually assigned by patterns in the Matcher.
ent_id_unicode ID of the entity the token is an instance of, if any. Usually assigned by patterns in the Matcher.
lemmaint Base form of the token, with no inflectional suffixes.
lemma_unicodeBase form of the token, with no inflectional suffixes.
lowerintLower-case form of the token.
lower_unicodeLower-case form of the token.
shapeint Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".
shape_unicode Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".
prefixint Hash value of a length-N substring from the start of the token. Defaults to N=1.
prefix_unicode A length-N substring from the start of the token. Defaults to N=1.
suffixint Hash value of a length-N substring from the end of the token. Defaults to N=3.
suffix_unicodeLength-N substring from the end of the token. Defaults to N=3.
is_alphabool Does the token consist of alphabetic characters? Equivalent to token.text.isalpha().
is_asciibool Does the token consist of ASCII characters? Equivalent to [any(ord(c) >= 128 for c in token.text)].
is_digitbool Does the token consist of digits? Equivalent to token.text.isdigit().
is_lowerbool Is the token in lowercase? Equivalent to token.text.islower().
is_titlebool Is the token in titlecase? Equivalent to token.text.istitle().
is_punctboolIs the token punctuation?
is_spacebool Does the token consist of whitespace characters? Equivalent to token.text.isspace().
like_urlboolDoes the token resemble a URL?
like_numboolDoes the token represent a number? e.g. "10.9", "10", "ten", etc.
like_emailboolDoes the token resemble an email address?
is_oovboolIs the token out-of-vocabulary?
is_stopboolIs the token part of a "stop list"?
posintCoarse-grained part-of-speech.
pos_unicodeCoarse-grained part-of-speech.
tagintFine-grained part-of-speech.
tag_unicodeFine-grained part-of-speech.
depintSyntactic dependency relation.
dep_unicodeSyntactic dependency relation.
langintLanguage of the parent document's vocabulary.
lang_unicodeLanguage of the parent document's vocabulary.
probfloatSmoothed log probability estimate of token's type.
idxintThe character offset of the token within the parent document.
sentimentfloatA scalar value indicating the positivity or negativity of the token.
lex_idintID of the token's lexical type.