scikit

Tokenizer

Segment text, and create Doc objects with the discovered segment boundaries.

Tokenizer.__init__

Create a Tokenizer, to create Doc objects given unicode text.

NameTypeDescription
vocabVocabA storage container for lexical types.
rulesdictExceptions and special-cases for the tokenizer.
prefix_searchcallable A function matching the signature of re.compile(string).search to match prefixes.
suffix_searchcallable A function matching the signature of re.compile(string).search to match suffixes.
infix_finditercallable A function matching the signature of re.compile(string).finditer to find infixes.
token_matchcallableA boolean function matching strings to be recognised as tokens.
returnsTokenizerThe newly constructed object.

Tokenizer.__call__

Tokenize a string.

NameTypeDescription
stringunicodeThe string to tokenize.
returnsDocA container for linguistic annotations.

Tokenizer.pipe

Tokenize a stream of texts.

NameTypeDescription
texts-A sequence of unicode texts.
batch_sizeintThe number of texts to accumulate in an internal buffer.
n_threadsint The number of threads to use, if the implementation supports multi-threading. The default tokenizer is single-threaded.
yieldsDocA sequence of Doc objects, in order.

Tokenizer.find_infix

Find internal split points of the string.

NameTypeDescription
stringunicodeThe string to split.
returnslist A list of re.MatchObject objects that have .start() and .end() methods, denoting the placement of internal segment separators, e.g. hyphens.

Tokenizer.find_prefix

Find the length of a prefix that should be segmented from the string, or None if no prefix rules match.

NameTypeDescription
stringunicodeThe string to segment.
returnsintThe length of the prefix if present, otherwise None.

Tokenizer.find_suffix

Find the length of a suffix that should be segmented from the string, or None if no suffix rules match.

NameTypeDescription
stringunicodeThe string to segment.
returnsint / NoneThe length of the suffix if present, otherwise None.

Tokenizer.add_special_case

Add a special-case tokenization rule. This mechanism is also used to add custom tokenizer exceptions to the language data. See the usage guide on adding languages for more details and examples.

NameTypeDescription
stringunicodeThe string to specially tokenize.
token_attrsiterable A sequence of dicts, where each dict describes a token and its attributes. The ORTH fields of the attributes must exactly match the string when they are concatenated.

Attributes

NameTypeDescription
vocabVocabThe vocab object of the parent Doc.
prefix_search- A function to find segment boundaries from the start of a string. Returns the length of the segment, or None.
suffix_search- A function to find segment boundaries from the end of a string. Returns the length of the segment, or None.
infix_finditer- A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of re.MatchObject objects.