scikit

Language
A text-processing pipeline.

Usually you'll load this once per process as nlp and pass the instance around your application. The Language class is created when you call spacy.load() and contains the shared vocabulary and language data, optional model data loaded from a model package or a path, and a processing pipeline containing components like the tagger or parser that are called on a document in order. You can also add your own processing pipeline components that take a Doc object, modify it and return it.

Language.__init__

Initialise a Language object.

NameTypeDescription
vocabVocab A Vocab object. If True, a vocab is created via Language.Defaults.create_vocab.
make_doccallable A function that takes text and returns a Doc object. Usually a Tokenizer.
metadict Custom meta data for the Language class. Is written to by models to add model meta data.
returnsLanguageThe newly constructed object.

Language.__call__

Apply the pipeline to some text. The text can span multiple sentences, and can contain arbtrary whitespace. Alignment into the original string is preserved.

NameTypeDescription
textunicodeThe text to be processed.
disablelist Names of pipeline components to disable.
returnsDocA container for accessing the annotations.

Language.pipe

Process texts as a stream, and yield Doc objects in order. Supports GIL-free multi-threading.

NameTypeDescription
texts-A sequence of unicode objects.
as_tuplesbool If set to True, inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. Defaults to False.
n_threadsint The number of worker threads to use. If -1, OpenMP will decide how many to use at run time. Default is 2.
batch_sizeintThe number of texts to buffer.
disablelist Names of pipeline components to disable.
yieldsDocDocuments in the order of the original text.

Language.update

Update the models in the pipeline.

NameTypeDescription
docsiterableA batch of Doc objects.
goldsiterableA batch of GoldParse objects.
dropfloatThe dropout rate.
sgdcallableAn optimizer.
returnsdictResults from the update.

Language.begin_training

Allocate models, pre-process training data and acquire an optimizer.

NameTypeDescription
gold_tuplesiterableGold-standard training data.
**cfg-Config parameters.
yieldstupleAn optimizer.

Language.use_params

Replace weights of models in the pipeline with those provided in the params dictionary. Can be used as a contextmanager, in which case, models go back to their original weights after the block.

NameTypeDescription
paramsdictA dictionary of parameters keyed by model ID.
**cfg-Config parameters.

Language.preprocess_gold

Can be called before training to pre-process gold data. By default, it handles nonprojectivity and adds missing tags to the tag map.

NameTypeDescription
docs_goldsiterableTuples of Doc and GoldParse objects.
yieldstupleTuples of Doc and GoldParse objects.

Language.create_pipe

Create a pipeline component from a factory.

NameTypeDescription
nameunicode Factory name to look up in Language.factories .
configdictConfiguration parameters to initialise component.
returnscallableThe pipeline component.

Language.add_pipe

Add a component to the processing pipeline. Valid components are callables that take a Doc object, modify it and return it. Only one of before, after, first or last can be set. Default behaviour is last=True.

NameTypeDescription
componentcallableThe pipeline component.
nameunicode Name of pipeline component. Overwrites existing component.name attribute if available. If no name is set and the component exposes no name attribute, component.__name__ is used. An error is raised if the name already exists in the pipeline.
beforeunicodeComponent name to insert component directly before.
afterunicodeComponent name to insert component directly after:
firstboolInsert component first / not first in the pipeline.
lastboolInsert component last / not last in the pipeline.

Language.get_pipe

Get a pipeline component for a given component name.

NameTypeDescription
nameunicodeName of the pipeline component to get.
returnscallableThe pipeline component.

Language.replace_pipe

Replace a component in the pipeline.

NameTypeDescription
nameunicodeName of the component to replace.
componentcallableThe pipeline component to inser.

Language.rename_pipe

Rename a component in the pipeline. Useful to create custom names for pre-defined and pre-loaded components. To change the default name of a component added to the pipeline, you can also use the name argument on add_pipe .

NameTypeDescription
old_nameunicodeName of the component to rename.
new_nameunicodeNew name of the component.

Language.remove_pipe

Remove a component from the pipeline. Returns the removed component name and component function.

NameTypeDescription
nameunicodeName of the component to remove.
returnstupleA (name, component) tuple of the removed component.

Language.to_disk

Save the current state to a directory. If a model is loaded, this will include the model.

NameTypeDescription
pathunicode or Path A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects.
disablelist Names of pipeline components to disable and prevent from being saved.

Language.from_disk

Loads state from a directory. Modifies the object in place and returns it. If the saved Language object contains a model, the model will be loaded.

NameTypeDescription
pathunicode or Path A path to a directory. Paths may be either strings or Path-like objects.
disablelist Names of pipeline components to disable.
returnsLanguageThe modified Language object.

Language.to_bytes

Serialize the current state to a binary string.

NameTypeDescription
disablelist Names of pipeline components to disable and prevent from being serialized.
returnsbytesThe serialized form of the Language object.

Language.from_bytes

Load state from a binary string.

NameTypeDescription
bytes_databytesThe data to load from.
disablelist Names of pipeline components to disable.
returnsLanguageThe Language object.

Attributes

NameTypeDescription
vocabVocabA container for the lexical types.
tokenizerTokenizerThe tokenizer.
make_doclambda text: DocCreate a Doc object from unicode text.
pipelinelist List of (name, component) tuples describing the current processing pipeline, in order.
pipe_nameslistList of pipeline component names, in order.
metadict Custom meta data for the Language class. If a model is loaded, contains meta data of the model.

Class attributes

NameTypeDescription
Defaultsclass Settings, data and factory methods for creating the nlp object and processing pipeline.
langunicode Two-letter language ID, i.e. ISO code.
factoriesdict Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name.