scikit

Deep Learning
Using spaCy to pre-process text for deep learning, and how to plug in your own machine learning models.

Pre-processing text for deep learning

spaCy and Thinc

Thinc is the machine learning library powering spaCy. It's a practical toolkit for implementing models that follow the "Embed, encode, attend, predict" architecture. It's designed to be easy to install, efficient for CPU usage and optimised for NLP and deep learning with text – in particular, hierarchically structured input and variable-length sequences.

spaCy's built-in pipeline components can all be powered by any object that follows Thinc's Model API. If a wrapper is not yet available for the library you're using, you should create a thinc.neural.Model subclass that implements a begin_update method. You'll also want to implement to_bytes, from_bytes, to_disk and from_disk methods, to save and load your model. Here's the tempate you'll need to fill in:

Thinc Model API

class ThincModel(thinc.neural.Model): def __init__(self, *args, **kwargs): pass def begin_update(self, X, drop=0.): def backprop(dY, sgd=None): return dX return Y, backprop def to_disk(self, path, **exclude): return None def from_disk(self, path, **exclude): return self def to_bytes(self, **exclude): return bytes def from_bytes(self, msgpacked_bytes, **exclude): return self

The begin_update method should return a callback, that takes the gradient with respect to the output, and returns the gradient with respect to the input. It's usually convenient to implement the callback as a nested function, so you can refer to any intermediate variables from the forward computation in the enclosing scope.

How Thinc works

Neural networks are all about composing small functions that we know how to differentiate into larger functions that we know how to differentiate. To differentiate a function efficiently, you usually need to store intermediate results, computed during the "forward pass", to reuse them during the backward pass. Most libraries require the data passed through the network to accumulate these intermediate result. This is the "tape" in tape-based differentiation.

In Thinc, a model that computes y = f(x) is required to also return a callback that computes dx = f'(dy). The same intermediate state needs to be tracked, but this becomes an implementation detail for the model to take care of – usually, the callback is implemented as a closure, so the intermediate results can be read from the enclosing scope.

Using spaCy with TensorFlow / Keras

Using spaCy with scikit-learn

Using spaCy with PyTorch

Here's how a begin_update function that wraps an arbitrary PyTorch model would look:

class PytorchWrapper(thinc.neural.Model):
    def __init__(self, pytorch_model):
        self.pytorch_model = pytorch_model

    def begin_update(self, x_data, drop=0.):
        x_var = Variable(x_data)
        # Make prediction
        y_var = pytorch_model.forward(x_var)
        def backward(dy_data, sgd=None):
            dy_var = Variable(dy_data)
            dx_var = torch.autograd.backward(x_var, dy_var)
            return dx_var
        return y_var.data, backward

PyTorch requires data to be wrapped in a container, Variable, that tracks the operations performed on the data. This "tape" of operations is then used by torch.autograd.backward to compute the gradient with respect to the input. For example, the following code constructs a PyTorch Linear layer that takes a vector of shape (length, 2), multiples it by a (2, 2) matrix of weights, adds a (2,) bias, and returns the resulting (length, 2) vector:

PyTorch Linear

from torch import autograd from torch import nn import torch import numpy pt_model = nn.Linear(2, 2) length = 5 input_data = numpy.ones((5, 2), dtype='f') input_var = autograd.Variable(torch.Tensor(input_data)) output_var = pt_model(input_var) output_data = output_var.data.numpy()

Given target values we would like the output data to approximate, we can then "learn" values of the parameters within pt_model, to give us output that's closer to our target. As a trivial example, let's make the linear layer compute the negative inverse of the input:

def get_target(input_data):
    return -(1 / input_data)

To update the PyTorch model, we create an optimizer and give it references to the model's parameters. We'll then randomly generate input data and get the target result we'd like the function to produce. We then compute the gradient of the error between the current output and the target. Using the most popular definition of "error", this is simply the average difference:

from torch import optim

optimizer = optim.SGD(pt_model.parameters(), lr = 0.01)
for i in range(10):
    input_data = numpy.random.uniform(-1., 1., (length, 2))
    target = -(1 / input_data)

    output_var = pt_model(autograd.Variable(torch.Tensor(input_data)))
    output_data = output_var.data.numpy()

    d_output_data = (output_data - target) / length
    d_output_var = autograd.Variable(torch.Tensor(d_output_data))

    d_input_var = torch.autograg.backward(output_var, d_output_var)
    optimizer.step()

Using spaCy with DyNet