# Deep LearningUsing spaCy to pre-process text for deep learning, and how to plug in your own machine learning models.

## Pre-processing text for deep learning

## spaCy and Thinc

Thinc is the machine learning library powering spaCy. It's a practical toolkit for implementing models that follow the "Embed, encode, attend, predict" architecture. It's designed to be easy to install, efficient for CPU usage and optimised for NLP and deep learning with text â€“ in particular, hierarchically structured input and variable-length sequences.

spaCy's built-in pipeline components can all be powered by any object that follows Thinc's `Model`

API. If a wrapper is not yet available for the library you're using, you should create a `thinc.neural.Model`

subclass that implements a `begin_update`

method. You'll also want to implement `to_bytes`

, `from_bytes`

, `to_disk`

and `from_disk`

methods, to save and load your model. Here's the tempate you'll need to fill in:

## Thinc Model API

`class ThincModel(thinc.neural.Model): def __init__(self, *args, **kwargs): pass def begin_update(self, X, drop=0.): def backprop(dY, sgd=None): return dX return Y, backprop def to_disk(self, path, **exclude): return None def from_disk(self, path, **exclude): return self def to_bytes(self, **exclude): return bytes def from_bytes(self, msgpacked_bytes, **exclude): return self`

The `begin_update`

method should return a callback, that takes the gradient with respect to the output, and returns the gradient with
respect to the input. It's usually convenient to implement the callback
as a nested function, so you can refer to any intermediate variables from
the forward computation in the enclosing scope.

### How Thinc works

Neural networks are all about composing small functions that we know how to differentiate into larger functions that we know how to differentiate. To differentiate a function efficiently, you usually need to store intermediate results, computed during the "forward pass", to reuse them during the backward pass. Most libraries require the data passed through the network to accumulate these intermediate result. This is the "tape" in tape-based differentiation.

In Thinc, a model that computes `y = f(x)`

is required to also return a callback that computes `dx = f'(dy)`

. The same intermediate state needs to be tracked, but this becomes an
implementation detail for the model to take care of â€“ usually, the
callback is implemented as a closure, so the intermediate results can be
read from the enclosing scope.

## Using spaCy with TensorFlow / Keras

## Using spaCy with scikit-learn

## Using spaCy with PyTorch

Here's how a `begin_update`

function that wraps an arbitrary PyTorch model would look:

```
class PytorchWrapper(thinc.neural.Model):
def __init__(self, pytorch_model):
self.pytorch_model = pytorch_model
def begin_update(self, x_data, drop=0.):
x_var = Variable(x_data)
# Make prediction
y_var = pytorch_model.forward(x_var)
def backward(dy_data, sgd=None):
dy_var = Variable(dy_data)
dx_var = torch.autograd.backward(x_var, dy_var)
return dx_var
return y_var.data, backward
```

PyTorch requires data to be wrapped in a container, `Variable`

, that tracks the operations performed on the data. This "tape" of operations is then used by `torch.autograd.backward`

to compute the gradient with respect to the input. For example, the following code
constructs a PyTorch Linear layer that takes a vector of shape `(length, 2)`

, multiples it by a `(2, 2)`

matrix of weights, adds a `(2,)`

bias, and returns the resulting `(length, 2)`

vector:

## PyTorch Linear

`from torch import autograd from torch import nn import torch import numpy pt_model = nn.Linear(2, 2) length = 5 input_data = numpy.ones((5, 2), dtype='f') input_var = autograd.Variable(torch.Tensor(input_data)) output_var = pt_model(input_var) output_data = output_var.data.numpy()`

Given target values we would like the output data to approximate, we can then "learn" values of the parameters within `pt_model`

, to give us output that's closer to our target. As a trivial example, let's make the
linear layer compute the negative inverse of the input:

```
def get_target(input_data):
return -(1 / input_data)
```

To update the PyTorch model, we create an optimizer and give it
references to the model's parameters. We'll then randomly generate input
data and get the target result we'd like the function to produce. We then compute the **gradient of the error** between the current output and the target. Using the most popular definition of "error", this is
simply the average difference:

```
from torch import optim
optimizer = optim.SGD(pt_model.parameters(), lr = 0.01)
for i in range(10):
input_data = numpy.random.uniform(-1., 1., (length, 2))
target = -(1 / input_data)
output_var = pt_model(autograd.Variable(torch.Tensor(input_data)))
output_data = output_var.data.numpy()
d_output_data = (output_data - target) / length
d_output_var = autograd.Variable(torch.Tensor(d_output_data))
d_input_var = torch.autograg.backward(output_var, d_output_var)
optimizer.step()
```