Welcome to AllenNLP! This tutorial will walk you through the basics of building and training an AllenNLP model.
{% include more-tutorials.html %}Before we get started, make sure you have a clean Python 3.6 or 3.7 virtual environment, and then run the following command to install the AllenNLP library:
{% highlight bash %} pip install allennlp {% endhighlight %}In this tutorial we'll implement a slightly enhanced version of the PyTorch LSTM for Part-of-Speech Tagging tutorial, adding some features that make it a slightly more realistic task (and that also showcase some of the benefits of AllenNLP):
(In addition to what's highlighted in this tutorial, AllenNLP provides many other "for free" features.)
Given a sentence (e.g. "The dog ate the apple"
) we want to predict part-of-speech tags for each word
(e.g ["DET", "NN", "V", "DET", "NN"]
).
As in the PyTorch tutorial, we'll embed each word in a low-dimensional space, pass them through an LSTM to get a sequence of encodings, and use a feedforward layer to transform those into a sequence of logits (corresponding to the possible part-of-speech tags).
Below is the annotated code for accomplishing this. You can start reading the annotations from the top, or just look through the code and look to the annotations when you need more explanation.
Instance
containing Field
s of various types. Here each example will have a TextField
containing the sentence, and a SequenceLabelField
containing the corresponding part-of-speech tags.Instance
s.cached_path
helper downloads such files, caches them locally, and returns the local path. It also accepts local file paths (which it just returns as-is).TokenIndexer
abstraction for this representation.TokenIndexer
represents a rule for how to turn a token into indices, a Vocabulary
contains the corresponding mappings from strings to integers. For example, your token indexer might specify to represent a token as a sequence of character ids, in which case the Vocabulary
would contain the mapping {character -> id}. In this particular example we use a SingleIdTokenIndexer
that assigns each token a unique id, and so the Vocabulary
will just contain a mapping {token -> id} (as well as the reverse mapping).DatasetReader
, the other class you'll typically need to implement is Model
, which is a PyTorch Module
that takes tensor inputs and produces a dict of tensor outputs (including the training loss
you want to optimize).DataIterator
s that can intelligently batch our data.Trainer
.DatasetReader
subclass.DatasetReader
needs is a dict of TokenIndexer
s that specify how to convert tokens into indices. By default we'll just generate a single index for each token (which we'll call "tokens") that's just a unique id for each distinct token. (This is just the standard "word to index" mapping you'd use in most NLP tasks.)DatasetReader.text_to_instance
takes the inputs corresponding to a training example (in this case the tokens of the sentence and the corresponding part-of-speech tags), instantiates the corresponding Field
s (in this case a TextField
for the sentence and a SequenceLabelField
for its tags), and returns the Instance
containing those fields. Notice that the tags are optional, since we'd like to be able to create instances from unlabeled data to make predictions on them._read
, which takes a filename and produces a stream of Instance
s. Most of the work has already been done in text_to_instance
.Model
, which is a subclass of torch.nn.Module
. How it works is largely up to you, it mostly just needs a forward
method that takes tensor inputs and produces a dict of tensor outputs that includes the loss you'll use to train the model. As mentioned above, our model will consist of an embedding layer, a sequence encoder, and a feedforward network.TextFieldEmbedder
which represents a general way of turning tokens into tensors. (Here we know that we want to represent each unique word with a learned tensor, but using the general class allows us to easily experiment with different types of embeddings, for example ELMo.)Seq2SeqEncoder
even though we know we want to use an LSTM. Again, this makes it easy to experiment with other sequence encoders, for example a Transformer.Vocabulary
, which contains the namespaced mappings of tokens to indices and labels to indices.CategoricalAccuracy
metric, which we'll use to track accuracy during each training and validation epoch.forward
, which is where the actual computation happens. Each Instance
in your dataset will get (batched with other instances and) fed into forward
. The forward
method expects dicts of tensors as input, and it expects their names to be the names of the fields in your Instance
. In this case we have a sentence field and (possibly) a labels field, so we'll construct our forward
accordingly:get_text_field_mask
, which returns a tensor of 0s and 1s corresponding to the padded and unpadded locations.sentence
tensor (each sentence a sequence of token ids) to the word_embeddings
module, which converts each sentence into a sequence of embedded tensors.get_metrics
method that pulls the data out of it. Behind the scenes, the CategoricalAccuracy
metric is storing the number of predictions and the number of correct predictions, updating those counts during each call to forward. Each call to get_metric returns the calculated accuracy and (optionally) resets the counts, which is what allows us to track accuracy anew for each epoch.DatasetReader
and Model
, we're ready to train. We first need an instance of our dataset reader.cached_path
to cache the files locally (and to hand reader.read
the path to the local cached version.)Vocabulary
(that is, the mapping[s] from tokens / labels to ids).BasicTextFieldEmbedder
which takes a mapping from index names to embeddings. If you go back to where we defined our DatasetReader
, the default parameters included a single index called "tokens", so our mapping just needs an embedding corresponding to that index. We use the Vocabulary
to find how many embeddings we need and our EMBEDDING_DIM
parameter to specify the output dimension. It's also possible to start with pre-trained embeddings (for example, GloVe vectors), but there's no need to do that on this tiny toy dataset.PytorchSeq2SeqWrapper
here is slightly unfortunate (and if you use configuration files you won't need to worry about it) but here it's required to add some extra functionality (and a cleaner interface) to the built-in PyTorch module. In AllenNLP we do everything batch first, so we specify that as well.DataIterator
that handles batching for our datasets. The BucketIterator
sorts instances by the specified fields in order to create batches with similar sequence lengths. Here we indicate that we want to sort the instances by the number of tokens in the sentence field.Trainer
and run it. Here we tell it to run for 1000 epochs and to stop training early if it ever spends 10 epochs without the validation metric improving. The default validation metric is loss (which improves by getting smaller), but it's also possible to specify a different metric and direction (e.g. accuracy should get bigger).Predictor
abstraction that takes inputs, converts them to instances, feeds them through your model, and returns JSON-serializable results. Often you'd need to implement your own Predictor, but AllenNLP already has a SentenceTaggerPredictor
that works perfectly here, so we can use it. It requires our model (for making predictions) and a dataset reader (for creating instances).predict
method that just needs a sentence and returns (a JSON-serializable version of) the output dict from forward. Here tag_logits
will be a (5, 3) array of logits, corresponding to the 3 possible tags for each of the 5 words.argmax
.word_embeddings
and lstm
with the original model earlier. All of a model's parameters need to be on the same device.Although this tutorial has been an explanation of how AllenNLP works, in practice you probably wouldn't write your code this way. Most AllenNLP objects (Models, DatasetReaders, and so on) can be constructed declaratively from JSON-like objects.
So a more typical use case would involve implementing the Model and
DatasetReader as above, but creating a Jsonnet
file indicating how you want to instantiate and train them,
and then simply using the command line tool allennlp train
(which would automatically save an archive of the trained model).
For more details on how this would work in this example, please read Using Config Files.
{% include more-tutorials.html %}