Welcome

Welcome to AllenNLP! This tutorial will walk you through the basics of building and training an AllenNLP model.

{% include more-tutorials.html %}

Before we get started, make sure you have a clean Python 3.6 or 3.7 virtual environment, and then run the following command to install the AllenNLP library:

{% highlight bash %} pip install allennlp {% endhighlight %}

In this tutorial we'll implement a slightly enhanced version of the PyTorch LSTM for Part-of-Speech Tagging tutorial, adding some features that make it a slightly more realistic task (and that also showcase some of the benefits of AllenNLP):

  1. We'll read our data from files. (The tutorial example uses data that's given as part of the Python code.)
  2. We'll use a separate validation dataset to check our performance. (The tutorial example trains and evaluates on the same dataset.)
  3. We'll use tqdm to track the progress of our training.
  4. We'll implement early stopping based on the loss on the validation dataset.
  5. We'll track accuracy on both the training and validation sets as we train the model.

(In addition to what's highlighted in this tutorial, AllenNLP provides many other "for free" features.)


The Problem

Given a sentence (e.g. "The dog ate the apple") we want to predict part-of-speech tags for each word
(e.g ["DET", "NN", "V", "DET", "NN"]).

As in the PyTorch tutorial, we'll embed each word in a low-dimensional space, pass them through an LSTM to get a sequence of encodings, and use a feedforward layer to transform those into a sequence of logits (corresponding to the possible part-of-speech tags).

Below is the annotated code for accomplishing this. You can start reading the annotations from the top, or just look through the code and look to the annotations when you need more explanation.

{% highlight python %} from typing import Iterator, List, Dict {% endhighlight %}
{% highlight python %} import torch import torch.optim as optim import numpy as np {% endhighlight %}
{% highlight python %} from allennlp.data import Instance from allennlp.data.fields import TextField, SequenceLabelField {% endhighlight %}
{% highlight python %} from allennlp.data.dataset_readers import DatasetReader {% endhighlight %}
{% highlight python %} from allennlp.common.file_utils import cached_path {% endhighlight %}
{% highlight python %} from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer from allennlp.data.tokenizers import Token {% endhighlight %}
{% highlight python %} from allennlp.data.vocabulary import Vocabulary {% endhighlight %}
{% highlight python %} from allennlp.models import Model {% endhighlight %}
{% highlight python %} from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder from allennlp.modules.token_embedders import Embedding from allennlp.modules.seq2seq_encoders import Seq2SeqEncoder, PytorchSeq2SeqWrapper from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits {% endhighlight %}
{% highlight python %} from allennlp.training.metrics import CategoricalAccuracy {% endhighlight %}
{% highlight python %} from allennlp.data.iterators import BucketIterator {% endhighlight %}
{% highlight python %} from allennlp.training.trainer import Trainer {% endhighlight %}
{% highlight python %} from allennlp.predictors import SentenceTaggerPredictor torch.manual_seed(1) {% endhighlight %}
{% highlight python %} class PosDatasetReader(DatasetReader): """ DatasetReader for PoS tagging data, one sentence per line, like The###DET dog###NN ate###V the###DET apple###NN """ {% endhighlight %}
{% highlight python %} def __init__(self, token_indexers: Dict[str, TokenIndexer] = None) -> None: super().__init__(lazy=False) self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()} {% endhighlight %}
{% highlight python %} def text_to_instance(self, tokens: List[Token], tags: List[str] = None) -> Instance: sentence_field = TextField(tokens, self.token_indexers) fields = {"sentence": sentence_field} if tags: label_field = SequenceLabelField(labels=tags, sequence_field=sentence_field) fields["labels"] = label_field return Instance(fields) {% endhighlight %}
{% highlight python %} def _read(self, file_path: str) -> Iterator[Instance]: with open(file_path) as f: for line in f: pairs = line.strip().split() sentence, tags = zip(*(pair.split("###") for pair in pairs)) yield self.text_to_instance([Token(word) for word in sentence], tags) {% endhighlight %}
{% highlight python %} class LstmTagger(Model): {% endhighlight %}
{% highlight python %} def __init__(self, {% endhighlight %}
{% highlight python %} word_embeddings: TextFieldEmbedder, {% endhighlight %}
{% highlight python %} encoder: Seq2SeqEncoder, {% endhighlight %}
{% highlight python %} vocab: Vocabulary) -> None: {% endhighlight %}
{% highlight python %} super().__init__(vocab) self.word_embeddings = word_embeddings self.encoder = encoder {% endhighlight %}
{% highlight python %} self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(), out_features=vocab.get_vocab_size('labels')) {% endhighlight %}
{% highlight python %} self.accuracy = CategoricalAccuracy() {% endhighlight %}
{% highlight python %} def forward(self, sentence: Dict[str, torch.Tensor], labels: torch.Tensor = None) -> Dict[str, torch.Tensor]: {% endhighlight %}
{% highlight python %} mask = get_text_field_mask(sentence) {% endhighlight %}
{% highlight python %} embeddings = self.word_embeddings(sentence) {% endhighlight %}
{% highlight python %} encoder_out = self.encoder(embeddings, mask) {% endhighlight %}
{% highlight python %} tag_logits = self.hidden2tag(encoder_out) output = {"tag_logits": tag_logits} {% endhighlight %}
{% highlight python %} if labels is not None: self.accuracy(tag_logits, labels, mask) output["loss"] = sequence_cross_entropy_with_logits(tag_logits, labels, mask) return output {% endhighlight %}
{% highlight python %} def get_metrics(self, reset: bool = False) -> Dict[str, float]: return {"accuracy": self.accuracy.get_metric(reset)} {% endhighlight %}
{% highlight python %} reader = PosDatasetReader() {% endhighlight %}
{% highlight python %} train_dataset = reader.read(cached_path( 'https://raw.githubusercontent.com/allenai/allennlp' '/master/tutorials/tagger/training.txt')) validation_dataset = reader.read(cached_path( 'https://raw.githubusercontent.com/allenai/allennlp' '/master/tutorials/tagger/validation.txt')) {% endhighlight %}
{% highlight python %} vocab = Vocabulary.from_instances(train_dataset + validation_dataset) {% endhighlight %}
{% highlight python %} EMBEDDING_DIM = 6 HIDDEN_DIM = 6 {% endhighlight %}
{% highlight python %} token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'), embedding_dim=EMBEDDING_DIM) word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding}) {% endhighlight %}
{% highlight python %} lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True)) {% endhighlight %}
{% highlight python %} model = LstmTagger(word_embeddings, lstm, vocab) {% endhighlight %}
{% highlight python %} if torch.cuda.is_available(): cuda_device = 0 {% endhighlight %}
{% highlight python %} model = model.cuda(cuda_device) else: {% endhighlight %}
{% highlight python %} cuda_device = -1 {% endhighlight %}
{% highlight python %} optimizer = optim.SGD(model.parameters(), lr=0.1) {% endhighlight %}
{% highlight python %} iterator = BucketIterator(batch_size=2, sorting_keys=[("sentence", "num_tokens")]) {% endhighlight %}
{% highlight python %} iterator.index_with(vocab) {% endhighlight %}
{% highlight python %} trainer = Trainer(model=model, optimizer=optimizer, iterator=iterator, train_dataset=train_dataset, validation_dataset=validation_dataset, patience=10, num_epochs=1000, cuda_device=cuda_device) {% endhighlight %}
{% highlight python %} trainer.train() {% endhighlight %}
{% highlight python %} predictor = SentenceTaggerPredictor(model, dataset_reader=reader) {% endhighlight %}
{% highlight python %} tag_logits = predictor.predict("The dog ate the apple")['tag_logits'] {% endhighlight %}
{% highlight python %} tag_ids = np.argmax(tag_logits, axis=-1) {% endhighlight %}
{% highlight python %} print([model.vocab.get_token_from_index(i, 'labels') for i in tag_ids]) {% endhighlight %}
{% highlight python %} # Here's how to save the model. with open("/tmp/model.th", 'wb') as f: torch.save(model.state_dict(), f) {% endhighlight %}
{% highlight python %} vocab.save_to_files("/tmp/vocabulary") {% endhighlight %}
{% highlight python %} # And here's how to reload the model. vocab2 = Vocabulary.from_files("/tmp/vocabulary") {% endhighlight %}
{% highlight python %} model2 = LstmTagger(word_embeddings, lstm, vocab2) {% endhighlight %}
{% highlight python %} with open("/tmp/model.th", 'rb') as f: model2.load_state_dict(torch.load(f)) {% endhighlight %}
{% highlight python %} if cuda_device > -1: model2.cuda(cuda_device) {% endhighlight %}
{% highlight python %} predictor2 = SentenceTaggerPredictor(model2, dataset_reader=reader) tag_logits2 = predictor2.predict("The dog ate the apple")['tag_logits'] np.testing.assert_array_almost_equal(tag_logits2, tag_logits) {% endhighlight %}

Although this tutorial has been an explanation of how AllenNLP works, in practice you probably wouldn't write your code this way. Most AllenNLP objects (Models, DatasetReaders, and so on) can be constructed declaratively from JSON-like objects.

So a more typical use case would involve implementing the Model and DatasetReader as above, but creating a Jsonnet file indicating how you want to instantiate and train them, and then simply using the command line tool allennlp train (which would automatically save an archive of the trained model).

For more details on how this would work in this example, please read Using Config Files.

{% include more-tutorials.html %}