## Fine Tuning Pythia by Hand and Calculating PPL

6/24/2023

This is an introduction on how to get your feet wet writing a training loop to fine tune a GPT-NeoX model, specifically the 70m deduped Pythia model. After fine tuning, I measure and compare the PPL to gpt2 and do some rough plots to contrast the two models. The notebook can run on a free colab GPU instance so this is great for just a proof of concept that we can in fact fine tune the model on new data.

Disclaimer: this model is not intended for production use

Resource type | URI |
---|---|

Model | https://huggingface.co/keppy/pythia-70m-dedupe-yt |

Notebook | https://gist.github.com/keppy/a5be88ea59a67a901571b6e0c3478585 |

### Fine Tuning

Some of the more interesting bits from the fine tuning and metric calculations follow below.

This bit of code called `group_texts()`

can be found at https://huggingface.co/docs/transformers/tasks/language_modeling:

```
block_size = 128
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
if total_length >= block_size:
total_length = (total_length // block_size) * block_size
# Split by chunks of block_size.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
```

If we take a look at the actual training loop we see that we take our batches from the `train_dataloader`

which was built using `group_texts()`

:

```
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
```

### Evaluation

By now we have done our transfer learning (I'm not that old I swear) and we have our model as the result. So how do we evaluate this causal model? We want to get the model's perplexity (PPL). I'll let you research more about what that means but basically there is a library from Hugging Face called evaluate which will let us compute PPL against models on the Hugging Face hub.

What we do in the notebook is upload our newly trained model to the hub. Then, after doing a round of batching and evaluation, we can compute two separate metrics--one for gpt2 and one for our model to compare. We do this by adding batches of decoded tokens (predictions) to our metric evaluators inside the eval loop and then computing the metrics after the loop has finished.

```
# batching, eval, using metric.add_batch()......
# ...
gpt2_PPL = gpt2_metric.compute(model_id="gpt2")
yt_PPL = yt_metric.compute(model_id="keppy/pythia-70m-dedupe-yt")
```

The metric gives us also a mean perplexity so we can show that as some bars as well.

Feel free to copy the notebook from the gist to your own colab. Hopefully this is enough info to get you started. Going through the notebook will take you step by step, and it doesn't take long to run as I've used a very small subset of the dataset.