Skip to content

Writings & Research

Findings and discoveries documented in written form for your consumption

Writing

Notebooks

Models & Runs

Research to the People and Stanford Medicine's Rare Disease AI Hackathon

A week ago I flew down to SF to present at Github HQ for a rare disease AI hackathon aiming to jump-start research in applying LLMs and ML in general to the task of rare disease diagnosis and treatment. The two diseases being researched were Ehlers-Danlos and Hypophosphatasia(HPP). Over two months, my team focused solely on HPP. We developed a fine tuned model for answering questions about HPP using the ~1300 paper data source that was provided by the event organizers (shout-out to Pete Kane).

Initially I undertook the task of implementing techniques from KG RAG and then building upon them to integrate a knowledge graph that QIAGEN provided for the hackathon. Unfortunately this was more than I could do in the given time as the QIAGEN knowledge graph was not a drop in replacement for the SPOKE knowledge graph. I think there is still a lot of milage to be gained here though. If disease-lab can harness SPOKE and provide a portable Python library for querying it that integrates with vector storage, the interface could be extended to other knowledge graphs as well. The still messy part seems to be in writing the kg building, querying, and pruning code.

We came up with a fine tuned sentence embedding model. The papers were in PDF format so we had to invent some data cleaning and chunking methods. Ultimately we ended up with a fine tuned mxbai-embed-large-v1 model and were able to embed the entire dataset. On top of this we built a RAG system which is able to cite the specific sources in the corpus of embedded texts and provide a provenance section in each generation. We utilized Meta-Llama-3-8B-Instruct as our foundational model.

We also produced a fined tuned version of the Meta-Llama-3-8B-Instruct transformer model that is in need of further evaluation. To decipher if it is truly better than rag with the baseline foundational model, we need to test the model with/without RAG and with/without fine tuning.

I Demoed Too Early

Last Friday I demoed my game WorldEnder.ai and it wasn't ready yet. I had too many slides, which were useless in the demo setting, and I was rushing to explain the core invention. If I could do it over again I wouldn’t change a thing.

After the demo, everyone I talked to was engaged in my problem space and had ideas that saved me from potential design decisions that would eat up time. Because I showed a notebook and quickly talked through several lines of code, the feedback I got and the ideas we brainstormed didn’t feel threatening to the project or like a waste of time. It is still early enough to make any change we want without much cost. The algorithm and gameplay driving mechanism, which unanimously got positive reactions, drove discussion about exciting possibilities and even got a few people wanting to contribute.

Over the weekend I got a contributor and a PR, and we are going to push it further at a hackathon next weekend. If you have a chance to demo, find something to demo and do it. You might find your team that way.

Fine Tuning Pythia by Hand and Calculating PPL

6/24/2023

This is an introduction on how to get your feet wet writing a training loop to fine tune a GPT-NeoX model, specifically the 70m deduped Pythia model. After fine tuning, I measure and compare the PPL to gpt2 and do some rough plots to contrast the two models. The notebook can run on a free colab GPU instance so this is great for just a proof of concept that we can in fact fine tune the model on new data.

Disclaimer: this model is not intended for production use

Resource type URI
Model https://huggingface.co/keppy/pythia-70m-dedupe-yt
Notebook https://gist.github.com/keppy/a5be88ea59a67a901571b6e0c3478585

Fine Tuning

Some of the more interesting bits from the fine tuning and metric calculations follow below.

This bit of code called group_texts() can be found at https://huggingface.co/docs/transformers/tasks/language_modeling:

block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result
We have to run our tokenized text examples through this function to turn them into blocks of text that we can train on. So you can't just tokenize the data and then use that in your training loop. This is obvious to me now but when I was new to the API I found it odd that you cart this function around copy paste style.

If we take a look at the actual training loop we see that we take our batches from the train_dataloader which was built using group_texts():

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

Evaluation

By now we have done our transfer learning (I'm not that old I swear) and we have our model as the result. So how do we evaluate this causal model? We want to get the model's perplexity (PPL). I'll let you research more about what that means but basically there is a library from Hugging Face called evaluate which will let us compute PPL against models on the Hugging Face hub.

What we do in the notebook is upload our newly trained model to the hub. Then, after doing a round of batching and evaluation, we can compute two separate metrics--one for gpt2 and one for our model to compare. We do this by adding batches of decoded tokens (predictions) to our metric evaluators inside the eval loop and then computing the metrics after the loop has finished.

# batching, eval, using metric.add_batch()......
# ...
gpt2_PPL = gpt2_metric.compute(model_id="gpt2")
yt_PPL = yt_metric.compute(model_id="keppy/pythia-70m-dedupe-yt")
Notice how we just have to pass model IDs to compute. This allows us to compare the PPL of these two models and we can graph it.

PPL plot

The metric gives us also a mean perplexity so we can show that as some bars as well.

PPL mean perplexity

Feel free to copy the notebook from the gist to your own colab. Hopefully this is enough info to get you started. Going through the notebook will take you step by step, and it doesn't take long to run as I've used a very small subset of the dataset.