Skip to content

RAG

Research to the People and Stanford Medicine's Rare Disease AI Hackathon

A week ago I flew down to SF to present at Github HQ for a rare disease AI hackathon aiming to jump-start research in applying LLMs and ML in general to the task of rare disease diagnosis and treatment. The two diseases being researched were Ehlers-Danlos and Hypophosphatasia(HPP). Over two months, my team focused solely on HPP. We developed a fine tuned model for answering questions about HPP using the ~1300 paper data source that was provided by the event organizers (shout-out to Pete Kane).

Initially I undertook the task of implementing techniques from KG RAG and then building upon them to integrate a knowledge graph that QIAGEN provided for the hackathon. Unfortunately this was more than I could do in the given time as the QIAGEN knowledge graph was not a drop in replacement for the SPOKE knowledge graph. I think there is still a lot of milage to be gained here though. If disease-lab can harness SPOKE and provide a portable Python library for querying it that integrates with vector storage, the interface could be extended to other knowledge graphs as well. The still messy part seems to be in writing the kg building, querying, and pruning code.

We came up with a fine tuned sentence embedding model. The papers were in PDF format so we had to invent some data cleaning and chunking methods. Ultimately we ended up with a fine tuned mxbai-embed-large-v1 model and were able to embed the entire dataset. On top of this we built a RAG system which is able to cite the specific sources in the corpus of embedded texts and provide a provenance section in each generation. We utilized Meta-Llama-3-8B-Instruct as our foundational model.

We also produced a fined tuned version of the Meta-Llama-3-8B-Instruct transformer model that is in need of further evaluation. To decipher if it is truly better than rag with the baseline foundational model, we need to test the model with/without RAG and with/without fine tuning.