Kipi.ai / Insights / Blogs / Promise of LLMs in Trial-Patient Matching

Promise of LLMs in Trial-Patient Matching

In a previous post, we indicated that developments in Machine Learning (ML) and Natural Language Processing (NLP)−in particular Pre-trained Language Models (PLMs)−have led to a paradigm shift in eligibility pre-screening for clinical trials with state-of-the-art performances. In this post, we will look at different AI-based approaches to improve the efficiency of pre-screening for clinical trials. The patient-trial matching task is pursued from two different directions: 

1) Trial-centric trial-to-patient matching is pursued by organizations conducting the clinical trial. For this, the clinical trial criteria is used to query the patient databases and rank each patient for eligibility.

2) Patient-centric patient-to-trial matching allows patients and referring organizations to identify a list of clinical trials for which the patient is an eligible candidate. For this, the clinical trial databases are queried with the details of the patients to rank each trial for which the patient would be a likely candidate. Notwithstanding this directionality, the same set of capabilities is needed to perform the patient-trial matching task for either of these endeavors.

Unlock the power of Data

LLM-Based Screening

The advent of the transformer-based Large Language Models (LLMs) have revolutionized NLP with powerful tools. Prior NLP techniques, that use information extraction through named entity recognitions, negation detection and relation extractions, require extensive variable engineering and sometimes fail to interpret the semantic relationships accurately; these characteristics potentially limit the scalability and generalizability of such NLP models in patient-trial matching tasks. By contrast, the transformer-based LLMs that have been trained to perform a generic task using unlabeled text have demonstrated advanced reasoning and semantic understanding capabilities. Recent demonstrations of screening for clinical trials using LLMs have achieved state-of-the-art performance using benchmarking datasets. The typical process for LLM-based patient-trial screening is represented below:

  1. The patient notes and clinical trial criteria are fed to the LLM in the prompt
  2. The LLM generates an output (typically in the JSON format) which contains the explanation for each criteria, reference sentences, and the decision if the patient met / did not meet the eligibility criteria.
  3. A scoring function then uses the output to generate a score to indicate the eligibility of the patient for the trial

Besides deploying a standalone LLM to facilitate this process, other architectures included a retrieval pipeline for information extraction or structuring prior to the feed into LLM. These reported examples of initial forays into patient-trial screening show that LLMs are able to provide coherent explanations for over 90% of its correct decision, further validating their promise in improving the efficiencies in tasks related to clinical trials. We summarize some of the superior capabilities of LLMs in performing the clinical trial-patient matching tasks.

Scalable

To carry out a classification task, typical ML models are trained and validated on labeled samples, before the model is run on unlabeled samples. For a different classification task, the model needs to (re)train on a new set of labeled data. These training processes can be expensive. The availability of labeled data is also sometimes a bottleneck. These challenges limit the scalability and generalizability of such ML models. By contrast, LLMs have demonstrated state-of-the-art performance on NLP tasks even with few- or zero-shot examples in the prompts. 

Few-shot learning involves providing the LLM with a few examples to help it understand and perform a specific task. In the developing area of trial-patient matching, one- or three-shot approaches have been successfully used in limited datasets. However, such an approach increases the task’s cost (see below). Meanwhile, zero-shot learning enables LLMs to perform tasks without prior examples or specific training; such capabilities allow for ready scaling of the LLMs for any trial-patient matching tasks. It is exciting to see in the results from some other studies that zero-shot performance of these LLMs in trial-patient matching is at par with or better than that of the current state-of-the-art models. These early studies, notwithstanding their limitations, highlight the potential of the LLMs in performing any trial-patient matching task with little or no training data, therefore allowing for ready scalability and generalizability.

Interpretable

Prior use of ‘black-box’ machine learning models to solve clinical tasks have received limited adoption because the models fail to provide the rationale of their decision-making, thereby alienating the clinicians. By contrast, when these LLM models are prompted using the Chain-of-Thought (CoT) reasoning, they can generate a step-by-step explanation allowing for the clinician-in-the-loop to evaluate the natural language output.

The CoT prompting technique was developed to overcome a limitation of the LLMs. While increasing LLM size improves performance in intuitive tasks (system-1), such as sentiment analysis and topic classification, it does not significantly enhance performance in tasks requiring deliberate thinking (system-2), such as math word problems. Experiments conducted by researchers at Google Brain show that the performance of LLMs in these system-2 tasks could be improved using CoT prompting. An example from the study is included in Figure 1. The CoT prompting strategy induces the LLM to generate a series of short sentences that mimic the reasoning process, providing an interpretable window into the model’s behavior.

Figure 1: An example of Chain-of-Thought prompt. The example included in the prompt induces the LLM to rationalize its approach and generate the correct output.

In the case of a trial-patient matching task, the use of CoT prompting can elicit LLM responses, which include snippets of the patient notes in the output through which the model rationalizes its decision for each eligibility criteria. This output in natural language is readily reviewable by the clinician. Such a clinician-in-the-loop deployment of these trial-patient matching algorithms is positioned for favorable adoption.

Efficient

The cost of using LLMs is based on the number of input and output tokens used to complete a task. A typical prompt contains many tokens (instructions, definitions of the criteria, and examples), besides the patient note that is being evaluated. The initial set of reported trial-patient matching were done using synthetic data sets, many of which were simply 5-10 sentences long for each evaluated patient. In these cases, the entire medical note could be included in the prompt; the cost of patient-trial matching with such examples could be lower than $2 per enrolled patient.

In actual clinical settings, data relevant to determining trial eligibility (e.g., cancer) is dispersed across various notes: Treatment Plans, Telephone Encounters, Surgery, Progress, Procedures, Plans of Care, Patient Instructions, Pathology, Operation Notes, Medical Oncological Orders, Imaging, History and physical, Encounters, ED Provider Notes, Discharge Summaries, and Consultations. Simply concatenating all patient data for the LLM query can exceed the context size for many LLMs, escalating costs and potentially overwhelming the LLM with irrelevant information.

To reduce token usage while including the most relevant information for the criteria, researchers have architected solutions that pre-filter the patient data before the LLM is prompted for the match with the trial criteria. The pre-filtering is executed using a small, inexpensive embedding model. After both the patient notes and the eligibility criteria are embedded in the retrieval system, the cosine similarities between the two are calculated and the top-k ranking notes retained for each patient. These top-k patient notes from the retrieval system and the trial eligibility criteria are then used in the trial-patient matching prompt with the expensive LLM. Such an approach ensures that only the relevant patient chunks are included in the prompt, reducing the number of tokens and cost.

Researchers have also experimented with structuring the patient and clinical trial data into JSON format before they are used in the LLM prompt. However, this might be reserved for specialized trial-patient matches as it is not readily generalizable given the diversity in trial eligibility criteria. 

Limitations to Overcome

Further research and real-world evaluations are needed to validate the scalability and reliability of LLM-based screening systems fully. Setting up a HIPAA-compliant environment to carry out these tasks is a non-trivial effort; the fear of data leakage remains when using these LLMs, which often are SaaS offerings. The use of Real-World Data (RWD), i.e., including all patient encounters, and its abstraction for use in the prompts for LLMs can lead to identifying fresh challenges.

Another potential limitation to the generalizability of these screening LLMs is the diversity in the specificity of eligibility criteria of the trials. Lack of specificity in the eligibility criterion will require model fine-tuning or elaborate prompt engineering that can increase the cost of deployment. These challenges notwithstanding, the evaluated LLMs show a significant promise for transforming patient screening in clinical trials with improved accuracy, increased efficiency, explainable decisions, scalability, and cost-effectiveness.

July 11, 2024