MALADE: Adverse event identification for pharmacovigilance based on large model agent intelligence and RAG technology

Written by

Caleb Hayes

Updated on:June-19th-2025

MALADE: Orchestration of LLM-powered Agents with RAG for Pharmacovigilance

summary

In the era of large language models (LLMs), there is an unprecedented opportunity to develop new LLM-based methods for the synthesis, extraction, and summarization of trusted medical knowledge, given their superior text understanding and generation capabilities. This paper focuses on the pharmacovigilance (PhV) problem, which is both important and challenging in identifying adverse drug events (ADEs) from diverse text resources such as medical literature, clinical notes, and drug labels. Unfortunately, this task is hampered by multiple factors, including terminological differences between drugs and outcomes, and the fact that ADE descriptions are often buried in a large volume of narrative text. We present MALADE, the first collaborative multi-agent system that effectively leverages large language models with retrieval-augmented generation techniques for ADE extraction from drug label data. The technique involves extracting relevant information from text resources and augmenting queries to a large language model, guiding it to generate responses consistent with the augmented data. MALADE is a general architecture that does not rely on a specific large language model. Its unique features include: (1) leveraging multiple external resources such as medical literature, drug labels, and FDA tools (e.g., the open FDA Drug Information API), (2) extracting the associations between drugs and outcomes and the strength of the associations in a structured format, and (3) providing explanations for the established associations. MALADE is instantiated using GPT-4 Turbo or GPT-4o and FDA drug label data, and its effectiveness is demonstrated by achieving an area under the ROC curve (AUC) of 0.90 on the truth table OMOP for ADE. Our implementation leverages the Langroid multi-agent large language model framework.

introduction

Pharmacovigilance (PhV) is the science of identifying and preventing adverse drug events (ADEs) caused by medicinal products after they have been marketed. Pharmacovigilance is extremely important to the pharmaceutical industry and public health as it aims to protect the well-being of patients by detecting new safety issues and intervening when necessary.

A core problem in pharmacovigilance (PhV) is adverse event (ADE) extraction: given a drug class C and an adverse event E, determine whether (and to what extent) C is associated with E. This task requires the analysis of large amounts of text data from a variety of sources, such as patient medical records, clinical notes, social media, spontaneous reporting systems, drug labels, medical literature, and clinical trial reports. In addition to the large volume of text from these sources, ADE extraction is further complicated by the variability of drug names and outcomes, and the fact that ADE descriptions are often buried in large amounts of narrative text [14].

Traditionally, various classical natural language processing (NLP) and deep learning techniques have been used to address this problem [22, 21, 35, 2]. Compared to classical NLP methods, today’s best large language models (LLMs) (and even weaker open source/local LLMs [36, 11]) have achieved significant improvements in text understanding and generation capabilities, and leveraging these models can not only improve existing ADE extraction methods, but also consider previously unavailable data sources. Recent attempts to apply LLMs to ADE extraction have only leveraged the off-the-shelf ChatGPT [38], but their performance is limited and their reasoning for extraction is inconsistent [32]. These limitations stem primarily from two factors: (a) accurate ADE extraction requires access to specific data sources that LLMs may not have “seen”. Relying on the “built-in” knowledge of large language models (LLMs) during pre-training can produce inaccurate results; and (b) since large language models are probabilistic next-word predictors, they can produce erroneous or unreliable results if the task is not carefully decomposed into simpler subtasks or if there is no mechanism to verify and correct their responses.

In this paper, we present MALADE2 (Multi-Agent LLM-Powered Adverse Event Extraction), the first effective multi-agent retrieval-augmented generation (RAG) system for adverse event extraction. Our approach addresses the above two limitations using two key techniques: (a) RAG, which augments the input query with relevant textual data snippets and prompts a large language model to generate responses consistent with the augmented information [15]; and (b) strategically coordinating multiple large language model-based agents, each responsible for a relatively small subtask of the overall adverse event extraction task [41]. Specifically, our system has agents dedicated to these subtasks (see Figure 1): (1) identifying representative drugs of each drug class from a medical database (e.g., MIMIC-IV), (2) collecting side effect information about these drugs from an external textual knowledge base (e.g., the FDA drug label database), and finally, (3) writing a final answer summarizing the effect of drug classes on adverse events. Each agent is assigned a specific subtask and collaborates with other agents to achieve the ultimate goal of adverse event identification. In addition, we further improve the reliability of multi-agent systems by pairing each agent with a critic agent whose role is to verify the behavior and responses of its counterpart.

Although used here specifically for adverse drug reaction (ADE) extraction, the system demonstrates how a multi-agent approach can be used to generate credible, evidence-based summaries and confidence scores for challenging medical problems that require synthesis of evidence from multiple sources of clinical knowledge and data. Thus, MALADE can be viewed as a case study illustrating an approach that may subsequently be applied to other problems in clinical decision support (PhV), including identification of possible drug interactions, as well as clinical problems outside of PhV, such as identification of symptoms for pathologies of interest known in clinical records.

In summary, our paper makes the following contributions.

Precision Evaluation. Unlike simpler systems that generate only a binary label indicating whether drug class C is associated with adverse event E, our approach generates different scores, including a confidence score that represents how confident the large language model is in its label assignment. These scores allow us to perform a rigorous quantitative evaluation against the established Observational Medical Outcomes Partnership (OMOP) ground truth tables of ADEs associated with common drug classes [19]. We achieve an area under the curve (AUC) of approximately 0.85 using GPT-4 Turbo and 0.90 using GPT-4o (Section 5). To our knowledge, this is the best performance among the baseline methods, although direct comparisons may be limited.

Generate well-reasoned responses and justifications. MALADE is designed to provide key features that are critical for high-risk applications such as adverse drug event (ADE) identification: (1) A structured format for drug-outcome associations, including scores representing the strength of association and the rarity of the adverse event; this is important for ensuring robust downstream processing of the extracted association information. (2) Providing justification for the extracted drug-outcome associations so that human experts can understand and validate these associations. This is made possible by the RAG component in the MALADE architecture, which allows for the use of a variety of external sources such as medical literature, drug labels, FDA tools (e.g., the OpenFDA Drug Information API), as well as common clinical data sources such as OMOP or PCORI, and even specific electronic health record (EHR) systems where available. Observability, i.e., complete and detailed logs of inter-agent conversations and intermediate steps; these are critical for debugging and auditing system behavior.

Generalizable insights on machine learning in healthcare. Our proposed multi-agent architecture does not rely on large language models and data sources, and is based on design primitives that are intended to be general building blocks for the coordination of multiple agents based on large language models (Section 3). Thus, while MALADE is specifically designed for adverse drug reaction (ADE) identification, our design methodology provides a general blueprint for effectively building multi-agent systems for trusted medical knowledge synthesis and summarization, with a wide range of healthcare applications.

Core Overview

Background

Research Questions
：The problem to be solved in this paper is how to extract adverse drug event (ADE) information from drug label data. Pharmacovigilance (PhV) is the science of identifying and preventing adverse drug events caused by drugs after they are marketed, and its importance lies in protecting the health of patients.
Research Difficulties
: The research difficulties of this problem include: inconsistent terminology for drugs and outcomes, ADE descriptions are often buried in large amounts of narrative text, and the limitations of existing natural language processing (NLP) and deep learning techniques in handling these complex tasks.
Related Work
Related work on this problem includes causal discovery methods developed using large-scale research programs such as Sentinel, OMOP, and OHDSI, and research on building ADE prediction models using social forums. Recent studies have attempted to apply large language models (LLMs) to ADE extraction, but suffer from knowledge limitations and inconsistent reasoning of a single LLM.

Research Methods

This paper proposes MALADE, the first effective multi-agent system that leverages LLM and retrieval-augmented generation (RAG) techniques for ADE extraction from drug label data. Specifically,

Retrieval Augmentation Generation (RAG) : RAG technology augments relevant textual data when inputting a query to the LLM and guides the LLM to generate answers consistent with the augmented data. The basic idea of RAG is that when a query is posed to the LLM agent, the most relevant document fragments are retrieved from the document store, and the original query is augmented with these fragments as new prompts, and then the LLM is asked to answer the original query based on these fragments.
Multi-agent system : The MALADE system consists of multiple LLM-driven agents, each responsible for a relatively small subtask. The specific subtasks include: identifying representative drugs from medical databases, collecting drug side effect information from external text knowledge bases, and synthesizing the impact of drug categories on adverse health outcomes. Each agent is paired with its corresponding critic agent, which verifies the behavior and response of the main agent.
Agent-Critic Interaction : The Agent-Critic interaction pattern is the core design pattern of the MALADE system. The Agent is responsible for processing external inputs and outputs, while the Critic verifies the Agent's reasoning steps and compliance with instructions, and provides feedback. The Agent iteratively generates responses based on the feedback until the Critic is satisfied.

Experimental design

Data Collection
: The experiments used the OMOP evaluation ground truth task (OMOP ADE task), which assigns one of three labels to each (drug class, health outcome) pair: “increase”, “decrease”, or “no effect”.
Experimental setup
: Two LLMs were evaluated: GPT-4 Turbo and GPT-4o. For each LLM, AUC and F1 score analyses were performed for effect-based classification and ADE-based classification.
Experimental procedures
:

STEP 1
: We found an extensive list of drugs belonging to the drug class by querying the FDA's National Drug Code (NDC) database and screened for the three most common drugs using prescription rates from the MIMIC-IV clinical database.
STEP 2
: An Agent (DrugAgent) for each representative drug generates a free-text summary of its effects on health outcomes, referenced to the latest external drug reference sources (such as the FDA drug label database).
STEP 3
CategoryAgent combines drug-level information to generate structured reports, including labels, confidence scores, risk levels, and strength of evidence for the effects of drug categories on health outcomes.

Results and Analysis

ADE recognition effect : MALADE performs well in distinguishing ADE from non-ADE, with effect-based AUC and F1 scores of 0.851 and 0.609 (GPT-4 Turbo), and ADE-based AUC and F1 scores of 0.851 and 0.556 (GPT-4 Turbo).
Effectiveness of Agent-Critic Interaction : Through ablation experiments, it is found that Critic significantly improves the reliability of the system, especially in the absence of strong evidence (i.e., when the ground truth is "no effect").
Insights provided by justifications : The justifications provided by MALADE are consistent with the reasoning of human experts and help understand the failure modes of the system. For example, CategoryAgent occasionally overestimates the risk of a drug category based on weak evidence.

Overall conclusion

The MALADE system proposed in this paper significantly improves the accuracy and reliability of extracting ADE information from drug label data through multi-agent collaboration and retrieval-augmented generation technology. MALADE not only performs well in ADE identification tasks, but also provides a general multi-agent system architecture for future pharmacovigilance research and broader medical tasks.

Paper Evaluation

Advantages and innovations

Multi-agent architecture
: MALADE is the first effective multi-agent retrieval-augmented generation (RAG) system specifically for adverse event (ADE) extraction from drug label data.
External knowledge utilization
: The system is able to leverage multiple external resources such as medical literature, drug labels, and FDA tools (e.g., OpenFDA Drug Information API), enhancing the knowledge base of LLM.
Structured output
: The system generates structured reports containing labels, confidence scores, strength of evidence, and rarity of drug-outcome associations to facilitate downstream processing and analysis.
Explanatory
: The system provides explanations for the established associations, enabling human experts to understand and verify them.
Reliability enhancement
: By introducing a critic agent to verify the output of the main agent, the reliability of the system is significantly improved.
Versatility
: The design approach of MALADE is not only applicable to ADE extraction, but can also be extended to other drug safety monitoring (PhV) problems and even clinical problems in non-medical fields.

Shortcomings and reflections

Reliance on text data
: The system relies entirely on FDA label data in text form and cannot reliably identify the strength of any association if the information is not explicitly included in the label.
Future work directions
: Future work directions include extracting ADEs from electronic health record (EHR) data and performing detailed evaluations using native open source LLMs such as LlaMA, Grok, and Mistral.
Manual input at the initial step
: The system requires some minimal human input at the initial steps, such as converting the drug class name into the form expected by the FDA database.
Increased structured input and output
: Increasing the use of structured input and output may improve the reliability of DrugAgent, for example by enforcing the presence of certain information instead of free text output.

Key questions and answers

Question 1: How can the MALADE system use retrieval augmentation generation (RAG) technology to improve accuracy when processing adverse drug events (ADE) extraction from drug label data?

Retrieval stage
: Retrieve the most relevant document fragments from the document store. These document fragments can be external text data from the FDA drug label database, MIMIC-IV clinical database, etc.
Enhancement Phase
: Merge the retrieved document fragments with the original query to form a new prompt. For example, if the original query is "Does drug X increase the risk of condition Y?", the enhanced prompt might be "Given the passages below: [document passages], answer this question: Does drug X increase the risk of condition Y based ONLY on these passages, and indicate which passages support your answer."
Generation phase
: Guide the Large Language Model (LLM) to generate answers based on the augmented prompts. The answers generated by the LLM will be consistent with the augmented document fragments and provide references to these fragments as evidence to support its answer.

In this way, the RAG technique not only makes up for the latest knowledge that LLM may lack during pre-training, but also provides the ability of evidence citation, thereby significantly improving the accuracy and reliability of ADE extraction.

Question 2: How is the multi-agent architecture in the MALADE system designed? What are the specific responsibilities of each agent?

The MALADE system consists of multiple LLM-driven agents, each of which is responsible for a relatively small subtask. The specific responsibilities are as follows:

DrugFinder
: An extensive list of drugs belonging to the drug class was found from the FDA's National Drug Code (NDC) database and the three most common drugs were screened using prescription rates from the MIMIC-IV clinical database.
DrugAgent
: The agent for each representative drug generates a free-text summary of its effect on health outcomes. It consults the latest external drug reference sources (such as the FDA drug label database) and generates a summary that includes the risk level and strength of evidence.
CategoryAgent
: Combines drug-level information to generate structured reports. Reports include labels for the drug class’s effect on health outcomes (such as “increases,” “decreases,” or “no effect”), confidence scores, risk levels, and strength of evidence.
Critic
: Each agent is paired with its corresponding critic agent, which verifies the behavior and responses of the primary agent. The critic provides feedback to help the agent improve its generated answers until its answer is accepted.

Through this multi-agent collaboration approach, the MALADE system is able to effectively decompose complex tasks and leverage the collective knowledge and expertise of multiple agents to improve the accuracy and reliability of ADE extraction.

Question 3: How does the MALADE system perform in experiments? What are its advantages compared with other methods?

ADE recognition effect
: MALADE performs well in distinguishing ADEs from non-ADEs. The effect-based AUC and F1 scores are 0.851 and 0.609 (GPT-4 Turbo), and the ADE-based AUC and F1 scores are 0.851 and 0.556 (GPT-4 Turbo). These results show that MALADE can effectively identify the association between drug classes and health outcomes.
Effectiveness of Agent-Critic Interaction
: Through ablation experiments, it is found that Critic significantly improves the reliability of the system, especially in the absence of strong evidence (i.e., when the ground truth is "no effect"). This shows that the Agent-Critic interaction model plays a key role in improving the accuracy of LLM generated answers.
Insights provided by reason
: The rationale provided by MALADE is consistent with the reasoning of human experts, helping us understand the failure modes of the system. For example, CategoryAgent occasionally overestimates the risk of a drug category based on weak evidence. This capability not only improves the accuracy of the system, but also provides valuable feedback for improving the system.

Compared with other methods, the advantage of the MALADE system is that it combines multi-agent collaboration and retrieval enhancement generation technology, which can provide more accurate and reliable results when dealing with complex tasks. In addition, the design principles and specific implementation methods of MALADE can also be extended to other medical tasks and pharmacovigilance research to provide more general solutions.