[Research Report] Empowering Medical Large Language Models (LLM): The World’s Largest Visual Question Answering Benchmark for Chest X-ray Images (Machine Intelligence for Medical Engineering Team, September 17, 2024)

September 17, 2024 12:19

Large Language Models (LLMs) have achieved remarkable success in many fields, but their application in the medical domain remains limited. A significant barrier is the scarcity of high-quality medical data. Unlike general datasets that can be collected relatively easily using crowdsourcing and common sense knowledge, medical datasets require expert knowledge and precise natural language understanding. This makes gathering sufficient, complex, and high-quality data a costly and challenging task.

To address this gap, an international collaboration between RIKEN AIP, the University of Tokyo, National Cancer Center Research Institute, The University of Texas Arlington, and the National Institutes of Health (NIH) has come together to create the world’s largest Visual Question Answering (VQA) benchmark focused on chest X-ray images.

As shown in the example (Fig. 1), AI researchers and radiologists designed the benchmark to reflect the real-world diagnostic procedures used in clinical practice. Researchers at first trains a LLM to extract clinically relevant information from paired X-ray images and reports. What makes this benchmark unique is its integration of reasoning paths, which can be incorporated as Chains of Thought (CoT) specifically tailored for medical LLMs.

This collaborative effort represents a significant step forward in improving the capabilities of LLMs in healthcare, paving the way for more accurate, explainable, and reliable medical AI systems. This paper is published in top medical AI journal, Medical Image Analysis.

Fig. 1. Clinical practical diagnostic procedure and extraction of clinical key information using LLM for constructing a medical VQA dataset.

Paper information: Journal Name: Medical Image Analysis

Paper Title: Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
DOI: https://doi.org/10.1016/j.media.2024.103279

Authors:
Xinyue Hu, The University of Texas Arlington, USA
Lin Gu, RIKEN AIP, University of Tokyo
Kazuma Kobayashi, National Cancer Center Research Institute
Liangchen Liu, The University of Texas Arlington, USA
Mengliang Zhang, The University of Texas Arlington, USA
Tatsuya Harada, University of Tokyo, RIKEN AIP
Ronald M. Summers, National Institutes of Health Clinical Center (NIH)
Yingying Zhu, The University of Texas Arlington, USA

Medical Large Language Models (LLMs) have the potential to reduce global health inequalities, particularly in low- and middle-income countries. For example, in complex cases, a second opinion from a medical VQA system can significantly enhance the confidence of junior clinicians when specialized experts are not available. Deploying such systems also contributes to sustainable development goals (SDGs) by addressing healthcare shortages in resource-poor regions like Africa, which houses only 3% of the world’s healthcare workforce while bearing 24% of the global disease burden.

The biggest bottleneck for current medical LLMs lies in the limited and diverse training data. As shown in Fig. 2, existing datasets have several key limitations:

They mostly focus on simple questions such as “What is the primary abnormality in this image?” or “What is seen in the image?” (see Fig. 2(c)).
They cover a wide range of modalities (e.g., MRI, CT, X-ray) and body sites (e.g., neuroimaging, chest X-rays, abdominal CT/MRI scans). However, the pathology of diseases in different body parts is highly complex and heterogeneous, meaning that medical images and the questions associated with them vary significantly across modalities, specialties, and diseases.

Fig 2. Previous medical datasets.

To overcome these limitations, the joint research team proposed a rule-based method to extract critical clinical information and developed a large-scale Difference VQA dataset [1]. This dataset addresses clinically important questions such as disease locations, severities, and treatment progressions. Initially presented at the KDD 2023 conference, this work was positively received by the medical LLM community. In response to feedback, the extended version fine-tunes a 70B model of Llama 2 to extract clinical questions and answers focused on abnormalities, body location, disease severity, and more—mimicking real diagnostic practices.

Fig. 3. Question type distribution.

Fig.4 Answer type distribution.

The proposed dataset includes 780,014 question–answer pairs across various categories like abnormality, location, type, level, and view. Figures 3 and 4 illustrate the question-answer distributions. Additionally, the research team also proposed an open sourced multi-relationship graph learning method for VQA that highlights reasoning paths in answering questions. These reasoning paths can be used as Chains of Thought for medical LLMs and to construct knowledge-driven prompts for training.

[1] Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M. Summers, and Yingying Zhu. 2023. Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23). Association for Computing Machinery, New York, NY, USA, 4156–4165. https://doi.org/10.1145/3580305.3599819

Related Laboratories

last updated on June 12, 2025 11:04Laboratory

Machine Intelligence for Medical Engineering Team

Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
	Link to the event page for the 1st	2nd	3rd	4th	5th	6th
Link to the event page for the 7th	Link to the event page for the 8th	Link to the event page for the 9th	10th	11th	Link to the event page for the 12th	13th
14th	15th	16th	Link to the event page for the 17th	18th	Link to the event page for the 19th	20th
21th	22th	23th	24th	Link to the event page for the 25th	26th	27th
28th	Link to the event page for the 29th	30th

Center for Advanced Intelligence Project

News