SurgTEMP: Temporal-Aware Surgical Video Question Answering
with Text-guided Visual Memory

Shi Li1,2*, Vinkle Srivastav1,2, Nicolas Chanel1,2, Saurav Sharma1,2, Nabani Banik1,2, Lorenzo Arboit1,2, Kun Yuan1,2, Pietro Mascagni1,2,3†, Nicolas Padoy1,2†
1University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France   2IHU Strasbourg   3Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
*Corresponding author  ·  Co-last authors
SurgTEMP model architecture overview

SurgTEMP is a multimodal LLM that builds a hierarchical visual memory (spatial & temporal) guided by the text query, and is trained with a Surgical Competency Progression (SCP) scheme — progressively from basic perception to safety assessment and clinical reasoning across 11 tasks for laparoscopic cholecystectomy.

Abstract

Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support.

Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment.

To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks.

To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (~128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy — Perception, Assessment, and Reasoning — spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment.

In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.

Highlights

🧠

SurgTEMP Model

Multimodal LLM for surgical video VQA across 11 clinically relevant tasks, combining spatial and temporal memory banks guided by the text query.

🔍

TEMP Constructor

Text-guided attention selects the most query-relevant frames and patches, building a hierarchical spatial–temporal visual memory pyramid.

📈

SCP Training

Surgical Competency Progression progressively builds perception, assessment, and reasoning capabilities — mirroring clinical training.

🗂️

CholeVidQA-32K

32K open-ended QA pairs from 3,855 laparoscopic cholecystectomy segments (~128 h) spanning a 3-level task hierarchy.

Text-guided Memory Pyramid (TEMP)

TEMP constructor — cross-modal attention, spatial and temporal memory banks

The TExt-guided Memory Pyramid (TEMP) constructor computes patch-text and frame-text similarity via cross-modal attention, then uses differentiable frame selection (Gumbel-Softmax) to build a spatial memory bank of the most query-relevant patches and a temporal memory bank of reweighted frame representations.

CholeVidQA-32K Dataset

CholeVidQA-32K data curation and annotation pipeline

Dataset Curation Pipeline. QA pairs are generated through a multi-stage pipeline combining clinician annotations from CholecT50, Endoscapes, and CholeScore with structured LLM-based QA generation and clinical validation.

Dataset composition: task hierarchy, task types, and temporal duration

Dataset Statistics. Composition breakdown by hierarchy level, task type, and video segment duration — 32K QA pairs from 3,855 segments across CholecT50, Endoscapes, and CholeScore.

CholeVidQA-32K dataset thumbnail with example QA pairs

Sample QA Pairs. Examples from all three hierarchy levels — Perception, Assessment, and Reasoning — illustrating the diversity of question types and clinical reasoning required.

Three-Tier Evaluation Framework

Three-tier evaluation: LLM-based verbalizer, overlap scorer, LLM judge

Evaluation combines (1) an LLM-based verbalizer for categorical F1 & balanced accuracy, (2) text overlap metrics (BLEU, METEOR, ROUGE-L, CIDEr), and (3) an LLM judge scoring correctness, relevance, and linguistic quality. Multi-judge agreement (Kendall's W = 0.852) confirms evaluation robustness.

Quantitative Results

Overall performance on the CholeVidQA-32K test set across GPT Scores (Correctness, Relevance, Linguistic Quality), Overlap Metrics (BLEU, METEOR, ROUGE-L, CIDEr), and Classification Metrics (balanced Accuracy and F1-score, with answer rates). SurgTEMP attains the best fine-tuned score on every single metric.

Models GPT Scores Overlap Metrics Classification Metrics
CRRLLG BLEUMETEORROUGE-LCIDEr bAcc F1-score
ScoreRate ScoreRate
Open-source Zero-shot
mPLUG-Owl3 32.0645.0043.09 5.5723.8123.126.40 32.2995 15.0097
InternVideo2.5 16.2819.3816.56 2.938.879.784.67 25.7191 1.247
LongVA 25.8134.9034.68 1.9121.6114.210.61 5.079 22.4656
LLaVA-Video 33.2941.0738.72 3.7621.3917.204.01 27.3449 12.5665
VideoGPT+ 36.6548.3246.61 3.8321.9622.6814.19 40.2068 25.1378
Fine-tuned
VideoGPT+-ft 64.0673.5871.63 14.6233.2931.8742.33 52.3795 49.3082
LLaVA-Video-ft 60.0567.3065.56 14.6732.4431.4140.85 51.3188 44.2063
SurgTEMP (Ours) 71.6281.6579.12 15.2936.2834.9042.53 56.53100 52.3391

For classification metrics, Score is the metric value on answered samples and Rate (%) is the answer rate. Best among Open-source Zero-shot  ·  Best among Fine-tuned.

Qualitative Results

Acknowledgement

This work was funded by the European Union (ERC, CompSURG, 101088553) and French state funds managed by the ANR under Grants ANR-10-IAHU-02, ANR-23-IACL-0004, ANR-10-IDEX-0002, and ANR-20-SFRI-0012, with HPC resources provided by CAMMA, IHU Strasbourg, and Unistra Mesocentre.

License

The dataset, code, and model weights released with this work are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) .

You are free to share and adapt the material for non-commercial purposes, provided you give appropriate credit to the original authors and distribute any derivatives under the same license.

BibTeX

@misc{li2026surgtemptemporalawaresurgicalvideo,
      title={SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy},
      author={Shi Li and Vinkle Srivastav and Nicolas Chanel and Saurav Sharma and Nabani Banik and Lorenzo Arboit and Kun Yuan and Pietro Mascagni and Nicolas Padoy},
      year={2026},
      eprint={2603.29962},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.29962},
}