SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Li, Shi; Srivastav, Vinkle; Chanel, Nicolas; Sharma, Saurav; Banik, Nabani; Arboit, Lorenzo; Yuan, Kun; Mascagni, Pietro; Padoy, Nicolas

SurgTEMP: Temporal-Aware Surgical Video Question Answering
with Text-guided Visual Memory

Shi Li^1,2*, Vinkle Srivastav^1,2, Nicolas Chanel^1,2, Saurav Sharma^1,2, Nabani Banik^1,2, Lorenzo Arboit^1,2, Kun Yuan^1,2, Pietro Mascagni^1,2,3†, Nicolas Padoy^1,2†

¹University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France ²IHU Strasbourg ³Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
^*Corresponding author · ^†Co-last authors

arXiv Code Model (coming soon) Data (coming soon)

SurgTEMP is a multimodal LLM that builds a hierarchical visual memory (spatial & temporal) guided by the text query, and is trained with a Surgical Competency Progression (SCP) scheme — progressively from basic perception to safety assessment and clinical reasoning across 11 tasks for laparoscopic cholecystectomy.

Abstract

Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support.

Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment.

To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks.

To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (~128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy — Perception, Assessment, and Reasoning — spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment.

In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.

Highlights

🧠

SurgTEMP Model

Multimodal LLM for surgical video VQA across 11 clinically relevant tasks, combining spatial and temporal memory banks guided by the text query.

🔍

TEMP Constructor

Text-guided attention selects the most query-relevant frames and patches, building a hierarchical spatial–temporal visual memory pyramid.

📈

SCP Training

Surgical Competency Progression progressively builds perception, assessment, and reasoning capabilities — mirroring clinical training.

🗂️

CholeVidQA-32K

32K open-ended QA pairs from 3,855 laparoscopic cholecystectomy segments (~128 h) spanning a 3-level task hierarchy.

Text-guided Memory Pyramid (TEMP)

TEMP constructor — cross-modal attention, spatial and temporal memory banks

The TExt-guided Memory Pyramid (TEMP) constructor computes patch-text and frame-text similarity via cross-modal attention, then uses differentiable frame selection (Gumbel-Softmax) to build a spatial memory bank of the most query-relevant patches and a temporal memory bank of reweighted frame representations.

CholeVidQA-32K Dataset

CholeVidQA-32K data curation and annotation pipeline

Dataset Curation Pipeline. QA pairs are generated through a multi-stage pipeline combining clinician annotations from CholecT50, Endoscapes, and CholeScore with structured LLM-based QA generation and clinical validation.

Dataset composition: task hierarchy, task types, and temporal duration

Dataset Statistics. Composition breakdown by hierarchy level, task type, and video segment duration — 32K QA pairs from 3,855 segments across CholecT50, Endoscapes, and CholeScore.

CholeVidQA-32K dataset thumbnail with example QA pairs

Sample QA Pairs. Examples from all three hierarchy levels — Perception, Assessment, and Reasoning — illustrating the diversity of question types and clinical reasoning required.

Three-Tier Evaluation Framework

Three-tier evaluation: LLM-based verbalizer, overlap scorer, LLM judge

Evaluation combines (1) an LLM-based verbalizer for categorical F1 & balanced accuracy, (2) text overlap metrics (BLEU, METEOR, ROUGE-L, CIDEr), and (3) an LLM judge scoring correctness, relevance, and linguistic quality. Multi-judge agreement (Kendall's W = 0.852) confirms evaluation robustness.

Quantitative Results

Overall performance on the CholeVidQA-32K test set across GPT Scores (Correctness, Relevance, Linguistic Quality), Overlap Metrics (BLEU, METEOR, ROUGE-L, CIDEr), and Classification Metrics (balanced Accuracy and F1-score, with answer rates). SurgTEMP attains the best fine-tuned score on every single metric.

Models	GPT Scores			Overlap Metrics				Classification Metrics
	CR	RL	LG	BLEU	METEOR	ROUGE-L	CIDEr	bAcc		F1-score
								Score	Rate	Score	Rate
Open-source Zero-shot
mPLUG-Owl3	32.06	45.00	43.09	5.57	23.81	23.12	6.40	32.29	95	15.00	97
InternVideo2.5	16.28	19.38	16.56	2.93	8.87	9.78	4.67	25.71	91	1.24	7
LongVA	25.81	34.90	34.68	1.91	21.61	14.21	0.61	5.07	9	22.46	56
LLaVA-Video	33.29	41.07	38.72	3.76	21.39	17.20	4.01	27.34	49	12.56	65
VideoGPT+	36.65	48.32	46.61	3.83	21.96	22.68	14.19	40.20	68	25.13	78
Fine-tuned
VideoGPT+-ft	64.06	73.58	71.63	14.62	33.29	31.87	42.33	52.37	95	49.30	82
LLaVA-Video-ft	60.05	67.30	65.56	14.67	32.44	31.41	40.85	51.31	88	44.20	63
SurgTEMP (Ours)	71.62	81.65	79.12	15.29	36.28	34.90	42.53	56.53	100	52.33	91

For classification metrics, Score is the metric value on answered samples and Rate (%) is the answer rate. ■ Best among Open-source Zero-shot · ■ Best among Fine-tuned.

Qualitative Results

Qualitative reasoning examples — comprehensive assessment and action rationale

Reasoning — Comprehensive Assessment & Action Rationale. SurgTEMP generates multi-dimensional safety assessments and explains the rationale behind surgical actions.

Qualitative reasoning examples — intraoperative planning

Reasoning — Intraoperative Planning. The model provides scenario-based procedural guidance grounded in observed anatomy and tool use.

Qualitative perception examples — tool, action, anatomy

Perception — Tools, Actions & Anatomy. Fine-grained identification of surgical instruments, actions, and anatomical structures with clinical context.

Qualitative assessment examples — CVS, difficulty, skills

Assessment — CVS, Difficulty & Skills. SurgTEMP accurately evaluates Critical View of Safety, intraoperative difficulty findings, and surgeon skill proficiency.

Acknowledgement

This work was funded by the European Union (ERC, CompSURG, 101088553) and French state funds managed by the ANR under Grants ANR-10-IAHU-02, ANR-23-IACL-0004, ANR-10-IDEX-0002, and ANR-20-SFRI-0012, with HPC resources provided by CAMMA, IHU Strasbourg, and Unistra Mesocentre.

License

The dataset, code, and model weights released with this work are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) .

You are free to share and adapt the material for non-commercial purposes, provided you give appropriate credit to the original authors and distribute any derivatives under the same license.

BibTeX

@misc{li2026surgtemptemporalawaresurgicalvideo,
      title={SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy},
      author={Shi Li and Vinkle Srivastav and Nicolas Chanel and Saurav Sharma and Nabani Banik and Lorenzo Arboit and Kun Yuan and Pietro Mascagni and Nicolas Padoy},
      year={2026},
      eprint={2603.29962},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.29962},
}

SurgTEMP: Temporal-Aware Surgical Video Question Answeringwith Text-guided Visual Memory