271-01-030 / 401-02-002 ISMRM Abstract

A Unified Vision-Language Foundation Model for Multi-Task MRI Application

Primary: Analysis Methods - Foundation Models

Secondary: Analysis Methods - Multi-Modal Learning with LLMs/VLMs

271-01-030 · ISMRM AMPC Selected Posters · Sunday, 10 May, 7:00 AM–2:00 PM · Traditional Posters

401-02-002 · Foundation Models · Tuesday, 12 May, 8:20 AM–10:10 AM · Hall 1A

Keywords: Foundation model Vision-language model Multi-task Learning Radiology Report Generation Multi-Modal Learning

Accepted

Xingxin He^1,2, Aurora Rofena^1,3, Yifan Hu¹, Ruimin Feng^1,2, Zhehao Liao¹, Valerio Guarrasi³, Paolo Soda³, Zhaoye Zhou², Albert Jang^1,2, Fang Liu ^1,2,4

¹Athinoula A. Martinos Center for Biomedical Imaging, Harvard Medical School, Boston, United States of America

²Massachusetts General Hospital, Boston, United States of America

³Unit of Artificial Intelligence and Computer Systems, Department of Engineering, Università Campus Bio-Medico di Roma, Rome, Italy

⁴Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Harvard Medical School, Boston, United States of America

Presenting Author: Fang Liu

Synopsis

Motivation:

Goals:

Approach:

Results:

Full abstract & presentation

The full text, figures, and any recorded presentation for this abstract are not shown here. Log in if you are a member or registered attendee with access.

Full abstracts, figures, and presentations for Cape Town - 2026 ISMRM-ISMRT Annual Meeting and Exhibition are available to registered attendees. This content becomes freely available to the public roughly two years after the meeting.

To request or purchase access, contact the ISMRM Central Office at info@ismrm.org.

References

1. McRobbie, D. W., Moore, E. A., Graves, M. J. & Prince, M. R. MRI from Picture to Proton. (Cambridge university press, 2017).

2. Lustig, M., Donoho, D. & Pauly, J. M. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 58, 1182–1195 (2007). https://doi.org/10.1002/mrm.21391 [doi]

3. Heimann, T. & Meinzer, H.-P. Statistical shape models for 3D medical image segmentation: a review. Medical image analysis 13, 543–563 (2009). https://doi.org/10.1016/j.media.2009.05.004 [doi]

4. Langlotz, C. P. RadLex: a new method for indexing online educational materials. Radiographics vol. 26 1595–1597 (2006). https://doi.org/10.1148/rg.266065168 [doi]

5. Kahn Jr, C. E. et al. Toward best practices in radiology reporting. Radiology 252, 852–856 (2009). https://doi.org/10.1148/radiol.2523081992 [doi]

6. Bommasani, R. et al. On the Opportunities and Risks of Foundation Models. Preprint at https://doi.org/10.48550/arXiv.2108.07258 (2022). [doi]

7. Qwen et al. Qwen2.5 Technical Report. Preprint at https://doi.org/10.48550/arXiv.2412.15115 (2025). [doi]

8. Vaswani, A. et al. Attention is All you Need. in Advances in Neural Information Processing Systems (eds Guyon, I. et al.) vol. 30 (Curran Associates, Inc., 2017).

9. Mu, S. & Lin, S. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications. arXiv preprint arXiv:2503.07137 (2025).

10. Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. in International Conference on Medical Image Computing and Computer-Assisted Intervention 234–241 (2015). https://doi.org/10.1007/978-3-319-24574-4_28 [doi]

11. Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 18, 203–211 (2021). https://doi.org/10.1038/s41592-020-01008-z [doi]

12. Reis, D., Kupec, J., Hong, J. & Daoudi, A. Real-time flying object detection with YOLOv8. arXiv preprint arXiv:2305.09972 (2023).

13. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition 770–778 (2016).

14. Ravi, N. et al. SAM 2: Segment Anything in Images and Videos. Preprint at https://doi.org/10.48550/arXiv.2408.00714 (2024). [doi]

15. Liu, S. et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).

16. Radford, A. et al. Learning Transferable Visual Models From Natural Language Supervision. in Proceedings of the 38th International Conference on Machine Learning 8748–8763 (PMLR, 2021).

17. Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Preprint at https://doi.org/10.48550/arXiv.2010.11929 (2021). [doi]

18. Ma, J. et al. Segment anything in medical images. Nat Commun 15, 654 (2024). https://doi.org/10.1038/s41467-024-44824-z [doi]

19. Moor, M. et al. Med-flamingo: a multimodal medical few-shot learner. in Machine Learning for Health (ML4H) 353–367 (PMLR, 2023).

Cite this abstract

http://echo.ismrm.org/p/ISMRM2026/271-01-030