How Well Do Vision-Language Models Explain Sarcasm? An Evaluation of Multimodal Explanation Quality for Social Media Posts

Authors

  • Ikhlasul Amal Universitas Gadjah Mada Author
  • Annisa Nur Ramadhani Universitas Muhammadiyah Surakarta Author

DOI:

https://doi.org/10.65917/aisa.v1i1.22

Keywords:

Multimodal Sarcasm, Social Media, Vision-Language Models, Zero-Shot Learning, Few-Shot Learning, BERTScore

Abstract

Sarcasm is a complex communicative phenomenon frequently encountered in social media, where the literal meaning of language sharply contradicts the speaker’s true intent, often reinforced by multimodal cues such as incongruent images or memes. While prior research has primarily focused on detecting sarcasm, far less attention has been devoted to generating human-interpretable explanations that clarify why content is sarcastic. This study addresses this gap by systematically evaluating the capabilities of fifteen Vision–Language Models (VLMs) of varying parameter sizes to produce multimodal sarcasm explanations under zero-shot and few-shot learning conditions. Using the publicly available MORE dataset of social media posts annotated with concise human-written explanations, we benchmarked each model’s outputs with three widely used evaluation metrics, including ROUGE, BERTScore, and Sentence-BERT, to assess both surface-level overlap and deeper semantic alignment. Our findings reveal that smaller models can rival or even outperform larger architectures in n-gram similarity measures, while embedding-based metrics often yield high scores even when generated explanations contradict the ground truth. These results highlight the limitations of current automatic metrics in reliably capturing the nuanced reasoning underlying sarcasm. Overall, this work demonstrates that model scale does not consistently predict explanation quality and underscores the need for more robust evaluation protocols.

Downloads

Published

28-07-2025

How to Cite

How Well Do Vision-Language Models Explain Sarcasm? An Evaluation of Multimodal Explanation Quality for Social Media Posts. (2025). Artificial Intelligence Systems and Its Applications, 1(1), 31-55. https://doi.org/10.65917/aisa.v1i1.22