How Well Do Vision-Language Models Explain Sarcasm? An Evaluation of Multimodal Explanation Quality for Social Media Posts

Ikhlasul Amal; Annisa Nur Ramadhani

doi:10.65917/aisa.v1i1.22

Authors

Ikhlasul Amal Universitas Gadjah Mada Author
Annisa Nur Ramadhani Universitas Muhammadiyah Surakarta Author

DOI:

https://doi.org/10.65917/aisa.v1i1.22

Keywords:

Multimodal Sarcasm, Social Media, Vision-Language Models, Zero-Shot Learning, Few-Shot Learning, BERTScore

Abstract

Sarcasm is a complex communicative phenomenon frequently encountered in social media, where the literal meaning of language sharply contradicts the speaker’s true intent, often reinforced by multimodal cues such as incongruent images or memes. While prior research has primarily focused on detecting sarcasm, far less attention has been devoted to generating human-interpretable explanations that clarify why content is sarcastic. This study addresses this gap by systematically evaluating the capabilities of fifteen Vision–Language Models (VLMs) of varying parameter sizes to produce multimodal sarcasm explanations under zero-shot and few-shot learning conditions. Using the publicly available MORE dataset of social media posts annotated with concise human-written explanations, we benchmarked each model’s outputs with three widely used evaluation metrics, including ROUGE, BERTScore, and Sentence-BERT, to assess both surface-level overlap and deeper semantic alignment. Our findings reveal that smaller models can rival or even outperform larger architectures in n-gram similarity measures, while embedding-based metrics often yield high scores even when generated explanations contradict the ground truth. These results highlight the limitations of current automatic metrics in reliably capturing the nuanced reasoning underlying sarcasm. Overall, this work demonstrates that model scale does not consistently predict explanation quality and underscores the need for more robust evaluation protocols.