TY - GEN
T1 - X-VARS
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
AU - Held, Jan
AU - Itani, Hani
AU - Cioppa, Anthony
AU - Giancola, Silvio
AU - Ghanem, Bernard
AU - Van Droogenbroeck, Marc
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The rapid advancement of artificial intelligence has led to significant improvements in automated decision-making. However, the increased performance of models often comes at the cost of explainability and transparency of their decision-making processes. In this paper, we investigate the capabilities of large language models to explain decisions, using football refereeing as a testing ground, given its decision complexity and subjectivity. We introduce the EXplainable Video Assistant Referee System, X-VARS, a multi-modal large language model designed for understanding football videos from the point of view of a referee. X-VARS can perform a multitude of tasks, including video description, question answering, action recognition, and conducting meaningful conversations based on video content and in accordance with the Laws of the Game for football referees. We validate X-VARS on our novel dataset, SoccerNet-XFoul, which consists of more than 22k video-question-answer triplets annotated by over 70 experienced football referees. Our experiments and human study illustrate the impressive capabilities of X-VARS in interpreting complex football clips. Furthermore, we highlight the potential of X-VARS to reach human performance and support football referees in the future. We will provide code, model, dataset, and demo upon publication.
AB - The rapid advancement of artificial intelligence has led to significant improvements in automated decision-making. However, the increased performance of models often comes at the cost of explainability and transparency of their decision-making processes. In this paper, we investigate the capabilities of large language models to explain decisions, using football refereeing as a testing ground, given its decision complexity and subjectivity. We introduce the EXplainable Video Assistant Referee System, X-VARS, a multi-modal large language model designed for understanding football videos from the point of view of a referee. X-VARS can perform a multitude of tasks, including video description, question answering, action recognition, and conducting meaningful conversations based on video content and in accordance with the Laws of the Game for football referees. We validate X-VARS on our novel dataset, SoccerNet-XFoul, which consists of more than 22k video-question-answer triplets annotated by over 70 experienced football referees. Our experiments and human study illustrate the impressive capabilities of X-VARS in interpreting complex football clips. Furthermore, we highlight the potential of X-VARS to reach human performance and support football referees in the future. We will provide code, model, dataset, and demo upon publication.
KW - Explainability
KW - Football
KW - Multi-modal language model
KW - Referee
UR - http://www.scopus.com/inward/record.url?scp=85202147020&partnerID=8YFLogxK
U2 - 10.1109/CVPRW63382.2024.00332
DO - 10.1109/CVPRW63382.2024.00332
M3 - Conference contribution
AN - SCOPUS:85202147020
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 3267
EP - 3279
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
PB - IEEE Computer Society
Y2 - 16 June 2024 through 22 June 2024
ER -