TY - GEN
T1 - Skating-Mixer: Long-Term Sport Audio-Visual Modeling with MLPs
AU - Xia, Jingfei
AU - Zhuge, Mingchen
AU - Geng, Tiantian
AU - Fan, Shun
AU - Wei, Yuantai
AU - He, Zhenyu
AU - Zheng, Feng
N1 - KAUST Repository Item: Exported on 2023-08-29
Acknowledgements: This work was supported by the National Key R&D Program of China (Grant NO. 2022YFF1202903) and the National Natural Science Foundation of China (Grant NO. 61972188 and 62122035).
PY - 2023/6/26
Y1 - 2023/6/26
N2 - Figure skating scoring is challenging because it requires judging the technical moves of the players as well as their coordination with the background music. Most learning-based methods cannot solve it well for two reasons: 1) each move in figure skating changes quickly, hence simply applying traditional frame sampling will lose a lot of valuable information, especially in 3 to 5 minutes long videos; 2) prior methods rarely considered the critical audio-visual relationship in their models. Due to these reasons, we introduce a novel architecture, named Skating-Mixer. It extends the MLP framework in a multimodal fashion and effectively learns longterm representations through our designed memory recurrent unit (MRU). Aside from the model, we collected a highquality audio-visual FS 1000 dataset, which contains over 1000 videos on 8 types of programs with 7 different rating metrics, overtaking other datasets in both quantity and diversity. Experiments show the proposed method achieves SOTAs over all major metrics on the public Fis-V and our FS 1000 dataset. In addition, we include an analysis applying our method to the recent competitions in Beijing 2022 Winter Olympic Games, proving our method has strong applicability.
AB - Figure skating scoring is challenging because it requires judging the technical moves of the players as well as their coordination with the background music. Most learning-based methods cannot solve it well for two reasons: 1) each move in figure skating changes quickly, hence simply applying traditional frame sampling will lose a lot of valuable information, especially in 3 to 5 minutes long videos; 2) prior methods rarely considered the critical audio-visual relationship in their models. Due to these reasons, we introduce a novel architecture, named Skating-Mixer. It extends the MLP framework in a multimodal fashion and effectively learns longterm representations through our designed memory recurrent unit (MRU). Aside from the model, we collected a highquality audio-visual FS 1000 dataset, which contains over 1000 videos on 8 types of programs with 7 different rating metrics, overtaking other datasets in both quantity and diversity. Experiments show the proposed method achieves SOTAs over all major metrics on the public Fis-V and our FS 1000 dataset. In addition, we include an analysis applying our method to the recent competitions in Beijing 2022 Winter Olympic Games, proving our method has strong applicability.
UR - http://hdl.handle.net/10754/693763
UR - https://ojs.aaai.org/index.php/AAAI/article/view/25392
UR - http://www.scopus.com/inward/record.url?scp=85167997687&partnerID=8YFLogxK
U2 - 10.1609/aaai.v37i3.25392
DO - 10.1609/aaai.v37i3.25392
M3 - Conference contribution
SN - 9781577358800
SP - 2901
EP - 2909
BT - Proceedings of the AAAI Conference on Artificial Intelligence
PB - Association for the Advancement of Artificial Intelligence (AAAI)
ER -