TY - JOUR
T1 - Discriminating physiological from non-physiological interfaces in structures of protein complexes: A community-wide study
AU - Schweke, Hugo
AU - Xu, Qifang
AU - Tauriello, Gerardo
AU - Pantolini, Lorenzo
AU - Schwede, Torsten
AU - Cazals, Frédéric
AU - Lhéritier, Alix
AU - Fernandez-Recio, Juan
AU - Rodríguez-Lumbreras, Luis Angel
AU - Schueler-Furman, Ora
AU - Varga, Julia K.
AU - Jiménez-García, Brian
AU - Réau, Manon F.
AU - Bonvin, Alexandre M. J. J.
AU - Savojardo, Castrense
AU - Martelli, Pier-Luigi
AU - Casadio, Rita
AU - Tubiana, Jérôme
AU - Wolfson, Haim J.
AU - Oliva, Romina
AU - Barradas-Bautista, Didier
AU - Ricciardelli, Tiziana
AU - Cavallo, Luigi
AU - Venclovas, Česlovas
AU - Olechnovič, Kliment
AU - Guerois, Raphael
AU - Andreani, Jessica
AU - Martin, Juliette
AU - Wang, Xiao
AU - Terashi, Genki
AU - Sarkar, Daipayan
AU - Christoffer, Charles
AU - Aderinwale, Tunde
AU - Verburgt, Jacob
AU - Kihara, Daisuke
AU - Marchand, Anthony
AU - Correia, Bruno E.
AU - Duan, Rui
AU - Qiu, Liming
AU - Xu, Xianjin
AU - Zhang, Shuang
AU - Zou, Xiaoqin
AU - Dey, Sucharita
AU - Dunbrack, Roland L.
AU - Levy, Emmanuel D.
AU - Wodak, Shoshana J
N1 - KAUST Repository Item: Exported on 2023-07-11
Acknowledgements: This study (SJW coordinator) was performed as part of the Implementation Plan (2019-2021) of Activity-2 of the 3DBioInfo Elixir Community ( https://elixir-europe.org/communities/3d-bioinfo ). We thank Alexander Botzki (VIB Technology Training, Flanders, Belgium) for helping with installing and hosting the project GitHub. G.T., L.P., and T.S. acknowledge the contributions of Gabriel Studer in setting up the AlphaFold pipeline for our analysis, sciCORE at the University of Basel for providing computational resources and system administration support, and funding from the SIB Swiss Institute of Bioinformatics and the Biozentrum PhD Fellowships. D.K. acknowledges supports by the National Institutes of Health (R01GM133840) and the National Science Foundation (DMS2151678, DBI2003635, CMMI1825941, MCB2146026, and MCB1925643). X.W. is recipient of the MolSSI graduate fellowship. A.M.J.J.B. and M.R. acknowledge financial support from the Netherlands eScience Center (ASDI.2016.043), from the Netherlands Organization for Scientific Research (NWO) (TOP-PUNT grant 718.015.001) and from the European Union Horizon 2020 project BioExcel (823830). J.F.-R. acknowledges support by Spanish Ministry of Science (grant PID2019-110167RB-I00 / AEI / 10.13039/501100011033). B. J.-G. is employed by Zymvol Biomodeling on a project which received funding from the European Union's Horizon 2020 research and innovation program under Marie Skłodowska-Curie grant agreement No. 801342 (Tecniospring INDUSTRY) and the Government of Catalonia's Agency for Business Competitiveness (ACCIÓ). X.Z. acknowledges the support from NIH/NIGMS (grant R35GM136409). E.D.L. acknowledges support from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement No. 819318), by a research grant from A.-M. Boucher, by research grants from the Estelle Funk Foundation, the Estate of Fannie Sherr, the Estate of Albert Delighter, the Merle S. Cahn Foundation, Mrs Mildred S. Gosden, the Estate of Elizabeth Wachsman, the Arnold Bortman Family Foundation. S.D acknowledges Ramalingaswami re-entry Fellowship (March 30th, 2021) from DBT India.O. S_F acknowledges support from the Israel Science Foundation, founded by the Israel Academy of Science and Humanities (grant number 301/2021.J.K.V. is supported by Marie Sklodowska-Curie European Training Network Grant #860517.
PY - 2023/6/27
Y1 - 2023/6/27
N2 - Reliably scoring and ranking candidate models of protein complexes and assigning their oligomeric state from the structure of the crystal lattice represent outstanding challenges. A community-wide effort was launched to tackle these challenges. The latest resources on protein complexes and interfaces were exploited to derive a benchmark dataset consisting of 1677 homodimer protein crystal structures, including a balanced mix of physiological and non-physiological complexes. The non-physiological complexes in the benchmark were selected to bury a similar or larger interface area than their physiological counterparts, making it more difficult for scoring functions to differentiate between them. Next, 252 functions for scoring protein-protein interfaces previously developed by 13 groups were collected and evaluated for their ability to discriminate between physiological and non-physiological complexes. A simple consensus score generated using the best performing score of each of the 13 groups, and a cross-validated Random Forest (RF) classifier were created. Both approaches showed excellent performance, with an area under the Receiver Operating Characteristic (ROC) curve of 0.93 and 0.94, respectively, outperforming individual scores developed by different groups. Additionally, AlphaFold2 engines recalled the physiological dimers with significantly higher accuracy than the non-physiological set, lending support to the reliability of our benchmark dataset annotations. Optimizing the combined power of interface scoring functions and evaluating it on challenging benchmark datasets appears to be a promising strategy.
AB - Reliably scoring and ranking candidate models of protein complexes and assigning their oligomeric state from the structure of the crystal lattice represent outstanding challenges. A community-wide effort was launched to tackle these challenges. The latest resources on protein complexes and interfaces were exploited to derive a benchmark dataset consisting of 1677 homodimer protein crystal structures, including a balanced mix of physiological and non-physiological complexes. The non-physiological complexes in the benchmark were selected to bury a similar or larger interface area than their physiological counterparts, making it more difficult for scoring functions to differentiate between them. Next, 252 functions for scoring protein-protein interfaces previously developed by 13 groups were collected and evaluated for their ability to discriminate between physiological and non-physiological complexes. A simple consensus score generated using the best performing score of each of the 13 groups, and a cross-validated Random Forest (RF) classifier were created. Both approaches showed excellent performance, with an area under the Receiver Operating Characteristic (ROC) curve of 0.93 and 0.94, respectively, outperforming individual scores developed by different groups. Additionally, AlphaFold2 engines recalled the physiological dimers with significantly higher accuracy than the non-physiological set, lending support to the reliability of our benchmark dataset annotations. Optimizing the combined power of interface scoring functions and evaluating it on challenging benchmark datasets appears to be a promising strategy.
UR - http://hdl.handle.net/10754/692876
UR - https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/pmic.202200323
UR - http://www.scopus.com/inward/record.url?scp=85162927714&partnerID=8YFLogxK
U2 - 10.1002/pmic.202200323
DO - 10.1002/pmic.202200323
M3 - Article
C2 - 37365936
SN - 1615-9853
JO - PROTEOMICS
JF - PROTEOMICS
ER -