TY - JOUR
T1 - NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding
AU - Wang, Kanix
AU - Stevens, Robert
AU - Alachram, Halima
AU - Li, Yu
AU - Soldatova, Larisa
AU - King, Ross
AU - Ananiadou, Sophia
AU - Schoene, Annika M.
AU - Li, Maolin
AU - Christopoulou, Fenia
AU - Ambite, José Luis
AU - Matthew, Joel
AU - Garg, Sahil
AU - Hermjakob, Ulf
AU - Marcu, Daniel
AU - Sheng, Emily
AU - Beißbarth, Tim
AU - Wingender, Edgar
AU - Galstyan, Aram
AU - Gao, Xin
AU - Chambers, Brendan
AU - Pan, Weidi
AU - Khomtchouk, Bohdan B.
AU - Evans, James A.
AU - Rzhetsky, Andrey
N1 - KAUST Repository Item: Exported on 2021-10-22
Acknowledged KAUST grant number(s): FCC/1/1976-26-01, REI/1/0018-01-01, REI/1/4473-01-01, FCS/1/4102-02-01
Acknowledgements: We are grateful to E. Gannon and M. Rzhetsky, for comments on earlier versions of this manuscript. W.P. thanks J. Li for some technical assistance and discussions. This work was funded by the DARPA Big Mechanism program under ARO contract W911NF1410333, by National Institutes of Health grants R01HL122712, 1P50MH094267, K12HL143959 (BBK), and U01HL108634-01, and by a gift from Liz and Kent Dauten. Additional support came from King Abdullah University of Science and Technology (KAUST), awards number FCS/1/4102-02-01, FCC/1/1976-26-01, REI/1/0018-01-01, and REI/1/4473-01-01.
PY - 2021/10/20
Y1 - 2021/10/20
N2 - Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.
AB - Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.
UR - http://hdl.handle.net/10754/672916
UR - https://www.nature.com/articles/s41540-021-00200-x
U2 - 10.1038/s41540-021-00200-x
DO - 10.1038/s41540-021-00200-x
M3 - Article
C2 - 34671039
SN - 2056-7189
VL - 7
JO - npj Systems Biology and Applications
JF - npj Systems Biology and Applications
IS - 1
ER -