Generating custom word embeddings for geoscientific corpi

C. E. Birnie*, M. Ravasi

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

Abstract

In the field of natural language processing, word embeddings are a set of techniques that transform words from an input corpus into a low-dimensional space with the aim of capturing the relationships between words. It is well known that such relations are highly dependent on the context of the input corpus, which in science varies highly from field to field. In this work we compare the performance of word embeddings pre-trained on generic text versus custom made word embeddings trained on an extensive corpus of geoscientific papers. Numerous examples highlight the difference in meaning and closeness of words betweeen geoscientific and generic context. A prime example is the term ghost which has a specific definition in geophysics, different to that of its common usage in the English language. Moreover, domain specific analogies, such as ‘Compressional is to P-wave what shear is to... S-wave’, are investigated to understand the extent to which the different word embeddings capture the relationship between terms. Finally, we anticipate some use cases of word embeddings aimed at extracting key information from documents and providing better indexing.

Original languageEnglish (US)
DOIs
StatePublished - 2020
Event1st EAGE Digitalization Conference and Exhibition - Vienna, Austria
Duration: Nov 30 2020Dec 3 2020

Conference

Conference1st EAGE Digitalization Conference and Exhibition
Country/TerritoryAustria
CityVienna
Period11/30/2012/3/20

ASJC Scopus subject areas

  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Generating custom word embeddings for geoscientific corpi'. Together they form a unique fingerprint.

Cite this