The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding. - Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur Accéder directement au contenu
Article Dans Une Revue Studies in Health Technology and Informatics Année : 2020

The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding.

Résumé

Recent studies in the biomedical domain suggest that learning statistical word representations (static or contextualized word embeddings) on large corpora of specialized data improve the results on downstream natural language processing (NLP) tasks. In this paper, we explore the impact of the data source of word representations on a natural language understanding task. We compared embeddings learned with Fasttext (static embedding) and ELMo (contextualized embedding) representations, learned either on the general domain (Wikipedia) or on specialized data (electronic health records, EHR). The best results were obtained with ELMo representations learned on EHR data for the two sub-tasks (+7% and +4% of gain in F1-score). Moreover, ELMo representations were trained with only a fraction of the data used for Fasttext.
Fichier non déposé

Dates et versions

hal-03476839 , version 1 (13-12-2021)

Identifiants

Citer

Antoine Neuraz, Bastien Rance, Nicolas Garcelon, Leonardo Campillos Llanos, Anita Burgun, et al.. The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding.. Studies in Health Technology and Informatics, 2020, 270, pp.432-436. ⟨10.3233/SHTI200197⟩. ⟨hal-03476839⟩
19 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More