The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding.

Recent studies in the biomedical domain suggest that learning statistical word representations (static or contextualized word embeddings) on large corpora of specialized data improve the results on downstream natural language processing (NLP) tasks. In this paper, we explore the impact of the data source of word representations on a natural language understanding task. We compared embeddings learned with Fasttext (static embedding) and ELMo (contextualized embedding) representations, learned either on the general domain (Wikipedia) or on specialized data (electronic health records, EHR). The best results were obtained with ELMo representations learned on EHR data for the two sub-tasks (+7% and +4% of gain in F1-score). Moreover, ELMo representations were trained with only a fraction of the data used for Fasttext.

Mots clés

Contextual word embeddings Natural Language processing Natural language understanding

Domaines

Informatique et langage [cs.CL] Intelligence artificielle [cs.AI]

Antoine Neuraz : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03476839

Soumis le : lundi 13 décembre 2021-11:09:04

Dernière modification le : vendredi 19 avril 2024-16:18:59

Dates et versions

hal-03476839 , version 1 (13-12-2021)

Identifiants

HAL Id : hal-03476839 , version 1
DOI : 10.3233/SHTI200197
PUBMED : 32570421

Citer

Antoine Neuraz, Bastien Rance, Nicolas Garcelon, Leonardo Campillos Llanos, Anita Burgun, et al.. The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding.. Studies in Health Technology and Informatics, 2020, 270, pp.432-436. ⟨10.3233/SHTI200197⟩. ⟨hal-03476839⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EPHE CNRS LIMSI CORDELIERS PSL UNIV-PARIS-SACLAY SORBONNE-UNIVERSITE SU-SCIENCES UP-SANTE LISN GS-ENGINEERING GS-COMPUTER-SCIENCE LISN-ILES

19 Consultations

0 Téléchargements