EuMediCS - Euskarazko Medikuntzaren Domeinuko Corpus Sintetikoa, Itzultzaile Automatikoen Ekarpena

Ane G. Domingo-Aldama; Irune Palacios; Maitane Urruela; Iker De la Iglesia; Ander Barrena; Josu Goikoetxea

doi:10.26876/ikergazte.vi.03.21

Authors

Ane G. Domingo-Aldama University of the Basque Country (UPV/EHU)
Irune Palacios University of the Basque Country (UPV/EHU)
Maitane Urruela University of the Basque Country (UPV/EHU)
Iker De la Iglesia University of the Basque Country (UPV/EHU)
Ander Barrena University of the Basque Country (UPV/EHU)
Josu Goikoetxea University of the Basque Country (UPV/EHU)

DOI:

https://doi.org/10.26876/ikergazte.vi.03.21

Keywords:

Large Language Models, Neural Machine Translation, Basque Corpus, Medical-domain

Abstract

In recent years, Large Language Models (LLM) have significantly transformed the field of artificial intelligence, achieving remarkable success in tasks such as translation and text synthesis. In the medical domain, these models have also demonstrated great performance, even reaching the human level in some cases, when they have been specifically trained for the task. However, most of the advancements using LLMs have been made in high-resource languages like English, which means a great disadvantage for low-resource languages like Basque, since the LLMs trained in these languages are few and of low quality in comparison, due to the lack of data. Moreover, in specific domains like medicine, these models do not even exist in most cases. To address this gap, this project aims to create the first synthetic medical-domain corpus in Basque, in order to train an LLM in that context. To achieve this, we propose three translation models capable of translating from Spanish and English to Basque. Then, the quality of these translation models will be evaluated, ultimately selecting the best of them for a large-scale translation of medical texts into Basque in the future. This work represents a notable contribution to the development of specialized LLMs for Basque, particularly in the medical domain.

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

EuMediCS - Euskarazko Medikuntzaren Domeinuko Corpus Sintetikoa, Itzultzaile Automatikoen Ekarpena

Authors

DOI:

Keywords:

Abstract

License

Downloads

Published

How to Cite

Conference Proceedings Volume

Section

Categories

eISSN-zutabe

Language

BAIONAKO EGOITZA SOZIALA

BILBOKO EGOITZA SOZIALA

EIBARKO EGOITZA AKADEMIKOA

IRUÑEKO EGOITZA SOZIALA