EuMediCS - Euskarazko Medikuntzaren Domeinuko Corpus Sintetikoa, Itzultzaile Automatikoen Ekarpena
DOI:
https://doi.org/10.26876/ikergazte.vi.03.21Keywords:
Large Language Models, Neural Machine Translation, Basque Corpus, Medical-domainAbstract
In recent years, Large Language Models (LLM) have significantly transformed the field of artificial intelligence, achieving remarkable success in tasks such as translation and text synthesis. In the medical domain, these models have also demonstrated great performance, even reaching the human level in some cases, when they have been specifically trained for the task. However, most of the advancements using LLMs have been made in high-resource languages like English, which means a great disadvantage for low-resource languages like Basque, since the LLMs trained in these languages are few and of low quality in comparison, due to the lack of data. Moreover, in specific domains like medicine, these models do not even exist in most cases. To address this gap, this project aims to create the first synthetic medical-domain corpus in Basque, in order to train an LLM in that context. To achieve this, we propose three translation models capable of translating from Spanish and English to Basque. Then, the quality of these translation models will be evaluated, ultimately selecting the best of them for a large-scale translation of medical texts into Basque in the future. This work represents a notable contribution to the development of specialized LLMs for Basque, particularly in the medical domain.
License
Copyright (c) 2025 IkerGazte. Nazioarteko ikerketa euskaraz

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
