Please use this identifier to cite or link to this item: https://hdl.handle.net/11147/15426
Full metadata record
DC FieldValueLanguage
dc.contributor.authorOğul, İ.Ü.-
dc.contributor.authorSoygazi, F.-
dc.contributor.authorBostanoğlu, B.E.-
dc.date.accessioned2025-03-25T22:55:22Z-
dc.date.available2025-03-25T22:55:22Z-
dc.date.issued2025-
dc.identifier.issn2376-5992-
dc.identifier.urihttps://doi.org/10.7717/PEERJ-CS.2662-
dc.identifier.urihttps://hdl.handle.net/11147/15426-
dc.description.abstractNatural language inference (NLI) is a subfield of natural language processing (NLP) that aims to identify the contextual relationship between premise and hypothesis sentences. While high-resource languages like English benefit from robust and rich NLI datasets, creating similar datasets for low-resource languages is challenging due to the cost and complexity of manual annotation. Although translation of existing datasets offers a practical solution, direct translation of domain-specific datasets presents unique challenges, particularly in handling abbreviations, metric conversions, and cultural alignment. This study introduces a pipeline for translating a medical NLI dataset into Turkish, which is a low-resource language. Our approach employs fine-tuning the Llama-3.1 model with selected samples from the Medical Abbreviation dataset (MeDAL) to extract and resolve medical abbreviations. Consequently, NLI pairs are refined with extracted abbreviations and subjected to metric correction. Later, the processed sentences are then translated using Facebook’s No Language Left Behind (NLLB) translation model. To ensure quality, we conducted comprehensive evaluations using both machine learning models and medical expert review. Our results show that BERTurk achieved 75.17% accuracy on TurkMedNLI test data and 76.30% on the normalized test set, while BioBERTurk demonstrated comparable performance with 75.59% accuracy on test data and 72.29% on the normalized dataset. Medical experts further validated the translations through manual assessment of sampled sentences. This work demonstrates the effectiveness of large language models in adapting domain-specific datasets for low-resource languages, establishing a foundation for future research in multilingual biomedical NLP. Copyright 2025 Oğul et al. Distributed under Creative Commons CC-BY 4.0en_US
dc.language.isoenen_US
dc.publisherPeerJ Inc.en_US
dc.relation.ispartofPeerJ Computer Scienceen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectBerten_US
dc.subjectLanguage Translationen_US
dc.subjectLlamaen_US
dc.subjectLlmen_US
dc.subjectMednlien_US
dc.subjectNatural Language Inferenceen_US
dc.subjectNatural Language Processingen_US
dc.subjectNllben_US
dc.titleTurkmednli: a Turkish Medical Natural Language Inference Dataset Through Large Language Model Based Translationen_US
dc.typeArticleen_US
dc.departmentİzmir Institute of Technologyen_US
dc.identifier.volume11en_US
dc.identifier.scopus2-s2.0-85219134639-
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.identifier.doi10.7717/PEERJ-CS.2662-
dc.authorscopusid57195222455-
dc.authorscopusid57220960947-
dc.authorscopusid24478565000-
dc.identifier.wosqualityQ2-
dc.identifier.scopusqualityQ1-
item.openairecristypehttp://purl.org/coar/resource_type/c_18cf-
item.languageiso639-1en-
item.openairetypeArticle-
item.grantfulltextnone-
item.fulltextNo Fulltext-
item.cerifentitytypePublications-
Appears in Collections:Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection
Show simple item record



CORE Recommender

Google ScholarTM

Check




Altmetric


Items in GCRIS Repository are protected by copyright, with all rights reserved, unless otherwise indicated.