Empowering Celtic Languages with AI: A Breakthrough for Welsh and Beyond
Celtic languages such as Cornish, Irish, Scottish Gaelic, and Welsh are some of the oldest living languages in the United Kingdom. In a groundbreaking initiative to empower speakers of these ancient tongues, the UK-LLM sovereign AI project is developing an advanced artificial intelligence model. This model is based on NVIDIA’s Nemotron technology and will have the capability to reason in both English and Welsh. Welsh is a language currently spoken by around 850,000 people in Wales, and this initiative aims to support various public services, including healthcare, education, and legal resources, in the Welsh language.
The importance of enabling AI to function in Welsh cannot be overstated. It fosters inclusivity and ensures that essential services are accessible in the native language of Wales. As UK Prime Minister Keir Starmer stated, "I want every corner of the U.K. to be able to harness the benefits of artificial intelligence. By enabling AI to reason in Welsh, we’re making sure that public services — from healthcare to education — are accessible to everyone, in the language they live by." This initiative showcases how cutting-edge AI technology, developed on the U.K.’s most advanced AI supercomputer located in Bristol, can serve the public good, protect cultural heritage, and unlock opportunities nationwide.
Established in 2023, the UK-LLM project, originally known as BritLLM and led by University College London, has already released two models for U.K. languages. The development of the new Welsh model is a collaborative effort with Wales’ Bangor University and NVIDIA. This aligns with Welsh government initiatives to increase the active use of the Welsh language, with a target of achieving one million speakers by 2050, as outlined in the Cymraeg 2050 strategy.
U.K.-based AI cloud provider Nscale will make this new AI model accessible to developers via an application programming interface (API). Gruffudd Prys, a senior terminologist and head of the Language Technologies Unit at Canolfan Bedwyr, emphasizes the significance of this project: "The aim is to ensure that Welsh remains a living, breathing language that continues to develop with the times. AI shows enormous potential to help with second-language acquisition of Welsh as well as for enabling native speakers to improve their language skills."
The new model is expected to enhance the accessibility of Welsh resources significantly. By enabling public institutions and businesses operating in Wales to translate content or provide bilingual chatbot services, it ensures that written materials are equally available in Welsh and English. This advancement will benefit various sectors, including healthcare providers, educators, broadcasters, retailers, and restaurant owners.
Looking beyond Welsh, the UK-LLM team plans to apply the same methodology to develop AI models for other languages spoken across the U.K., such as Cornish, Irish, Scots, and Scottish Gaelic. Furthermore, they aim to collaborate internationally to build models for languages from Africa and Southeast Asia. Pontus Stenetorp, a professor of natural language processing and deputy director for the Centre of Artificial Intelligence at University College London, stated, "Our aim is to take the insights gained from the Welsh model and apply them to other minority languages, in the U.K. and across the globe."
Harnessing Sovereign AI Infrastructure for Model Development
The Welsh language model is built on NVIDIA Nemotron, a family of open-source models featuring open weights, datasets, and recipes. The UK-LLM development team utilized the Llama Nemotron Super model with 49 billion parameters and the Nemotron Nano model with 9 billion parameters, training them on Welsh-language data. Given the limited availability of Welsh source data compared to languages like English or Spanish, the team created a large Welsh training dataset using NVIDIA NIM microservices. These tools translated over 30 million entries from English to Welsh using NVIDIA Nemotron open datasets.
This translation and training process was accelerated using a GPU cluster on the NVIDIA DGX Cloud Lepton platform. Hundreds of NVIDIA GH200 Grace Hopper Superchips on the U.K.’s most powerful supercomputer, Isambard-AI, located at the University of Bristol, were utilized to enhance the efficiency of these workloads. This supercomputer is backed by £225 million in government investment, underscoring the project’s significance and potential impact.
Capturing Linguistic Nuances with Careful Evaluation
Bangor University, located in Gwynedd, a county with the highest percentage of Welsh speakers, provides linguistic and cultural expertise to support the new model’s development. Gruffudd Prys, with nearly two decades of experience in language technology for Welsh, plays a crucial role in this collaboration. His team verifies the accuracy of machine-translated training data and manually translated evaluation data. They also assess how the model handles Welsh linguistic nuances, such as the mutation of consonants at the beginning of Welsh words based on their neighboring words, which is a challenge for AI.
The model and its associated Welsh training and evaluation datasets will be available for enterprise and public sector use, supporting additional research, model training, and application development. "It’s one thing to have this AI capability exist in Welsh, but it’s another to make it open and accessible for everyone," Prys said. "That subtle distinction can be the difference between this technology being used or not being used."
Deploying Sovereign AI Models with NVIDIA Nemotron and NIM Microservices
The framework used to develop the UK-LLM’s Welsh model provides a foundation for multilingual AI development worldwide. Nemotron models, data, and recipes are publicly available for developers to build reasoning models tailored to virtually any language, domain, and workflow. Packaged as NVIDIA NIM microservices, Nemotron models are optimized for cost-effective computing and can run anywhere, from a laptop to the cloud.
European enterprises will be able to deploy open, sovereign models on the Perplexity AI-powered search engine. This development marks a significant step toward making advanced AI technologies accessible and beneficial to a wider audience, preserving linguistic diversity while fostering technological advancement.
To sum up, the UK-LLM initiative, through its collaboration with NVIDIA and Bangor University, represents a landmark effort in leveraging AI to support and preserve the Welsh language. The project not only promises to enhance public services but also aims to ensure the continued vitality of Welsh and other minority languages across the globe. As technology and cultural preservation intersect, this model sets a precedent for future endeavors in AI-driven language preservation and empowerment.
For further insights, you can explore more about NVIDIA Nemotron on their official website.
For more Information, Refer to this article.
































