The multi-institution project is led by Dawn Knight at Cardiff University and includes partner institutions Swansea, Bangor and Lancaster Universities. CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes: National Corpus of Contemporary Welsh) will break new ground as both a language resource and a model of corpus construction.  Attached to the project are two Research Assistants at Swansea University as well as others at the partner institutions.

This project will create a major corpus of the Welsh language. CorCenCC will be the first ever large-scale corpus to represent spoken, written and electronically-mediated Welsh (compiling an initial data set of 10 million Welsh words) together with a functional design informed, from the outset, by representatives of all anticipated academic and community user groups. CorCenCC will provide societal, economic and academic benefits by:

  • Facilitating uses of Welsh in public, commercial, educational and governmental settings.
  • Redefining the scope, relevance and design infrastructure of corpus development methodology.

A corpus allows users to identify and explore language as it is actually used, rather than relying on intuition or prescriptive accounts of how it 'should' be used. This evidence-based approach is used by academic researchers, lexicographers, teachers, language learners, assessors, resource developers, policy makers, publishers, translators and others, and is essential to the development of technologies such as predictive text production, word processing tools, machine translation, voice recognition and web search tools. Welsh has had no comprehensive corpus facility able to meet all these requirements.

CorCenCC will capitalise on extensive community interest in sustaining and 'growing' Welsh, using the novel integration of crowdsourcing, a powerful data collection method which has the potential to revolutionise corpus construction. Recruited through social and broadcast media, roadshows and existing networks, Welsh speakers will record and upload their own data via a mobile app, and even contribute to data coding. This approach promises representative language across genres, language varieties (regional and social) and contexts. Traditional data collection will supplement the crowdsourcing, ensuring a representative balance of data as specified in the project targets.

Logo for CorCenCC

The project entails collaboration between computer scientists, Welsh language experts, education specialists and applied linguists. The project advisory group includes representatives from a wide range of external stakeholder groups.  These include the Welsh Government, National Assembly for Wales, Welsh Joint Education Committee, the National Centre for Learning Welsh,, Gwasg y Lolfa, National Library of Wales, and University of Wales Dictionary of the Welsh Language. 

Three people sat in a meeting room.

Professor Tess Fitzpatrick (Department of Applied Linguistics), Associate Professor ‌Steve Morris and Dr Alex Lovell (Department of Welsh) are Co-Investigators on this £1.8 million ESRC/AHRC funded research project.