About this project

RIAH thin image bar

CorCenCC logoCORPWS CENEDLAETHOL CYMRAEG CYFOES (THE NATIONAL CORPUS OF CONTEMPORARY WELSH): A COMMUNITY DRIVEN APPROACH TO LINGUISTIC CORPUS CONSTRUCTION

Professor Tess Fitzpatrick (Department of English Language and Applied Linguistics), ‌Steve Morris (Department of Welsh) and Mark Stonelake (Academi Hywel Teifi) are Co-Investigators on this £1.8 million 42-month ESRC/AHRC funded research project. The multi-institution project is led by Cardiff University with other partner institutions including Swansea, Bangor and Lancaster Universities and it will break new ground as both a language resource and a model of corpus construction.  Attached to the project are two Research Assistants (one full-time and one part-time) at Swansea University as well as others at the partner institutions.

 This project will create a major corpus of the Welsh language: CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes: National Corpus of Contemporary Welsh). It will be the first ever large-scale corpus to represent spoken, written and electronically-mediated Welsh (compiling an initial data set of 10 million Welsh words) together with a functional design informed, from the outset, by representatives of all anticipated academic and community user groups. CorCenCC will provide societal, economic and academic benefits by:

- Facilitating uses of Welsh in public, commercial, educational and governmental settings.

- Redefining the scope, relevance and design infrastructure of corpus development methodology.

CorCenCC project imageA corpus allows users to identify and explore language as it is actually used, rather than relying on intuition or prescriptive accounts of how it 'should' be used. This evidence-based approach is used by academic researchers, lexicographers, teachers, language learners, assessors, resource developers, policy makers, publishers, translators and others, and is essential to the development of technologies such as predictive text production, word processing tools, machine translation, voice recognition and web search tools. Welsh has had no comprehensive corpus facility able to meet all these requirements.

CorCenCC will capitalise on extensive community interest in sustaining and 'growing' Welsh, using the novel integration of crowdsourcing, a powerful data collection method which has the potential to revolutionise corpus construction. Recruited through social and broadcast media, roadshows and existing networks, Welsh speakers will record and upload their own data via a mobile app, and even contribute to data coding. This approach promises representative language across genres, language varieties (regional and social) and contexts. Traditional, data collection will supplement the crowdsourcing, ensuring a representative balance of data as specified in the project targets.

For more information on the project, visit: http://sites.cardiff.ac.uk/corcencc

Project Members

The project will be led by Dawn Knight, at the Centre for Language and Communication Research, Cardiff University. The academic project team comprises:

  • Dawn Knight Cardiff University (School of English, Communication and Philosophy)
  • Irena Spasic Cardiff University (School of Computer Science and Informatics)
  • Jeremy Evas Cardiff University (School of Welsh)
  • Tess Fitzpatrick Swansea University (Department of English Language and Applied Linguistics)
  • Steve Morris Swansea University (Department of Welsh)
  • Mark Stonelake Swansea University (Academi Hywel Teifi)
  • Paul Rayson Lancaster University (School of Computing and Communications)
  • Enlli Thomas Bangor University (School of Education)

Other contributors and collaborators include computer programmers, Welsh language experts and a range of external stakeholders including the Welsh Government, National Assembly for Wales, Welsh Joint Education Committee, Welsh for Adults, Gwasg y Lolfa, and University of Wales Dictionary of the Welsh Language. 

Project Launch Video