About this project
CORPWS CENEDLAETHOL CYMRAEG CYFOES (THE NATIONAL CORPUS OF CONTEMPORARY WELSH): A COMMUNITY DRIVEN APPROACH TO LINGUISTIC CORPUS CONSTRUCTION
Professor Tess Fitzpatrick (Department of English Language and Applied Linguistics), Steve Morris (Department of Welsh) and Mark Stonelake (Academi Hywel Teifi) are Co-Investigators on this £1.8 million 42-month ESRC/AHRC funded research project. The multi-institution project is led by Cardiff University with other partner institutions including Swansea, Bangor and Lancaster Universities and it will break new ground as both a language resource and a model of corpus construction. Attached to the project are two Research Assistants (one full-time and one part-time) at Swansea University as well as others at the partner institutions.
This project will create a major corpus of the Welsh language: CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes: National Corpus of Contemporary Welsh). It will be the first ever large-scale corpus to represent spoken, written and electronically-mediated Welsh (compiling an initial data set of 10 million Welsh words) together with a functional design informed, from the outset, by representatives of all anticipated academic and community user groups. CorCenCC will provide societal, economic and academic benefits by:
- Facilitating uses of Welsh in public, commercial, educational and governmental settings.
- Redefining the scope, relevance and design infrastructure of corpus development methodology.
A corpus allows users to identify and explore language as it is actually used, rather than relying on intuition or prescriptive accounts of how it 'should' be used. This evidence-based approach is used by academic researchers, lexicographers, teachers, language learners, assessors, resource developers, policy makers, publishers, translators and others, and is essential to the development of technologies such as predictive text production, word processing tools, machine translation, voice recognition and web search tools. Welsh has had no comprehensive corpus facility able to meet all these requirements.
CorCenCC will capitalise on extensive community interest in sustaining and 'growing' Welsh, using the novel integration of crowdsourcing, a powerful data collection method which has the potential to revolutionise corpus construction. Recruited through social and broadcast media, roadshows and existing networks, Welsh speakers will record and upload their own data via a mobile app, and even contribute to data coding. This approach promises representative language across genres, language varieties (regional and social) and contexts. Traditional, data collection will supplement the crowdsourcing, ensuring a representative balance of data as specified in the project targets.
For more information on the project, visit: http://sites.cardiff.ac.uk/corcencc