![]() ![]() Hence, in order to create interesting sociolinguistic data, it is very often efficient to mix corpora from different sources. Second, language production involves a very wide variety of speakers and contexts. As a result, not only are existing data very frequently reused, but it is also often necessary to mix data coming from different origins when the goal is to produce very large datasets. First, creating spoken language corpora is very expensive. CORLI is now an official CLARIN K Centre ( 1.1 The Importance of Data SharingĢData sharing is an absolute necessity for research and applications in spoken language, for two main reasons. The main activities of the consortium are helping corpus providers to produce open science and shareable data, providing continuous education for advanced students as well as senior researchers, and sharing good practices about data and tools. The general policy of the consortium is to build on existing practices, research tools, and material from members of the consortium, and to improve them to match the needs of all the people and laboratories interested in corpus linguistics. CORLI ( is one of the consortia of Huma-Num, a large French infrastructure dedicated to the digital humanities ( CORLI aims to promote and assist research in corpus linguistics and is based on a network of researchers and engineers working in this field. ![]() The CORLI Consortium and Corpus Linguistic ResearchġThe CORLI consortium is a network of researchers and laboratories engaged in corpus linguistic research. TEICORPO can run the Treetagger part-of-speech tagger and the Stanford CoreNLP tools on TEI files and can export the resulting files to textometric tools such as TXM, Le Trameur, or Iramuteq, making it suitable for spoken language corpora editing as well as for various research purposes. Backward conversion is possible in many cases, with limitations inherent in the destination target format. This tool enables the conversion of transcriptions created with alignment software such as CLAN, Transcriber, Praat, or ELAN as well as common file formats (CSV, XLSX, TXT, or DOCX) and the TEI format, which plays the role of a lossless pivot format. TEICORPO is based on the principle of an underlying common format, namely TEI XML as described in its specification for spoken language use (ISO 2016). To help researchers reach this goal, CORLI has designed a pair of tools: TEICORPO to assist in the conversion and use of spoken language corpora, and TEIMETA for metadata purposes. Because of the time required to collect and transcribe spoken language resources, their number is limited and thus corpora need to be interoperable and reusable in order to improve research on themes such as phonology, prosody, interaction, syntax, and textometry. The goal of CORLI is to promote and provide tools and information for good and efficient research practices in corpus linguistics, especially on spoken language corpora. CORLI is a consortium of Huma-Num, the French national infrastructure dedicated to the technical support and promotion of digital humanities.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |