As a first example we hightlight the Polish Cued Speech Corpus of Hearing-Impaired Children which was curated in April 2020, and published at https://talkbank.org/. The audio files are accessible only via the TLA at Max Plack Institute. All information and links can be found on the landing page of this corpus: https://phonbank.talkbank.org/access/Clinical/PCSC.html. Also check out on this interview for Tour de CLARIN with the Polish producers of the corpus, Katarzyna Klessa and Anita Lorenc.

As a next showcase we point to the rich corpora of speech from children and adults with language disorders collected in the VALID project (Klatter at al., 2014) and stored at TLA. Within VALID, four existing digital datasets were curated in order to make them available for scientific research in CLARIN-compatible format. The datasets included are:

  • SLI RU-Kentalis database
  • Bilingual deaf children RU-Kentalis database
  • ADHD and SLI corpus UvA database
  • Deaf adults RU database

More information about these datasets can be found at https://validdata.org/clarin-project/datasets/. The data sets can be found at TLA via this link.

Another show case is the P-MoLL dataset, which is accessible via this link to all registered users of TLA. The project P-Moll (=Modalität von Lernervarietäten im Längsschnitt) was run at the Free University in Berlin by Prof. Norbert Dittmar from 1987 to 1992. It dealt with the study of the acquisition of modality in German as a second language by untutored adult immigrants with Polish and Italian as their native language. The longitudinal data collection covers about two and a half years of the learners’ acquisition process. It contains their oral speech production from different elicitation tasks and free conversations with native speakers (Dittmar et al., 1990).

Another example of a well-documented dataset on second language learning is the LESLLA corpus. LESLLA stands for Literacy Education and Second Language Learning for Adults, see https://www.leslla.org/. The corpus contains speech of 15 low-educated learners of Dutch as a second language. All of them are women; 8 are Turkish, 7 Moroccan. (Turks and Moroccans are the two largest immigrant groups in the Netherlands). At the time of the recordings, they were between 22 and 45 years old. Participants had to carry out five tasks which all involved spoken language but varied from strictly controlled to semi-spontaneous. An extensive description of the curated corpus can be found in Sanders, Van de Craats & De Lint (2014). This corpus is also accessible at TLA via this link.

Dittmar, N., Reich, A., Skiba, R., Schumacher, M., & Terborg, H. (1990). Die Erlernung modaler Konzepte des Deutschen durch erwachsene polnische Migranten: Eine empirische Längsschnittstudie. In: Informationen Deutsch als Fremdsprache: Info DaF 17(2), pp. 125-172.

Klatter, J., Van Hout, R., Heuvel, H. van den, Fikkert, P., Baker, A., De Jong J., Wijnen, F., Sanders, E., Trilsbeek, P. (2014). Vulnerability in Acquisition, Language Impairments in Dutch: Creating a VALID Data Archive. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), May 2014, pp. 1525–31.

Sanders, E., Van de Craats, I, De Lint, V. (2014). The Dutch LESLLA Corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), May 2014, pp. 2715-2718.