Nijmegen Corpus of Casual Czech

Corpus Transcription

The corpus was orthographically transcribed by native speakers of Czech, who used the TRANSCRIBER software (Barras et al., 2001). All speech and all non-speech events, such as laughter and filled pauses, were orthographically transcribed, in Common Czech, while the registration of pronunciation variation was kept to the minimum. The speech of every pair of naive speakers was transcribed in a two-tier annotation file, while confederates, who had been recorded in a a separate mono channel, were transcribed separately in a one-tier annotation file. The transcribed text is organized into chunks with an average duration of 2.37 seconds. The orthographic transcription of the corpus contains around 361,977 word tokens, distributed over 68,426 chunks.