OLAC Record: CAREGIVER Corpus

OLAC Record
oai:catalogue.elra.info:ELRA-S0410

Metadata

Title: CAREGIVER Corpus

Access Rights: Rights available for: nonCommercialUse

Coverage: United Kingdom

Date Available (W3CDTF): 2020-09-03

Date Issued (W3CDTF): 2020-09-03

Description: A multi-lingual speech corpus used for modeling language acquisition called CAREGIVER has been designed and recorded within the framework of the EU funded Acquisition of Communication and Recognition Skills (ACORNS) project. The motivation behind the corpus and its design relies on current knowledge regarding infant language acquisition. Instead of recording infants and children, the voices of their primary and secondary caregivers were captured in both infant-directed and adult-directed speech modes over four languages in a read speech manner. The challenges and methods applied to obtain similar prompts in terms of complexity and semantics across different languages, as well as the normalized recording procedures employed at different locations, are covered. An orthographic transcription is available for every utterance. Also, time-aligned word and phone annotations for some of the sub-corpora exist.However, in the actual corpus there are a couple of deviations from this setup. The corpus contains nearly 66,000 utterance-based audio files spoken over a two-year period by 16 male and 14 female native speakers of Dutch, English, and Finnish. Swedish is not provided. For Dutch only year 2 recordings are available. Overview:1) UK English:Year 1: - 4 speakers (2 males, 2 females)- 1000 recordings per speaker- orthographic transcriptions in .xml and speech recordings in .wavYear 2 : - 10 speakers including 4 speakers (same as for year 1) with 2397 recordings per speaker and 6 speakers (3 males, 3 females) used as test speakers with 600 recordings per speaker - orthographic transcriptions in .xml and speech recordings in .wav- annotation: time stamps at word and phone levels by Forced Alignment and a list of errors in time stamps at word level 2) Finnish:Year 1 :- 4 speakers (2 males, 2 females)- 2000 recordings per speaker- orthographic transcriptions in .xml and speech recordings in .wavYear 2: - 10 speakers including 4 speakers (same as for year 1) with 2397 recordings per speaker and 6 speakers (3 males, 3 females) used as test speakers with 600 recordings per speaker- orthographic transcriptions in .xml and speech recordings in .wav3) Dutch:Year 2:- 10 speakers including 4 speakers recorded twice (2 males and 2 females) and 6 speakers (4 males and 2 females) used as test speakers with one recording session. - orthographic transcriptions in .cor and speech recordings in .wav- annotation: time stamps at sentence level onlyTo be mentioned as reference to the corpus:Altosaar, T., Bosch, L. ten, Aimetti, G., Koniaris, Chr., Demuynck, K., Heuvel, H. van den (2010): A Speech Corpus for Modeling Language Acquisition: CAREGIVER. Proceedings LREC2010, Malta, pp. 1062-1068. http://www.lrec-conf.org/proceedings/lrec2010/pdf/597_Paper.pdf.

Identifier: ELRA-S0410

ISLRN: 072-357-063-759-1

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-S0410/

Language: Dutch; Flemish

Finnish

English

Language (ISO639): nld

fin

eng

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-S0410

DateStamp: 2020-09-03

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2020. ELRA (European Language Resources Association).
Terms: area_Europe country_FI country_GB country_NL dcmi_Sound iso639_eng iso639_fin iso639_nld olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-S0410
Up-to-date as of: Wed Oct 1 0:57:24 EDT 2025

Metadata
Title:		CAREGIVER Corpus
Access Rights:		Rights available for: nonCommercialUse
Coverage:		United Kingdom
Date Available (W3CDTF):		2020-09-03
Date Issued (W3CDTF):		2020-09-03
Description:		A multi-lingual speech corpus used for modeling language acquisition called CAREGIVER has been designed and recorded within the framework of the EU funded Acquisition of Communication and Recognition Skills (ACORNS) project. The motivation behind the corpus and its design relies on current knowledge regarding infant language acquisition. Instead of recording infants and children, the voices of their primary and secondary caregivers were captured in both infant-directed and adult-directed speech modes over four languages in a read speech manner. The challenges and methods applied to obtain similar prompts in terms of complexity and semantics across different languages, as well as the normalized recording procedures employed at different locations, are covered. An orthographic transcription is available for every utterance. Also, time-aligned word and phone annotations for some of the sub-corpora exist.However, in the actual corpus there are a couple of deviations from this setup. The corpus contains nearly 66,000 utterance-based audio files spoken over a two-year period by 16 male and 14 female native speakers of Dutch, English, and Finnish. Swedish is not provided. For Dutch only year 2 recordings are available. Overview:1) UK English:Year 1: - 4 speakers (2 males, 2 females)- 1000 recordings per speaker- orthographic transcriptions in .xml and speech recordings in .wavYear 2 : - 10 speakers including 4 speakers (same as for year 1) with 2397 recordings per speaker and 6 speakers (3 males, 3 females) used as test speakers with 600 recordings per speaker - orthographic transcriptions in .xml and speech recordings in .wav- annotation: time stamps at word and phone levels by Forced Alignment and a list of errors in time stamps at word level 2) Finnish:Year 1 :- 4 speakers (2 males, 2 females)- 2000 recordings per speaker- orthographic transcriptions in .xml and speech recordings in .wavYear 2: - 10 speakers including 4 speakers (same as for year 1) with 2397 recordings per speaker and 6 speakers (3 males, 3 females) used as test speakers with 600 recordings per speaker- orthographic transcriptions in .xml and speech recordings in .wav3) Dutch:Year 2:- 10 speakers including 4 speakers recorded twice (2 males and 2 females) and 6 speakers (4 males and 2 females) used as test speakers with one recording session. - orthographic transcriptions in .cor and speech recordings in .wav- annotation: time stamps at sentence level onlyTo be mentioned as reference to the corpus:Altosaar, T., Bosch, L. ten, Aimetti, G., Koniaris, Chr., Demuynck, K., Heuvel, H. van den (2010): A Speech Corpus for Modeling Language Acquisition: CAREGIVER. Proceedings LREC2010, Malta, pp. 1062-1068. http://www.lrec-conf.org/proceedings/lrec2010/pdf/597_Paper.pdf.
Identifier:		ELRA-S0410
Identifier:		ISLRN: 072-357-063-759-1
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-S0410/
Language:		Dutch; Flemish
		Finnish
		English
Language (ISO639):		nld
		fin
		eng
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-S0410
DateStamp:		2020-09-03
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2020. ELRA (European Language Resources Association).
Terms:		area_Europe country_FI country_GB country_NL dcmi_Sound iso639_eng iso639_fin iso639_nld olac_primary_text