OLAC Record: The CMU Kids Corpus

OLAC Record
oai:www.ldc.upenn.edu:LDC97S63

Metadata

Title: The CMU Kids Corpus

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Eskenazi, Maxine, Jack Mostow, and David Graff. The CMU Kids Corpus LDC97S63. Web Download. Philadelphia: Linguistic Data Consortium, 1997

Contributor: Eskenazi, Maxine

Mostow, Jack

Graff, David

Date (W3CDTF): 1997

Description: *Introduction* This database is comprised of sentences read aloud by children. It was originally designed in order to create a training set of children's speech for the SPHINX II automatic speech recognizer for its use in the LISTEN project at Carnegie Mellon University. *Data* The children range in age from six to eleven (see details below) and were in first through third grades (the 11-year-old was in 6th grade) at the time of recording. There were 24 male and 52 female speakers. Although the girls outnumber the boys, we feel that the small difference in vocal tract length between the two at this age should make the effect of this imbalance negligible. There are 5,180 utterances in all. The speakers come from two separate populations. Since the LISTEN reading coach needed good examples of reading aloud, it was decided that the majority of the speakers should be "good" readers. They were recorded in the summer of 1995 and were enrolled in either the Chatham College Summer Camp or the Mount Lebanon Extended Day Summer Fun program in Pittsburgh. They were recorded on-site. This set will hereafter be called SUM95. There are 44 speakers and 3,333 utterances in this set. The LISTEN system also needed examples of errorful reading and dialectic variants. The readers who supplied this type of speech come from a school which has a high population of children who are at risk of growing up poor readers and who could therefore benefit from any reading tutor or other system built upon this database. They come from Fort Pitt School in Pittsburgh and were recorded in April 1996. This subset will be referred to as FP. There are 32 speakers and 1,847 utterances in this set. The list of speakers, the set they are in and the number of sentences per speaker can be found in the "tables" directory, in the file named "speaker.tbl." It should be noted that although there will be some dialectal variation in the speech of the SUM95 subset, the speech of the FP subset gives us a very good representation of dialects of the children that may be targeted for the LISTEN system. However, the user should be aware that the speakers' dialect partly reflects what is locally called "Pittsburghese." *Samples* Please view the following samples: * Audio Sample * Transcript Sample *Updates* There are no updates at this time.

Format: Sampling Rate: 16000

Sampling Format: 1-channel pcm

Identifier: LDC97S63

https://catalog.ldc.upenn.edu/LDC97S63

ISBN: 1-58563-120-5

ISLRN: 566-795-587-797-8

DOI: 10.35111/b4v0-ff65

Language: English

Language (ISO639): eng

License: CMU Kids Corpus - Individual Agreement: https://catalog.ldc.upenn.edu/license/cmu-kids-individual-agreement.pdf

CMU Kids Corpus - Organization Agreement: https://catalog.ldc.upenn.edu/license/cmu-kids-organization-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC97S63

Rights Holder: The text presented to the children was obtained from Weekly Reader stories. Weekly Reader is a four-page color reading supplement given out to children in many classrooms. Special reprint permission granted by Weekly Reader (R), published by Weekly Reader Corporation Copyright (c) 1994, 1995 by Weekly Reader Corporation All Rights Reserved.

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC97S63

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Eskenazi, Maxine; Mostow, Jack; Graff, David. 1997. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC97S63
Up-to-date as of: Wed Oct 29 7:00:44 EDT 2025

Metadata
Title:		The CMU Kids Corpus
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Eskenazi, Maxine, Jack Mostow, and David Graff. The CMU Kids Corpus LDC97S63. Web Download. Philadelphia: Linguistic Data Consortium, 1997
Contributor:		Eskenazi, Maxine
		Mostow, Jack
		Graff, David
Date (W3CDTF):		1997
Description:		Introduction This database is comprised of sentences read aloud by children. It was originally designed in order to create a training set of children's speech for the SPHINX II automatic speech recognizer for its use in the LISTEN project at Carnegie Mellon University. Data The children range in age from six to eleven (see details below) and were in first through third grades (the 11-year-old was in 6th grade) at the time of recording. There were 24 male and 52 female speakers. Although the girls outnumber the boys, we feel that the small difference in vocal tract length between the two at this age should make the effect of this imbalance negligible. There are 5,180 utterances in all. The speakers come from two separate populations. Since the LISTEN reading coach needed good examples of reading aloud, it was decided that the majority of the speakers should be "good" readers. They were recorded in the summer of 1995 and were enrolled in either the Chatham College Summer Camp or the Mount Lebanon Extended Day Summer Fun program in Pittsburgh. They were recorded on-site. This set will hereafter be called SUM95. There are 44 speakers and 3,333 utterances in this set. The LISTEN system also needed examples of errorful reading and dialectic variants. The readers who supplied this type of speech come from a school which has a high population of children who are at risk of growing up poor readers and who could therefore benefit from any reading tutor or other system built upon this database. They come from Fort Pitt School in Pittsburgh and were recorded in April 1996. This subset will be referred to as FP. There are 32 speakers and 1,847 utterances in this set. The list of speakers, the set they are in and the number of sentences per speaker can be found in the "tables" directory, in the file named "speaker.tbl." It should be noted that although there will be some dialectal variation in the speech of the SUM95 subset, the speech of the FP subset gives us a very good representation of dialects of the children that may be targeted for the LISTEN system. However, the user should be aware that the speakers' dialect partly reflects what is locally called "Pittsburghese." Samples Please view the following samples: * Audio Sample * Transcript Sample Updates There are no updates at this time.
Format:		Sampling Rate: 16000
Format:		Sampling Format: 1-channel pcm
Identifier:		LDC97S63
		https://catalog.ldc.upenn.edu/LDC97S63
		ISBN: 1-58563-120-5
		ISLRN: 566-795-587-797-8
		DOI: 10.35111/b4v0-ff65
Language:		English
Language (ISO639):		eng
License:		CMU Kids Corpus - Individual Agreement: https://catalog.ldc.upenn.edu/license/cmu-kids-individual-agreement.pdf
License:		CMU Kids Corpus - Organization Agreement: https://catalog.ldc.upenn.edu/license/cmu-kids-organization-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC97S63
Rights Holder:		The text presented to the children was obtained from Weekly Reader stories. Weekly Reader is a four-page color reading supplement given out to children in many classrooms. Special reprint permission granted by Weekly Reader (R), published by Weekly Reader Corporation Copyright (c) 1994, 1995 by Weekly Reader Corporation All Rights Reserved.
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC97S63
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Eskenazi, Maxine; Mostow, Jack; Graff, David. 1997. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text