OLAC Record: MyST Children's Conversational Speech

OLAC Record
oai:www.ldc.upenn.edu:LDC2021S05

Metadata

Title: MyST Children's Conversational Speech

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Pradhan, Sameer, Ronald Cole, and Wayne Ward. MyST Children's Conversational Speech LDC2021S05. Web Download. Philadelphia: Linguistic Data Consortium, 2021

Contributor: Pradhan, Sameer

Cole, Ronald Allan

Ward, Wayne

Date (W3CDTF): 2021

Date Issued (W3CDTF): 2021-06-15

Description: *Introduction* MyST (My Science Tutor) Children's Conversational Speech was developed by Boulder Learning Inc. It is comprised of approximately 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary. Data was collected in two phases between 2008 and 2017. In both phases, spoken dialogs with the virtual tutor were aligned to classroom instruction using the Full Option Science System (FOSS) system, a research-based science curriculum for grades K-8. The eight FOSS science modules represented in this data set consisted of an average of 16 small-group classroom science investigations. Following the investigations, students conversed with the virtual science tutor for 15-20 minutes. The tutor asked open-ended questions about media presented on-screen, and students produced spoken answers. *Data* Speech data was collected in 10,496 sessions for a total of 227,567 utterances. Approximately 45% of those utterances (102,433) were transcribed. All data collected in Phase I was transcribed using rich transcription guidelines; data collected in Phase II was partially transcribed using a reduced version of those guidelines. The transcription guidelines are included in this release. Data is divided into development, test, and train partitions for use with ASR systems Speech is presented in single channel, 16kHz, 16-bit flac compressed wav format. Transcripts are UTF-8 encoded plain text. *Samples* Please view this student answer audio sample (FLAC) and transcript sample (TXT). *Updates* None at this time.

Extent: Corpus size: 30595048 KB

Format: Sampling Rate: 16000

Sampling Format: pcm

Identifier: LDC2021S05

https://catalog.ldc.upenn.edu/LDC2021S05

ISBN: 1-58563-967-2

ISLRN: 848-818-101-134-5

DOI: 10.35111/cyxy-p432

Language: English

Language (ISO639): eng

License: MyST Children’s Conversational Speech Agreement: https://catalog.ldc.upenn.edu/license/myst-childrens-conversational-speech-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2021S05

Rights Holder: Portions © 2021 Boulder Learning Inc., © 2021 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): lexicon

primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2021S05

DateStamp: 2025-06-17

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Pradhan, Sameer; Cole, Ronald Allan; Ward, Wayne. 2021. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_lexicon olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2021S05
Up-to-date as of: Thu Sep 18 1:01:55 EDT 2025

Metadata
Title:		MyST Children's Conversational Speech
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Pradhan, Sameer, Ronald Cole, and Wayne Ward. MyST Children's Conversational Speech LDC2021S05. Web Download. Philadelphia: Linguistic Data Consortium, 2021
Contributor:		Pradhan, Sameer
		Cole, Ronald Allan
		Ward, Wayne
Date (W3CDTF):		2021
Date Issued (W3CDTF):		2021-06-15
Description:		Introduction MyST (My Science Tutor) Children's Conversational Speech was developed by Boulder Learning Inc. It is comprised of approximately 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary. Data was collected in two phases between 2008 and 2017. In both phases, spoken dialogs with the virtual tutor were aligned to classroom instruction using the Full Option Science System (FOSS) system, a research-based science curriculum for grades K-8. The eight FOSS science modules represented in this data set consisted of an average of 16 small-group classroom science investigations. Following the investigations, students conversed with the virtual science tutor for 15-20 minutes. The tutor asked open-ended questions about media presented on-screen, and students produced spoken answers. Data Speech data was collected in 10,496 sessions for a total of 227,567 utterances. Approximately 45% of those utterances (102,433) were transcribed. All data collected in Phase I was transcribed using rich transcription guidelines; data collected in Phase II was partially transcribed using a reduced version of those guidelines. The transcription guidelines are included in this release. Data is divided into development, test, and train partitions for use with ASR systems Speech is presented in single channel, 16kHz, 16-bit flac compressed wav format. Transcripts are UTF-8 encoded plain text. Samples Please view this student answer audio sample (FLAC) and transcript sample (TXT). Updates None at this time.
Extent:		Corpus size: 30595048 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: pcm
Identifier:		LDC2021S05
		https://catalog.ldc.upenn.edu/LDC2021S05
		ISBN: 1-58563-967-2
		ISLRN: 848-818-101-134-5
		DOI: 10.35111/cyxy-p432
Language:		English
Language (ISO639):		eng
License:		MyST Children’s Conversational Speech Agreement: https://catalog.ldc.upenn.edu/license/myst-childrens-conversational-speech-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2021S05
Rights Holder:		Portions © 2021 Boulder Learning Inc., © 2021 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		lexicon
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2021S05
DateStamp:		2025-06-17
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Pradhan, Sameer; Cole, Ronald Allan; Ward, Wayne. 2021. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_lexicon olac_primary_text