OLAC Record: West Point Heroico Spanish Speech

OLAC Record
oai:www.ldc.upenn.edu:LDC2006S37

Metadata

Title: West Point Heroico Spanish Speech

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Morgan, John. West Point Heroico Spanish Speech LDC2006S37. Web Download. Philadelphia: Linguistic Data Consortium, 2006

Contributor: Morgan, John

Date (W3CDTF): 2006

Date Issued (W3CDTF): 2006-10-25

Description: *Introduction* West Point Heroico Spanish Speech was developed by the Department of Foreign Languages (DFL) and the Center for Technology Enhanced Language Learning (CTELL) and contains approximately 19,000 audio files of prompted Spanish speech with associated transcripts. This corpus was designed and collected by staff and faculty of DFL and CTELL to develop acoustic models for speech recognition systems. The U.S. government uses these systems to provide speech-recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. Additionally, parts of this corpus were designed to model question/answer dialogues for use in domain-specific speech-to-speech translation systems. The corpus consists of two subcorpora, one collected in September 2001 at El Heroico Colegio Militar (HEROICO), the Mexican Military Academy in Mexico City, and the other at the United States Military Academy (USMA), also known as West Point, at different times since 1997. The USMA subcorpus includes data from non-native speakers and data collected through a throat microphone. *Data* Two kinds of prompt scripts were used, one to elicit read speech and one for free-response answers to questions. The scripts used to record read speech have a total of 724 distinct sentences, 205 short, simple sentences used in typical language learning scenarios, and 519 sentences extracted from lecture notes used at USMA in a military readings course. The script used to elicit free-response answers contains 143 questions. The corpus includes .txt files of all the read sentences, questions, and transcriptions of subjects' answers. The files are separated by recording location and named accordingly. Speech data was collected at HEROICO using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re- recorded. A member of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment. The data from USMA was collected using several different microphones and formats. Most of the data were recorded on Pentium computers running Linux through an Shure SM10 head-mounted microphone. Entropics ESPS programs were used in most cases, especially when both head-mounted and throat microphones were used. *Samples* For an example of the data in this corpus, please listen to this audio sample (WAV). *Updates* None at this time.

Extent: Corpus size: 2775534 KB

Format: Sampling Rate: 22050

Sampling Format: pcm

Identifier: LDC2006S37

https://catalog.ldc.upenn.edu/LDC2006S37

ISBN: 1-58563-391-7

ISLRN: 331-222-724-302-4

DOI: 10.35111/6nac-6589

Language: Spanish

Language (ISO639): spa

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2006S37

Rights Holder: Portions © 2001 United States Military Academy, © 2006 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2006S37

DateStamp: 2021-06-04

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Morgan, John. 2006. Linguistic Data Consortium.
Terms: area_Europe country_ES dcmi_Sound dcmi_Text iso639_spa olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006S37
Up-to-date as of: Wed Oct 29 7:00:49 EDT 2025

Metadata
Title:		West Point Heroico Spanish Speech
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Morgan, John. West Point Heroico Spanish Speech LDC2006S37. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:		Morgan, John
Date (W3CDTF):		2006
Date Issued (W3CDTF):		2006-10-25
Description:		Introduction West Point Heroico Spanish Speech was developed by the Department of Foreign Languages (DFL) and the Center for Technology Enhanced Language Learning (CTELL) and contains approximately 19,000 audio files of prompted Spanish speech with associated transcripts. This corpus was designed and collected by staff and faculty of DFL and CTELL to develop acoustic models for speech recognition systems. The U.S. government uses these systems to provide speech-recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. Additionally, parts of this corpus were designed to model question/answer dialogues for use in domain-specific speech-to-speech translation systems. The corpus consists of two subcorpora, one collected in September 2001 at El Heroico Colegio Militar (HEROICO), the Mexican Military Academy in Mexico City, and the other at the United States Military Academy (USMA), also known as West Point, at different times since 1997. The USMA subcorpus includes data from non-native speakers and data collected through a throat microphone. Data Two kinds of prompt scripts were used, one to elicit read speech and one for free-response answers to questions. The scripts used to record read speech have a total of 724 distinct sentences, 205 short, simple sentences used in typical language learning scenarios, and 519 sentences extracted from lecture notes used at USMA in a military readings course. The script used to elicit free-response answers contains 143 questions. The corpus includes .txt files of all the read sentences, questions, and transcriptions of subjects' answers. The files are separated by recording location and named accordingly. Speech data was collected at HEROICO using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re- recorded. A member of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment. The data from USMA was collected using several different microphones and formats. Most of the data were recorded on Pentium computers running Linux through an Shure SM10 head-mounted microphone. Entropics ESPS programs were used in most cases, especially when both head-mounted and throat microphones were used. Samples For an example of the data in this corpus, please listen to this audio sample (WAV). Updates None at this time.
Extent:		Corpus size: 2775534 KB
Format:		Sampling Rate: 22050
Format:		Sampling Format: pcm
Identifier:		LDC2006S37
		https://catalog.ldc.upenn.edu/LDC2006S37
		ISBN: 1-58563-391-7
		ISLRN: 331-222-724-302-4
		DOI: 10.35111/6nac-6589
Language:		Spanish
Language (ISO639):		spa
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2006S37
Rights Holder:		Portions © 2001 United States Military Academy, © 2006 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2006S37
DateStamp:		2021-06-04
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Morgan, John. 2006. Linguistic Data Consortium.
Terms:		area_Europe country_ES dcmi_Sound dcmi_Text iso639_spa olac_primary_text