OLAC Record: West Point Croatian Speech

OLAC Record
oai:www.ldc.upenn.edu:LDC2005S28

Metadata

Title: West Point Croatian Speech

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: LaRocca, Stephen A., Christine Tomei, and Milan Sokolich. West Point Croatian Speech LDC2005S28. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: LaRocca, Stephen A.

Tomei, Christine

Sokolich, Milan

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-10-15

Description: *Introduction* West Point Croatian Speech was developed by the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) and contains approximately 21 hours of read and free response Croatian speech. The corpus was collected by staff and faculty of DFL and CTELL to develop acoustic models for speech recognition systems. The US government uses these systems to provide speech recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. In addition, parts of this corpus were designed to model question-answer dialogues for use in domain-specific speech to speech translation systems. It consists of two subcorpora collected in 2000 and 2001 in Zagreb, Croatia. Informants were recruited from the English department at the University of Zagreb and the Croatian Military Academy. The 2000 subcorpus consists entirely of read speech, while the 2001 corpus includes free response answers to questions in addition to read speech. *Data* The read speech in the two subcorpora were elicited from two different prompt scripts. The scripts used to record read speech contain a total of 6,329 distinct sentences. Each informant in 2000 attempted to read 100 sentences from a total of 200 carefully designed sentences written by Dr. Christine Tomei. Informants in 2001 read short text passages extracted from Croatian language webpages. The script used to elicit free response answers contains 143 questions. Each speaker in the 2001 subcorpus attempted to record 105 utterances by reading 75 sentences and giving 35 free response answers to 35 questions. These recordings were transcribed by Milan Sokolich, who also wrote a pronounciation lexicon that includes grammatical tags. Speech data was collected using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re-recorded. A member of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment. *Samples* For an example of the data in this corpus, please listen to this sample (WAV). *Updates* None at this time.

Format: Sampling Rate: 22050

Sampling Format: pcm

Identifier: LDC2005S28

https://catalog.ldc.upenn.edu/LDC2005S28

ISBN: 1-58563-359-3

ISLRN: 531-836-688-808-6

DOI: 10.35111/e542-fj42

Language: Croatian

Language (ISO639): hrv

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2005S28

Rights Holder: Portions © 2000-2001 United States Military Academy, © 2005 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): lexicon

primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005S28

DateStamp: 2022-01-20

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: LaRocca, Stephen A.; Tomei, Christine; Sokolich, Milan. 2005. Linguistic Data Consortium.
Terms: area_Europe country_HR dcmi_Sound dcmi_Text iso639_hrv olac_lexicon olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005S28
Up-to-date as of: Wed Oct 29 7:00:52 EDT 2025

Metadata
Title:		West Point Croatian Speech
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		LaRocca, Stephen A., Christine Tomei, and Milan Sokolich. West Point Croatian Speech LDC2005S28. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		LaRocca, Stephen A.
		Tomei, Christine
		Sokolich, Milan
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-10-15
Description:		Introduction West Point Croatian Speech was developed by the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) and contains approximately 21 hours of read and free response Croatian speech. The corpus was collected by staff and faculty of DFL and CTELL to develop acoustic models for speech recognition systems. The US government uses these systems to provide speech recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. In addition, parts of this corpus were designed to model question-answer dialogues for use in domain-specific speech to speech translation systems. It consists of two subcorpora collected in 2000 and 2001 in Zagreb, Croatia. Informants were recruited from the English department at the University of Zagreb and the Croatian Military Academy. The 2000 subcorpus consists entirely of read speech, while the 2001 corpus includes free response answers to questions in addition to read speech. Data The read speech in the two subcorpora were elicited from two different prompt scripts. The scripts used to record read speech contain a total of 6,329 distinct sentences. Each informant in 2000 attempted to read 100 sentences from a total of 200 carefully designed sentences written by Dr. Christine Tomei. Informants in 2001 read short text passages extracted from Croatian language webpages. The script used to elicit free response answers contains 143 questions. Each speaker in the 2001 subcorpus attempted to record 105 utterances by reading 75 sentences and giving 35 free response answers to 35 questions. These recordings were transcribed by Milan Sokolich, who also wrote a pronounciation lexicon that includes grammatical tags. Speech data was collected using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re-recorded. A member of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment. Samples For an example of the data in this corpus, please listen to this sample (WAV). Updates None at this time.
Format:		Sampling Rate: 22050
Format:		Sampling Format: pcm
Identifier:		LDC2005S28
		https://catalog.ldc.upenn.edu/LDC2005S28
		ISBN: 1-58563-359-3
		ISLRN: 531-836-688-808-6
		DOI: 10.35111/e542-fj42
Language:		Croatian
Language (ISO639):		hrv
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2005S28
Rights Holder:		Portions © 2000-2001 United States Military Academy, © 2005 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		lexicon
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005S28
DateStamp:		2022-01-20
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		LaRocca, Stephen A.; Tomei, Christine; Sokolich, Milan. 2005. Linguistic Data Consortium.
Terms:		area_Europe country_HR dcmi_Sound dcmi_Text iso639_hrv olac_lexicon olac_primary_text