OLAC Record

Title:West Point Croatian Speech
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:LaRocca, Stephen A., Christine Tomei, and Milan Sokolich. West Point Croatian Speech LDC2005S28. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:LaRocca, Stephen A.
Tomei, Christine
Sokolich, Milan
Date (W3CDTF):2005
Date Issued (W3CDTF):2005-10-15
Description:*Introduction* West Point Croatian Speech was developed by the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) and contains approximately 21 hours of read and free response Croatian speech. The corpus was collected by staff and faculty of DFL and CTELL to develop acoustic models for speech recognition systems. The US government uses these systems to provide speech recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. In addition, parts of this corpus were designed to model question-answer dialogues for use in domain-specific speech to speech translation systems. It consists of two subcorpora collected in 2000 and 2001 in Zagreb, Croatia. Informants were recruited from the English department at the University of Zagreb and the Croatian Military Academy. The 2000 subcorpus consists entirely of read speech, while the 2001 corpus includes free response answers to questions in addition to read speech. *Data* The read speech in the two subcorpora were elicited from two different prompt scripts. The scripts used to record read speech contain a total of 6,329 distinct sentences. Each informant in 2000 attempted to read 100 sentences from a total of 200 carefully designed sentences written by Dr. Christine Tomei. Informants in 2001 read short text passages extracted from Croatian language webpages. The script used to elicit free response answers contains 143 questions. Each speaker in the 2001 subcorpus attempted to record 105 utterances by reading 75 sentences and giving 35 free response answers to 35 questions. These recordings were transcribed by Milan Sokolich, who also wrote a pronounciation lexicon that includes grammatical tags. Speech data was collected using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re-recorded. A member of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment. *Samples* For an example of the data in this corpus, please listen to this sample (WAV). *Updates* None at this time.
Format:Sampling Rate: 22050
Sampling Format: pcm
ISBN: 1-58563-359-3
ISLRN: 531-836-688-808-6
DOI: 10.35111/e542-fj42
Language (ISO639):hrv
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2005S28
Rights Holder: Portions © 2000-2001 United States Military Academy, © 2005 Trustees of the University of Pennsylvania
Type (DCMI):Sound
Type (OLAC):lexicon


Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2005S28
DateStamp:  2022-01-20
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: LaRocca, Stephen A.; Tomei, Christine; Sokolich, Milan. 2005. Linguistic Data Consortium.
Terms: area_Europe country_HR dcmi_Sound dcmi_Text iso639_hrv olac_lexicon olac_primary_text

Up-to-date as of: Tue May 7 7:24:45 EDT 2024