OLAC Record oai:www.ldc.upenn.edu:LDC2005S15 |
Metadata | ||
Title: | HKUST Mandarin Telephone Speech, Part 1 | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Fung, Pascale, Shudong Huang, and David Graff. HKUST Mandarin Telephone Speech, Part 1 LDC2005S15. Web Download. Philadelphia: Linguistic Data Consortium, 2005 | |
Contributor: | Fung, Pascale | |
Huang, Shudong | ||
Graff, David | ||
Date (W3CDTF): | 2005 | |
Date Issued (W3CDTF): | 2005-07-15 | |
Description: | *Introduction* HKUST Mandarin Telephone Speech, Part 1 was developed by Hong Kong University of Science and Technology (HKUST) and contains approximately 149 hours of conversational telephone speech (CTS) in Mandarin. Given that Standard Mandarin is not the native dialect in many regions of China but is the official language of education, speakers may or may not have regional accents speaking Mandarin. It was decided that subjects' birthplaces were divided into Mandarin-dominant and non-Mandarin-dominant regions and all calls were audited and classified into standard and accented types without further distinctions. In 2004, HKUST was contracted to collect and transcribe 200 hours of Mandarin Chinese CTS from Mandarin speakers in mainland China under the DARPA EARS framework. The first 50 hours of speech and transcripts were released in June 2004 to the EARS community for the RT-04 NIST evaluation. NIST partitioned the remaining 150 hours of collection into training, development, and evaluation sets. This release contains the training and development sets. The corresponding transcripts for these speech files are available in HKUST Mandarin Telephone Transcript Data, Part 1 (LDC2005T32). *Data* Subject recruitment was done in several cities across mainland China. Most subjects did not previously know each other. To encourage more meaningful conversation, topics similar to those in Fisher English were designed. All calls were initiated by an automated operator calling two participants as scheduled to initiate a call. Subjects were asked about demographic questions (gender, age, native language/dialect, birthplace, education, occupation, phone type, etc.) before they were bridged for normal conversation. Their answers to select demographic questions are part of the call list files in the corpus. Subjects were allowed to talk up to 10 minutes. With a few exceptions, most calls are of the maximum length. Although subjects were allowed to make up to three calls, all subjects made just one call in this release with one exception, where PIN 10683 and PIN 10686 belong to a single individual. Here's a breakdown of the quantities and gender distribution for the calls by set: Set Calls Hours Males Females Training 873 144.7 948 797 Development 24 3.9 24 24 Totals 867 148.6 972 821 Each call side was recorded on a separate .wav file, sampled at 8-bits (a-law encoded), 8 kHz. They were multiplexed later in sphere format with a-law encoding preserved. In the case where one side was shorter than the other, the shorter side was padded with silence. In the release, the file name of each recorded call is in the format of date_time_Apin_Bpin.sph and the corresponding transcript in LDC2005T32 is in the same format with .txt extension. *Samples* For an example of the data in this corpus, please listen to these audio samples: WAV or MP3. *Updates* None at this time. | |
Format: | Sampling Rate: 8000 | |
Sampling Format: alaw | ||
Identifier: | LDC2005S15 | |
https://catalog.ldc.upenn.edu/LDC2005S15 | ||
ISBN: 1-58563-351-8 | ||
ISLRN: 964-004-555-226-5 | ||
DOI: 10.35111/rffd-da17 | ||
Language: | Mandarin Chinese | |
Language (ISO639): | cmn | |
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2005S15 | |
Rights Holder: | © 2005 Trustees of the University of Pennsylvania | |
Type (DCMI): | Sound | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2005S15 | |
DateStamp: | 2022-01-20 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Fung, Pascale; Huang, Shudong; Graff, David. 2005. Linguistic Data Consortium. | |
Terms: | area_Asia country_CN dcmi_Sound iso639_cmn olac_primary_text |