OLAC Record: HKUST Mandarin Telephone Transcript Data, Part 1

OLAC Record
oai:www.ldc.upenn.edu:LDC2005T32

Metadata

Title: HKUST Mandarin Telephone Transcript Data, Part 1

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Fung, Pascale, Shudong Huang, and David Graff. HKUST Mandarin Telephone Transcript Data, Part 1 LDC2005T32. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: Fung, Pascale

Huang, Shudong

Graff, David

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-07-15

Description: *Introduction* HKUST Mandarin Telephone Transcript Data Part 1 was developed by Hong Kong University of Science and Technology (HKUST) and contains transcripts for 897 telephone conversations in Mandarin Chinese. In 2004 HKUST was contracted to collect and transcribe 200 hours of Mandarin Chinese conversational telephone speech from Mandarin speakers in mainland China under the DARPA EARS framework. The first 50 hours of speech and transcripts were released in June 2004 to the EARS community for the RT-04 NIST evaluation. NIST partitioned the remaining 150 hours of collection into training, development, and evaluation sets. This release contains the training and development sets with 873 and 24 calls, respectively. Subject recruitment was done in several cities across mainland China. Most subjects did not previously know each other. To encourage more meaningful conversation, topics similar to those in Fisher English were designed. In all the calls, an operator would call two participants as scheduled to initiate a call. Subjects were asked demographic questions before being bridged for normal conversation. Their demographic responses were recorded as separate files. Subjects' birthplaces were divided into Mandarin-dominant and non-Mandarin-dominant regions and all calls were audited and classified into standard and accented types without further distinctions. Selected demographics are provided as a tab-delimited, plain-text, tabular file. Subjects were allowed to talk up to 10 minutes. With a few exceptions, most calls are of the maximum length. Although subjects were allowed to make up to three calls, all subjects made just one call in this release with one exception, where PIN 10683 and PIN 10686 belong to a single individual. The corresponding speech files for these transcripts are available in HKUST Mandarin Telephone Speech, Part 1 (LDC2005S15). *Data* Each call side was recorded on a separate .wav file, sampled at 8 bits (a-law encoded), 8 kHz. They were multiplexed later in sphere format with a-law encoding preserved. In the case where one side was shorter than the other, the shorter side was padded with silence. In the release, the file name of each recorded call is in the format of "date_time_Apin_Bpin.sph" and the corresponding transcript is in the same format with .txt extension. All calls were fully transcribed from beginning to end. Standard simplified Chinese characters, encoded in GBK (CP-936), were used. Speech is segmented at natural boundaries wherever possible and each segment is no more than 10 seconds long. HKUST formulated transcription guidelines based on LDC's RT-03 transcription guidelines. For more information, refer to "trans-guidelines.pdf" included in the release. The transcripts provided by HKUST were XML-formatted with each side of a call in a separate file. LDC multiplexed the two sides into a single file with turns interleaved in temporal order (based on the initial time stamps), and converted the format into the LDC format. All transcripts were checked against RT-04 formatting standards. The Chinese text is not segmented into words, though there are occasional white spaces within some turns. *Samples* For an example of the data in this corpus, please view this transcript sample (TXT). *Updates* None at this time.

Extent: Corpus size: 11264 KB

Identifier: LDC2005T32

https://catalog.ldc.upenn.edu/LDC2005T32

ISBN: 1-58563-352-6

ISLRN: 254-896-342-130-6

DOI: 10.35111/t0sj-tn09

Language: Mandarin Chinese

Language (ISO639): cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2005T32

Rights Holder: Portions © 2005 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005T32

DateStamp: 2021-07-19

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Fung, Pascale; Huang, Shudong; Graff, David. 2005. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T32
Up-to-date as of: Wed Oct 29 7:00:51 EDT 2025

Metadata
Title:		HKUST Mandarin Telephone Transcript Data, Part 1
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Fung, Pascale, Shudong Huang, and David Graff. HKUST Mandarin Telephone Transcript Data, Part 1 LDC2005T32. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		Fung, Pascale
		Huang, Shudong
		Graff, David
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-07-15
Description:		Introduction HKUST Mandarin Telephone Transcript Data Part 1 was developed by Hong Kong University of Science and Technology (HKUST) and contains transcripts for 897 telephone conversations in Mandarin Chinese. In 2004 HKUST was contracted to collect and transcribe 200 hours of Mandarin Chinese conversational telephone speech from Mandarin speakers in mainland China under the DARPA EARS framework. The first 50 hours of speech and transcripts were released in June 2004 to the EARS community for the RT-04 NIST evaluation. NIST partitioned the remaining 150 hours of collection into training, development, and evaluation sets. This release contains the training and development sets with 873 and 24 calls, respectively. Subject recruitment was done in several cities across mainland China. Most subjects did not previously know each other. To encourage more meaningful conversation, topics similar to those in Fisher English were designed. In all the calls, an operator would call two participants as scheduled to initiate a call. Subjects were asked demographic questions before being bridged for normal conversation. Their demographic responses were recorded as separate files. Subjects' birthplaces were divided into Mandarin-dominant and non-Mandarin-dominant regions and all calls were audited and classified into standard and accented types without further distinctions. Selected demographics are provided as a tab-delimited, plain-text, tabular file. Subjects were allowed to talk up to 10 minutes. With a few exceptions, most calls are of the maximum length. Although subjects were allowed to make up to three calls, all subjects made just one call in this release with one exception, where PIN 10683 and PIN 10686 belong to a single individual. The corresponding speech files for these transcripts are available in HKUST Mandarin Telephone Speech, Part 1 (LDC2005S15). Data Each call side was recorded on a separate .wav file, sampled at 8 bits (a-law encoded), 8 kHz. They were multiplexed later in sphere format with a-law encoding preserved. In the case where one side was shorter than the other, the shorter side was padded with silence. In the release, the file name of each recorded call is in the format of "date_time_Apin_Bpin.sph" and the corresponding transcript is in the same format with .txt extension. All calls were fully transcribed from beginning to end. Standard simplified Chinese characters, encoded in GBK (CP-936), were used. Speech is segmented at natural boundaries wherever possible and each segment is no more than 10 seconds long. HKUST formulated transcription guidelines based on LDC's RT-03 transcription guidelines. For more information, refer to "trans-guidelines.pdf" included in the release. The transcripts provided by HKUST were XML-formatted with each side of a call in a separate file. LDC multiplexed the two sides into a single file with turns interleaved in temporal order (based on the initial time stamps), and converted the format into the LDC format. All transcripts were checked against RT-04 formatting standards. The Chinese text is not segmented into words, though there are occasional white spaces within some turns. Samples For an example of the data in this corpus, please view this transcript sample (TXT). Updates None at this time.
Extent:		Corpus size: 11264 KB
Identifier:		LDC2005T32
		https://catalog.ldc.upenn.edu/LDC2005T32
		ISBN: 1-58563-352-6
		ISLRN: 254-896-342-130-6
		DOI: 10.35111/t0sj-tn09
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2005T32
Rights Holder:		Portions © 2005 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005T32
DateStamp:		2021-07-19
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Fung, Pascale; Huang, Shudong; Graff, David. 2005. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text