OLAC Record: Mandarin Chinese Phonetic Segmentation and Tone

OLAC Record
oai:www.ldc.upenn.edu:LDC2015S05

Metadata

Title: Mandarin Chinese Phonetic Segmentation and Tone

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Yuan, Jiahong, Neville Ryant, and Mark Liberman. Mandarin Chinese Phonetic Segmentation and Tone LDC2015S05. Web Download. Philadelphia: Linguistic Data Consortium, 2015

Contributor: Yuan, Jiahong

Ryant, Neville

Liberman, Mark

Date (W3CDTF): 2015

Date Issued (W3CDTF): 2015-04-20

Description: *Introduction* Mandarin Chinese Phonetic Segmentation and Tone was developed by the Linguistic Data Consortium (LDC) and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24, respectively). That collection consists of approximately 30 hours of Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM, a commercial radio station based in Los Angeles, CA. The ability to use large speech corpora for research in phonetics, sociolinguistics and psychology, among other fields, depends on the availability of phonetic segmentation and transcriptions. This corpus was developed to investigate the use of phone boundary models on forced alignment in Mandarin Chinese. Using the approach of embedded tone modeling (also used for incorporating tones for automatic speech recognition), the performance on forced alignment between tone-dependent and tone-independent models was compared. *Data* Utterances were considered as the time-stamped between-pause units in the transcribed news recordings. Those with background noise, music, unidentified speakers and accented speakers were excluded. A test set was developed with 300 utterances randomly selected from six speakers (50 utterances for each speaker). The remaining 7,549 utterances formed a training set. The utterances in the test set were manually labeled and segmented into initials and finals in Pinyin, a Roman alphabet system for transcribing Chinese characters. Tones were marked on the finals, including Tone1 through Tone4, and Tone0 for the neutral tone. The Sandhi Tone3 was labeled as Tone2. The training set was automatically segmented and transcribed using the LDC forced aligner, which is a Hidden Markov Model (HMM) aligner trained on the same utterances (Yuan et al. 2014). The aligner achieved 93.1% agreement (of phone boundaries) within 20 ms on the test set compared to manual segmentation. The quality of the phonetic transcription and tone labels of the training set was evaluated by checking 100 utterances randomly selected from it. The 100 utterances contained 1,252 syllables: 15 syllables had mistaken tone transcriptions; two syllables showed mistaken transcriptions of the final, and there were no syllables with transcription errors on the initial. Each utterance has three associated files: a flac compressed wav file, a word transcript file, and a phonetic boundaries and label file. *Samples* Please view this audio sample, transcript sample and phonetic labels sample. *Acknowledgement* This work was supported in part by National Science Foundation Grant No. IIS-0964556. *Updates* None at this time *Additional Licensing Instructions* This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.

Extent: Corpus size: 667448 KB

Identifier: LDC2015S05

https://catalog.ldc.upenn.edu/LDC2015S05

ISBN: 1-58563-710-6

ISLRN: 567-512-470-543-8

DOI: 10.35111/djnc-2014

Language: Mandarin Chinese

Language (ISO639): cmn

License: Mandarin Chinese Phonetic Segmentation and Tone User Agreement: https://catalog.ldc.upenn.edu/license/mandarin-chinese-phonetic-segmentation-and-tone-user-agreement.pdf

Medium: Distribution: Web Download

Provenance: Collected by the Linguistic Data Consortium (LDC) in Philadelphia, PA, USA.

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2015S05

Rights Holder: Portions © 1997 China Central TV, © 1997 MultiCultural Broadcasting Corporation, © 1997, 1998, 2007, 2015 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2015S05

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Yuan, Jiahong; Ryant, Neville; Liberman, Mark. 2015. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Sound dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2015S05
Up-to-date as of: Fri Aug 8 0:28:30 EDT 2025

Metadata
Title:		Mandarin Chinese Phonetic Segmentation and Tone
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Yuan, Jiahong, Neville Ryant, and Mark Liberman. Mandarin Chinese Phonetic Segmentation and Tone LDC2015S05. Web Download. Philadelphia: Linguistic Data Consortium, 2015
Contributor:		Yuan, Jiahong
		Ryant, Neville
		Liberman, Mark
Date (W3CDTF):		2015
Date Issued (W3CDTF):		2015-04-20
Description:		Introduction Mandarin Chinese Phonetic Segmentation and Tone was developed by the Linguistic Data Consortium (LDC) and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24, respectively). That collection consists of approximately 30 hours of Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM, a commercial radio station based in Los Angeles, CA. The ability to use large speech corpora for research in phonetics, sociolinguistics and psychology, among other fields, depends on the availability of phonetic segmentation and transcriptions. This corpus was developed to investigate the use of phone boundary models on forced alignment in Mandarin Chinese. Using the approach of embedded tone modeling (also used for incorporating tones for automatic speech recognition), the performance on forced alignment between tone-dependent and tone-independent models was compared. Data Utterances were considered as the time-stamped between-pause units in the transcribed news recordings. Those with background noise, music, unidentified speakers and accented speakers were excluded. A test set was developed with 300 utterances randomly selected from six speakers (50 utterances for each speaker). The remaining 7,549 utterances formed a training set. The utterances in the test set were manually labeled and segmented into initials and finals in Pinyin, a Roman alphabet system for transcribing Chinese characters. Tones were marked on the finals, including Tone1 through Tone4, and Tone0 for the neutral tone. The Sandhi Tone3 was labeled as Tone2. The training set was automatically segmented and transcribed using the LDC forced aligner, which is a Hidden Markov Model (HMM) aligner trained on the same utterances (Yuan et al. 2014). The aligner achieved 93.1% agreement (of phone boundaries) within 20 ms on the test set compared to manual segmentation. The quality of the phonetic transcription and tone labels of the training set was evaluated by checking 100 utterances randomly selected from it. The 100 utterances contained 1,252 syllables: 15 syllables had mistaken tone transcriptions; two syllables showed mistaken transcriptions of the final, and there were no syllables with transcription errors on the initial. Each utterance has three associated files: a flac compressed wav file, a word transcript file, and a phonetic boundaries and label file. Samples Please view this audio sample, transcript sample and phonetic labels sample. Acknowledgement This work was supported in part by National Science Foundation Grant No. IIS-0964556. Updates None at this time Additional Licensing Instructions This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.
Extent:		Corpus size: 667448 KB
Identifier:		LDC2015S05
		https://catalog.ldc.upenn.edu/LDC2015S05
		ISBN: 1-58563-710-6
		ISLRN: 567-512-470-543-8
		DOI: 10.35111/djnc-2014
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		Mandarin Chinese Phonetic Segmentation and Tone User Agreement: https://catalog.ldc.upenn.edu/license/mandarin-chinese-phonetic-segmentation-and-tone-user-agreement.pdf
Medium:		Distribution: Web Download
Provenance:		Collected by the Linguistic Data Consortium (LDC) in Philadelphia, PA, USA.
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2015S05
Rights Holder:		Portions © 1997 China Central TV, © 1997 MultiCultural Broadcasting Corporation, © 1997, 1998, 2007, 2015 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2015S05
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Yuan, Jiahong; Ryant, Neville; Liberman, Mark. 2015. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Sound dcmi_Text iso639_cmn olac_primary_text