OLAC Record: CALLHOME Mandarin Chinese Transcripts

OLAC Record
oai:www.ldc.upenn.edu:LDC2008T17

Metadata

Title: CALLHOME Mandarin Chinese Transcripts - XML version

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: McEnery, Tony, and Richard Xiao. CALLHOME Mandarin Chinese Transcripts - XML version LDC2008T17. Web Download. Philadelphia: Linguistic Data Consortium, 2008

Contributor: McEnery, Tony

Xiao, Richard

Date (W3CDTF): 2008

Date Issued (W3CDTF): 2008-09-15

Description: *Introduction* CALLHOME Mandarin Chinese Transcripts - XML Version, Linguistic Data Consortium (LDC) catalog number LDC2008T17 and isbn 1-58563-485-7, was developed at Lancaster University, United Kingdom. LDC's CALLHOME Mandarin Chinese collection includes telephone speech, associated transcripts and a lexicon. CALLHOME Mandarin Chinese Speech consists of 120 unscripted telephone conversations between native speakers of Mandarin Chinese. All calls, which lasted up to thirty minutes, originated in North America and were placed to locations overseas; most participants called family members or close friends. CALLHOME Mandarin Chinese Transcripts covers a contiguous five or ten-minute segment from each of the telephone speech files. The transcripts are in tab-delimited format with GB2312 encoding, are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. CALLHOME Mandarin Chinese Lexicon is comprised of over 40,000 words from twenty CALLHOME Mandarin transcripts. CALLHOME Mandarin Chinese Transcripts - XML Version, the latest addition to this collection, presents the entire original corpus of 120 transcripts in XML format with UTF-8 encoding, retokenization and part-of-speech (POS) tagging. The retokenization and POS information were supplied using the Chinese Lexical Analysis System (ICTCLAS) developed by the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. ICTCLAS aims to incorporate Chinese word segmentation, POS tagging, disambiguation and unknown words recognition into a single theoretical framework using multi-layered hierarchical hidden Markov models. In addition to the original applications for Mandarin Chinese CALLHOME data (e.g., speech recognition), CALLHOME Mandarin Chinese Transcripts - XML Version will be useful in the grammatical study of spoken Mandarin. *Data* This XML corpus retains all of the linguistic analyses (e.g., timestamps, spoken features and proper nouns) from the original transcripts release, but the mnemonics used in the original release were migrated into XML markup following the mapping rules described below: All analyses in the original release were retained at the sacrifice of tokenization and part-of-speech tagging accuracy (e.g., some mnemonics encoding spoken features may split a word, which can affect the tagging accuracy). However, the results of the automated processing were substantially post-edited. For example, four aspect markers in Chinese (-le, -guo, -zhe and zai) were disambiguated and corrected by hand; all of the classifiers (also called "measure words") were re-tagged using a more fine-grained annotation scheme developed on the Lancaster University project. In addition, a large number of obvious typographical errors in the original release were corrected in the process of post-editing. Number of unique words: 6,895 Total number of words: 300,767 *Samples*

Extent: Corpus size: 8714 KB

Identifier: LDC2008T17

https://catalog.ldc.upenn.edu/LDC2008T17

ISBN: 1-58563-485-9

ISLRN: 741-988-462-570-4

DOI: 10.35111/c4fh-9d64

Language: Mandarin Chinese

Language (ISO639): cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2008T17

Rights Holder: Portions © 2004-2008 Lancaster University, © 1996, 2008 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2008T17

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: McEnery, Tony; Xiao, Richard. 2008. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2008T17
Up-to-date as of: Wed Oct 29 7:01:04 EDT 2025

Metadata
Title:		CALLHOME Mandarin Chinese Transcripts - XML version
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		McEnery, Tony, and Richard Xiao. CALLHOME Mandarin Chinese Transcripts - XML version LDC2008T17. Web Download. Philadelphia: Linguistic Data Consortium, 2008
Contributor:		McEnery, Tony
Contributor:		Xiao, Richard
Date (W3CDTF):		2008
Date Issued (W3CDTF):		2008-09-15
Description:		Introduction CALLHOME Mandarin Chinese Transcripts - XML Version, Linguistic Data Consortium (LDC) catalog number LDC2008T17 and isbn 1-58563-485-7, was developed at Lancaster University, United Kingdom. LDC's CALLHOME Mandarin Chinese collection includes telephone speech, associated transcripts and a lexicon. CALLHOME Mandarin Chinese Speech consists of 120 unscripted telephone conversations between native speakers of Mandarin Chinese. All calls, which lasted up to thirty minutes, originated in North America and were placed to locations overseas; most participants called family members or close friends. CALLHOME Mandarin Chinese Transcripts covers a contiguous five or ten-minute segment from each of the telephone speech files. The transcripts are in tab-delimited format with GB2312 encoding, are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. CALLHOME Mandarin Chinese Lexicon is comprised of over 40,000 words from twenty CALLHOME Mandarin transcripts. CALLHOME Mandarin Chinese Transcripts - XML Version, the latest addition to this collection, presents the entire original corpus of 120 transcripts in XML format with UTF-8 encoding, retokenization and part-of-speech (POS) tagging. The retokenization and POS information were supplied using the Chinese Lexical Analysis System (ICTCLAS) developed by the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. ICTCLAS aims to incorporate Chinese word segmentation, POS tagging, disambiguation and unknown words recognition into a single theoretical framework using multi-layered hierarchical hidden Markov models. In addition to the original applications for Mandarin Chinese CALLHOME data (e.g., speech recognition), CALLHOME Mandarin Chinese Transcripts - XML Version will be useful in the grammatical study of spoken Mandarin. Data This XML corpus retains all of the linguistic analyses (e.g., timestamps, spoken features and proper nouns) from the original transcripts release, but the mnemonics used in the original release were migrated into XML markup following the mapping rules described below: All analyses in the original release were retained at the sacrifice of tokenization and part-of-speech tagging accuracy (e.g., some mnemonics encoding spoken features may split a word, which can affect the tagging accuracy). However, the results of the automated processing were substantially post-edited. For example, four aspect markers in Chinese (-le, -guo, -zhe and zai) were disambiguated and corrected by hand; all of the classifiers (also called "measure words") were re-tagged using a more fine-grained annotation scheme developed on the Lancaster University project. In addition, a large number of obvious typographical errors in the original release were corrected in the process of post-editing. Number of unique words: 6,895 Total number of words: 300,767 Samples
Extent:		Corpus size: 8714 KB
Identifier:		LDC2008T17
		https://catalog.ldc.upenn.edu/LDC2008T17
		ISBN: 1-58563-485-9
		ISLRN: 741-988-462-570-4
		DOI: 10.35111/c4fh-9d64
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2008T17
Rights Holder:		Portions © 2004-2008 Lancaster University, © 1996, 2008 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2008T17
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		McEnery, Tony; Xiao, Richard. 2008. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text