OLAC Record

Title:CALLHOME Mandarin Chinese Transcripts - XML version
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:McEnery, Tony, and Richard Xiao. CALLHOME Mandarin Chinese Transcripts - XML version LDC2008T17. Web Download. Philadelphia: Linguistic Data Consortium, 2008
Contributor:McEnery, Tony
Xiao, Richard
Date (W3CDTF):2008
Date Issued (W3CDTF):2008-09-15
Description:*Introduction* CALLHOME Mandarin Chinese Transcripts - XML Version, Linguistic Data Consortium (LDC) catalog number LDC2008T17 and isbn 1-58563-485-7, was developed at Lancaster University, United Kingdom. LDC's CALLHOME Mandarin Chinese collection includes telephone speech, associated transcripts and a lexicon. CALLHOME Mandarin Chinese Speech consists of 120 unscripted telephone conversations between native speakers of Mandarin Chinese. All calls, which lasted up to thirty minutes, originated in North America and were placed to locations overseas; most participants called family members or close friends. CALLHOME Mandarin Chinese Transcripts covers a contiguous five or ten-minute segment from each of the telephone speech files. The transcripts are in tab-delimited format with GB2312 encoding, are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. CALLHOME Mandarin Chinese Lexicon is comprised of over 40,000 words from twenty CALLHOME Mandarin transcripts. CALLHOME Mandarin Chinese Transcripts - XML Version, the latest addition to this collection, presents the entire original corpus of 120 transcripts in XML format with UTF-8 encoding, retokenization and part-of-speech (POS) tagging. The retokenization and POS information were supplied using the Chinese Lexical Analysis System (ICTCLAS) developed by the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. ICTCLAS aims to incorporate Chinese word segmentation, POS tagging, disambiguation and unknown words recognition into a single theoretical framework using multi-layered hierarchical hidden Markov models. In addition to the original applications for Mandarin Chinese CALLHOME data (e.g., speech recognition), CALLHOME Mandarin Chinese Transcripts - XML Version will be useful in the grammatical study of spoken Mandarin. *Data* This XML corpus retains all of the linguistic analyses (e.g., timestamps, spoken features and proper nouns) from the original transcripts release, but the mnemonics used in the original release were migrated into XML markup following the mapping rules described below: All analyses in the original release were retained at the sacrifice of tokenization and part-of-speech tagging accuracy (e.g., some mnemonics encoding spoken features may split a word, which can affect the tagging accuracy). However, the results of the automated processing were substantially post-edited. For example, four aspect markers in Chinese (-le, -guo, -zhe and zai) were disambiguated and corrected by hand; all of the classifiers (also called "measure words") were re-tagged using a more fine-grained annotation scheme developed on the Lancaster University project. In addition, a large number of obvious typographical errors in the original release were corrected in the process of post-editing. Number of unique words: 6,895 Total number of words: 300,767 *Samples*
Extent:Corpus size: 8714 KB
ISBN: 1-58563-485-9
ISLRN: 741-988-462-570-4
DOI: 10.35111/c4fh-9d64
Language:Mandarin Chinese
Language (ISO639):cmn
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2008T17
Rights Holder:Portions © 2004-2008 Lancaster University, © 1996, 2008 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text


Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2008T17
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: McEnery, Tony; Xiao, Richard. 2008. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

Up-to-date as of: Tue May 7 7:24:55 EDT 2024