OLAC Record

Title:Chinese Treebank 5.0
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Palmer, Martha, et al. Chinese Treebank 5.0 LDC2005T01. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:Palmer, Martha
Chiou, Fu-Dong
Xue, Nianwen
Lee, Tsan-Kuang
Date (W3CDTF):2005
Date Issued (W3CDTF):2005-01-15
Description:*Introduction* Chinese Treebank 5.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2005T01 and ISBN 1-58563-323-2. The Penn Chinese Treebank is an ongoing project that started in the summer of 1998. The goal of the project is to create a 500,000-word corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published in 2000, and it was later corrected and released in 2001 as Chinese Treebank 2.0. Another updated version was released in 2004 as Chinese Treebank 4.0. More information about the project is available on the Chinese Treebank website. The content used in this corpus comes from the following newswire sources: 698 articles Xinhua (1994-1998) 55 articles Information Services Department of HKSAR (1997) 132 articles Sinorama magazine, Taiwan (1996-1998 & 2000-2001) *Data* Chinese Treebank 5.0 contains 507,222 words, 824,983 Hanzi, 18,782 sentences, and 890 data files. All files are GB encoded. The format of Chinese Treebank 5.0 is the same as the Penn English Treebank. All files have been annotated at least twice. The first pass was done by one annotator, and the resulting files were checked by a second annotator (second pass). Some files were also double-blind annotated and then adjudicated to create gold standard files. The corpus provides four versions of files: bracketed, raw, segmented and postagged. The raw, segmented and postagged versions are generated from the bracketed version and so do not reflect the previous annotation stages. The bracketed files are sequentially named as follows: chtb_nnnn.fid, where nnnn is a sequential file number. *Samples* To see an example of Gold Standard file, please examine this sample. *Updates* The 5.1 update contains corrections to errors found in the earlier version. Specifically, sentences which had more than one top-level node have been modified. Additionally, some GB-encoded white spaces have been converted to ASCII. The 5.1 package is available as an additional download to all those who have licensed CTB5.0.
ISBN: 1-58563-323-2
ISLRN: 426-628-131-806-1
Language:Mandarin Chinese
Language (ISO639):cmn
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2005T01
Rights Holder: Portions © 1994-1998 Xinhua News Agency, © 1996-2001 Sinorama Magazine, © 1997 The Government of the Hong Kong Special Administrative Region, © 2001, 2004, 2005 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text


Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2005T01
DateStamp:  2020-03-06
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Palmer, Martha; Chiou, Fu-Dong; Xue, Nianwen; Lee, Tsan-Kuang. 2005. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

Up-to-date as of: Sun Aug 2 15:58:54 EDT 2020