OLAC Record
oai:www.ldc.upenn.edu:LDC2010T07

Metadata
Title:Chinese Treebank 7.0
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Xue, Nianwen, et al. Chinese Treebank 7.0 LDC2010T07. Web Download. Philadelphia: Linguistic Data Consortium, 2010
Contributor:Xue, Nianwen
Jiang, Zixin
Zhong, Xiuhong
Palmer, Martha
Xia, Fei
Chiou, Fu-Dong
Chang, Meiyu
Date (W3CDTF):2010
Date Issued (W3CDTF):2010-11-16
Description:*Introduction * Chinese Treebank 7.0, Linguistic Data Consortium (LDC) catalog number LDC2010T07 and isbn 1-58563-542-1, consists of over one million words of annotated and parsed text from Chinese newswire, magazine news, various broadcast news and broadcast conversation programs, web newsgroups and weblogs. The Chinese Treebank project began at the University of Pennsylvania in 1998, continued at the University of Colorado and is now at Brandeis University. The projects goal is to provide a large, part-of-speech tagged and fully bracketed Chinese language corpus. The first delivery, Chinese Treebank 1.0, contained 100,000 syntactically annotated words from Xinhua News Agency (Xinhua) newswire. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. The LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, the LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). Chinese Treebank 6.0 (LDC2007T36), released in 2007, consisted of 780,000 words. Chinese Treebank 7.0 adds new annotated newswire data, broadcast material and web text to this effort. *Data * This release consists of 2,448 text files, 51,447 sentences, 1,196,329 words and 1,931,381 hanzi (Chinese characters). The data is provided in UTF-8 encoding and the annnotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the enclosed segmentation, POS-tagging and bracketing guidelines. The data is provided in four different formats: raw text, word segmented, word segmented and POS-tagged and syntactically-bracketed formats. Chinese Treebank 7.0 includes text from the following genres and sources: Genre # words Newswire (Agence France Presse, China News Service, Guangming Daily, Peoples Daily, Xinhua News Agency) 260,164 News Magazine (Sinorama) 256,305 Broadcast News (China Broadcasting System, China Central TV, China National Radio, China Television System, New Tang Dynasty TV, Phoenix TV, Voice of America) 287,442 Broadcast Conversation (Anhui TV, China Central TV, CNN, MSNBC, New Tang Dynasty TV, Phoenix TV) 184,161 Newsgroups, Weblogs 208,257 Total 1,196,329 *Sponsorship * This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-0022. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. *Samples* For an example of the data in this corpus, please review the sample file. *Updates* No updates have been issued as of this time. Contact: ldc@ldc.upenn.edu © 2010 Linguistic Data Consortium , Trustees of the University of Pennsylvania . All Rights Reserved.
Extent:Corpus size: 153600 KB
Identifier:LDC2010T07
https://catalog.ldc.upenn.edu/LDC2010T07
ISBN: 1-58563-542-1
ISLRN: 156-627-429-482-3
DOI: 10.35111/gs8j-tv18
Language:Mandarin Chinese
Language (ISO639):cmn
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2010T07
Rights Holder: Portions © 2006 Agence France Presse, © 2006 Anhui TV, © 2005 Cable News Network, LP, LLLP, © 2000-2001 China Broadcasting System, © 2000-2001, 2005-2006 China Central TV, © 2000-2001 China National Radio, © 2006 Chinanews.com, © 2000-2001 China Television System, © 2006 Guangming Daily, © 2006 National Broadcasting Company, Inc., © 2006 New Tang Dynasty TV, © 2006 Peoples Daily Online, © 2005-2006 Phoenix TV, © 1999-2001 Sinorama Magazine, © 1996-1998, 2006 Xinhua News Agency, © 2001, 2004, 2005, 2007, 2009, 2010 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2010T07
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Xue, Nianwen; Jiang, Zixin; Zhong, Xiuhong; Palmer, Martha; Xia, Fei; Chiou, Fu-Dong; Chang, Meiyu. 2010. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2010T07
Up-to-date as of: Thu Oct 24 7:30:28 EDT 2024