OLAC Record: Chinese Treebank 8.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2013T21

Metadata

Title: Chinese Treebank 8.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Xue, Nianwen, et al. Chinese Treebank 8.0 LDC2013T21. Web Download. Philadelphia: Linguistic Data Consortium, 2013

Contributor: Xue, Nianwen

Zhang, Xiuhong

Jiang, Zixin

Palmer, Martha

Xia, Fei

Chiou, Fu-Dong

Chang, Meiyu

Date (W3CDTF): 2013

Date Issued (W3CDTF): 2013-11-15

Description: *Introduction* Chinese Treebank 8.0 consists of approximately 1.5 million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups and weblogs. The Chinese Treebank project began at the University of Pennsylvania in 1998, continued at the University of Colorado and then moved to Brandeis University. The project goal is to provide a large, part-of-speech tagged and fully bracketed Chinese language corpus. The first delivery, Chinese Treebank 1.0, contained 100,000 syntactically annotated words from Xinhua News Agency newswire. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). Chinese Treebank 6.0 (LDC2007T36), released in 2007, consisted of 780,000 words. Chinese Treebank 7.0 (LDC2010T08), released in 2010, added new annotated newswire data, broadcast material and web text to the approximate total of one million words. Chinese Treebank 8.0 adds new annotated data from newswire, magazine articles and government documents. *Data* There are 3,007 text files in this release, containing 71,369 sentences, 1,620,561 words, 2,589,848 characters (hanzi or foreign). The data is provided in UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the segmentation, POS-tagging and bracketing guidelines included in this release. The data is provided in four different formats: raw text, word segmented, POS-tagged and syntactically bracketed formats. All files were automatically verified and manually checked. *Samples* Please view samples in each format: * POS Tagged * Raw Text * Word Segmented * Syntactically Bracketed *Sponsorship* This work was supported in part by the Defense Advanced Research Projects Agency GALE Program Grant No. HR0011-06-0022 and BOLT Program No. HR0011-11-C-0145. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. *Updates* None at this time.

Extent: Corpus size: 94208 KB

Identifier: LDC2013T21

https://catalog.ldc.upenn.edu/LDC2013T21

ISBN: 1-58563-661-4

ISLRN: 860-172-183-494-4

DOI: 10.35111/wygn-4f57

Language: Mandarin Chinese

Chinese

Language (ISO639): cmn

zho

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2013T21

Rights Holder: Portions © 2006 Agence France Presse, © 2006 Anhui TV, © 2005 Cable News Network, LP, LLLP, © 2000-2001 China Broadcasting System, © 2000-2001, 2005-2006 China Central TV, © 2000-2001 China National Radio, © 2006 Chinanews.com, © 2000-2001 China Television System, © 2006 Guangming Daily, © 2006 National Broadcasting Company, Inc. © 2006 New Tang Dynasty TV, © 2006 Peoples Daily Online, © 2005-2006 Phoenix TV, © 1996-2001 Sinorama Magazine, © 1997 The Government of the Hong Kong Special Administrative Region, © 1994-1998, 2006 Xinhua News Agency, © 2001, 2004, 2005, 2007, 2009, 2010, 2013 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2013T21

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Xue, Nianwen; Zhang, Xiuhong; Jiang, Zixin; Palmer, Martha; Xia, Fei; Chiou, Fu-Dong; Chang, Meiyu. 2013. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn iso639_zho olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2013T21
Up-to-date as of: Wed Oct 29 7:00:26 EDT 2025

Metadata
Title:		Chinese Treebank 8.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Xue, Nianwen, et al. Chinese Treebank 8.0 LDC2013T21. Web Download. Philadelphia: Linguistic Data Consortium, 2013
Contributor:		Xue, Nianwen
		Zhang, Xiuhong
		Jiang, Zixin
		Palmer, Martha
		Xia, Fei
		Chiou, Fu-Dong
		Chang, Meiyu
Date (W3CDTF):		2013
Date Issued (W3CDTF):		2013-11-15
Description:		Introduction Chinese Treebank 8.0 consists of approximately 1.5 million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups and weblogs. The Chinese Treebank project began at the University of Pennsylvania in 1998, continued at the University of Colorado and then moved to Brandeis University. The project goal is to provide a large, part-of-speech tagged and fully bracketed Chinese language corpus. The first delivery, Chinese Treebank 1.0, contained 100,000 syntactically annotated words from Xinhua News Agency newswire. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). Chinese Treebank 6.0 (LDC2007T36), released in 2007, consisted of 780,000 words. Chinese Treebank 7.0 (LDC2010T08), released in 2010, added new annotated newswire data, broadcast material and web text to the approximate total of one million words. Chinese Treebank 8.0 adds new annotated data from newswire, magazine articles and government documents. Data There are 3,007 text files in this release, containing 71,369 sentences, 1,620,561 words, 2,589,848 characters (hanzi or foreign). The data is provided in UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the segmentation, POS-tagging and bracketing guidelines included in this release. The data is provided in four different formats: raw text, word segmented, POS-tagged and syntactically bracketed formats. All files were automatically verified and manually checked. Samples Please view samples in each format: * POS Tagged * Raw Text * Word Segmented * Syntactically Bracketed Sponsorship This work was supported in part by the Defense Advanced Research Projects Agency GALE Program Grant No. HR0011-06-0022 and BOLT Program No. HR0011-11-C-0145. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Updates None at this time.
Extent:		Corpus size: 94208 KB
Identifier:		LDC2013T21
		https://catalog.ldc.upenn.edu/LDC2013T21
		ISBN: 1-58563-661-4
		ISLRN: 860-172-183-494-4
		DOI: 10.35111/wygn-4f57
Language:		Mandarin Chinese
Language:		Chinese
Language (ISO639):		cmn
Language (ISO639):		zho
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2013T21
Rights Holder:		Portions © 2006 Agence France Presse, © 2006 Anhui TV, © 2005 Cable News Network, LP, LLLP, © 2000-2001 China Broadcasting System, © 2000-2001, 2005-2006 China Central TV, © 2000-2001 China National Radio, © 2006 Chinanews.com, © 2000-2001 China Television System, © 2006 Guangming Daily, © 2006 National Broadcasting Company, Inc. © 2006 New Tang Dynasty TV, © 2006 Peoples Daily Online, © 2005-2006 Phoenix TV, © 1996-2001 Sinorama Magazine, © 1997 The Government of the Hong Kong Special Administrative Region, © 1994-1998, 2006 Xinhua News Agency, © 2001, 2004, 2005, 2007, 2009, 2010, 2013 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2013T21
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Xue, Nianwen; Zhang, Xiuhong; Jiang, Zixin; Palmer, Martha; Xia, Fei; Chiou, Fu-Dong; Chang, Meiyu. 2013. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Text iso639_cmn iso639_zho olac_primary_text