OLAC Record: Chinese Treebank 6.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2007T36

Metadata

Title: Chinese Treebank 6.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Palmer, Martha, et al. Chinese Treebank 6.0 LDC2007T36. Web Download. Philadelphia: Linguistic Data Consortium, 2007

Contributor: Palmer, Martha

Xue, Nianwen

Xia, Fei

Chiou, Fu-Dong

Jiang, Zixin

Chang, Meiyu

Date (W3CDTF): 2007

Date Issued (W3CDTF): 2007-12-20

Description: *Introduction* This file contains documentation for Chinese Treebank 6.0, Linguistic Data Consortium (LDC) catalog number LDC2007T36 and isbn 1-58563-450-6. The Chinese Treebank project began at the University of Pennsylvania in 1998 and continues at Penn and the University of Colorado. Chinese Treebank 6.0 is the latest version produced from this effort, consisting of 780,000 words (over 1.28 million Chinese characters) that are segmented, part-of-speech tagged and fully bracketed. The data sources include newswire from Xinhua News Agency, articles from Sinorama Magazine, news from the website of the Hong Kong Special Administrative Region and transcripts from various broadcast news programs. The LDC published Chinese Treebank 1.0 in 2000; it was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. The LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, the LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). For information about Chinese Treebank methodology and guidelines, consult the attached documentation files and the Chinese Treebank Project website. This release encompasses 2,036 text files, containing 28,295 sentences, 781,351 words and 1,285,149 hanzi (Chinese characters). The data is provided in two encodings: GBK and UTF-8, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the enclosed segmentation, POS-tagging and bracketing guidelines. The data is provided in four different formats: raw text, word segmented, word segmented and POS-tagged, and syntactically bracketed. *Samples* For an example of the data in this publication, please examine this sample of the bracketed data.

Extent: Corpus size: 118784 KB

Identifier: LDC2007T36

https://catalog.ldc.upenn.edu/LDC2007T36

ISBN: 1-58563-450-6

ISLRN: 616-484-921-813-1

DOI: 10.35111/bfb8-gt03

Language: Mandarin Chinese

Language (ISO639): cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2007T36

Rights Holder: Portions © 2000-2001 China Broadcasting System, © 2000-2001 China Central TV, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 1997 The Government of the Hong Kong Special Administrative Region, © 1996-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 2001, 2004, 2005, 2007 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2007T36

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Palmer, Martha; Xue, Nianwen; Xia, Fei; Chiou, Fu-Dong; Jiang, Zixin; Chang, Meiyu. 2007. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2007T36
Up-to-date as of: Wed Oct 29 7:01:00 EDT 2025

Metadata
Title:		Chinese Treebank 6.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Palmer, Martha, et al. Chinese Treebank 6.0 LDC2007T36. Web Download. Philadelphia: Linguistic Data Consortium, 2007
Contributor:		Palmer, Martha
		Xue, Nianwen
		Xia, Fei
		Chiou, Fu-Dong
		Jiang, Zixin
		Chang, Meiyu
Date (W3CDTF):		2007
Date Issued (W3CDTF):		2007-12-20
Description:		Introduction This file contains documentation for Chinese Treebank 6.0, Linguistic Data Consortium (LDC) catalog number LDC2007T36 and isbn 1-58563-450-6. The Chinese Treebank project began at the University of Pennsylvania in 1998 and continues at Penn and the University of Colorado. Chinese Treebank 6.0 is the latest version produced from this effort, consisting of 780,000 words (over 1.28 million Chinese characters) that are segmented, part-of-speech tagged and fully bracketed. The data sources include newswire from Xinhua News Agency, articles from Sinorama Magazine, news from the website of the Hong Kong Special Administrative Region and transcripts from various broadcast news programs. The LDC published Chinese Treebank 1.0 in 2000; it was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. The LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, the LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). For information about Chinese Treebank methodology and guidelines, consult the attached documentation files and the Chinese Treebank Project website. This release encompasses 2,036 text files, containing 28,295 sentences, 781,351 words and 1,285,149 hanzi (Chinese characters). The data is provided in two encodings: GBK and UTF-8, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the enclosed segmentation, POS-tagging and bracketing guidelines. The data is provided in four different formats: raw text, word segmented, word segmented and POS-tagged, and syntactically bracketed. Samples For an example of the data in this publication, please examine this sample of the bracketed data.
Extent:		Corpus size: 118784 KB
Identifier:		LDC2007T36
		https://catalog.ldc.upenn.edu/LDC2007T36
		ISBN: 1-58563-450-6
		ISLRN: 616-484-921-813-1
		DOI: 10.35111/bfb8-gt03
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2007T36
Rights Holder:		Portions © 2000-2001 China Broadcasting System, © 2000-2001 China Central TV, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 1997 The Government of the Hong Kong Special Administrative Region, © 1996-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 2001, 2004, 2005, 2007 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2007T36
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Palmer, Martha; Xue, Nianwen; Xia, Fei; Chiou, Fu-Dong; Jiang, Zixin; Chang, Meiyu. 2007. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text