OLAC Record: Tagged Chinese Gigaword

OLAC Record
oai:www.ldc.upenn.edu:LDC2007T03

Metadata

Title: Tagged Chinese Gigaword

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Huang, Chu-Ren. Tagged Chinese Gigaword LDC2007T03. Web Download. Philadelphia: Linguistic Data Consortium, 2007

Contributor: Huang, Chu-Ren

Date (W3CDTF): 2007

Date Issued (W3CDTF): 2007-06-20

Description: *Introduction* Tagged Chinese Gigaword, created by scholars at Academia Sinica, Taipei, Taiwan, is the part-of-speech tagged version of the LDC's Chinese Gigaword Second Edition LDC2005T14. It contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency (Taiwan), Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags. In order to avoid any problems or confusion that could result from differences in character-set specifications in the source data, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the readme file, all characters in the text are either single-byte ASCII or multi-byte Chinese. All sources have been categorized into four distinct "types": * story: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences. * multi: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on. * advis: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users." * other: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on. *Data* The table below lists the number files, their compressed and uncompressed size, number of words and number of documents divided by source. #Files = number of files. Rzip-MB = compressed size in megabytes. Totl-MB = uncompressed size in megabytes. K-words = number of words in thousands. #DOCs = number of documents. Source #Files Rzip-MB Totl-MB K-wrds #DOCs CNA_CMN 168 994 7363 792195 1769953 XIN_CMN 168 615 4535 471110 992261 ZBN_CMN 10 40 223 28066 41418 TOTAL 346 1648 12121 1291371 2803632 The following tables present the quantity of "K-wrds" and "#DOCS", divided by source and DOC type: #DOCs K-wrds type="advis": CNA_CMN 8160 751 XIN_CMN 6553 711 ZBN_CMN 0 0 TOTAL 14713 1462 type="multi": CNA_CMN 30552 23429 XIN_CMN 11329 7516 ZBN_CMN 55 41 TOTAL 41936 30986 type="other": CNA_CMN 100758 40258 XIN_CMN 31255 9999 ZBN_CMN 279 130 TOTAL 132292 50387 type="story": CNA_CMN 1630483 727748 XIN_CMN 943132 452878 ZBN_CMN 41084 27898 TOTAL 2614691 1208524 The performance of CKIP Segmentation and POS tagging system has been tested in Bakeoff 2005 and Bakeoff 2006. The test result is shown as follows: Doc# RefWord# TestWord# MatchWord# Recall (%) Precision (%) F-Score (%) Bakeoff 2005 190 116509 116443 112091 96.2 96.3 96.2 Bakeoff 2006 148 90405 90327 87332 96.6 96.7 96.6 Note: Recall=MatchWord# / RefWord# Precision=MatchWord# / TestWord# F-Score=2 * Recall * Precision / (Recall + Precision) *Samples* For an example of the data contained in this corpus, please view this screen capture(jpg) of the annotated text.

Extent: Corpus size: 2527068 KB

Identifier: LDC2007T03

https://catalog.ldc.upenn.edu/LDC2007T03

ISBN: 1-58563-409-3

ISLRN: 614-675-002-053-4

DOI: 10.35111/ckna-1h68

Language: Mandarin Chinese

Language (ISO639): cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2007T03

Rights Holder: Portions © 2005-2007 Academia Sinica, © 1991-2004 Central News Agency (Taiwan), © 2000-2003 SPH AsiaOne, Ltd., © 1990-2004 Xinhua News Agency, © 2005, 2007 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2007T03

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Huang, Chu-Ren. 2007. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2007T03
Up-to-date as of: Wed Oct 29 7:00:56 EDT 2025

Metadata
Title:		Tagged Chinese Gigaword
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Huang, Chu-Ren. Tagged Chinese Gigaword LDC2007T03. Web Download. Philadelphia: Linguistic Data Consortium, 2007
Contributor:		Huang, Chu-Ren
Date (W3CDTF):		2007
Date Issued (W3CDTF):		2007-06-20
Description:		Introduction Tagged Chinese Gigaword, created by scholars at Academia Sinica, Taipei, Taiwan, is the part-of-speech tagged version of the LDC's Chinese Gigaword Second Edition LDC2005T14. It contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency (Taiwan), Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags. In order to avoid any problems or confusion that could result from differences in character-set specifications in the source data, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the readme file, all characters in the text are either single-byte ASCII or multi-byte Chinese. All sources have been categorized into four distinct "types": * story: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences. * multi: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on. * advis: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users." * other: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on. Data The table below lists the number files, their compressed and uncompressed size, number of words and number of documents divided by source. #Files = number of files. Rzip-MB = compressed size in megabytes. Totl-MB = uncompressed size in megabytes. K-words = number of words in thousands. #DOCs = number of documents. Source #Files Rzip-MB Totl-MB K-wrds #DOCs CNA_CMN 168 994 7363 792195 1769953 XIN_CMN 168 615 4535 471110 992261 ZBN_CMN 10 40 223 28066 41418 TOTAL 346 1648 12121 1291371 2803632 The following tables present the quantity of "K-wrds" and "#DOCS", divided by source and DOC type: #DOCs K-wrds type="advis": CNA_CMN 8160 751 XIN_CMN 6553 711 ZBN_CMN 0 0 TOTAL 14713 1462 type="multi": CNA_CMN 30552 23429 XIN_CMN 11329 7516 ZBN_CMN 55 41 TOTAL 41936 30986 type="other": CNA_CMN 100758 40258 XIN_CMN 31255 9999 ZBN_CMN 279 130 TOTAL 132292 50387 type="story": CNA_CMN 1630483 727748 XIN_CMN 943132 452878 ZBN_CMN 41084 27898 TOTAL 2614691 1208524 The performance of CKIP Segmentation and POS tagging system has been tested in Bakeoff 2005 and Bakeoff 2006. The test result is shown as follows: Doc# RefWord# TestWord# MatchWord# Recall (%) Precision (%) F-Score (%) Bakeoff 2005 190 116509 116443 112091 96.2 96.3 96.2 Bakeoff 2006 148 90405 90327 87332 96.6 96.7 96.6 Note: Recall=MatchWord# / RefWord# Precision=MatchWord# / TestWord# F-Score=2 * Recall * Precision / (Recall + Precision) Samples For an example of the data contained in this corpus, please view this screen capture(jpg) of the annotated text.
Extent:		Corpus size: 2527068 KB
Identifier:		LDC2007T03
		https://catalog.ldc.upenn.edu/LDC2007T03
		ISBN: 1-58563-409-3
		ISLRN: 614-675-002-053-4
		DOI: 10.35111/ckna-1h68
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2007T03
Rights Holder:		Portions © 2005-2007 Academia Sinica, © 1991-2004 Central News Agency (Taiwan), © 2000-2003 SPH AsiaOne, Ltd., © 1990-2004 Xinhua News Agency, © 2005, 2007 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2007T03
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Huang, Chu-Ren. 2007. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text