OLAC Record: Chinese Gigaword

OLAC Record
oai:www.ldc.upenn.edu:LDC2003T09

Metadata

Title: Chinese Gigaword

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Graff, David, and Ke Chen. Chinese Gigaword LDC2003T09. Web Download. Philadelphia: Linguistic Data Consortium, 2003

Contributor: Graff, David

Chen, Ke

Date (W3CDTF): 2003

Date Issued (W3CDTF): 2003-05-22

Description: *Introduction* Chinese Gigaword was produced by the Linguistic Data Consortium (LDC) and contains approximately 1 billion words of Mandarin Chinese news text. This is a comprehensive archive of newswire text data that has been acquired from Chinese news sources by the LDC over several years. Two distinct international sources of Chinese newswire are represented here: * Central News Agency of Taiwan (CNA) * Xinhua News Agency of Beijing (XIN) Some of the Xinhua content in this collection has been published previously by LDC in other, older corpora, particularly Mandarin Chinese News Text (LDC95T13), TREC Mandarin (LDC2000T52), and the various TDT Multilanguage Text corpora. But all of the CNA data and a significant amount of Xinhua material is being released here for the first time. *Data* There are 286 files, totaling approximately 1.5 GB in compressed form (4 GB uncompressed). The table below presents the following categories of information: source of the data, number of files per source, K-words (thousands of words), and number of documents. The K-words numbers represent the actual number of Chinese characters; there is no notion of "space-separated word tokens" in Chinese. Source #Files K-words #DOCs CNA 144 735,499 1,649,492 XIN 142 382,881 817,348 TOTAL 286 1,118,380 2,466,840 The original data archives received by LDC from Xinhua were encoded in GB-2312, whereas those from CNA were encoded in Big-5. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the README file, all characters in the text are either single-byte ASCII or multi-byte Chinese. All text data are presented in SGML form, using a very simple, minimal markup structure. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file provided in the corpus. Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). All sources have received a uniform treatment in terms of quality control and have been categorized into four distinct "types": story This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences. multi This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on. advis These are DOCs which the news service addresses to news editors, they are not intended for publication to the "end users." other These DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The general strategy for categorizing DOCs into these four classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the three "non-story" types. When none of the known clues was in evidence, the DOC was classified as a "story." *Samples* For an example of the data in this corpus, please view this sample (TXT). *Updates* There are no updates at this time.

Extent: Corpus size: 1572864 KB

Identifier: LDC2003T09

https://catalog.ldc.upenn.edu/LDC2003T09

ISBN: 1-58563-230-9

ISLRN: 251-875-847-656-5

DOI: 10.35111/n069-0642

Language: Mandarin Chinese

Language (ISO639): cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2003T09

Rights Holder: Portions © 1991-2002 Central News Agency of Taiwan, © 1990-2002 Xinhua News Agency

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2003T09

DateStamp: 2024-09-13

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Graff, David; Chen, Ke. 2003. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2003T09
Up-to-date as of: Wed Oct 29 7:00:16 EDT 2025

Metadata
Title:		Chinese Gigaword
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Graff, David, and Ke Chen. Chinese Gigaword LDC2003T09. Web Download. Philadelphia: Linguistic Data Consortium, 2003
Contributor:		Graff, David
Contributor:		Chen, Ke
Date (W3CDTF):		2003
Date Issued (W3CDTF):		2003-05-22
Description:		Introduction Chinese Gigaword was produced by the Linguistic Data Consortium (LDC) and contains approximately 1 billion words of Mandarin Chinese news text. This is a comprehensive archive of newswire text data that has been acquired from Chinese news sources by the LDC over several years. Two distinct international sources of Chinese newswire are represented here: * Central News Agency of Taiwan (CNA) * Xinhua News Agency of Beijing (XIN) Some of the Xinhua content in this collection has been published previously by LDC in other, older corpora, particularly Mandarin Chinese News Text (LDC95T13), TREC Mandarin (LDC2000T52), and the various TDT Multilanguage Text corpora. But all of the CNA data and a significant amount of Xinhua material is being released here for the first time. Data There are 286 files, totaling approximately 1.5 GB in compressed form (4 GB uncompressed). The table below presents the following categories of information: source of the data, number of files per source, K-words (thousands of words), and number of documents. The K-words numbers represent the actual number of Chinese characters; there is no notion of "space-separated word tokens" in Chinese. Source #Files K-words #DOCs CNA 144 735,499 1,649,492 XIN 142 382,881 817,348 TOTAL 286 1,118,380 2,466,840 The original data archives received by LDC from Xinhua were encoded in GB-2312, whereas those from CNA were encoded in Big-5. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the README file, all characters in the text are either single-byte ASCII or multi-byte Chinese. All text data are presented in SGML form, using a very simple, minimal markup structure. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file provided in the corpus. Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). All sources have received a uniform treatment in terms of quality control and have been categorized into four distinct "types": story This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences. multi This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on. advis These are DOCs which the news service addresses to news editors, they are not intended for publication to the "end users." other These DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The general strategy for categorizing DOCs into these four classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the three "non-story" types. When none of the known clues was in evidence, the DOC was classified as a "story." Samples For an example of the data in this corpus, please view this sample (TXT). Updates There are no updates at this time.
Extent:		Corpus size: 1572864 KB
Identifier:		LDC2003T09
		https://catalog.ldc.upenn.edu/LDC2003T09
		ISBN: 1-58563-230-9
		ISLRN: 251-875-847-656-5
		DOI: 10.35111/n069-0642
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2003T09
Rights Holder:		Portions © 1991-2002 Central News Agency of Taiwan, © 1990-2002 Xinhua News Agency
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2003T09
DateStamp:		2024-09-13
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Graff, David; Chen, Ke. 2003. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text