OLAC Record: Chinese Gigaword Second Edition

OLAC Record
oai:www.ldc.upenn.edu:LDC2005T14

Metadata

Title: Chinese Gigaword Second Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Graff, David, et al. Chinese Gigaword Second Edition LDC2005T14. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: Graff, David

Chen, Ke

Kong, Junbo

Maeda, Kazuaki

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-08-17

Description: *Introduction* Chinese Gigaword Second Edition was produced by the Linguistic Data Consortium (LDC) and contains a comprehensive archive of newswire text data in Chinese totalling approximately 1.3 billion words that has been acquired over several years by LDC. This edition includes all of the contents in the first release, Chinese Gigaword (LDC2003T09), as well as new data collected after the publication of the first edition, specifically Xinhua from October 2002 through December 2004 and CNA from January 2003 through December 2004. Also, a limited number of articles from a new newspaper source (Lianhe Zaobao) have been added in this edition. *Data* Here is a table of the three distinct international sources of Chinese newswire included in this edition along with a breakdown of how many documents and K-words (thousands of words) are included for each: Source Abbreviation Documents K-words Central News Agency, Taiwan (cna_cmn) 1,769,952 792,195 Xinhua News Agency (xin_cmn) 992,261 471,110 Zaobao Newspaper (zbn_cmn) 41,418 28,066 Totals 2,803,632 1,291,371 The seven-character abbreviations shown above represent both the source name and the language ID ("cmn" for Mandarin Chinese). The files are presented in zipped format containing SGML-formatted text files with multiple documents. Documents fall within three categories: * Story: a report composed of paragraphs and full sentences; most common * Multi: unrelated "blurbs" of several news items * Advis: advisories directed at news editors and not intended for publication/general audience * Other: intended for publication but not paragraphs or sentences; these are things like lists of sports scores, stock prices, temperatures around the world, etc. *Samples* For an example of the data in this corpus, please view this sample (SGML). *Updates* None at this time.

Identifier: LDC2005T14

https://catalog.ldc.upenn.edu/LDC2005T14

ISBN: 1-58563-353-4

ISLRN: 292-607-460-859-8

DOI: 10.35111/vr0r-sb06

Language: Mandarin Chinese

Language (ISO639): cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2005T14

Rights Holder: Portions © 1991-2004 Central News Agency, Taiwan © 1990-2004 Xinhua News Agency, © 2000-2003 SPH AsiaOne, Ltd., © 2005 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005T14

DateStamp: 2021-11-05

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Graff, David; Chen, Ke; Kong, Junbo; Maeda, Kazuaki. 2005. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T14
Up-to-date as of: Wed Oct 29 7:00:27 EDT 2025

Metadata
Title:		Chinese Gigaword Second Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Graff, David, et al. Chinese Gigaword Second Edition LDC2005T14. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		Graff, David
		Chen, Ke
		Kong, Junbo
		Maeda, Kazuaki
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-08-17
Description:		Introduction Chinese Gigaword Second Edition was produced by the Linguistic Data Consortium (LDC) and contains a comprehensive archive of newswire text data in Chinese totalling approximately 1.3 billion words that has been acquired over several years by LDC. This edition includes all of the contents in the first release, Chinese Gigaword (LDC2003T09), as well as new data collected after the publication of the first edition, specifically Xinhua from October 2002 through December 2004 and CNA from January 2003 through December 2004. Also, a limited number of articles from a new newspaper source (Lianhe Zaobao) have been added in this edition. Data Here is a table of the three distinct international sources of Chinese newswire included in this edition along with a breakdown of how many documents and K-words (thousands of words) are included for each: Source Abbreviation Documents K-words Central News Agency, Taiwan (cna_cmn) 1,769,952 792,195 Xinhua News Agency (xin_cmn) 992,261 471,110 Zaobao Newspaper (zbn_cmn) 41,418 28,066 Totals 2,803,632 1,291,371 The seven-character abbreviations shown above represent both the source name and the language ID ("cmn" for Mandarin Chinese). The files are presented in zipped format containing SGML-formatted text files with multiple documents. Documents fall within three categories: * Story: a report composed of paragraphs and full sentences; most common * Multi: unrelated "blurbs" of several news items * Advis: advisories directed at news editors and not intended for publication/general audience * Other: intended for publication but not paragraphs or sentences; these are things like lists of sports scores, stock prices, temperatures around the world, etc. Samples For an example of the data in this corpus, please view this sample (SGML). Updates None at this time.
Identifier:		LDC2005T14
		https://catalog.ldc.upenn.edu/LDC2005T14
		ISBN: 1-58563-353-4
		ISLRN: 292-607-460-859-8
		DOI: 10.35111/vr0r-sb06
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2005T14
Rights Holder:		Portions © 1991-2004 Central News Agency, Taiwan © 1990-2004 Xinhua News Agency, © 2000-2003 SPH AsiaOne, Ltd., © 2005 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005T14
DateStamp:		2021-11-05
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Graff, David; Chen, Ke; Kong, Junbo; Maeda, Kazuaki. 2005. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text