OLAC Record: Chinese Gigaword Third Edition

OLAC Record
oai:www.ldc.upenn.edu:LDC2007T38

Metadata

Title: Chinese Gigaword Third Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Graff, David. Chinese Gigaword Third Edition LDC2007T38. Web Download. Philadelphia: Linguistic Data Consortium, 2007

Contributor: Graff, David

Date (W3CDTF): 2007

Date Issued (W3CDTF): 2007-08-17

Description: *Introduction* Chinese Gigaword Third Edition is a comprehensive archive of newswire text data that has been acquired over several years by the LDC. This edition includes all of the contents in Chinese Gigaword Second Edition (LDC2005T14) as well as new data collected after the publication of that edition. Also, an archive of articles from a new newswire source (Agence France Presse) has been added in the third edition. The four distinct international sources of Chinese newswire included in this edition are the following: * Agence France Presse (afp_cmn) * Central News Agency, Taiwan (cna_cmn) * Xinhua News Agency (xin_cmn) * Zaobao Newspaper (zbn_cmn) The seven-letter codes in the parentheses above are used for the directory names and data files for each source, and are also used (in ALL_CAPS) as part of the unique DOC "id" string assigned to each news article. *Data* The original data archives received by the LDC from Agence France Presse, Xinhua News Agency and Zaobao were encoded in GB-2312, whereas those from Central News Agency (CNA) were encoded in Big-5. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to UTF-8 character encoding. *New in the Third Edition* * Over six years worth of articles (October 2000 through December 2006) from Agence France Presse are being released for the first time. * Two years worth of new articles (January 2005 through December 2006) have been added to the Xinhua data set. * Nearly two years worth of content was added to the CNA data set. There was a gap in the LDC's collection from this source during 2006: no CNA Chinese content was collected between July 27 and December 17 2006, inclusive, so there are no data files for August through November of that year, and the December data file is about half its expected size. * A small set of older stories (October through December 1998) have been added from Zaobao; these were previously published by LDC as part of TDT3 Multilanguage Text Version 2.0 (LDC2001T58) and are being included in Gigaword for the first time. *Samples* Please examine this sample(JPEG) for an example of the data in this corpus. *Sponsorship* This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Extent: Corpus size: 2097152 KB

Identifier: LDC2007T38

https://catalog.ldc.upenn.edu/LDC2007T38

ISBN: 1-58563-455-7

ISLRN: 222-703-436-942-5

DOI: 10.35111/w035-1a74

Language: Mandarin Chinese

Language (ISO639): cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2007T38

Rights Holder: Portions © 2000-2006 Agence France Presse, © 1991-2006 Central News Agency (Taiwan), © 1998, 2000-2003 SPH AsiaOne, Ltd., © 1990-2006 Xinhua News Agency, © 2003, 2005, 2007 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2007T38

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Graff, David. 2007. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2007T38
Up-to-date as of: Wed Oct 29 7:01:00 EDT 2025

Metadata
Title:		Chinese Gigaword Third Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Graff, David. Chinese Gigaword Third Edition LDC2007T38. Web Download. Philadelphia: Linguistic Data Consortium, 2007
Contributor:		Graff, David
Date (W3CDTF):		2007
Date Issued (W3CDTF):		2007-08-17
Description:		Introduction Chinese Gigaword Third Edition is a comprehensive archive of newswire text data that has been acquired over several years by the LDC. This edition includes all of the contents in Chinese Gigaword Second Edition (LDC2005T14) as well as new data collected after the publication of that edition. Also, an archive of articles from a new newswire source (Agence France Presse) has been added in the third edition. The four distinct international sources of Chinese newswire included in this edition are the following: * Agence France Presse (afp_cmn) * Central News Agency, Taiwan (cna_cmn) * Xinhua News Agency (xin_cmn) * Zaobao Newspaper (zbn_cmn) The seven-letter codes in the parentheses above are used for the directory names and data files for each source, and are also used (in ALL_CAPS) as part of the unique DOC "id" string assigned to each news article. Data The original data archives received by the LDC from Agence France Presse, Xinhua News Agency and Zaobao were encoded in GB-2312, whereas those from Central News Agency (CNA) were encoded in Big-5. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to UTF-8 character encoding. New in the Third Edition * Over six years worth of articles (October 2000 through December 2006) from Agence France Presse are being released for the first time. * Two years worth of new articles (January 2005 through December 2006) have been added to the Xinhua data set. * Nearly two years worth of content was added to the CNA data set. There was a gap in the LDC's collection from this source during 2006: no CNA Chinese content was collected between July 27 and December 17 2006, inclusive, so there are no data files for August through November of that year, and the December data file is about half its expected size. * A small set of older stories (October through December 1998) have been added from Zaobao; these were previously published by LDC as part of TDT3 Multilanguage Text Version 2.0 (LDC2001T58) and are being included in Gigaword for the first time. Samples Please examine this sample(JPEG) for an example of the data in this corpus. Sponsorship This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Extent:		Corpus size: 2097152 KB
Identifier:		LDC2007T38
		https://catalog.ldc.upenn.edu/LDC2007T38
		ISBN: 1-58563-455-7
		ISLRN: 222-703-436-942-5
		DOI: 10.35111/w035-1a74
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2007T38
Rights Holder:		Portions © 2000-2006 Agence France Presse, © 1991-2006 Central News Agency (Taiwan), © 1998, 2000-2003 SPH AsiaOne, Ltd., © 1990-2006 Xinhua News Agency, © 2003, 2005, 2007 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2007T38
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Graff, David. 2007. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text