OLAC Record: English Gigaword Second Edition

OLAC Record
oai:www.ldc.upenn.edu:LDC2005T12

Metadata

Title: English Gigaword Second Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Graff, David, et al. English Gigaword Second Edition LDC2005T12. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: Graff, David

Kong, Junbo

Chen, Ke

Maeda, Kazuaki

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-07-15

Description: *Introduction* English Gigaword Second Edition was produced by the Linguistic Data Consortium (LDC) and contains approximately 2.3 billion words of English newswire text data that has been acquired over several years by LDC. This edition includes all of the contents in the first edition, English Gigaword (LDC2003T05), as well as new data from July 2002 through December 2004 from all four sources in the first edition and a new source, the Central News Agency of Taiwan, English Service. This second addition also includes a three-letter language code in the source abbreviations, and minor formatting improvements (mostly line-wrapping). *Data* Here is a table showing the five distinct international sources of English newswire included in this release along with the breakdown of their contents in numbers of documents and K-words (thousands of words): Source Abbreviation Documents K-words Agence France-Presse, English Service (afp_eng) 1,202,139 337,792 Associated Press Worldstream, English Service (apw_eng) 1,975,456 736,518 Central News Agency of Taiwan, English Service (cna_eng) 57,999 15,039 The New York Times Newswire Service (nyt_eng) 1,446,256 1,026,533 The Xinhua News Agency, English Service (xin_eng) 1,017,150 201,346 Totals 5,699,000 2,317,228 All the data is organized into zipped files. All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. Documents are sorted into four different types: * Story: a report composed of paragraphs and full sentences; most common * Multi: unrelated "blurbs" of several news items * Advis: advisories directed at news editors and not intended for publication/general audience * Other: intended for publication but not paragraphs or sentences; these are things like lists of sports scores, stock prices, temperatures around the world, etc. *Samples* For an example of the data in this corpus, please view this sample (SGML). *Updates* None at this time.

Identifier: LDC2005T12

https://catalog.ldc.upenn.edu/LDC2005T12

ISBN: 1-58563-350-X

ISLRN: 274-788-133-216-1

DOI: 10.35111/stcf-4x49

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2005T12

Rights Holder: Portions © 1994-1997 and 2001-2004 Agence France-Presse, © 1994-2004 Associated Press, © 1997-2004 Central News Agency of Taiwan, © 1994-2004 New York Times, © 1995-2004 Xinhua News Agency, © 2005 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005T12

DateStamp: 2021-11-12

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Graff, David; Kong, Junbo; Chen, Ke; Maeda, Kazuaki. 2005. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T12
Up-to-date as of: Wed Oct 29 7:00:26 EDT 2025

Metadata
Title:		English Gigaword Second Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Graff, David, et al. English Gigaword Second Edition LDC2005T12. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		Graff, David
		Kong, Junbo
		Chen, Ke
		Maeda, Kazuaki
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-07-15
Description:		Introduction English Gigaword Second Edition was produced by the Linguistic Data Consortium (LDC) and contains approximately 2.3 billion words of English newswire text data that has been acquired over several years by LDC. This edition includes all of the contents in the first edition, English Gigaword (LDC2003T05), as well as new data from July 2002 through December 2004 from all four sources in the first edition and a new source, the Central News Agency of Taiwan, English Service. This second addition also includes a three-letter language code in the source abbreviations, and minor formatting improvements (mostly line-wrapping). Data Here is a table showing the five distinct international sources of English newswire included in this release along with the breakdown of their contents in numbers of documents and K-words (thousands of words): Source Abbreviation Documents K-words Agence France-Presse, English Service (afp_eng) 1,202,139 337,792 Associated Press Worldstream, English Service (apw_eng) 1,975,456 736,518 Central News Agency of Taiwan, English Service (cna_eng) 57,999 15,039 The New York Times Newswire Service (nyt_eng) 1,446,256 1,026,533 The Xinhua News Agency, English Service (xin_eng) 1,017,150 201,346 Totals 5,699,000 2,317,228 All the data is organized into zipped files. All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. Documents are sorted into four different types: * Story: a report composed of paragraphs and full sentences; most common * Multi: unrelated "blurbs" of several news items * Advis: advisories directed at news editors and not intended for publication/general audience * Other: intended for publication but not paragraphs or sentences; these are things like lists of sports scores, stock prices, temperatures around the world, etc. Samples For an example of the data in this corpus, please view this sample (SGML). Updates None at this time.
Identifier:		LDC2005T12
		https://catalog.ldc.upenn.edu/LDC2005T12
		ISBN: 1-58563-350-X
		ISLRN: 274-788-133-216-1
		DOI: 10.35111/stcf-4x49
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2005T12
Rights Holder:		Portions © 1994-1997 and 2001-2004 Agence France-Presse, © 1994-2004 Associated Press, © 1997-2004 Central News Agency of Taiwan, © 1994-2004 New York Times, © 1995-2004 Xinhua News Agency, © 2005 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005T12
DateStamp:		2021-11-12
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Graff, David; Kong, Junbo; Chen, Ke; Maeda, Kazuaki. 2005. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text