OLAC Record: English Gigaword

OLAC Record
oai:www.ldc.upenn.edu:LDC2003T05

Metadata

Title: English Gigaword

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Graff, David, and Christopher Cieri. English Gigaword LDC2003T05. Web Download. Philadelphia: Linguistic Data Consortium, 2003

Contributor: Graff, David

Cieri, Christopher

Date (W3CDTF): 2003

Date Issued (W3CDTF): 2003-01-28

Description: *Introduction* English Gigaword was produced by the Linguistic Data Consortium (LDC) and contains approximately 1.8 billion words of English news text. This is a comprehensive archive of newswire text data in English that has been acquired over several years by LDC. Four distinct international sources of English newswire are represented here: * Agence France Press English Service (AFE)) * Associated Press Worldstream English Service (APW) * The New York Times Newswire Service (NYT) * The Xinhua News Agency English Service (XIE) *Data* Much of the content in this collection has been published previously by LDC in a variety of other, older corpora, particularly the (North American News Text Corpus (LDC95T21), the North American News Text Supplement (LDC98T30)), the various TDT corpora and (The AQUAINT Corpus of English News Text (LDC2002T31)). But there is a significant amount of material that is being released here for the first time: all of the Agence France Presse content, the 1995 and 2001 Xinhua content, and the portions of NYT and APW dating from February 2001 forward. The file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file which is provided as part of this publication. There are 314 files, totaling approximately 4 GB in compressed form (12 GB uncompressed). The table below presents the following categories of information: source of the data, number of files per source, K-words (thousands of words), and number of documents. Source #Files K-words #DOCs AFE 44 170,969 656,269 APW 91 539,665 1,477,466 NYT 96 914,159 1,298,498 XIE 83 131,711 679,007 TOTAL 314 1,756,504 4,111,240 For this release, all sources have received a uniform treatment in terms of quality control, and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types": story This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences. multi This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on. advis These are DOCs which the news service addresses to news editors, they are not intended for publication to the "end users." other These DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on. *Samples* Please view the following sample: Text Sample *Updates* There are no updates available at this time.

Extent: Corpus size: 4089446 KB

Identifier: LDC2003T05

https://catalog.ldc.upenn.edu/LDC2003T05

ISBN: 1-58563-260-0

ISLRN: 953-543-425-922-6

DOI: 10.35111/0z6y-q265

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2003T05

Rights Holder: Portions © 1994-1997 and 2001-2002 Agence France-Presse, © 1994-2002 Associated Press, © 1994-2002 New York Times, © 1995-2001 Xinhua News Agency, © 2002 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2003T05

DateStamp: 2024-09-17

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Graff, David; Cieri, Christopher. 2003. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2003T05
Up-to-date as of: Wed Oct 29 7:00:15 EDT 2025

Metadata
Title:		English Gigaword
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Graff, David, and Christopher Cieri. English Gigaword LDC2003T05. Web Download. Philadelphia: Linguistic Data Consortium, 2003
Contributor:		Graff, David
Contributor:		Cieri, Christopher
Date (W3CDTF):		2003
Date Issued (W3CDTF):		2003-01-28
Description:		Introduction English Gigaword was produced by the Linguistic Data Consortium (LDC) and contains approximately 1.8 billion words of English news text. This is a comprehensive archive of newswire text data in English that has been acquired over several years by LDC. Four distinct international sources of English newswire are represented here: * Agence France Press English Service (AFE)) * Associated Press Worldstream English Service (APW) * The New York Times Newswire Service (NYT) * The Xinhua News Agency English Service (XIE) Data Much of the content in this collection has been published previously by LDC in a variety of other, older corpora, particularly the (North American News Text Corpus (LDC95T21), the North American News Text Supplement (LDC98T30)), the various TDT corpora and (The AQUAINT Corpus of English News Text (LDC2002T31)). But there is a significant amount of material that is being released here for the first time: all of the Agence France Presse content, the 1995 and 2001 Xinhua content, and the portions of NYT and APW dating from February 2001 forward. The file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file which is provided as part of this publication. There are 314 files, totaling approximately 4 GB in compressed form (12 GB uncompressed). The table below presents the following categories of information: source of the data, number of files per source, K-words (thousands of words), and number of documents. Source #Files K-words #DOCs AFE 44 170,969 656,269 APW 91 539,665 1,477,466 NYT 96 914,159 1,298,498 XIE 83 131,711 679,007 TOTAL 314 1,756,504 4,111,240 For this release, all sources have received a uniform treatment in terms of quality control, and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types": story This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences. multi This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on. advis These are DOCs which the news service addresses to news editors, they are not intended for publication to the "end users." other These DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on. Samples Please view the following sample: Text Sample Updates There are no updates available at this time.
Extent:		Corpus size: 4089446 KB
Identifier:		LDC2003T05
		https://catalog.ldc.upenn.edu/LDC2003T05
		ISBN: 1-58563-260-0
		ISLRN: 953-543-425-922-6
		DOI: 10.35111/0z6y-q265
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2003T05
Rights Holder:		Portions © 1994-1997 and 2001-2002 Agence France-Presse, © 1994-2002 Associated Press, © 1994-2002 New York Times, © 1995-2001 Xinhua News Agency, © 2002 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2003T05
DateStamp:		2024-09-17
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Graff, David; Cieri, Christopher. 2003. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text