OLAC Record: English Gigaword Third Edition

OLAC Record
oai:www.ldc.upenn.edu:LDC2007T07

Metadata

Title: English Gigaword Third Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Graff, David, et al. English Gigaword Third Edition LDC2007T07. Web Download. Philadelphia: Linguistic Data Consortium, 2007

Contributor: Graff, David

Kong, Junbo

Chen, Ke

Maeda, Kazuaki

Date (W3CDTF): 2007

Date Issued (W3CDTF): 2007-05-17

Description: *Introduction* The English Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the third edition of the English Gigaword Corpus. This edition includes all of the contents in the previous edition (LDC2005T12) as well as new data from the same five sources presented there covering 24-month period of January 2005 through December 2006. Also, a sixth data source (the Los Angeles Times/Washington Post newswire service) has been added in this edition. The six distinct international sources of English newswire included in this edition are the following: Agence France-Presse, English Service (afp_eng) Associated Press Worldstream, English Service (apw_eng) Central News Agency of Taiwan, English Service (cna_eng) Los Angeles Times/Washington Post Newswire Service (ltw_eng) New York Times Newswire Service (nyt_eng) Xinhua News Agency, English Service (xin_eng) The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code ("eng") separated by an underscore ("_") character. The three-letter language code conforms to LDC's internal convention based on the new ISO 639-3 standard. The seven-letter codes are used in both the directory names where the data files are found, and in the prefix that appears at the beginning of every data file name. As with other Gigaword releases, some of the content in the this corpus has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora, the various TDT corpora, and the AQUAINT text corpus, as well as earlier editions of Gigaword English. *New in the Third Edition* * New newswire data contents from January 2005 to December 2006 have been added for all of the five newswire sources that were represented in the first edition. * A new source, the Los Angeles Times/Washington Post newswire service, has been added. * A small handful of corrections to older APW data have been made to remove a few non-English stories, clean up some character "noise", and rectify the encoding for a few non-ASCII characters. * The CNA content introduced in Gigaword English 2nd Edition has been completely updated to repair data corruptions caused by occasional character encoding problems; as a result of the update, there may be differences in the inventory and/or ID strings of DOC elements in this portion of the corpus, relative to the previous edition. (The nature of encoding problems is explained below under "SOURCE SPECIFIC PROPERTIES".) * Many of the files (141 out of 722) include a small number of UTF-8 "wide" characters, typically accented letters found in proper names and borrowed words (some sources also use special punctuation marks, non-breaking spaces, etc). Apart from the replacement/update of all CNA files, the data content of the 2nd edition has been included in the present release without modification. *Samples* For an example of the data in this corpus, please review this text file. *Update* The New York Times newswire text archive in this corpus contains some articles in Spanish. A scan of the 149 monthly data files under "nyt_eng" yielded 2517 DOC elements with the 'type="story"' attribute where the story content was in Spanish. The scan also disclosed 421 DOC elements with the 'type="story"' attribute where the text content was in fact not a news story. Two additional files to the online documentation for this corpus identify those occurrences. * other.file-doc.map * spanish.file-doc.map *Sponsorship* This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Identifier: LDC2007T07

https://catalog.ldc.upenn.edu/LDC2007T07

ISBN: 1-58563-416-6

ISLRN: 336-874-552-847-5

DOI: 10.35111/k4mz-9k30

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2007T07

Rights Holder: Portions © 1994-2006 Agence France Presse, © 1994-2006 The Associated Press, © 1997-2006 Central News Agency (Taiwan), © 1994-1998, 2003-2006 Los Angeles Times-Washington Post News Service, Inc., © 1994-2006 New York Times, © 1995-2006 Xinhua News Agency, © 2007 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2007T07

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Graff, David; Kong, Junbo; Chen, Ke; Maeda, Kazuaki. 2007. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2007T07
Up-to-date as of: Wed Oct 29 7:00:58 EDT 2025

Metadata
Title:		English Gigaword Third Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Graff, David, et al. English Gigaword Third Edition LDC2007T07. Web Download. Philadelphia: Linguistic Data Consortium, 2007
Contributor:		Graff, David
		Kong, Junbo
		Chen, Ke
		Maeda, Kazuaki
Date (W3CDTF):		2007
Date Issued (W3CDTF):		2007-05-17
Description:		Introduction The English Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the third edition of the English Gigaword Corpus. This edition includes all of the contents in the previous edition (LDC2005T12) as well as new data from the same five sources presented there covering 24-month period of January 2005 through December 2006. Also, a sixth data source (the Los Angeles Times/Washington Post newswire service) has been added in this edition. The six distinct international sources of English newswire included in this edition are the following: Agence France-Presse, English Service (afp_eng) Associated Press Worldstream, English Service (apw_eng) Central News Agency of Taiwan, English Service (cna_eng) Los Angeles Times/Washington Post Newswire Service (ltw_eng) New York Times Newswire Service (nyt_eng) Xinhua News Agency, English Service (xin_eng) The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code ("eng") separated by an underscore ("_") character. The three-letter language code conforms to LDC's internal convention based on the new ISO 639-3 standard. The seven-letter codes are used in both the directory names where the data files are found, and in the prefix that appears at the beginning of every data file name. As with other Gigaword releases, some of the content in the this corpus has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora, the various TDT corpora, and the AQUAINT text corpus, as well as earlier editions of Gigaword English. New in the Third Edition * New newswire data contents from January 2005 to December 2006 have been added for all of the five newswire sources that were represented in the first edition. * A new source, the Los Angeles Times/Washington Post newswire service, has been added. * A small handful of corrections to older APW data have been made to remove a few non-English stories, clean up some character "noise", and rectify the encoding for a few non-ASCII characters. * The CNA content introduced in Gigaword English 2nd Edition has been completely updated to repair data corruptions caused by occasional character encoding problems; as a result of the update, there may be differences in the inventory and/or ID strings of DOC elements in this portion of the corpus, relative to the previous edition. (The nature of encoding problems is explained below under "SOURCE SPECIFIC PROPERTIES".) * Many of the files (141 out of 722) include a small number of UTF-8 "wide" characters, typically accented letters found in proper names and borrowed words (some sources also use special punctuation marks, non-breaking spaces, etc). Apart from the replacement/update of all CNA files, the data content of the 2nd edition has been included in the present release without modification. Samples For an example of the data in this corpus, please review this text file. Update The New York Times newswire text archive in this corpus contains some articles in Spanish. A scan of the 149 monthly data files under "nyt_eng" yielded 2517 DOC elements with the 'type="story"' attribute where the story content was in Spanish. The scan also disclosed 421 DOC elements with the 'type="story"' attribute where the text content was in fact not a news story. Two additional files to the online documentation for this corpus identify those occurrences. * other.file-doc.map * spanish.file-doc.map Sponsorship This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Identifier:		LDC2007T07
		https://catalog.ldc.upenn.edu/LDC2007T07
		ISBN: 1-58563-416-6
		ISLRN: 336-874-552-847-5
		DOI: 10.35111/k4mz-9k30
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2007T07
Rights Holder:		Portions © 1994-2006 Agence France Presse, © 1994-2006 The Associated Press, © 1997-2006 Central News Agency (Taiwan), © 1994-1998, 2003-2006 Los Angeles Times-Washington Post News Service, Inc., © 1994-2006 New York Times, © 1995-2006 Xinhua News Agency, © 2007 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2007T07
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Graff, David; Kong, Junbo; Chen, Ke; Maeda, Kazuaki. 2007. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text