OLAC Record
oai:www.ldc.upenn.edu:LDC2003T05

Metadata
Title:English Gigaword
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Graff, David, and Christopher Cieri. English Gigaword LDC2003T05. Web Download. Philadelphia: Linguistic Data Consortium, 2003
Contributor:Graff, David
Cieri, Christopher
Date (W3CDTF):2003
Date Issued (W3CDTF):2003-01-28
Description:*Introduction* English Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T05 and ISBN 1-58563-260-0, and is distributed on DVD. This is a comprehensive archive of newswire text data in English that has been acquired over several years by the LDC. Four distinct international sources of English newswire are represented here: Agence France Press English Service (afe) Associated Press Worldstream English Service (apw) The New York Times Newswire Service (nyt) The Xinhua News Agency English Service (xie) *Data* Much of the content in this collection has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora (LDC95T21, LDC98T30), the various TDT corpora and the AQUAINT text corpus (LDC2002T31). But there is a significant amount of material that is being released here for the first time: all of the Agence France Presse content, the 1995 and 2001 Xinhua content, and the portions of NYT and APW dating from February 2001 forward. Each data file name consists of the three-letter prefix, followed by a six-digit date (representing the year and month during which the file contents were delivered by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file which is provided as part of this publication. Please follow this link for a sample file. The markup structure, common to all data files, can be summarized as follows: The Headline Element is Optional -- not all DOCs have one The Dateline Element is Optional -- not all DOCs have one Paragraph tags are only used if the "type" attribute of the DOC happens to be "story" Note that all data files use the UNIX-standard " " form of line termination, and text lines are generally wrapped to a width of 80 characters or less For this release, all sources have received a uniform treatment in terms of quality control and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types." The classification is indicated by the "type="string" " attribute that is included in each opening DOC tag. The four types are: story, multi, advis and other. Statistics regarding the quantities of data for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are not compressed (i.e. nearly 12 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFE 44 417 1216 170969 656269 APW 91 1213 3647 539665 1477466 NYT 96 2104 5906 914159 1298498 XIE 83 320 940 131711 679007 TOTAL 314 4054 11709 1756504 4111240 *Updates* There are no updates available at this time.
Extent:Corpus size: 4089446 KB
Identifier:LDC2003T05
https://catalog.ldc.upenn.edu/LDC2003T05
ISBN: 1-58563-260-0
ISLRN: 953-543-425-922-6
DOI: 10.35111/0z6y-q265
Language:English
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2003T05
Rights Holder:Portions © 1994-1997 and 2001-2002 Agence France-Presse, © 1994-2002 Associated Press, © 1994-2002 New York Times, © 1995-2001 Xinhua News Agency, © 2002 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2003T05
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Graff, David; Cieri, Christopher. 2003. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2003T05
Up-to-date as of: Mon Mar 25 7:19:39 EDT 2024